This is an Eval Central archive copy, find the original at depictdatastudio.com.
In 1998, NASA launched the unmanned Mars Climate Orbiter to study the atmosphere of Mars.
However, the spacecraft never finished its mission. In fact, upon reaching Mars the next year, the $125 million spacecraft promptly crash landed into Mars, disintegrating in the atmosphere.
What could have caused such a crash landing?
Was it a freak meteor strike?
Faulty equipment?
ALIENS, perhaps?!?
The answer, surprisingly, is that the crash was caused by a classic case of BAD DATA.
That’s right–this spacecraft, this wonder of science, was rendered useless by bad data being entered into its flawless system. The Mars Climate Orbiter was designed to work on metric units, but unfortunately commands for the spacecraft were being sent from Earth in English units.
The result was a $125 million conversion error.
Collecting Survey Data at HOPE International
So what exactly does this have to do with spreadsheets? I’m glad you asked. I work with the nonprofit HOPE International as a Listening, Monitoring, and Evaluation Analyst. The mission of HOPE is to invest in the dreams of families in the world’s underserved communities as we proclaim and live the Gospel.
My team contributes to that by facilitating listening to those we serve, primarily through administering surveys and analyzing the data. Our surveys focus on many things–impact, experience, satisfaction, etc.–but regardless of the focus area, I always can’t WAIT to dive into the results.
When you spend so long crafting a questionnaire, translating it just right, and training enumerators to administer the survey, it’s nearly impossible to resist jumping into analysis once the results are in.
However, I’ve found that this is precisely what I must do–resist the urge to jump straight into analysis.
This is because, just as in the case of the Mars Climate Orbiter, a perfectly designed analysis system with flawless pivot tables will amount to nothing (or worse, a $125 million error) without proper data flowing into the system.
That’s right–I’m talking about DATA CLEANING.
Why Data Cleaning?
Data cleaning is an essential part of our survey process.
There have been many real-world situations where the results would have been biased or even completely incorrect had we not first taken the time to clean the data.
Here are a few situations we’ve encountered in the past:
- Duplicate survey responses caused by system error, or a respondent accidentally taking a survey twice.
- Pretest/training responses being included with the actual data from survey administration.
- Surveys being completed in an extremely short amount of time, where most if not all of the answer choices were blank.
- Data entry errors, such as accidentally copying a response in Excel across multiple rows and erasing original responses.
As you can see, the issues above would cause drastic differences if not corrected through a data cleaning process.
As tempting as it is to jump straight into crafting pivot tables and analyzing the results of the survey, engaging in a thorough cleaning and recoding of the data is vital to ensuring accurate results.
6 Data Cleaning Steps to Save You Millions
I’d like to show you what we do for our data cleaning process, and how Simple Spreadsheets helped to make this process even stronger.
In this article, you’ll learn:
- How to check for duplicates (for example, if someone accidentally took the same survey twice);
- How to check the survey for changes (for example, if translation typos were found after going live);
- How to check for outliers in survey duration (how long it takes someone to complete a survey);
- How to Use COUNTA and COUNTBLANK;
- How to Recode Variables with IF Statements; and
- How to Combine Datasets Together with VLOOKUP.
Yes, all of these data cleaning steps can be completed in Microsoft Excel.
(1) How to Look for Duplicates
One of the most important steps in our data cleaning process is to look for “duplicates.”
Duplicates are two (or more) entries that are either exactly the same, or match on a critical piece of information (like ID number or name).
It’s crucial that we identify these duplicates and resolve them before starting analysis. Otherwise, our results will not be accurate, and will instead overrepresent the duplicated entries.
Which Variable(s) Should Be Unique?
To check for duplicates, first identify the key variables in your data set that should be unique for each respondent.
For instance, our clients have an identification number which is unique to them. This field should not be duplicated in a data set.
Highlight the Duplicates in a Different Color
Once you determine your key variables, there is a simple Excel process that you can follow in order to identify and sort through your duplicates:
- Step 1- Highlight the column of interest.
- Step 2- In the Excel ribbon, select “Home” > ”Conditional Formatting” > ”Highlight Cells Rules” >”Duplicate Values.”
- Step 3- In the pop-up window, choose a highlight color of your choice and press “OK.” This will highlight all of the cells in the selected column that contain duplicate values.
Once these steps have been followed, any duplicates for the criteria you selected will be highlighted.
Manually Examine Each of the Duplicate Entries
I like to then filter the column where it only contains the duplicate values, sort in ascending order, and then manually go down the list to analyze each duplicate pair (or trio, etc).
Doing this manually really helps you to get a feel for the data, and understand whether the duplicates are truly duplicates, or whether there is some other systematic issue at play.
If the duplicates match exactly in all fields in the survey, then they are “true duplicates.” We usually keep the response that was entered first and remove the other response.
If they don’t match exactly in all of the fields, then we connect with our team that administered the survey and try to determine together how to handle the entries, whether removing them entirely, keeping some, or keeping all.
(2) How to Check the Survey for Version Changes
Another important step in the process is to check survey versions for any notable changes.
When we are administering a survey, we do everything we can to test the survey beforehand, in order to not make any changes during the administration.
However, unforeseen changes to translation, wording, or even whole questions sometimes need to be made during the administration process, and it’s important to check if any of these changes could impact how data is interpreted.
For instance, if the first 10 respondents to a survey saw this question:
“How satisfied are you with the training curriculum?”
- Very satisfied
- Satisfied
- Neither satisfied nor unsatisfied
- Very unsatisfied
- Very unsatisfied
And the rest of the respondents saw this question:
“How satisfied are you with the training curriculum?”
- Very satisfied
- Satisfied
- Neither satisfied nor unsatisfied
- Unsatisfied
- Very unsatisfied
Then the fourth answer would mean two different things, depending on when the survey was taken.
In a large survey that is being translated into multiple languages, it is quite possible that small details like this go unnoticed, even through quality checks and testing.
Compare Spreadsheets with the “Compare Files” Add-In for Excel
In order to avoid having to meticulously analyze each version of the survey row by row in Excel, we utilize the “Compare Files” function.
This is located in the “Inquire” tab as an add-in for Excel, but I highly recommend you download it.
It saves a considerable amount of time comparing two spreadsheets.
To use this function:
- Simply open the spreadsheets you want to compare at the same time.
- Click “Compare Files.”
- Choose the files you would like to compare.
- Press the “Compare” button.
Excel will then open a third document which lists all the differences (and their categories).
Our team then goes through this document to see if any critical changes were made to the survey during administration, and we account for these changes accordingly in the analysis.
(3) How to Check for Outliers in Survey Duration
Lastly, a simple but important step in our data cleaning process is to check the duration of a survey.
Usually, we determine the average time it took to complete the survey, and then manually investigate any responses that were much faster or much slower than that average length.
These could just be outliers, or they could be surveys that weren’t finished, system errors, data entry errors, etc.
We also look for “straightlining,” which is when a respondent answers the same response to each question (usually in order to just get the survey over with faster).
Removing any responses that are errors and accounting for straightlining is an important factor in our analysis.
(4) How to Use COUNTA and COUNTBLANK in Excel
The Simple Spreadsheets course both affirmed the current steps in our data cleaning process (particularly in the area of handling duplicates), and added new tools into our toolbox!
One simple tool that I’ve found helpful is the COUNTA and COUNTBLANK functions.
These functions are two sides of the same coin.
- COUNTA returns the number of cells that are not empty in a specified range.
- COUNTBLANK returns the number of cells that are blank in a specified range.
We’ve used these two functions to quickly assess whether our data passes the “sniff test.”
For instance, if there is a question that we designed as mandatory for everyone in the survey but only half of the cells are populated, there is something wrong with our dataset and we need to investigate further.
Some of the possible causes could be that the question was not marked as mandatory in the survey software, the data was entered incorrectly, there was an error in translation, etc.
Basically, by using these two functions for each column in our dataset, we can get a bird’s-eye-view of the pattern of responses to each question in the survey.
(5) How to Recode Variables with IF Statements in Excel
Recoding was a game-changer for me in the data cleaning process.
Before taking Simple Spreadsheets, I didn’t know how to make the data do what we needed it to do for our analyses.
For instance, maybe the geographical information in our database was captured in cities, but I needed to organize it into regions for our stratified random sample.
Or, maybe the data contained registration dates for clients, but I needed to organize them into different categories of tenure.
I didn’t know any method to do this besides manually going through the data and recategorizing by hand.
Needless to say–WOW did Simple Spreadsheets save me time!
The IF function allowed me to recategorize data by using a simple formula.
For a practical example, I had a list of bank branches that I needed to group together into different regions. Instead of doing this manually, I was able to use the IF formula to create different groupings for the regions all at once.
(6) How to Combine Datasets Together with VLOOKUP in Excel
VLOOKUP was also an extremely helpful formula for me to get the data sets to do what we needed them to do.
Often we will have multiple datasets that we need to merge together, because we have different sources of information.
Because most of our clients have Client ID numbers, I was able to use these numbers as the common source of information in the VLOOKUP function, thus merging together datasets in minutes with confidence.
Save Yourself $125 million
I honestly can’t count the amount of times that the data cleaning process has brought us helpful insights that both ensure we have accurate results, and helped us to improve our processes in the future so that we avoid/account for any potential errors.
Simple Spreadsheets was a great help in affirming and bolstering our data cleaning process, and I hope that this article gives you a jump start into creating a similar process that suits your needs.
It’s not always the most fun process (although I’ve grown to really love it and have earned the title of “Detective” on my team ), but it is CRUCIAL to ensuring a good result.
Just ask NASA…a million dollar data cleaning system would still have saved them $124 million in the long run.