This is an Eval Central archive copy, find the original at evalacademy.com.
You have developed your evaluation plan, written your questions, and deployed your tools. And now, you have just received your raw data. You are excited to jump in and analyze the data.
You note several inconsistencies in the raw data. Three different date formats are used within the same column. You take a closer look and several categorical values are spelled and punctuated differently. Also, several cells contain both numerical and text data. And dispersed randomly across the spreadsheet are blank open cells.
These are a few common data inconsistencies we face as evaluators and analysts when receiving a fresh set of raw data. Data cleaning is often the first step, unless receiving the unicorn of all spreadsheets with perfect formatting, consistency, and accuracy. However, more likely than not, you will experience one or several of these data inconsistencies in your raw data.
This article addresses four common mistakes made with data entry and offers suggestions to improve data entry to increase the overall productivity and, let us not forget, happiness of your colleagues, collaborators, and evaluators.
Many may think that “a date is a date,” but this is not always the case within a spreadsheet. When several date formats are used (e.g., 2020-04-13; April 13th, 2020; Apr 13, 2020), the spreadsheet, and your evaluator, may be confused trying to make sense of these values. In Excel, dates are stored in each cell as a numerical value and converted to a pre-selected date format. However, if text values are included with your dates, Excel may be unable to accurately interpret each cell entry.
Best case scenario: a few data points are missed when analyzing or visualizing your data.
However, this best case is not ideal when you, as an evaluator, are needing to present accurate results to your clients, such that they can digest and act on the results of your evaluation.
How do we improve date entry?
Enter dates in a YYYY-MM-DD format.
If you prefer a different date format (see Ablebits article for ideas on date formatting), be consistent. Dates entered consistently, regardless of format, are much easier to work with.
Consider using data validation to prevent improperly formatted dates or text from being entered into your spreadsheet.
Communicate with data entry personnel and agree on a single date format for data entry.
For anyone that works largely with quantitative data, your heart may have just sunk as you read those two words: “Text data.” In this article, we are not talking about open-ended questions for our qualitative analyses (this may be a topic for another article). No, we are referring to text data in the form of Likert scales (e.g., Strongly agree to Strongly disagree) and categorical variables (e.g., gender, ethnicity).
But why does the thought of analyzing text data result in mild anxiety? It boils down to spelling and grammar. When cell entries vary in their spelling and grammar, your spreadsheet software may count each variant of a cell entry as a discrete value. Your five-point Likert scale now becomes a ten-point scale and your categorical response variables now has three unique variants for each potential response.
How do we ensure that text data is entered accurately?
Consider converting Likert scale data to numbers (e.g., Strongly agree = 1; Strongly disagree = 5).
Similarly, categorical variables can be coded as numbers (e.g., Female = 1; Male = 2).
If converting text data to numerical data, provide a guide to data entry personnel to follow (and for you to review when analyzing the data).
If you prefer to go the route of using text, consider using data validation. Lock cells to accept only values spelled correctly and avoid misspellings and other errors.
Multiple Values per Cell
It is tempting to include several pieces of related data in a single cell. You have created a spreadsheet and asked the person entering the data to record a few biometrics: height and blood pressure.
You ask that height be measured in centimeters. You receive the data and each result is entered with the unit “cm” at the end of each value. When glancing at the data, it is useful to understand the unit of measurement. However, this creates complications when analyzing the data. Your spreadsheet will read each height measurement as a text value. Try to complete a simple calculation and you will receive an error. Units are important but should be included in the column title rather than included with individual cell values.
Now for blood pressure. You are measuring systolic over diastolic blood pressure. A patient gets a reading of 120/80 and the value is entered under a single column titled “Blood Pressure”. While this may save a few seconds on data entry, this can complicate analysis as two values share a single cell. It is better to split the data into two columns: (1) “Systolic Blood Pressure” and (2) “Diastolic Blood Pressure”. Each cell should contain a single, discrete value. It is better to add an extra column to your spreadsheet than attempt to cram multiple data into a single cell.
How do we avoid multiple values per cell?
Communicate with data entry personnel and explain that each cell should contain a single, discrete data value.
If a value has a unit of measurement, include the measurement in the column title and only insert numerical values into your cells.
With related data, split the results into two or more columns with each column relating to a single value (see Microsoft Office Support to see how to split delimited data into multiple cells in Excel).
Keep notes separate from raw data values. If you need to clarify or explain a value, add a separate column to include your note.
Now to address missing data. Missing data can come in many guises, from blank cells to NA’s to numerical place holders (0’s or 99’s). While each method of dealing with missing data has its merits, consistency in handling missing data is the key to clean data.
Within spreadsheet software, such as Excel, blank cells will not interfere with calculations. Looking to average a column? Excel will skip the blanks and average the present data. However, insert a numerical place holder, such as 0 or 99, and your averages will now be skewed by the presence of these values. It is possible to work around these place holders, but it is easy to forget about these place holders when working with numerical data as they are not immediately identified as missing values.
An alternate is to use NA to fill in the blanks. This makes it immediately clear that data is missing from the cell. However, a hiccup occurs if calculating values in Excel. Your calculations are likely to error out due to the NA’s present in your data. This can be worked around by excluding NA’s in your formulae. An advantage of using NA’s is when exporting your data to statistical software (such as R Statistical software), which are often designed to handle NA’s effectively. If you import your raw data into such software frequently, consider replacing missing data with NA.
How do we handle missing data?
Leave no cell blank. While blank cells may work for calculating values in your spreadsheet software, it is unclear whether these cells reflect an absent value (i.e., there is no value available) or that the cell was missed during data entry (i.e., human error).
Avoid using numerical place holders to code for missing data. These are easily overlooked and will skew data calculations.
Use a code like NA. While you may be required to adjust your formulae in your spreadsheet software, it will be clear that these cells represent missing data.
Additionally, NA works well with external statistical software. If you use your spreadsheets for data storage only, use NA as it will work best with your statistical software of choice.
Regardless of your choice, be consistent with coding missing data and communicate to data entry personnel on the preferred method.
It is near impossible to eliminate all data entry mistakes; humans make mistakes. However, it is possible to be aware of common mistakes and create procedures to reduce data entry mistakes. Consistency and communication are key when sharing data between teams and collaborators. If everyone is on the same page, it is possible to reduce many of these mistakes. Address these mistakes early, and you will have much cleaner data to work with come analysis time. And clean data results in accurate, efficient, and actionable results.
Consistency. Consistently entered data will result in clean, readable data that will significantly improve the efficiency of your data analysis.
Communication. Communicate data entry protocols and codes to everyone that will interact with the spreadsheet.
Data validation. Consider using data validation to ensure that only valid data are entered into each cell.
One piece of data per cell. Each cell should contain a single, discrete value. If you have related data, consider splitting the data into multiple columns.
No blank cells. Missing data can cause issues in data analysis. Find a code to represent missing data and use it consistently throughout your data.
We respect your privacy.