Data Prep#
By Kenneth Burchfiel
Released under the MIT License
Before a dataset can be analyzed and visualized within Python, it often needs to be reformatted and cleaned. In order to demonstrate how Python can easily perform these reformatting and cleaning tasks, this script will clean and reformat a set of NVCU winter student survey results; combine those results with fall and spring survey data; and merge in information from our current enrollment table.
Our survey_results database file (which we’ll import from PFN’s Appendix folder) already contains student survey responses for the fall and spring. However, let’s say that you’ve been asked to add a set of winter results to this dataset as well, then calculate a weighted average of fall, winter, and spring survey results for each student.
If these results were in the same format as the fall and spring ones and had no missing data, this process would be very simple. Unfortunately, that’s not the case with the winter results that we’ll be processing within this script. These results feature:
Column names that differ from those in the fall/spring results
Different data formats
A missing column
Duplicate values
Missing values for certain students
In addition, to make matters even more complex, these winter results are spread out over 16 different files (one for each level within each college).
It would be cumbersome and mind-numbing to modify each of these 16 datasets within Excel, Google Sheets, or a similar program so that they could be combined with our pre-existing fall and spring data. However, the Python code shown below will make this data cleaning process much easier. And once this script is in place, if you happened to get next year’s winter results in the same format* as this year’s, you’d be able to get them cleaned up and reformatted in no time.
*You may find in your work, however, that the results are in yet another format the following year, followed by a different format the year after that. Data-related tasks are always made easier when inputs stay the same, but in the real world, you’ll often need to rework datasets in order to make them compatible with pre-existing processes.
import sys
sys.path.insert(1, '../Appendix')
from helper_funcs import config_notebook
display_type = config_notebook(display_max_columns = 7)
import os
import pandas as pd
from sqlalchemy import create_engine
Our first step in preparing our winter survey results will be to import the 16 files that comprise them into a DataFrame. We’ll start this process by calling os.listdir()
to create a list of all files within our winter_results folder:
file_list = os.listdir('winter_results')
file_list[0:5]
['STM_Ju_results.csv',
'STB_Se_results.csv',
'STM_Se_results.csv',
'STC_So_results.csv',
'STB_Ju_results.csv']
Note that each file represents a single college/level pair.
(Note: as an American university, NVCU has four ‘levels’: freshmen, sophomore, junior, and senior. These four terms (which will often be truncated to ‘Fr’, ‘So’, ‘Ju’, and ‘Se’ within our datasets and tables) correspond to the first, second, third, and fourth years of enrollment at the university.
‘Year’ is a more common term than ‘level’ to refer to these four names, but it could easily get confused with school year values in our datasets. Therefore, I’ll normally use ‘level’ to represent these terms within PFN.)
Next, we’ll use a for loop to (1) read each file within this list into a DataFrame, then (2) add those files to a list of DataFrames. We’ll then apply pd.concat()
to combine these file-specific DataFrames into a single DataFrame.
df_list = []
for file in file_list:
df = pd.read_csv(f'winter_results/{file}')
df_list.append(df)
# Combining these individual DataFrames into a single file:
df_winter_results = pd.concat(
[df for df in df_list]) # df for df in df_list is a list comprehension
# that contains all DataFrames in df_list.
# Replacing the original indices with a new, unified index:
df_winter_results.reset_index(drop=True, inplace=True)
df_winter_results.head()
MATRIC# | SEASON | SURVEY_SCORE | STARTINGYR | MATRICYR | |
---|---|---|---|---|---|
0 | 1655 | W | 53.0% | 23 | 21 |
1 | 1294 | W | 73.0% | 23 | 21 |
2 | 875 | W | 67.0% | 23 | 21 |
3 | 2571 | W | 61.0% | 23 | 21 |
4 | 547 | W | 76.0% | 23 | 21 |
The following cell shows a more concise means of creating the same DataFrame. Although this approach requires fewer lines of code, it’s also less flexible (as the former method allows you to make individual updates to each DataFrame if needed).
df_winter_results = pd.concat(
[pd.read_csv(f'winter_results/{file}')
for file in os.listdir('winter_results')]).reset_index(drop=True)
df_winter_results.head()
MATRIC# | SEASON | SURVEY_SCORE | STARTINGYR | MATRICYR | |
---|---|---|---|---|---|
0 | 1655 | W | 53.0% | 23 | 21 |
1 | 1294 | W | 73.0% | 23 | 21 |
2 | 875 | W | 67.0% | 23 | 21 |
3 | 2571 | W | 61.0% | 23 | 21 |
4 | 547 | W | 76.0% | 23 | 21 |
Reformatting and cleaning our winter results dataset#
Our next step is to combine these winter survey results with the fall and spring results in our NVCU database. Here’s what those results look like:
# Connecting to our database:
e = create_engine('sqlite:///../Appendix/nvcu_db.db')
df_fall_spring_results = pd.read_sql(
"Select * from survey_results", con=e)
df_fall_spring_results.head()
student_id | starting_year | season | score | |
---|---|---|---|---|
0 | 2020-1 | 2023 | Fall | 88 |
1 | 2020-2 | 2023 | Fall | 37 |
2 | 2020-3 | 2023 | Fall | 54 |
3 | 2020-4 | 2023 | Fall | 56 |
4 | 2020-5 | 2023 | Fall | 77 |
If we naively tried to add our winter results to our fall/spring results, we’d end up with a very messy DataFrame with numerous blank (i.e. NaN
) cells:
df_messy_combo = pd.concat([df_fall_spring_results,
df_winter_results])
Compare the beginning of this DataFrame (which has fall/spring results) with the end (which has the new winter results we just imported):
df_messy_combo.head()
student_id | starting_year | season | ... | SURVEY_SCORE | STARTINGYR | MATRICYR | |
---|---|---|---|---|---|---|---|
0 | 2020-1 | 2023.0 | Fall | ... | NaN | NaN | NaN |
1 | 2020-2 | 2023.0 | Fall | ... | NaN | NaN | NaN |
2 | 2020-3 | 2023.0 | Fall | ... | NaN | NaN | NaN |
3 | 2020-4 | 2023.0 | Fall | ... | NaN | NaN | NaN |
4 | 2020-5 | 2023.0 | Fall | ... | NaN | NaN | NaN |
5 rows × 9 columns
df_messy_combo.tail()
student_id | starting_year | season | ... | SURVEY_SCORE | STARTINGYR | MATRICYR | |
---|---|---|---|---|---|---|---|
14617 | NaN | NaN | NaN | ... | 68.0% | 23.0 | 23.0 |
14618 | NaN | NaN | NaN | ... | 61.0% | 23.0 | 23.0 |
14619 | NaN | NaN | NaN | ... | 66.0% | 23.0 | 23.0 |
14620 | NaN | NaN | NaN | ... | 67.0% | 23.0 | 23.0 |
14621 | NaN | NaN | NaN | ... | 57.0% | 23.0 | 23.0 |
5 rows × 9 columns
This unkempt output is caused by discrepancies in column names between the two tables. To rectify this issue, we’ll need to rename our winter results fields to match their corresponding fields within the fall/spring table. Thankfully, Pandas makes this process very straightforward:
df_winter_results.rename(columns = {
'SEASON':'season','STARTINGYR':'starting_year',
'SURVEY_SCORE':'score'}, inplace=True)
df_winter_results.head()
MATRIC# | season | score | starting_year | MATRICYR | |
---|---|---|---|---|---|
0 | 1655 | W | 53.0% | 23 | 21 |
1 | 1294 | W | 73.0% | 23 | 21 |
2 | 875 | W | 67.0% | 23 | 21 |
3 | 2571 | W | 61.0% | 23 | 21 |
4 | 547 | W | 76.0% | 23 | 21 |
We’ll also need to convert our MATRIC#
and MATRICYR
fields into a single student_id
field. (This student_id field simply combines students’ matriculation years with their matriculation numbers; see nvcu_db_gen.ipynb within the Appendix for more details.) This can be done as follows:
df_winter_results['MATRICYR'] += 2000 # Converts our MATRICYR
# values from YY to YYYY format so that they'll match the format of the
# matriculation year component of the student_id values within
# df_fall_spring_results
# Converting students' MATRICYR and MATRIC# values into student IDs (which
# are based on these two values):
# (Note that both columns must be converted to strings in order for
# this code to work. If we just added the two integers together, we'd
# get their sum--which isn't what we want in this case.)
df_winter_results['student_id'] = (
df_winter_results['MATRICYR'].astype('str')
+ '-'
+ df_winter_results['MATRIC#'].astype('str'))
# Now that we've used our MATRICYR and MATRIC# columns to create
# our student IDs, we no longer need to retain them:
df_winter_results.drop(
['MATRICYR', 'MATRIC#'],
axis=1, inplace=True)
df_winter_results.head()
season | score | starting_year | student_id | |
---|---|---|---|---|
0 | W | 53.0% | 23 | 2021-1655 |
1 | W | 73.0% | 23 | 2021-1294 |
2 | W | 67.0% | 23 | 2021-875 |
3 | W | 61.0% | 23 | 2021-2571 |
4 | W | 76.0% | 23 | 2021-547 |
The columns in df_winter_results
now match those within df_fall_spring_results
. That’s great! Let’s try combining the two datasets to see if we’re ready to calculate average survey scores for each season:
df_results = pd.concat([df_fall_spring_results,
df_winter_results])
df_results.head()
student_id | starting_year | season | score | |
---|---|---|---|---|
0 | 2020-1 | 2023 | Fall | 88 |
1 | 2020-2 | 2023 | Fall | 37 |
2 | 2020-3 | 2023 | Fall | 54 |
3 | 2020-4 | 2023 | Fall | 56 |
4 | 2020-5 | 2023 | Fall | 77 |
Note the differences between the previous and following outputs (which show fall and winter data, respectively):
df_winter_results.tail()
season | score | starting_year | student_id | |
---|---|---|---|---|
14617 | W | 68.0% | 23 | 2023-5278 |
14618 | W | 61.0% | 23 | 2023-3542 |
14619 | W | 66.0% | 23 | 2023-78 |
14620 | W | 67.0% | 23 | 2023-2505 |
14621 | W | 57.0% | 23 | 2023-1312 |
These excerpts show that, unfortunately, we’re not quite ready to analyze this data just yet: there are several formatting differences that we’ll need to address.
For instance, the ‘score’ column within df_fall_spring_results
uses an integer format, whereas these same numbers are formatted as strings within df_winter_results
. This will produce errors when we attempt to perform numerical calculations on this field:
# df_results['score'].mean()
This line was commented out because it would produce the following TypeError
:
unsupported operand type(s) for +: 'int' and 'str'
The following cell resolves this issue by converting our string-formatted score values to integers:
df_winter_results['score'] = df_winter_results[
'score'].str.replace('.0%','').astype('int') # Chaining these
# column-specific operations together helps reduce the number of
# times we need to write the column name--which can help prevent
# errors related to incorrect field names.
df_winter_results.head()
season | score | starting_year | student_id | |
---|---|---|---|---|
0 | W | 53 | 23 | 2021-1655 |
1 | W | 73 | 23 | 2021-1294 |
2 | W | 67 | 23 | 2021-875 |
3 | W | 61 | 23 | 2021-2571 |
4 | W | 76 | 23 | 2021-547 |
We’ll also need to reformat our winter results’ season
and starting_year
values so that they match the formats found in the fall/spring table.
The following cell replaces the ‘W’ values within the ‘season’ column with ‘Winter’ so that they’ll better match the ‘Fall’ and ‘Spring’ values in df_fall_spring_results
:
df_winter_results['season'] = (
df_winter_results['season'].replace({'W':'Winter'}))
df_winter_results.head()
season | score | starting_year | student_id | |
---|---|---|---|---|
0 | Winter | 53 | 23 | 2021-1655 |
1 | Winter | 73 | 23 | 2021-1294 |
2 | Winter | 67 | 23 | 2021-875 |
3 | Winter | 61 | 23 | 2021-2571 |
4 | Winter | 76 | 23 | 2021-547 |
The following code would also have worked; however, it assumes that every row within the DataFrame is indeed a winter result. This is the case in our simulated data, but in the real world, some data from other seasons might have leaked in, causing this code to incorrectly reclassify certain results.
# df_winter_results['season'] = 'Winter'
Finally, we’ll add 2000 to every starting_year value so that our years will show up within YYYY format–just as they do within our fall/spring dataset.
df_winter_results['starting_year'] += 2000
df_winter_results.head()
season | score | starting_year | student_id | |
---|---|---|---|---|
0 | Winter | 53 | 2023 | 2021-1655 |
1 | Winter | 73 | 2023 | 2021-1294 |
2 | Winter | 67 | 2023 | 2021-875 |
3 | Winter | 61 | 2023 | 2021-2571 |
4 | Winter | 76 | 2023 | 2021-547 |
Removing duplicates#
We’ve now successfully made our winter dataset’s field names and values compatible with those in our fall/spring dataset. However, before we can combine the two together, we’ll need to remove some duplicate results. (The NVCU administration has specified that they want only one survey score per student/season pair to get retained; otherwise, students who fill out a given survey multiple times will get weighted more often than those who complete it only once.)
Let’s first identify these duplicate rows. The following code filters df_winter_results
to include any rows whose season
, starting_year
, and student_id
columns match. (The inclusion of keep=False
instructs Pandas to return all copies of a duplicated row, not just the first one that it encounters.)
df_winter_results[df_winter_results.duplicated(
subset = ['season', 'starting_year', 'student_id'],
keep=False)]
season | score | starting_year | student_id | |
---|---|---|---|---|
24 | Winter | 56 | 2023 | 2021-2639 |
31 | Winter | 75 | 2023 | 2021-825 |
... | ... | ... | ... | ... |
14620 | Winter | 67 | 2023 | 2023-2505 |
14621 | Winter | 57 | 2023 | 2023-1312 |
1392 rows × 4 columns
These duplicate values can easily be removed using Pandas’ drop_duplicates() function. However, before removing duplicate rows, it’s a good idea to first consider which of these duplicated entries to retain. (For instance, you might choose to keep the earliest copy of a set of duplicated surveys; the latest copy; or the one with the highest/lowest score.)
In our case, we’ll keep the duplicated row with the highest score and remove all others. We can do this by (1) sorting our DataFrame to show higher scores before lower ones, then (2) keeping the first row (i.e. the one with the highest score) when removing our duplicates.
df_winter_results.sort_values(
'score', ascending=False, inplace=True)
df_winter_results.head()
season | score | starting_year | student_id | |
---|---|---|---|---|
2310 | Winter | 99 | 2023 | 2020-616 |
1511 | Winter | 98 | 2023 | 2020-35 |
8147 | Winter | 97 | 2023 | 2020-816 |
7804 | Winter | 97 | 2023 | 2020-395 |
523 | Winter | 97 | 2023 | 2021-468 |
Removing duplicate values:
Note: when removing duplicates, think carefully about which columns to include in your subset
argument. For instance, if we had multiple years’ worth of data in our table, using ['season', 'student_id']
as your subset would cause only one result for each student/season pair to get retained, thus removing valid data for other years from your table.
df_winter_results.drop_duplicates(
subset = ['season', 'starting_year', 'student_id'],
keep = 'first', inplace = True)
df_winter_results.head()
season | score | starting_year | student_id | |
---|---|---|---|---|
2310 | Winter | 99 | 2023 | 2020-616 |
1511 | Winter | 98 | 2023 | 2020-35 |
8147 | Winter | 97 | 2023 | 2020-816 |
7804 | Winter | 97 | 2023 | 2020-395 |
523 | Winter | 97 | 2023 | 2021-468 |
Rerunning our duplicate check code confirms that no duplicate entries remain within our dataset:
df_duplicated_winter_results = (
df_winter_results[df_winter_results.duplicated(
subset = ['season', 'starting_year', 'student_id'],
keep = False)])
len(df_duplicated_winter_results)
0
Since we may need to run this code again in the future with new sets of winter data, we should include a block of code at this point that will check whether we have any duplicates, then raise an error if so. (This will prevent the script from continuing on and possibly generating incorrect findings.)
if len(df_duplicated_winter_results) > 0:
raise ValueError("Duplicate winter survey results are still present!")
else:
print("No duplicate entries are present within the winter dataset.")
No duplicate entries are present within the winter dataset.
Combining winter survey results with our fall/spring dataset#
We’re now finally ready to combine df_winter_results with df_fall_spring results. However, one final issue remains with this table: winter survey results are missing for a number of students. This won’t cause any issues with the following code, but we’ll need to take these missing entries into account when analyzing our survey data within the upcoming Descriptive Stats chapter.
df_results = pd.concat(
[df_fall_spring_results,
df_winter_results]).sort_values(
['starting_year', 'season']).reset_index(drop=True)
df_results.head()
student_id | starting_year | season | score | |
---|---|---|---|---|
0 | 2020-1 | 2023 | Fall | 88 |
1 | 2020-2 | 2023 | Fall | 37 |
2 | 2020-3 | 2023 | Fall | 54 |
3 | 2020-4 | 2023 | Fall | 56 |
4 | 2020-5 | 2023 | Fall | 77 |
The following output confirms that the format of the winter results within this table matches that of the fall/spring rows:
df_results.query("season == 'Winter'").head()
student_id | starting_year | season | score | |
---|---|---|---|---|
32768 | 2020-616 | 2023 | Winter | 99 |
32769 | 2020-35 | 2023 | Winter | 98 |
32770 | 2020-816 | 2023 | Winter | 97 |
32771 | 2020-395 | 2023 | Winter | 97 |
32772 | 2021-468 | 2023 | Winter | 97 |
Note that, when our seasons are sorted alphabetically, Fall will come first, followed by Spring and then Winter. In order to allow for a chronological sort (which will prove useful when creating charts and pivot tables), we’ll add in a ‘season_order’ column that maps these seasons to integers.
df_results['season_order'] = df_results['season'].map(
{'Fall':0,'Winter':1,'Spring':2})
df_results.head()
student_id | starting_year | season | score | season_order | |
---|---|---|---|---|---|
0 | 2020-1 | 2023 | Fall | 88 | 0 |
1 | 2020-2 | 2023 | Fall | 37 | 0 |
2 | 2020-3 | 2023 | Fall | 54 | 0 |
3 | 2020-4 | 2023 | Fall | 56 | 0 |
4 | 2020-5 | 2023 | Fall | 77 | 0 |
Merging college and level data into our survey results table#
In its current form, df_results
allows us to calculate survey results at the university-wide level. However, in order to determine whether these results differ by college and by level, we’ll also need to merge college and level data into our dataset.
We can accomplish this by reading data from our curr_enrollment SQL table into a DataFrame, then merging that data with df_results. (The student_id field present in both DataFrames can serve as a merge key.)
Note: if we had survey data for multiple years, our current enrollment table wouldn’t be a good candidate for our merge, since it would show only the current college and levels for each student. We’d instead want to source our enrollment data from a historical enrollment table so that students’ yearly survey results could be matched with their levels and colleges during the years that they took the survey.
# The following code reads in only the fields from df_curr_enrollment
# that we'll need for our analyses. (If we wanted to read in all fields,
# we could replace the field names in our SQL query with *.)
df_curr_enrollment = pd.read_sql(
"Select student_id, college, level, \
level_for_sorting from curr_enrollment", con=e)
df_curr_enrollment.head()
student_id | college | level | level_for_sorting | |
---|---|---|---|---|
0 | 2020-1 | STC | Se | 3 |
1 | 2020-2 | STM | Se | 3 |
2 | 2020-3 | STC | Se | 3 |
3 | 2020-4 | STC | Se | 3 |
4 | 2020-5 | STM | Se | 3 |
The next cell uses a left merge to add enrollment data into df_results. The left
argument for the how
parameter instructs the code to retain all rows in df_results even if no corresponding enrollment data were found. (If we had instead used right
as our argument, all rows in df_curr_enrollment
would have been retained whether or not we had corresponding survey data for them.)
Other options for the how
parameter include outer
, which preserves all rows in both datasets regardless of whether or not a given key was found in both of them, and inner
, which would only keep rows whose student IDs were found in both datasets.
For more on the df.merge()
function used below, visit https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html .
df_results = df_results.merge(
df_curr_enrollment, on='student_id', how='left')
df_results.head()
student_id | starting_year | season | ... | college | level | level_for_sorting | |
---|---|---|---|---|---|---|---|
0 | 2020-1 | 2023 | Fall | ... | STC | Se | 3 |
1 | 2020-2 | 2023 | Fall | ... | STM | Se | 3 |
2 | 2020-3 | 2023 | Fall | ... | STC | Se | 3 |
3 | 2020-4 | 2023 | Fall | ... | STC | Se | 3 |
4 | 2020-5 | 2023 | Fall | ... | STM | Se | 3 |
5 rows × 8 columns
Now that we’ve merged in college and level data, we’ll save this dataset to a .csv file so that it can be processed by the code in our descriptive statistics chapters:
df_results.to_csv('2023_survey_results.csv', index = False)
Conclusion#
This script has provided a brief introduction to data cleaning and reformatting. Other PFN sections will provide further examples of data reformatting, as this process is often a necessary prerequisite for analysis and visualization tasks.
Next up is the Descriptive Stats chapter, in which we’ll perform various statistical tests on the new .csv file that we created within this section.