Data Prep

Data Prep#

By Kenneth Burchfiel

Released under the MIT License

Before a dataset can be analyzed and visualized within Python, it often needs to be reformatted and cleaned. In order to demonstrate how Python can easily perform these reformatting and cleaning tasks, this script will clean and reformat a set of NVCU winter student survey results; combine those results with fall and spring survey data; and merge in information from our current enrollment table.

Our survey_results database file (which we’ll import from PFN’s Appendix folder) already contains student survey responses for the fall and spring. However, let’s say that you’ve been asked to add a set of winter results to this dataset as well, then calculate a weighted average of fall, winter, and spring survey results for each student.

If these results were in the same format as the fall and spring ones and had no missing data, this process would be very simple. Unfortunately, that’s not the case with the winter results that we’ll be processing within this script. These results feature:

Column names that differ from those in the fall/spring results
Different data formats
A missing column
Duplicate values
Missing values for certain students

In addition, to make matters even more complex, these winter results are spread out over 16 different files (one for each level within each college).

It would be cumbersome and mind-numbing to modify each of these 16 datasets within Excel, Google Sheets, or a similar program so that they could be combined with our pre-existing fall and spring data. However, the Python code shown below will make this data cleaning process much easier. And once this script is in place, if you happened to get next year’s winter results in the same format* as this year’s, you’d be able to get them cleaned up and reformatted in no time.

*You may find in your work, however, that the results are in yet another format the following year, followed by a different format the year after that. Data-related tasks are always made easier when inputs stay the same, but in the real world, you’ll often need to rework datasets in order to make them compatible with pre-existing processes.

import sys
sys.path.insert(1, '../Appendix')
from helper_funcs import config_notebook
display_type = config_notebook(display_max_columns = 7) 

import os
import pandas as pd
from sqlalchemy import create_engine

Our first step in preparing our winter survey results will be to import the 16 files that comprise them into a DataFrame. We’ll start this process by calling os.listdir() to create a list of all files within our winter_results folder:

file_list = os.listdir('winter_results')
file_list[0:5]

['STM_Ju_results.csv',
 'STB_Se_results.csv',
 'STM_Se_results.csv',
 'STC_So_results.csv',
 'STB_Ju_results.csv']

Note that each file represents a single college/level pair.

(Note: as an American university, NVCU has four ‘levels’: freshmen, sophomore, junior, and senior. These four terms (which will often be truncated to ‘Fr’, ‘So’, ‘Ju’, and ‘Se’ within our datasets and tables) correspond to the first, second, third, and fourth years of enrollment at the university.

‘Year’ is a more common term than ‘level’ to refer to these four names, but it could easily get confused with school year values in our datasets. Therefore, I’ll normally use ‘level’ to represent these terms within PFN.)

Next, we’ll use a for loop to (1) read each file within this list into a DataFrame, then (2) add those files to a list of DataFrames. We’ll then apply pd.concat() to combine these file-specific DataFrames into a single DataFrame.

df_list = []
for file in file_list:
    df = pd.read_csv(f'winter_results/{file}')
    df_list.append(df)

# Combining these individual DataFrames into a single file:
df_winter_results = pd.concat(
    [df for df in df_list]) # df for df in df_list is a list comprehension
# that contains all DataFrames in df_list.
# Replacing the original indices with a new, unified index:
df_winter_results.reset_index(drop=True, inplace=True)
df_winter_results.head()

	MATRIC#	SEASON	SURVEY_SCORE	STARTINGYR	MATRICYR
0	1655	W	53.0%	23	21
1	1294	W	73.0%	23	21
2	875	W	67.0%	23	21
3	2571	W	61.0%	23	21
4	547	W	76.0%	23	21

The following cell shows a more concise means of creating the same DataFrame. Although this approach requires fewer lines of code, it’s also less flexible (as the former method allows you to make individual updates to each DataFrame if needed).

df_winter_results = pd.concat(
    [pd.read_csv(f'winter_results/{file}') 
     for file in os.listdir('winter_results')]).reset_index(drop=True)
df_winter_results.head()

	MATRIC#	SEASON	SURVEY_SCORE	STARTINGYR	MATRICYR
0	1655	W	53.0%	23	21
1	1294	W	73.0%	23	21
2	875	W	67.0%	23	21
3	2571	W	61.0%	23	21
4	547	W	76.0%	23	21

Reformatting and cleaning our winter results dataset#

Our next step is to combine these winter survey results with the fall and spring results in our NVCU database. Here’s what those results look like:

# Connecting to our database:
e = create_engine('sqlite:///../Appendix/nvcu_db.db')
df_fall_spring_results = pd.read_sql(
    "Select * from survey_results", con=e)
df_fall_spring_results.head()

	student_id	starting_year	season	score
0	2020-1	2023	Fall	88
1	2020-2	2023	Fall	37
2	2020-3	2023	Fall	54
3	2020-4	2023	Fall	56
4	2020-5	2023	Fall	77

If we naively tried to add our winter results to our fall/spring results, we’d end up with a very messy DataFrame with numerous blank (i.e. NaN) cells:

df_messy_combo = pd.concat([df_fall_spring_results, 
           df_winter_results])

Compare the beginning of this DataFrame (which has fall/spring results) with the end (which has the new winter results we just imported):

df_messy_combo.head()

	student_id	starting_year	season	...	SURVEY_SCORE	STARTINGYR	MATRICYR
0	2020-1	2023.0	Fall	...	NaN	NaN	NaN
1	2020-2	2023.0	Fall	...	NaN	NaN	NaN
2	2020-3	2023.0	Fall	...	NaN	NaN	NaN
3	2020-4	2023.0	Fall	...	NaN	NaN	NaN
4	2020-5	2023.0	Fall	...	NaN	NaN	NaN

5 rows × 9 columns

df_messy_combo.tail()

	student_id	starting_year	season	...	SURVEY_SCORE	STARTINGYR	MATRICYR
14617	NaN	NaN	NaN	...	68.0%	23.0	23.0
14618	NaN	NaN	NaN	...	61.0%	23.0	23.0
14619	NaN	NaN	NaN	...	66.0%	23.0	23.0
14620	NaN	NaN	NaN	...	67.0%	23.0	23.0
14621	NaN	NaN	NaN	...	57.0%	23.0	23.0

5 rows × 9 columns

This unkempt output is caused by discrepancies in column names between the two tables. To rectify this issue, we’ll need to rename our winter results fields to match their corresponding fields within the fall/spring table. Thankfully, Pandas makes this process very straightforward:

df_winter_results.rename(columns = {
    'SEASON':'season','STARTINGYR':'starting_year',
    'SURVEY_SCORE':'score'}, inplace=True)
df_winter_results.head()

	MATRIC#	season	score	starting_year	MATRICYR
0	1655	W	53.0%	23	21
1	1294	W	73.0%	23	21
2	875	W	67.0%	23	21
3	2571	W	61.0%	23	21
4	547	W	76.0%	23	21

We’ll also need to convert our MATRIC# and MATRICYR fields into a single student_id field. (This student_id field simply combines students’ matriculation years with their matriculation numbers; see nvcu_db_gen.ipynb within the Appendix for more details.) This can be done as follows:

df_winter_results['MATRICYR'] += 2000 # Converts our MATRICYR
# values from YY to YYYY format so that they'll match the format of the 
# matriculation year component of the student_id values within 
# df_fall_spring_results

# Converting students' MATRICYR and MATRIC# values into student IDs (which
# are based on these two values):
# (Note that both columns must be converted to strings in order for
# this code to work. If we just added the two integers together, we'd
# get their sum--which isn't what we want in this case.)
df_winter_results['student_id'] = (
    df_winter_results['MATRICYR'].astype('str') 
    + '-' 
    + df_winter_results['MATRIC#'].astype('str'))

# Now that we've used our MATRICYR and MATRIC# columns to create 
# our student IDs, we no longer need to retain them:
df_winter_results.drop(
    ['MATRICYR', 'MATRIC#'], 
    axis=1, inplace=True)
df_winter_results.head()

	season	score	starting_year	student_id
0	W	53.0%	23	2021-1655
1	W	73.0%	23	2021-1294
2	W	67.0%	23	2021-875
3	W	61.0%	23	2021-2571
4	W	76.0%	23	2021-547

The columns in df_winter_results now match those within df_fall_spring_results. That’s great! Let’s try combining the two datasets to see if we’re ready to calculate average survey scores for each season:

df_results = pd.concat([df_fall_spring_results, 
           df_winter_results])
df_results.head()

	student_id	starting_year	season	score
0	2020-1	2023	Fall	88
1	2020-2	2023	Fall	37
2	2020-3	2023	Fall	54
3	2020-4	2023	Fall	56
4	2020-5	2023	Fall	77

Note the differences between the previous and following outputs (which show fall and winter data, respectively):

df_winter_results.tail()

	season	score	starting_year	student_id
14617	W	68.0%	23	2023-5278
14618	W	61.0%	23	2023-3542
14619	W	66.0%	23	2023-78
14620	W	67.0%	23	2023-2505
14621	W	57.0%	23	2023-1312

These excerpts show that, unfortunately, we’re not quite ready to analyze this data just yet: there are several formatting differences that we’ll need to address.

For instance, the ‘score’ column within df_fall_spring_results uses an integer format, whereas these same numbers are formatted as strings within df_winter_results. This will produce errors when we attempt to perform numerical calculations on this field:

# df_results['score'].mean()

This line was commented out because it would produce the following TypeError:

unsupported operand type(s) for +: 'int' and 'str'

The following cell resolves this issue by converting our string-formatted score values to integers:

df_winter_results['score'] = df_winter_results[
'score'].str.replace('.0%','').astype('int') # Chaining these 
# column-specific operations together helps reduce the number of
# times we need to write the column name--which can help prevent
# errors related to incorrect field names.
df_winter_results.head()

	season	score	starting_year	student_id
0	W	53	23	2021-1655
1	W	73	23	2021-1294
2	W	67	23	2021-875
3	W	61	23	2021-2571
4	W	76	23	2021-547

We’ll also need to reformat our winter results’ season and starting_year values so that they match the formats found in the fall/spring table.

The following cell replaces the ‘W’ values within the ‘season’ column with ‘Winter’ so that they’ll better match the ‘Fall’ and ‘Spring’ values in df_fall_spring_results:

df_winter_results['season'] = (
    df_winter_results['season'].replace({'W':'Winter'}))
df_winter_results.head()

	season	score	starting_year	student_id
0	Winter	53	23	2021-1655
1	Winter	73	23	2021-1294
2	Winter	67	23	2021-875
3	Winter	61	23	2021-2571
4	Winter	76	23	2021-547

The following code would also have worked; however, it assumes that every row within the DataFrame is indeed a winter result. This is the case in our simulated data, but in the real world, some data from other seasons might have leaked in, causing this code to incorrectly reclassify certain results.

# df_winter_results['season'] = 'Winter'

Finally, we’ll add 2000 to every starting_year value so that our years will show up within YYYY format–just as they do within our fall/spring dataset.

df_winter_results['starting_year'] += 2000
df_winter_results.head()

	season	score	starting_year	student_id
0	Winter	53	2023	2021-1655
1	Winter	73	2023	2021-1294
2	Winter	67	2023	2021-875
3	Winter	61	2023	2021-2571
4	Winter	76	2023	2021-547

Removing duplicates#

We’ve now successfully made our winter dataset’s field names and values compatible with those in our fall/spring dataset. However, before we can combine the two together, we’ll need to remove some duplicate results. (The NVCU administration has specified that they want only one survey score per student/season pair to get retained; otherwise, students who fill out a given survey multiple times will get weighted more often than those who complete it only once.)

Let’s first identify these duplicate rows. The following code filters df_winter_results to include any rows whose season, starting_year, and student_id columns match. (The inclusion of keep=False instructs Pandas to return all copies of a duplicated row, not just the first one that it encounters.)

df_winter_results[df_winter_results.duplicated(
    subset = ['season', 'starting_year', 'student_id'], 
    keep=False)]

	season	score	starting_year	student_id
24	Winter	56	2023	2021-2639
31	Winter	75	2023	2021-825
...	...	...	...	...
14620	Winter	67	2023	2023-2505
14621	Winter	57	2023	2023-1312

1392 rows × 4 columns

These duplicate values can easily be removed using Pandas’ drop_duplicates() function. However, before removing duplicate rows, it’s a good idea to first consider which of these duplicated entries to retain. (For instance, you might choose to keep the earliest copy of a set of duplicated surveys; the latest copy; or the one with the highest/lowest score.)

In our case, we’ll keep the duplicated row with the highest score and remove all others. We can do this by (1) sorting our DataFrame to show higher scores before lower ones, then (2) keeping the first row (i.e. the one with the highest score) when removing our duplicates.

df_winter_results.sort_values(
    'score', ascending=False, inplace=True)
df_winter_results.head()

	season	score	starting_year	student_id
2310	Winter	99	2023	2020-616
1511	Winter	98	2023	2020-35
8147	Winter	97	2023	2020-816
7804	Winter	97	2023	2020-395
523	Winter	97	2023	2021-468

Removing duplicate values:

Note: when removing duplicates, think carefully about which columns to include in your subset argument. For instance, if we had multiple years’ worth of data in our table, using ['season', 'student_id'] as your subset would cause only one result for each student/season pair to get retained, thus removing valid data for other years from your table.

df_winter_results.drop_duplicates(
    subset = ['season', 'starting_year', 'student_id'], 
    keep = 'first', inplace = True)
df_winter_results.head()

	season	score	starting_year	student_id
2310	Winter	99	2023	2020-616
1511	Winter	98	2023	2020-35
8147	Winter	97	2023	2020-816
7804	Winter	97	2023	2020-395
523	Winter	97	2023	2021-468

Rerunning our duplicate check code confirms that no duplicate entries remain within our dataset:

df_duplicated_winter_results = (
    df_winter_results[df_winter_results.duplicated(
    subset = ['season', 'starting_year', 'student_id'], 
    keep = False)])
len(df_duplicated_winter_results)

Since we may need to run this code again in the future with new sets of winter data, we should include a block of code at this point that will check whether we have any duplicates, then raise an error if so. (This will prevent the script from continuing on and possibly generating incorrect findings.)

if len(df_duplicated_winter_results) > 0:
    raise ValueError("Duplicate winter survey results are still present!")
else:
    print("No duplicate entries are present within the winter dataset.")

No duplicate entries are present within the winter dataset.

Combining winter survey results with our fall/spring dataset#

We’re now finally ready to combine df_winter_results with df_fall_spring results. However, one final issue remains with this table: winter survey results are missing for a number of students. This won’t cause any issues with the following code, but we’ll need to take these missing entries into account when analyzing our survey data within the upcoming Descriptive Stats chapter.

df_results = pd.concat(
    [df_fall_spring_results, 
     df_winter_results]).sort_values(
    ['starting_year', 'season']).reset_index(drop=True)
df_results.head()

	student_id	starting_year	season	score
0	2020-1	2023	Fall	88
1	2020-2	2023	Fall	37
2	2020-3	2023	Fall	54
3	2020-4	2023	Fall	56
4	2020-5	2023	Fall	77

The following output confirms that the format of the winter results within this table matches that of the fall/spring rows:

df_results.query("season == 'Winter'").head()

	student_id	starting_year	season	score
32768	2020-616	2023	Winter	99
32769	2020-35	2023	Winter	98
32770	2020-816	2023	Winter	97
32771	2020-395	2023	Winter	97
32772	2021-468	2023	Winter	97

Note that, when our seasons are sorted alphabetically, Fall will come first, followed by Spring and then Winter. In order to allow for a chronological sort (which will prove useful when creating charts and pivot tables), we’ll add in a ‘season_order’ column that maps these seasons to integers.

df_results['season_order'] = df_results['season'].map(
    {'Fall':0,'Winter':1,'Spring':2})
df_results.head()

	student_id	starting_year	season	score
0	2020-1	2023	Fall	88
1	2020-2	2023	Fall	37
2	2020-3	2023	Fall	54
3	2020-4	2023	Fall	56
4	2020-5	2023	Fall	77

Merging college and level data into our survey results table#

In its current form, df_results allows us to calculate survey results at the university-wide level. However, in order to determine whether these results differ by college and by level, we’ll also need to merge college and level data into our dataset.

We can accomplish this by reading data from our curr_enrollment SQL table into a DataFrame, then merging that data with df_results. (The student_id field present in both DataFrames can serve as a merge key.)

Note: if we had survey data for multiple years, our current enrollment table wouldn’t be a good candidate for our merge, since it would show only the current college and levels for each student. We’d instead want to source our enrollment data from a historical enrollment table so that students’ yearly survey results could be matched with their levels and colleges during the years that they took the survey.

# The following code reads in only the fields from df_curr_enrollment 
# that we'll need for our analyses. (If we wanted to read in all fields,
# we could replace the field names in our SQL query with *.)

df_curr_enrollment = pd.read_sql(
    "Select student_id, college, level, \
level_for_sorting from curr_enrollment", con=e)
df_curr_enrollment.head()

	student_id	college	level	level_for_sorting
0	2020-1	STC	Se	3
1	2020-2	STM	Se	3
2	2020-3	STC	Se	3
3	2020-4	STC	Se	3
4	2020-5	STM	Se	3

The next cell uses a left merge to add enrollment data into df_results. The left argument for the how parameter instructs the code to retain all rows in df_results even if no corresponding enrollment data were found. (If we had instead used right as our argument, all rows in df_curr_enrollment would have been retained whether or not we had corresponding survey data for them.)

Other options for the how parameter include outer, which preserves all rows in both datasets regardless of whether or not a given key was found in both of them, and inner, which would only keep rows whose student IDs were found in both datasets.

For more on the df.merge() function used below, visit https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html .

df_results = df_results.merge(
    df_curr_enrollment, on='student_id', how='left')
df_results.head()

	student_id	starting_year	season	...	college	level	level_for_sorting
0	2020-1	2023	Fall	...	STC	Se	3
1	2020-2	2023	Fall	...	STM	Se	3
2	2020-3	2023	Fall	...	STC	Se	3
3	2020-4	2023	Fall	...	STC	Se	3
4	2020-5	2023	Fall	...	STM	Se	3

5 rows × 8 columns

Now that we’ve merged in college and level data, we’ll save this dataset to a .csv file so that it can be processed by the code in our descriptive statistics chapters:

df_results.to_csv('2023_survey_results.csv', index = False)

Conclusion#

This script has provided a brief introduction to data cleaning and reformatting. Other PFN sections will provide further examples of data reformatting, as this process is often a necessary prerequisite for analysis and visualization tasks.

Next up is the Descriptive Stats chapter, in which we’ll perform various statistical tests on the new .csv file that we created within this section.