Descriptive Stats#
Now that we’ve learned how to retrieve, reformat, and clean data, we can finally begin analyzing it! This notebook demonstrates how to calculate descriptive statistics in Python using Pandas.
Suppose leaders at NVCU would like to know, on a daily basis, how incoming spring survey results differ from fall and winter ones. (These metrics could change each day as new spring survey data gets released, and the administration does not wish to wait until all results are in before they begin reviewing the numbers.)
One way to accomplish this task would be to manually import survey data from your database each day; paste it into Excel or Google Sheets; pivot the data, and then share the output. However, you could also accomplish these same steps in Python. While this would likely take you longer the first time around, you could then create updated analyses of your data in mere seconds. This notebook will show you how!
import time
start_time = time.time()
import sys
sys.path.insert(1, '../Appendix')
from helper_funcs import config_notebook
display_type = config_notebook(display_max_columns = 8,
display_max_rows = 16)
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
We’ll first import our combined set of fall, spring, and winter student survey results; these results were created within data_prep.ipynb. (Note that these results also include college and level data that we merged in from NVCU’s ‘curr_enrollment’ SQL table; that way, we can evaluate average results by level and college.)
df_survey_results = pd.read_csv('../Data_Prep/2023_survey_results.csv')
# Removing a few non-essential columns from the display so that the table
# will show up better in PDF format:
df_survey_results.drop(['season_order', 'level_for_sorting'],
axis = 1).head(5)
student_id | starting_year | season | score | college | level | |
---|---|---|---|---|---|---|
0 | 2020-1 | 2023 | Fall | 88 | STC | Se |
1 | 2020-2 | 2023 | Fall | 37 | STM | Se |
2 | 2020-3 | 2023 | Fall | 54 | STC | Se |
3 | 2020-4 | 2023 | Fall | 56 | STC | Se |
4 | 2020-5 | 2023 | Fall | 77 | STM | Se |
Evaluating changes in average university-wide results during the school year#
Our dataset contains fall, winter, and spring survey results. In order to determine how the mean survey score has changed over the course of the year, we can use Pandas’ pivot_table()
function (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html). I consider pivot_table()
to be one of the most useful functions in the Pandas library, and you’ll see it used again and again with PFN.
The pivot_table()
call below introduces three key arguments:
index
: the list of values by which to group results. Although our dataset only contains data for one year, we’ll still include starting_year
in our results in order to (1) allow the function to accommodate other school years and (2) demonstrate to the viewer that all of this data comes from 2023. We’ll also add both season_order
and season
to our list (in that order) so as to display results by season in chronological order. (Without the season_order
argument, our seasons would be sorted alphabetically, with spring results preceding winter ones.
values
: the metric(s) to assess. We’re interested in analyzing changes in average score by year, so we’ll pass score
as our argument. Multiple values can be passed to this argument if needed.
aggfunc
: the aggregate function to apply to our list of values. We’ll use mean
here, but we could also have chosen median
as a measure of the average.
I generally like to add reset_index()
to the result of pivot_table()
in order to remove any blank index values.
df_results_by_season = df_survey_results.pivot_table(
index=['starting_year', 'season_order', 'season'],
values='score', aggfunc='mean').reset_index()
df_results_by_season
starting_year | season_order | season | score | |
---|---|---|---|---|
0 | 2023 | 0 | Fall | 69.682251 |
1 | 2023 | 1 | Winter | 64.217076 |
2 | 2023 | 2 | Spring | 72.049622 |
These results show that the average score fell around 5 points from the fall to the winter, then increased nearly 8 points from the winter to the spring.
Here’s what the output looks like without the trailing reset_index() call:
df_survey_results.pivot_table(
index=['starting_year', 'season_order', 'season'],
values='score', aggfunc='mean')
score | |||
---|---|---|---|
starting_year | season_order | season | |
2023 | 0 | Fall | 69.682251 |
1 | Winter | 64.217076 | |
2 | Spring | 72.049622 |
We can also find the average score across seasons by setting margins
to True. The margins_name
argument lets us assign a name to this row; if we leave it blank, the row will be titled ‘All.’
df_survey_results.pivot_table(
index = ['starting_year', 'season_order', 'season'],
values = 'score', aggfunc = 'mean', margins=True,
margins_name='2023 Average').reset_index()
starting_year | season_order | season | score | |
---|---|---|---|---|
0 | 2023 | 0 | Fall | 69.682251 |
1 | 2023 | 1 | Winter | 64.217076 |
2 | 2023 | 2 | Spring | 72.049622 |
3 | 2023 Average | 68.882983 |
Calculating response rates#
The pivot_table()
function will also let us determine survey response rates as a percentage of our current enrollment. We can import this current enrollment value from our NVCU database:
e = create_engine('sqlite:///../Appendix/nvcu_db.db')
Calculating our current enrollment by counting the number of rows in our curr_enrollment table:
enrollment_count = len(pd.read_sql(
"Select * from curr_enrollment", con=e))
enrollment_count
16384
A faster way of computing this number is to request it within the original SQL query via COUNT(*). The following line, which demonstrates this approach, took only 6 milliseconds to run on my computer–one tenth the duration of the previous line (which took 62 milliseconds). If we were dealing with millions of rows instead of thousands, this performance difference would probably be even greater.
enrollment_count = pd.read_sql(
"Select COUNT(*) from curr_enrollment", con=e).iloc[0]['COUNT(*)']
enrollment_count
np.int64(16384)
Now that we have the denominator* for our response rate calculations, let’s go ahead and calculate the numerator (e.g. the number of responses per season). For teaching purposes, the following cell will also show how to use a dictionary to specify separate aggfunc
values for different value
entries. In this case, we’ll pass mean
as our aggregate function for the score
column, but count
as our aggregate function for the responses
field.
Note: When calculating row counts, make sure that the column you pass to the values
argument doesn’t contain null values; otherwise, your row counts will be incorrect (as null values will get excluded from your counts.) To prevent this issue, I often like to create a column that stores a value of 1 for every row. Using this column (titled responses
in the following cell) ensures that my pivot table will show the correct row counts for each group.
*A more robust approach here would have been to calculate separate enrollment figures for the fall, winter, and spring periods in order to take new matriculations and withdrawals into account.
df_survey_results['responses'] = 1
df_response_rates = df_survey_results.pivot_table(
index=['starting_year', 'season_order', 'season'],
values=['score', 'responses'],
aggfunc={'score':'mean','responses':'count'}).reset_index()
# Because all 'responses' values are 1, we could also have made 'sum'
# the aggfunc for 'responses' rather than 'count.'
df_response_rates
starting_year | season_order | season | responses | score | |
---|---|---|---|---|---|
0 | 2023 | 0 | Fall | 16384 | 69.682251 |
1 | 2023 | 1 | Winter | 13926 | 64.217076 |
2 | 2023 | 2 | Spring | 16384 | 72.049622 |
Calculating response rates as the quotient of survey counts and NVCU’s current enrollment:
df_response_rates['response_rate'] = 100*(
df_response_rates['responses'] / enrollment_count)
df_response_rates
starting_year | season_order | season | responses | score | response_rate | |
---|---|---|---|---|---|---|
0 | 2023 | 0 | Fall | 16384 | 69.682251 | 100.000000 |
1 | 2023 | 1 | Winter | 13926 | 64.217076 | 84.997559 |
2 | 2023 | 2 | Spring | 16384 | 72.049622 | 100.000000 |
This table shows that our survey response rates were 100% during the fall and spring and around 85% during the winter.
Using the columns
argument within pivot_table()
to show seasons side by side#
Currently, the DataFrame is in ‘long’ format: each row shows data for one specific season. However, in order to more easily calculate the change in results from one season to another, we can also use the columns
argument when creating a pivot table in order to show scores for each season side by side. (This will prove especially useful when we add additional index variables to our pivot_table()
call.
The following function is similar to our earlier pivot_table calls except that the season_order
and season
values have been moved from the index
argument to the argument for columns
. This change makes the seasons appear horizontally rather than vertically.
df_results_by_season_wide = df_survey_results.pivot_table(
index='starting_year', columns=['season_order', 'season'],
values='score', aggfunc='mean').reset_index()
df_results_by_season_wide
season_order | starting_year | 0 | 1 | 2 |
---|---|---|---|---|
season | Fall | Winter | Spring | |
0 | 2023 | 69.682251 | 64.217076 | 72.049622 |
Note that, because we passed two values to the columns
parameter, two levels of headers are now visible. However, I’d like to show just one level of columns that will comprise the starting_year
value in the top row and the season names in the bottom row. We can accomplish this by first calling to_flat_index()
to ‘flatten’ the columns into tuples:
df_results_by_season_wide.columns = (
df_results_by_season_wide.columns.to_flat_index())
df_results_by_season_wide
(starting_year, ) | (0, Fall) | (1, Winter) | (2, Spring) | |
---|---|---|---|---|
0 | 2023 | 69.682251 | 64.217076 | 72.049622 |
Next, I’ll use a list comprehension to replace our tuple-based columns with string-based ones. Note that I want to keep the first entry (‘starting_year’) in the first tuple and the second entries (Fall
, Winter
, and Spring
) in the others; this can be accomplished by adding an if/else statement to our list comprehension.
df_results_by_season_wide.columns = [
column_tuple[0] if column_tuple[1] not in ['Fall', 'Winter', 'Spring']
else column_tuple[1] for column_tuple in
df_results_by_season_wide.columns]
df_results_by_season_wide
starting_year | Fall | Winter | Spring | |
---|---|---|---|---|
0 | 2023 | 69.682251 | 64.217076 | 72.049622 |
Now that we have our seasons next to one another, we can easily calculate changes in average scores between them:
df_results_by_season_wide['Fall-Spring Change'] = (
df_results_by_season_wide['Spring']
- df_results_by_season_wide['Fall'])
df_results_by_season_wide['Fall-Winter Change'] = (
df_results_by_season_wide['Winter']
- df_results_by_season_wide['Fall'])
df_results_by_season_wide['Winter-Spring Change'] = (
df_results_by_season_wide['Spring']
- df_results_by_season_wide['Winter'])
df_results_by_season_wide
starting_year | Fall | Winter | Spring | Fall-Spring Change | Fall-Winter Change | Winter-Spring Change | |
---|---|---|---|---|---|---|---|
0 | 2023 | 69.682251 | 64.217076 | 72.049622 | 2.367371 | -5.465175 | 7.832546 |
We can simplify this code by creating a list of the seasons between which we want to calculate changes in survey scores, then looping through that list. Note that this code produces the following output as the previous cell.
for sp in [('Fall', 'Spring'), ('Fall', 'Winter'),
('Winter', 'Spring')]: # 'sp' here is short for 'season pair.'
df_results_by_season_wide[f'{sp[0]}-{sp[1]} Change'] = (
df_results_by_season_wide[sp[1]]
- df_results_by_season_wide[sp[0]])
df_results_by_season_wide
starting_year | Fall | Winter | Spring | Fall-Spring Change | Fall-Winter Change | Winter-Spring Change | |
---|---|---|---|---|---|---|---|
0 | 2023 | 69.682251 | 64.217076 | 72.049622 | 2.367371 | -5.465175 | 7.832546 |
Adding additional pivot index values#
We now know that our average NVCU student survey scores declined from the fall to the winter and then rose from the winter to the spring. Was this trend the same across colleges and levels? We can answer this question by adding our college and level fields to the index
argument of our pivot table function.
In order to make this section more efficient, we can create a function that performs the pivot table, column renaming, and growth calculations shown above for df_results_by_season_wide
. This will greatly reduce the amount of code that we need to write to perform these additional analyses.
def create_wide_table(index_values):
'''This function creates a wide pivot table of df_survey_results, then
performs additional column renaming steps and growth calculations.
index_values: a list of values to pass to the index argument of
pivot_table().'''
df_wide = df_survey_results.pivot_table(
index = index_values, columns = ['season_order', 'season'],
values = 'score', aggfunc = 'mean').reset_index()
df_wide.columns = (df_wide.columns.to_flat_index())
df_wide.columns = [column_tuple[0] if column_tuple[1] not in
['Fall', 'Winter', 'Spring'] else column_tuple[1]
for column_tuple in df_wide.columns]
for sp in [('Fall', 'Spring'), ('Fall', 'Winter'),
('Winter', 'Spring')]:
df_wide[f'{sp[0]}-{sp[1]} Change'] = (
df_wide[sp[1]] - df_wide[sp[0]])
return df_wide
Evaluating changes in survey scores by season and college:#
df_results_by_season_and_college_wide = create_wide_table(
index_values = ['starting_year', 'college'])
# Saving these results to a .csv file so that they can be used within
# other parts of Python for Nonprofits:
df_results_by_season_and_college_wide.to_csv(
'survey_results_by_college_wide.csv', index = False)
df_results_by_season_and_college_wide
starting_year | college | Fall | Winter | Spring | Fall-Spring Change | Fall-Winter Change | Winter-Spring Change | |
---|---|---|---|---|---|---|---|---|
0 | 2023 | STB | 69.797119 | 64.319684 | 67.077551 | -2.719568 | -5.477435 | 2.757867 |
1 | 2023 | STC | 69.568665 | 64.029750 | 66.911444 | -2.657221 | -5.538914 | 2.881694 |
2 | 2023 | STL | 69.596675 | 64.160327 | 76.727809 | 7.131134 | -5.436348 | 12.567482 |
3 | 2023 | STM | 69.735685 | 64.315841 | 76.639004 | 6.903320 | -5.419844 | 12.323163 |
Although we found earlier that university-wide survey results grew from the fall to the spring, this table shows that results for two colleges (STB and STC) actually dropped over that time period. (Their average spring survey scores were also markedly lower than STL’s and STM’s.) It also demonstrates that every college saw both a drop in scores from the fall to the winter and an increase in scores from the winter to the spring.
The following cell creates a ‘long’ version of this table that features only one season per row and doesn’t display changes in scores between seasons. We’ll utilize a .csv copy of this table within PFN’s graphing section. (We could also use the wide-formatted table shown above within our graphing script, but I wanted to make sure to demonstrate how to graph long-formatted data.)
Note that season_order
is added before season
within the index
list in order to get Winter results to precede Spring ones; however, it’s then dropped in order to help streamline the table.
df_results_by_season_and_college_long = df_survey_results.pivot_table(
index=['starting_year', 'college', 'season_order', 'season'],
values='score', aggfunc='mean').reset_index().drop(
'season_order', axis = 1)
df_results_by_season_and_college_long.to_csv(
'survey_results_by_college_long.csv', index = False)
df_results_by_season_and_college_long
starting_year | college | season | score | |
---|---|---|---|---|
0 | 2023 | STB | Fall | 69.797119 |
1 | 2023 | STB | Winter | 64.319684 |
2 | 2023 | STB | Spring | 67.077551 |
3 | 2023 | STC | Fall | 69.568665 |
4 | 2023 | STC | Winter | 64.029750 |
5 | 2023 | STC | Spring | 66.911444 |
6 | 2023 | STL | Fall | 69.596675 |
7 | 2023 | STL | Winter | 64.160327 |
8 | 2023 | STL | Spring | 76.727809 |
9 | 2023 | STM | Fall | 69.735685 |
10 | 2023 | STM | Winter | 64.315841 |
11 | 2023 | STM | Spring | 76.639004 |
Evaluating changes in survey scores by season and level:#
For this task, we’ll pivot the data by level_for_sorting
so as to order the rows from youngest to oldest.
df_results_by_season_and_college_wide = create_wide_table(
index_values=['starting_year', 'level_for_sorting', 'level'])
df_results_by_season_and_college_wide.drop([
'starting_year', 'level_for_sorting'], axis = 1) # I chose to drop
# these columns so that none of the more important ones would get
# cut off by my notebook's 8-column limit
level | Fall | Winter | Spring | Fall-Spring Change | Fall-Winter Change | Winter-Spring Change | |
---|---|---|---|---|---|---|---|
0 | Fr | 69.609774 | 64.184035 | 74.833180 | 5.223406 | -5.425740 | 10.649146 |
1 | So | 69.688672 | 64.104720 | 69.495124 | -0.193548 | -5.583952 | 5.390404 |
2 | Ju | 69.768957 | 64.426093 | 69.350671 | -0.418286 | -5.342864 | 4.924578 |
3 | Se | 69.698085 | 64.175332 | 73.546671 | 3.848586 | -5.522753 | 9.371339 |
This table shows that, whereas scores for sophomores and juniors did not change much from the fall to the spring, scores for freshmen and seniors increased quite a bit during that period. All levels showed a fall-to-winter drop followed by a winter-to-spring rise.
Evaluating changes in survey scores by season, level, and college:#
(I originally named the following DataFrame df_results_by_season_level_and_college_wide
, but since that’s a rather long name for a DataFrame that we’ll use quite frequently, I chose the abbreviated name df_results_slc
instead.)
df_results_slc = create_wide_table(
index_values=['starting_year', 'college',
'level_for_sorting', 'level'])
df_results_slc.to_csv('survey_results_slc_wide.csv', index = False)
df_results_slc.head()
starting_year | college | level_for_sorting | level | ... | Spring | Fall-Spring Change | Fall-Winter Change | Winter-Spring Change | |
---|---|---|---|---|---|---|---|---|---|
0 | 2023 | STB | 0 | Fr | ... | 69.177235 | -0.416348 | -5.685375 | 5.269026 |
1 | 2023 | STB | 1 | So | ... | 65.377306 | -5.000000 | -5.317285 | 0.317285 |
2 | 2023 | STB | 2 | Ju | ... | 64.950769 | -5.000000 | -5.459949 | 0.459949 |
3 | 2023 | STB | 3 | Se | ... | 68.543287 | -0.611041 | -5.388371 | 4.777330 |
4 | 2023 | STC | 0 | Fr | ... | 69.079348 | -0.519565 | -5.717023 | 5.197458 |
5 rows × 10 columns
The following cell creates a ‘long’ version of this table that can be incorporated into PFN’s graphing section.
df_survey_results_slc_long = df_survey_results.pivot_table(
index=['starting_year', 'college',
'level_for_sorting', 'level', 'season'],
values='score', aggfunc = 'mean').reset_index()
df_survey_results_slc_long.to_csv(
'survey_results_slc_long.csv', index = False)
df_survey_results_slc_long.head()
starting_year | college | level_for_sorting | level | season | score | |
---|---|---|---|---|---|---|
0 | 2023 | STB | 0 | Fr | Fall | 69.593583 |
1 | 2023 | STB | 0 | Fr | Spring | 69.177235 |
2 | 2023 | STB | 0 | Fr | Winter | 63.908208 |
3 | 2023 | STB | 1 | So | Fall | 70.377306 |
4 | 2023 | STB | 1 | So | Spring | 65.377306 |
Creating a wide student-level version of our survey results dataset#
We’ll also create a table that shows fall, winter, and spring survey results for each individual student within the same row. (This table will get used within Part 2 of the Descriptive Stats section.)
Since we’re simply condensing and reshaping df_survey_results
here rather than calculating averages, we can use pivot
rather than pivot_table
. Note that pivot
doesn’t take aggfunc
as an argument (as we’re not creating any aggregate statistics here).
df_student_results_wide = df_survey_results.copy().pivot(
index = ['starting_year', 'student_id'],
columns = 'season', values = 'score').reset_index()
df_student_results_wide.head()
season | starting_year | student_id | Fall | Spring | Winter |
---|---|---|---|---|---|
0 | 2023 | 2020-1 | 88.0 | 86.0 | 81.0 |
1 | 2023 | 2020-10 | 69.0 | 73.0 | 63.0 |
2 | 2023 | 2020-100 | 68.0 | 88.0 | 60.0 |
3 | 2023 | 2020-1000 | 58.0 | 65.0 | 55.0 |
4 | 2023 | 2020-1001 | 88.0 | 100.0 | NaN |
Note that ‘Spring’ is located to the left of ‘Winter’. To move it to the right, I’ll call df.insert()
, which allows us to add a new column into the DataFrame. This ‘new’ column will actually be the current Winter column, which I’ll first remove from the DataFrame via df.pop()
.
The first argument within df.insert()
is the location of the new column. The code calculates this location by subtracting 1 from the number of columns within df_student_results_wide
; this will represent the position to the right of the ‘Winter’ column once the ‘Spring’ column is removed. (I could have hardcoded this value, but this approach will better accommodate changes to df_student_results_wide
’s structure).
df_insert()
’s second argument is the name (which we’ll keep as ‘Spring’), and its third argument is the data to add to this new column (which, in this case, will be the existing ‘Spring’ column entries).
For more details on these functions, visit https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.insert.html and https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pop.html . The idea of chaining these two functions together isn’t an original one; I probably got it from a StackOverflow answer like this one from Marc Maxmeister : https://stackoverflow.com/a/77463008/13097194 )
df_student_results_wide.insert(
len(df_student_results_wide.columns)-1, 'Spring',
df_student_results_wide.pop('Spring'))
df_student_results_wide.head()
season | starting_year | student_id | Fall | Winter | Spring |
---|---|---|---|---|---|
0 | 2023 | 2020-1 | 88.0 | 81.0 | 86.0 |
1 | 2023 | 2020-10 | 69.0 | 63.0 | 73.0 |
2 | 2023 | 2020-100 | 68.0 | 60.0 | 88.0 |
3 | 2023 | 2020-1000 | 58.0 | 55.0 | 65.0 |
4 | 2023 | 2020-1001 | 88.0 | NaN | 100.0 |
I could also have reordered the columns via the following code; however, this code would then need to be updated to account for any new columns that we chose to add to the dataset, whereas the method shown above is a bit more flexible.
df_student_results_wide = df_student_results_wide[
['starting_year', 'student_id', 'Fall', 'Winter', 'Spring']].copy()
df_student_results_wide.head()
season | starting_year | student_id | Fall | Winter | Spring |
---|---|---|---|---|---|
0 | 2023 | 2020-1 | 88.0 | 81.0 | 86.0 |
1 | 2023 | 2020-10 | 69.0 | 63.0 | 73.0 |
2 | 2023 | 2020-100 | 68.0 | 60.0 | 88.0 |
3 | 2023 | 2020-1000 | 58.0 | 55.0 | 65.0 |
4 | 2023 | 2020-1001 | 88.0 | NaN | 100.0 |
Saving the table to a .csv file so that it can get imported into Part 2 of the Descriptive Stats section:
df_student_results_wide.to_csv(
'survey_results_by_student_wide.csv', index = False)
Comparing rows via sort_values()
and rank()
#
Which college/level pairs had the highest and lowest spring survey results? We could examine df_results_slc
line by line to answer this question; however, two Pandas functions–sort_values() and rank()–can help us answer this question more efficiently.
First, here are the the five college/level pairs with the highest average spring results:
df_results_slc.sort_values('Spring', ascending=False).drop(
['starting_year', 'level_for_sorting'], axis=1).head()
college | level | Fall | Winter | Spring | Fall-Spring Change | Fall-Winter Change | Winter-Spring Change | |
---|---|---|---|---|---|---|---|---|
15 | STM | Se | 70.058590 | 64.355972 | 79.309831 | 9.251241 | -5.702618 | 14.953859 |
8 | STL | Fr | 69.585030 | 64.321347 | 79.111622 | 9.526592 | -5.263683 | 14.790275 |
12 | STM | Fr | 69.650503 | 64.437544 | 78.488468 | 8.837966 | -5.212959 | 14.050925 |
11 | STL | Se | 69.026639 | 63.369668 | 77.915984 | 8.889344 | -5.656971 | 14.546315 |
10 | STL | Ju | 69.878706 | 64.483619 | 74.634771 | 4.756065 | -5.395087 | 10.151152 |
And here are the five pairs with the lowest spring results:
df_results_slc.sort_values('Spring').drop(
['starting_year', 'level_for_sorting'], axis=1).head()
college | level | Fall | Winter | Spring | Fall-Spring Change | Fall-Winter Change | Winter-Spring Change | |
---|---|---|---|---|---|---|---|---|
6 | STC | Ju | 69.180932 | 63.934755 | 64.180932 | -5.000000 | -5.246176 | 0.246176 |
5 | STC | So | 69.331325 | 63.480337 | 64.331325 | -5.000000 | -5.850988 | 0.850988 |
2 | STB | Ju | 69.950769 | 64.490820 | 64.950769 | -5.000000 | -5.459949 | 0.459949 |
1 | STB | So | 70.377306 | 65.060021 | 65.377306 | -5.000000 | -5.317285 | 0.317285 |
3 | STB | Se | 69.154329 | 63.765957 | 68.543287 | -0.611041 | -5.388371 | 4.777330 |
Note that the use of sort_values()
here did not actually change the underlying order of the DataFrame. Although the output appeared in sorted order immediately after sort_values()
got called, the DataFrame will then revert to its original sort order during subsequent lines of code. The following cell demonstrates this:
df_results_slc.head() # Note that the DataFrame
# is once again sorted by college and level
starting_year | college | level_for_sorting | level | ... | Spring | Fall-Spring Change | Fall-Winter Change | Winter-Spring Change | |
---|---|---|---|---|---|---|---|---|---|
0 | 2023 | STB | 0 | Fr | ... | 69.177235 | -0.416348 | -5.685375 | 5.269026 |
1 | 2023 | STB | 1 | So | ... | 65.377306 | -5.000000 | -5.317285 | 0.317285 |
2 | 2023 | STB | 2 | Ju | ... | 64.950769 | -5.000000 | -5.459949 | 0.459949 |
3 | 2023 | STB | 3 | Se | ... | 68.543287 | -0.611041 | -5.388371 | 4.777330 |
4 | 2023 | STC | 0 | Fr | ... | 69.079348 | -0.519565 | -5.717023 | 5.197458 |
5 rows × 10 columns
This behavior, which is seen in many other Pandas functions, is actually quite helpful: it allows you to test out changes and modifications without making them permanent (which, if you make a mistake, could force you to restart your script).
To make a sort persistent, you can use one of the following two lines:
# First option (my preference because it often requires fewer characters):
df_results_slc.sort_values('Spring', inplace=True)
# An alternative option (which can come in handy when making multiple
# changes to a dataset at once):
df_results_slc = (df_results_slc.sort_values('Spring')).copy()
# Make sure NOT to add 'inplace = True' as an argument when using the
# second method, as your DataFrame will then get replaced with
# a 'None' object!
# For an explanation of None, see:
# https://docs.python.org/3/library/constants.html#None
Now that we’ve made our sort persistent, we will continue to see it reflected within subsequent sets of output:
df_results_slc.head()
starting_year | college | level_for_sorting | level | ... | Spring | Fall-Spring Change | Fall-Winter Change | Winter-Spring Change | |
---|---|---|---|---|---|---|---|---|---|
6 | 2023 | STC | 2 | Ju | ... | 64.180932 | -5.000000 | -5.246176 | 0.246176 |
5 | 2023 | STC | 1 | So | ... | 64.331325 | -5.000000 | -5.850988 | 0.850988 |
2 | 2023 | STB | 2 | Ju | ... | 64.950769 | -5.000000 | -5.459949 | 0.459949 |
1 | 2023 | STB | 1 | So | ... | 65.377306 | -5.000000 | -5.317285 | 0.317285 |
3 | 2023 | STB | 3 | Se | ... | 68.543287 | -0.611041 | -5.388371 | 4.777330 |
5 rows × 10 columns
It’s also worth mentioning that none of these changes, even the persistent sort that we just effected, are affecting the underlying .csv file from which we retrieved our data. That file will only get modified if we use to_csv()
to save our table to that same filename. (You’ll usually want to save tables to a different filename anyway so that your output doesn’t overwrite your input.)
Calculating percentiles and ranks#
Ranks and percentiles are alternative ways to evaluate values relative to their peers. Let’s say that the NVCU administration would like you to calculate both the rank and the percentile of each college/level pair’s average spring score. However, they’d also like you to round the spring results to integers before making these calculations so that pairs with similar scores will get treated equally.
First, we’ll create a new condensed DataFrame that can store these integer-based results, ranks, and percentiles. We’ll then assign ranks to each integer.
# Creating a condensed DataFrame:
df_spring_ranks = df_results_slc.copy()[['starting_year', 'college',
'level_for_sorting', 'level', 'Spring']].sort_values(
'Spring', ascending=False)
# Converting average spring results to integers:
df_spring_ranks['Spring'] = df_spring_ranks['Spring'].astype('int')
# Calculating our ranks:
# Note: the inclusion of "method = 'min'" ensures that, in the case of
# ties, each tied row will show the lowest (i.e. best) possible rank.
# This is the ranking convention that I'm more familiar with, but Pandas
# allows for other methods also. Meanwhile, "ascending = False" assigns the
# best ranks to the highest results.
# See the df.rank() documentation for more details:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.
# DataFrame.rank.html
df_spring_ranks['Spring_Rank'] = df_spring_ranks[
'Spring'].rank(ascending = False, method = 'min')
df_spring_ranks
starting_year | college | level_for_sorting | level | Spring | Spring_Rank | |
---|---|---|---|---|---|---|
15 | 2023 | STM | 3 | Se | 79 | 1.0 |
8 | 2023 | STL | 0 | Fr | 79 | 1.0 |
12 | 2023 | STM | 0 | Fr | 78 | 3.0 |
11 | 2023 | STL | 3 | Se | 77 | 4.0 |
10 | 2023 | STL | 2 | Ju | 74 | 5.0 |
14 | 2023 | STM | 2 | Ju | 74 | 5.0 |
9 | 2023 | STL | 1 | So | 74 | 5.0 |
13 | 2023 | STM | 1 | So | 73 | 8.0 |
7 | 2023 | STC | 3 | Se | 69 | 9.0 |
0 | 2023 | STB | 0 | Fr | 69 | 9.0 |
4 | 2023 | STC | 0 | Fr | 69 | 9.0 |
3 | 2023 | STB | 3 | Se | 68 | 12.0 |
1 | 2023 | STB | 1 | So | 65 | 13.0 |
2 | 2023 | STB | 2 | Ju | 64 | 14.0 |
5 | 2023 | STC | 1 | So | 64 | 14.0 |
6 | 2023 | STC | 2 | Ju | 64 | 14.0 |
Our code for calculating percentiles will also use df.rank()
; we can instruct that function to display its output as percentiles by adding the argument pct=True
. We’ll also add (1) ascending=True
so that the highest scores will get the highest percentiles and (2) method='max'
so that, in the case of ties, the highest possible percentile will get displayed.
Note that, while the highest percentile in the following output is 100, the lowest percentile is not 0. I’m quite sure that this is because Pandas calculates percentiles as the percentage of results equal to or lower than the current result. Therefore, even the lowest row won’t get a percentile of 0 during percentile calculations, as it will at least be equal to itself. (The larger your dataset, the closer the smallest percentile will be to 0.)
df_spring_ranks['Spring_Percentile'] = (
100 * df_spring_ranks['Spring'].rank(
ascending=True, pct=True, method='max'))
df_spring_ranks.head()
starting_year | college | level_for_sorting | level | Spring | Spring_Rank | Spring_Percentile | |
---|---|---|---|---|---|---|---|
15 | 2023 | STM | 3 | Se | 79 | 1.0 | 100.00 |
8 | 2023 | STL | 0 | Fr | 79 | 1.0 | 100.00 |
12 | 2023 | STM | 0 | Fr | 78 | 3.0 | 87.50 |
11 | 2023 | STL | 3 | Se | 77 | 4.0 | 81.25 |
10 | 2023 | STL | 2 | Ju | 74 | 5.0 | 75.00 |
Although our DataFrame was sorted by Spring results, df.rank()
would still have successfully calculated ranks and percentiles regardless of how the DataFrame happened to be sorted. We can demonstrate this via the following code (which sorts the DataFrame by a different field before calling rank()
):
df_spring_ranks['Spring_Percentile'] = (
100 * df_spring_ranks.sort_values('college')['Spring'].rank(
ascending=True, pct=True, method='max'))
df_spring_ranks.head()
starting_year | college | level_for_sorting | level | Spring | Spring_Rank | Spring_Percentile | |
---|---|---|---|---|---|---|---|
15 | 2023 | STM | 3 | Se | 79 | 1.0 | 100.00 |
8 | 2023 | STL | 0 | Fr | 79 | 1.0 | 100.00 |
12 | 2023 | STM | 0 | Fr | 78 | 3.0 | 87.50 |
11 | 2023 | STL | 3 | Se | 77 | 4.0 | 81.25 |
10 | 2023 | STL | 2 | Ju | 74 | 5.0 | 75.00 |
end_time = time.time()
run_time = end_time - start_time
print(f"Finished running this script in {round(run_time, 3)} seconds.")
Finished running this script in 0.572 seconds.
The above line shows how long it took this notebook to run all of its Python code. As you can see, my claim at the start of this notebook that you could use Python to perform these tasks in ‘mere seconds’ wasn’t hyperbole!
Conclusion#
This notebook has provided a brief introduction to descriptive statistics calculations within Python. PFN’s graphing section will teach you how to convert some of the pivot tables created here into line and bar charts, thus making these data easier to interpret. However, before we can dive into the fun of Python-based visualizations, we should first cover a few more important descriptive stats topics.