Descriptive Stats

Descriptive Stats#

Now that we’ve learned how to retrieve, reformat, and clean data, we can finally begin analyzing it! This notebook demonstrates how to calculate descriptive statistics in Python using Pandas.

Suppose leaders at NVCU would like to know, on a daily basis, how incoming spring survey results differ from fall and winter ones. (These metrics could change each day as new spring survey data gets released, and the administration does not wish to wait until all results are in before they begin reviewing the numbers.)

One way to accomplish this task would be to manually import survey data from your database each day; paste it into Excel or Google Sheets; pivot the data, and then share the output. However, you could also accomplish these same steps in Python. While this would likely take you longer the first time around, you could then create updated analyses of your data in mere seconds. This notebook will show you how!

import time
start_time = time.time()
import sys
sys.path.insert(1, '../Appendix')
from helper_funcs import config_notebook
display_type = config_notebook(display_max_columns = 8,
                              display_max_rows = 16) 

import pandas as pd
import numpy as np
from sqlalchemy import create_engine    

We’ll first import our combined set of fall, spring, and winter student survey results; these results were created within data_prep.ipynb. (Note that these results also include college and level data that we merged in from NVCU’s ‘curr_enrollment’ SQL table; that way, we can evaluate average results by level and college.)

df_survey_results = pd.read_csv('../Data_Prep/2023_survey_results.csv')
# Removing a few non-essential columns from the display so that the table
# will show up better in PDF format:
df_survey_results.drop(['season_order', 'level_for_sorting'],
                       axis = 1).head(5)

	student_id	starting_year	season	score	college	level
0	2020-1	2023	Fall	88	STC	Se
1	2020-2	2023	Fall	37	STM	Se
2	2020-3	2023	Fall	54	STC	Se
3	2020-4	2023	Fall	56	STC	Se
4	2020-5	2023	Fall	77	STM	Se

Evaluating changes in average university-wide results during the school year#

Our dataset contains fall, winter, and spring survey results. In order to determine how the mean survey score has changed over the course of the year, we can use Pandas’ pivot_table() function (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html). I consider pivot_table() to be one of the most useful functions in the Pandas library, and you’ll see it used again and again with PFN.

The pivot_table() call below introduces three key arguments:

index: the list of values by which to group results. Although our dataset only contains data for one year, we’ll still include starting_year in our results in order to (1) allow the function to accommodate other school years and (2) demonstrate to the viewer that all of this data comes from 2023. We’ll also add both season_order and season to our list (in that order) so as to display results by season in chronological order. (Without the season_order argument, our seasons would be sorted alphabetically, with spring results preceding winter ones.

values: the metric(s) to assess. We’re interested in analyzing changes in average score by year, so we’ll pass score as our argument. Multiple values can be passed to this argument if needed.

aggfunc: the aggregate function to apply to our list of values. We’ll use mean here, but we could also have chosen median as a measure of the average.

I generally like to add reset_index() to the result of pivot_table() in order to remove any blank index values.

df_results_by_season = df_survey_results.pivot_table(
    index=['starting_year', 'season_order', 'season'], 
    values='score', aggfunc='mean').reset_index()
df_results_by_season

	starting_year	season_order	season	score
0	2023	0	Fall	69.682251
1	2023	1	Winter	64.217076
2	2023	2	Spring	72.049622

These results show that the average score fell around 5 points from the fall to the winter, then increased nearly 8 points from the winter to the spring.

Here’s what the output looks like without the trailing reset_index() call:

df_survey_results.pivot_table(
    index=['starting_year', 'season_order', 'season'], 
    values='score', aggfunc='mean')

			score
starting_year	season_order	season
2023	0	Fall	69.682251
	1	Winter	64.217076
	2	Spring	72.049622

We can also find the average score across seasons by setting margins to True. The margins_name argument lets us assign a name to this row; if we leave it blank, the row will be titled ‘All.’

df_survey_results.pivot_table(
    index = ['starting_year', 'season_order', 'season'], 
    values = 'score', aggfunc = 'mean', margins=True, 
    margins_name='2023 Average').reset_index()

	starting_year	season_order	season	score
0	2023	0	Fall	69.682251
1	2023	1	Winter	64.217076
2	2023	2	Spring	72.049622
3	2023 Average			68.882983

Calculating response rates#

The pivot_table() function will also let us determine survey response rates as a percentage of our current enrollment. We can import this current enrollment value from our NVCU database:

e = create_engine('sqlite:///../Appendix/nvcu_db.db')

Calculating our current enrollment by counting the number of rows in our curr_enrollment table:

enrollment_count = len(pd.read_sql(
    "Select * from curr_enrollment", con=e))
enrollment_count

A faster way of computing this number is to request it within the original SQL query via COUNT(*). The following line, which demonstrates this approach, took only 6 milliseconds to run on my computer–one tenth the duration of the previous line (which took 62 milliseconds). If we were dealing with millions of rows instead of thousands, this performance difference would probably be even greater.

enrollment_count = pd.read_sql(
    "Select COUNT(*) from curr_enrollment", con=e).iloc[0]['COUNT(*)']
enrollment_count

np.int64(16384)

Now that we have the denominator* for our response rate calculations, let’s go ahead and calculate the numerator (e.g. the number of responses per season). For teaching purposes, the following cell will also show how to use a dictionary to specify separate aggfunc values for different value entries. In this case, we’ll pass mean as our aggregate function for the score column, but count as our aggregate function for the responses field.

Note: When calculating row counts, make sure that the column you pass to the values argument doesn’t contain null values; otherwise, your row counts will be incorrect (as null values will get excluded from your counts.) To prevent this issue, I often like to create a column that stores a value of 1 for every row. Using this column (titled responses in the following cell) ensures that my pivot table will show the correct row counts for each group.

*A more robust approach here would have been to calculate separate enrollment figures for the fall, winter, and spring periods in order to take new matriculations and withdrawals into account.

df_survey_results['responses'] = 1
df_response_rates = df_survey_results.pivot_table(
    index=['starting_year', 'season_order', 'season'], 
    values=['score', 'responses'], 
    aggfunc={'score':'mean','responses':'count'}).reset_index() 
    # Because all 'responses' values are 1, we could also have made 'sum' 
    # the aggfunc for 'responses' rather than 'count.'
df_response_rates

	starting_year	season_order	season	responses	score
0	2023	0	Fall	16384	69.682251
1	2023	1	Winter	13926	64.217076
2	2023	2	Spring	16384	72.049622

Calculating response rates as the quotient of survey counts and NVCU’s current enrollment:

df_response_rates['response_rate'] = 100*(
    df_response_rates['responses'] / enrollment_count)
df_response_rates

	starting_year	season_order	season	responses	score	response_rate
0	2023	0	Fall	16384	69.682251	100.000000
1	2023	1	Winter	13926	64.217076	84.997559
2	2023	2	Spring	16384	72.049622	100.000000

This table shows that our survey response rates were 100% during the fall and spring and around 85% during the winter.

Using the `columns` argument within `pivot_table()` to show seasons side by side#

Currently, the DataFrame is in ‘long’ format: each row shows data for one specific season. However, in order to more easily calculate the change in results from one season to another, we can also use the columns argument when creating a pivot table in order to show scores for each season side by side. (This will prove especially useful when we add additional index variables to our pivot_table() call.

The following function is similar to our earlier pivot_table calls except that the season_order and season values have been moved from the index argument to the argument for columns. This change makes the seasons appear horizontally rather than vertically.

df_results_by_season_wide = df_survey_results.pivot_table(
    index='starting_year', columns=['season_order', 'season'],
    values='score', aggfunc='mean').reset_index()
df_results_by_season_wide

season_order	starting_year	0	1	2
season		Fall	Winter	Spring
0	2023	69.682251	64.217076	72.049622

Note that, because we passed two values to the columns parameter, two levels of headers are now visible. However, I’d like to show just one level of columns that will comprise the starting_year value in the top row and the season names in the bottom row. We can accomplish this by first calling to_flat_index() to ‘flatten’ the columns into tuples:

df_results_by_season_wide.columns = (
    df_results_by_season_wide.columns.to_flat_index())
df_results_by_season_wide

	(starting_year, )	(0, Fall)	(1, Winter)	(2, Spring)
0	2023	69.682251	64.217076	72.049622

Next, I’ll use a list comprehension to replace our tuple-based columns with string-based ones. Note that I want to keep the first entry (‘starting_year’) in the first tuple and the second entries (Fall, Winter, and Spring) in the others; this can be accomplished by adding an if/else statement to our list comprehension.

df_results_by_season_wide.columns = [
    column_tuple[0] if column_tuple[1] not in ['Fall', 'Winter', 'Spring'] 
    else column_tuple[1] for column_tuple in 
    df_results_by_season_wide.columns]
df_results_by_season_wide

	starting_year	Fall	Winter	Spring
0	2023	69.682251	64.217076	72.049622

Now that we have our seasons next to one another, we can easily calculate changes in average scores between them:

df_results_by_season_wide['Fall-Spring Change'] = (
    df_results_by_season_wide['Spring'] 
    - df_results_by_season_wide['Fall'])

df_results_by_season_wide['Fall-Winter Change'] = (
    df_results_by_season_wide['Winter'] 
    - df_results_by_season_wide['Fall'])

df_results_by_season_wide['Winter-Spring Change'] = (
    df_results_by_season_wide['Spring'] 
    - df_results_by_season_wide['Winter'])

df_results_by_season_wide

	starting_year	Fall	Winter	Spring	Fall-Spring Change	Fall-Winter Change	Winter-Spring Change
0	2023	69.682251	64.217076	72.049622	2.367371	-5.465175	7.832546

We can simplify this code by creating a list of the seasons between which we want to calculate changes in survey scores, then looping through that list. Note that this code produces the following output as the previous cell.

for sp in [('Fall', 'Spring'), ('Fall', 'Winter'), 
           ('Winter', 'Spring')]: # 'sp' here is short for 'season pair.'
    df_results_by_season_wide[f'{sp[0]}-{sp[1]} Change'] = (
        df_results_by_season_wide[sp[1]] 
        - df_results_by_season_wide[sp[0]])

df_results_by_season_wide

	starting_year	Fall	Winter	Spring	Fall-Spring Change	Fall-Winter Change	Winter-Spring Change
0	2023	69.682251	64.217076	72.049622	2.367371	-5.465175	7.832546

Adding additional pivot index values#

We now know that our average NVCU student survey scores declined from the fall to the winter and then rose from the winter to the spring. Was this trend the same across colleges and levels? We can answer this question by adding our college and level fields to the index argument of our pivot table function.

In order to make this section more efficient, we can create a function that performs the pivot table, column renaming, and growth calculations shown above for df_results_by_season_wide. This will greatly reduce the amount of code that we need to write to perform these additional analyses.

def create_wide_table(index_values):
    '''This function creates a wide pivot table of df_survey_results, then
    performs additional column renaming steps and growth calculations.
    
    index_values: a list of values to pass to the index argument of 
    pivot_table().'''
    
    df_wide = df_survey_results.pivot_table(
    index = index_values, columns = ['season_order', 'season'],
    values = 'score', aggfunc = 'mean').reset_index()
    df_wide.columns = (df_wide.columns.to_flat_index())
    df_wide.columns = [column_tuple[0] if column_tuple[1] not in 
        ['Fall', 'Winter', 'Spring'] else column_tuple[1] 
                       for column_tuple in df_wide.columns]
    for sp in [('Fall', 'Spring'), ('Fall', 'Winter'), 
               ('Winter', 'Spring')]: 
        df_wide[f'{sp[0]}-{sp[1]} Change'] = (
            df_wide[sp[1]] - df_wide[sp[0]])
    return df_wide

Evaluating changes in survey scores by season and college:#

df_results_by_season_and_college_wide = create_wide_table(
    index_values = ['starting_year', 'college'])
# Saving these results to a .csv file so that they can be used within
# other parts of Python for Nonprofits:
df_results_by_season_and_college_wide.to_csv(
    'survey_results_by_college_wide.csv', index = False)
df_results_by_season_and_college_wide

	starting_year	college	Fall	Winter	Spring	Fall-Spring Change	Fall-Winter Change	Winter-Spring Change
0	2023	STB	69.797119	64.319684	67.077551	-2.719568	-5.477435	2.757867
1	2023	STC	69.568665	64.029750	66.911444	-2.657221	-5.538914	2.881694
2	2023	STL	69.596675	64.160327	76.727809	7.131134	-5.436348	12.567482
3	2023	STM	69.735685	64.315841	76.639004	6.903320	-5.419844	12.323163

Although we found earlier that university-wide survey results grew from the fall to the spring, this table shows that results for two colleges (STB and STC) actually dropped over that time period. (Their average spring survey scores were also markedly lower than STL’s and STM’s.) It also demonstrates that every college saw both a drop in scores from the fall to the winter and an increase in scores from the winter to the spring.

The following cell creates a ‘long’ version of this table that features only one season per row and doesn’t display changes in scores between seasons. We’ll utilize a .csv copy of this table within PFN’s graphing section. (We could also use the wide-formatted table shown above within our graphing script, but I wanted to make sure to demonstrate how to graph long-formatted data.)

Note that season_order is added before season within the index list in order to get Winter results to precede Spring ones; however, it’s then dropped in order to help streamline the table.

df_results_by_season_and_college_long = df_survey_results.pivot_table(
    index=['starting_year', 'college', 'season_order', 'season'], 
    values='score', aggfunc='mean').reset_index().drop(
    'season_order', axis = 1)
df_results_by_season_and_college_long.to_csv(
    'survey_results_by_college_long.csv', index = False)
df_results_by_season_and_college_long

	starting_year	college	season	score
0	2023	STB	Fall	69.797119
1	2023	STB	Winter	64.319684
2	2023	STB	Spring	67.077551
3	2023	STC	Fall	69.568665
4	2023	STC	Winter	64.029750
5	2023	STC	Spring	66.911444
6	2023	STL	Fall	69.596675
7	2023	STL	Winter	64.160327
8	2023	STL	Spring	76.727809
9	2023	STM	Fall	69.735685
10	2023	STM	Winter	64.315841
11	2023	STM	Spring	76.639004

Evaluating changes in survey scores by season and level:#

For this task, we’ll pivot the data by level_for_sorting so as to order the rows from youngest to oldest.

df_results_by_season_and_college_wide = create_wide_table(
    index_values=['starting_year', 'level_for_sorting', 'level'])
df_results_by_season_and_college_wide.drop([
    'starting_year', 'level_for_sorting'], axis = 1) # I chose to drop
# these columns so that none of the more important ones would get 
# cut off by my notebook's 8-column limit

	level	Fall	Winter	Spring	Fall-Spring Change	Fall-Winter Change	Winter-Spring Change
0	Fr	69.609774	64.184035	74.833180	5.223406	-5.425740	10.649146
1	So	69.688672	64.104720	69.495124	-0.193548	-5.583952	5.390404
2	Ju	69.768957	64.426093	69.350671	-0.418286	-5.342864	4.924578
3	Se	69.698085	64.175332	73.546671	3.848586	-5.522753	9.371339

This table shows that, whereas scores for sophomores and juniors did not change much from the fall to the spring, scores for freshmen and seniors increased quite a bit during that period. All levels showed a fall-to-winter drop followed by a winter-to-spring rise.

Evaluating changes in survey scores by season, level, and college:#

(I originally named the following DataFrame df_results_by_season_level_and_college_wide, but since that’s a rather long name for a DataFrame that we’ll use quite frequently, I chose the abbreviated name df_results_slc instead.)

df_results_slc = create_wide_table(
    index_values=['starting_year', 'college', 
                    'level_for_sorting', 'level'])
df_results_slc.to_csv('survey_results_slc_wide.csv', index = False)
df_results_slc.head()

	starting_year	college	level_for_sorting	level	...	Spring	Fall-Spring Change	Fall-Winter Change	Winter-Spring Change
0	2023	STB	0	Fr	...	69.177235	-0.416348	-5.685375	5.269026
1	2023	STB	1	So	...	65.377306	-5.000000	-5.317285	0.317285
2	2023	STB	2	Ju	...	64.950769	-5.000000	-5.459949	0.459949
3	2023	STB	3	Se	...	68.543287	-0.611041	-5.388371	4.777330
4	2023	STC	0	Fr	...	69.079348	-0.519565	-5.717023	5.197458

5 rows × 10 columns

The following cell creates a ‘long’ version of this table that can be incorporated into PFN’s graphing section.

df_survey_results_slc_long = df_survey_results.pivot_table(
    index=['starting_year', 'college', 
             'level_for_sorting', 'level', 'season'],
    values='score', aggfunc = 'mean').reset_index()
df_survey_results_slc_long.to_csv(
    'survey_results_slc_long.csv', index = False)
df_survey_results_slc_long.head()

	starting_year	college	level_for_sorting	level	season	score
0	2023	STB	0	Fr	Fall	69.593583
1	2023	STB	0	Fr	Spring	69.177235
2	2023	STB	0	Fr	Winter	63.908208
3	2023	STB	1	So	Fall	70.377306
4	2023	STB	1	So	Spring	65.377306

Creating a wide student-level version of our survey results dataset#

We’ll also create a table that shows fall, winter, and spring survey results for each individual student within the same row. (This table will get used within Part 2 of the Descriptive Stats section.)

Since we’re simply condensing and reshaping df_survey_results here rather than calculating averages, we can use pivot rather than pivot_table. Note that pivot doesn’t take aggfunc as an argument (as we’re not creating any aggregate statistics here).

df_student_results_wide = df_survey_results.copy().pivot(
    index = ['starting_year', 'student_id'],
             columns = 'season', values = 'score').reset_index()
df_student_results_wide.head()

season	starting_year	student_id	Fall	Spring	Winter
0	2023	2020-1	88.0	86.0	81.0
1	2023	2020-10	69.0	73.0	63.0
2	2023	2020-100	68.0	88.0	60.0
3	2023	2020-1000	58.0	65.0	55.0
4	2023	2020-1001	88.0	100.0	NaN

Note that ‘Spring’ is located to the left of ‘Winter’. To move it to the right, I’ll call df.insert(), which allows us to add a new column into the DataFrame. This ‘new’ column will actually be the current Winter column, which I’ll first remove from the DataFrame via df.pop().

The first argument within df.insert() is the location of the new column. The code calculates this location by subtracting 1 from the number of columns within df_student_results_wide; this will represent the position to the right of the ‘Winter’ column once the ‘Spring’ column is removed. (I could have hardcoded this value, but this approach will better accommodate changes to df_student_results_wide’s structure).

df_insert()’s second argument is the name (which we’ll keep as ‘Spring’), and its third argument is the data to add to this new column (which, in this case, will be the existing ‘Spring’ column entries).

For more details on these functions, visit https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.insert.html and https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pop.html . The idea of chaining these two functions together isn’t an original one; I probably got it from a StackOverflow answer like this one from Marc Maxmeister : https://stackoverflow.com/a/77463008/13097194 )

df_student_results_wide.insert(
    len(df_student_results_wide.columns)-1, 'Spring',
                       df_student_results_wide.pop('Spring'))
df_student_results_wide.head()

season	starting_year	student_id	Fall	Winter	Spring
0	2023	2020-1	88.0	81.0	86.0
1	2023	2020-10	69.0	63.0	73.0
2	2023	2020-100	68.0	60.0	88.0
3	2023	2020-1000	58.0	55.0	65.0
4	2023	2020-1001	88.0	NaN	100.0

I could also have reordered the columns via the following code; however, this code would then need to be updated to account for any new columns that we chose to add to the dataset, whereas the method shown above is a bit more flexible.

df_student_results_wide = df_student_results_wide[
['starting_year', 'student_id', 'Fall', 'Winter', 'Spring']].copy()
df_student_results_wide.head()

season	starting_year	student_id	Fall	Winter	Spring
0	2023	2020-1	88.0	81.0	86.0
1	2023	2020-10	69.0	63.0	73.0
2	2023	2020-100	68.0	60.0	88.0
3	2023	2020-1000	58.0	55.0	65.0
4	2023	2020-1001	88.0	NaN	100.0

Saving the table to a .csv file so that it can get imported into Part 2 of the Descriptive Stats section:

df_student_results_wide.to_csv(
    'survey_results_by_student_wide.csv', index = False)

Comparing rows via `sort_values()` and `rank()`#

Which college/level pairs had the highest and lowest spring survey results? We could examine df_results_slc line by line to answer this question; however, two Pandas functions–sort_values() and rank()–can help us answer this question more efficiently.

First, here are the the five college/level pairs with the highest average spring results:

df_results_slc.sort_values('Spring', ascending=False).drop(
    ['starting_year', 'level_for_sorting'], axis=1).head()

	college	level	Fall	Winter	Spring	Fall-Spring Change	Fall-Winter Change	Winter-Spring Change
15	STM	Se	70.058590	64.355972	79.309831	9.251241	-5.702618	14.953859
8	STL	Fr	69.585030	64.321347	79.111622	9.526592	-5.263683	14.790275
12	STM	Fr	69.650503	64.437544	78.488468	8.837966	-5.212959	14.050925
11	STL	Se	69.026639	63.369668	77.915984	8.889344	-5.656971	14.546315
10	STL	Ju	69.878706	64.483619	74.634771	4.756065	-5.395087	10.151152

And here are the five pairs with the lowest spring results:

df_results_slc.sort_values('Spring').drop(
    ['starting_year', 'level_for_sorting'], axis=1).head()

	college	level	Fall	Winter	Spring	Fall-Spring Change	Fall-Winter Change	Winter-Spring Change
6	STC	Ju	69.180932	63.934755	64.180932	-5.000000	-5.246176	0.246176
5	STC	So	69.331325	63.480337	64.331325	-5.000000	-5.850988	0.850988
2	STB	Ju	69.950769	64.490820	64.950769	-5.000000	-5.459949	0.459949
1	STB	So	70.377306	65.060021	65.377306	-5.000000	-5.317285	0.317285
3	STB	Se	69.154329	63.765957	68.543287	-0.611041	-5.388371	4.777330

Note that the use of sort_values() here did not actually change the underlying order of the DataFrame. Although the output appeared in sorted order immediately after sort_values() got called, the DataFrame will then revert to its original sort order during subsequent lines of code. The following cell demonstrates this:

df_results_slc.head() # Note that the DataFrame
# is once again sorted by college and level

	starting_year	college	level_for_sorting	level	...	Spring	Fall-Spring Change	Fall-Winter Change	Winter-Spring Change
0	2023	STB	0	Fr	...	69.177235	-0.416348	-5.685375	5.269026
1	2023	STB	1	So	...	65.377306	-5.000000	-5.317285	0.317285
2	2023	STB	2	Ju	...	64.950769	-5.000000	-5.459949	0.459949
3	2023	STB	3	Se	...	68.543287	-0.611041	-5.388371	4.777330
4	2023	STC	0	Fr	...	69.079348	-0.519565	-5.717023	5.197458

5 rows × 10 columns

This behavior, which is seen in many other Pandas functions, is actually quite helpful: it allows you to test out changes and modifications without making them permanent (which, if you make a mistake, could force you to restart your script).

To make a sort persistent, you can use one of the following two lines:

# First option (my preference because it often requires fewer characters):
df_results_slc.sort_values('Spring', inplace=True)

# An alternative option (which can come in handy when making multiple
# changes to a dataset at once):
df_results_slc = (df_results_slc.sort_values('Spring')).copy()

# Make sure NOT to add 'inplace = True' as an argument when using the 
# second method, as your DataFrame will then get replaced with 
# a 'None' object!
# For an explanation of None, see: 
# https://docs.python.org/3/library/constants.html#None

Now that we’ve made our sort persistent, we will continue to see it reflected within subsequent sets of output:

df_results_slc.head()

	starting_year	college	level_for_sorting	level	...	Spring	Fall-Spring Change	Fall-Winter Change	Winter-Spring Change
6	2023	STC	2	Ju	...	64.180932	-5.000000	-5.246176	0.246176
5	2023	STC	1	So	...	64.331325	-5.000000	-5.850988	0.850988
2	2023	STB	2	Ju	...	64.950769	-5.000000	-5.459949	0.459949
1	2023	STB	1	So	...	65.377306	-5.000000	-5.317285	0.317285
3	2023	STB	3	Se	...	68.543287	-0.611041	-5.388371	4.777330

5 rows × 10 columns

It’s also worth mentioning that none of these changes, even the persistent sort that we just effected, are affecting the underlying .csv file from which we retrieved our data. That file will only get modified if we use to_csv() to save our table to that same filename. (You’ll usually want to save tables to a different filename anyway so that your output doesn’t overwrite your input.)

Calculating percentiles and ranks#

Ranks and percentiles are alternative ways to evaluate values relative to their peers. Let’s say that the NVCU administration would like you to calculate both the rank and the percentile of each college/level pair’s average spring score. However, they’d also like you to round the spring results to integers before making these calculations so that pairs with similar scores will get treated equally.

First, we’ll create a new condensed DataFrame that can store these integer-based results, ranks, and percentiles. We’ll then assign ranks to each integer.

# Creating a condensed DataFrame:
df_spring_ranks = df_results_slc.copy()[['starting_year', 'college', 
    'level_for_sorting', 'level', 'Spring']].sort_values(
    'Spring', ascending=False)
# Converting average spring results to integers:
df_spring_ranks['Spring'] = df_spring_ranks['Spring'].astype('int')

# Calculating our ranks:
# Note: the inclusion of "method = 'min'" ensures that, in the case of 
# ties, each tied row will show the lowest (i.e. best) possible rank. 
# This is the ranking convention that I'm more familiar with, but Pandas 
# allows for other methods also. Meanwhile, "ascending = False" assigns the 
# best ranks to the highest results.
# See the df.rank() documentation for more details:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.
# DataFrame.rank.html

df_spring_ranks['Spring_Rank'] = df_spring_ranks[
'Spring'].rank(ascending = False, method = 'min')
df_spring_ranks

	starting_year	college	level_for_sorting	level	Spring	Spring_Rank
15	2023	STM	3	Se	79	1.0
8	2023	STL	0	Fr	79	1.0
12	2023	STM	0	Fr	78	3.0
11	2023	STL	3	Se	77	4.0
10	2023	STL	2	Ju	74	5.0
14	2023	STM	2	Ju	74	5.0
9	2023	STL	1	So	74	5.0
13	2023	STM	1	So	73	8.0
7	2023	STC	3	Se	69	9.0
0	2023	STB	0	Fr	69	9.0
4	2023	STC	0	Fr	69	9.0
3	2023	STB	3	Se	68	12.0
1	2023	STB	1	So	65	13.0
2	2023	STB	2	Ju	64	14.0
5	2023	STC	1	So	64	14.0
6	2023	STC	2	Ju	64	14.0

Our code for calculating percentiles will also use df.rank(); we can instruct that function to display its output as percentiles by adding the argument pct=True. We’ll also add (1) ascending=True so that the highest scores will get the highest percentiles and (2) method='max' so that, in the case of ties, the highest possible percentile will get displayed.

Note that, while the highest percentile in the following output is 100, the lowest percentile is not 0. I’m quite sure that this is because Pandas calculates percentiles as the percentage of results equal to or lower than the current result. Therefore, even the lowest row won’t get a percentile of 0 during percentile calculations, as it will at least be equal to itself. (The larger your dataset, the closer the smallest percentile will be to 0.)

df_spring_ranks['Spring_Percentile'] = (
    100 * df_spring_ranks['Spring'].rank(
        ascending=True, pct=True, method='max'))
df_spring_ranks.head()

	starting_year	college	level_for_sorting	level	Spring	Spring_Rank	Spring_Percentile
15	2023	STM	3	Se	79	1.0	100.00
8	2023	STL	0	Fr	79	1.0	100.00
12	2023	STM	0	Fr	78	3.0	87.50
11	2023	STL	3	Se	77	4.0	81.25
10	2023	STL	2	Ju	74	5.0	75.00

Although our DataFrame was sorted by Spring results, df.rank() would still have successfully calculated ranks and percentiles regardless of how the DataFrame happened to be sorted. We can demonstrate this via the following code (which sorts the DataFrame by a different field before calling rank()):

df_spring_ranks['Spring_Percentile'] = (
    100 * df_spring_ranks.sort_values('college')['Spring'].rank(
        ascending=True, pct=True, method='max'))
df_spring_ranks.head()

	starting_year	college	level_for_sorting	level	Spring	Spring_Rank	Spring_Percentile
15	2023	STM	3	Se	79	1.0	100.00
8	2023	STL	0	Fr	79	1.0	100.00
12	2023	STM	0	Fr	78	3.0	87.50
11	2023	STL	3	Se	77	4.0	81.25
10	2023	STL	2	Ju	74	5.0	75.00

end_time = time.time()
run_time = end_time - start_time
print(f"Finished running this script in {round(run_time, 3)} seconds.")

Finished running this script in 0.572 seconds.

The above line shows how long it took this notebook to run all of its Python code. As you can see, my claim at the start of this notebook that you could use Python to perform these tasks in ‘mere seconds’ wasn’t hyperbole!

Conclusion#

This notebook has provided a brief introduction to descriptive statistics calculations within Python. PFN’s graphing section will teach you how to convert some of the pivot tables created here into line and bar charts, thus making these data easier to interpret. However, before we can dive into the fun of Python-based visualizations, we should first cover a few more important descriptive stats topics.

Descriptive Stats

Contents

Descriptive Stats#

Evaluating changes in average university-wide results during the school year#

Calculating response rates#

Using the columns argument within pivot_table() to show seasons side by side#

Adding additional pivot index values#

Evaluating changes in survey scores by season and college:#

Evaluating changes in survey scores by season and level:#

Evaluating changes in survey scores by season, level, and college:#

Creating a wide student-level version of our survey results dataset#

Comparing rows via sort_values() and rank()#

Calculating percentiles and ranks#

Conclusion#

Using the `columns` argument within `pivot_table()` to show seasons side by side#

Comparing rows via `sort_values()` and `rank()`#