Descriptive Stats: Part 2#
By Kenneth Burchfiel
This second part of Python for Nonprofits’ descriptive stats section covers several potential data analysis pitfalls. Specifically, it will explore:
How to adjust for missing values when calculating weighted averages
The risk of relying on column index positions
Why column-wise operations should be preferred over for loops
Challenges with finding average values for fields whose rows are themselves averages
Methods of accounting for missing data when creating pivot tables
Issues with using
np.where()
to create derivatives of columns with missing data (and whymap()
andnp.select()
are better fits)
I have to admit that, even by programming textbook standards, you may not find this to be the most exciting chapter. You may well be anxious to get ahead to the graphing, mapping, and online dashboard sections of PFN–i.e. the ‘fun’ stuff.
However, in order to be confident that your graphs, maps, and dashboards will provide an accurate view of your underlying data, it’s crucial to exercise caution when pivoting and transforming tables. This chapter is meant to help you be more cautious–and thus more successful–in your data analysis adventures.
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
e = create_engine('sqlite:///../Appendix/nvcu_db.db')
import sys
sys.path.insert(1, '../Appendix')
from helper_funcs import config_notebook
display_type = config_notebook(display_max_columns=6,
display_max_rows=8)
Calculating weighted average results by student (and dealing with missing values)#
Suppose that the NVCU administration also wishes to see what percentage of students had a weighted average annual survey score below 60. Because they are more interested in students’ most recent survey results, they would like you to assign a weight of 0.2 to the fall results; 0.3 to the winter results; and 0.5 to the spring results. (Thus, students’ weighted survey averages will equal 0.2*F + 0.3*W + 0.5*S, with F, W, and S referring to students’ fall, winter, and spring results, respectively.)
We’ll begin this analysis by loading in ‘survey_results_by_student_wide.csv’, which shows fall, winter, and spring scores side by side for each student.
df_student_results_wide = pd.read_csv(
'survey_results_by_student_wide.csv')
# Creating a copy of this DataFrame that includes only students with
# valid winter survey results:
# (We'll make use of this copy later within this notebook.)
df_valid_survey_results = (
df_student_results_wide.query("Winter.notna()").copy().drop(
'starting_year', axis=1))
df_student_results_wide.head()
starting_year | student_id | Fall | Winter | Spring | |
---|---|---|---|---|---|
0 | 2023 | 2020-1 | 88.0 | 81.0 | 86.0 |
1 | 2023 | 2020-10 | 69.0 | 63.0 | 73.0 |
2 | 2023 | 2020-100 | 68.0 | 60.0 | 88.0 |
3 | 2023 | 2020-1000 | 58.0 | 55.0 | 65.0 |
4 | 2023 | 2020-1001 | 88.0 | NaN | 100.0 |
Because not all students took the winter survey, some winter results have NaN (not a number) values. We can count the number of NaN results for each column by (1) calling the isna()
function, which displays whether or not each cell is NaN, then (2) following that call with .sum() in order to add up all NaN entries.
df_student_results_wide.isna().sum()
starting_year 0
student_id 0
Fall 0
Winter 2458
Spring 0
dtype: int64
These missing results will make our weighted average calculations a bit more complicated. For instance, suppose we tried to create our weighted averages using the following code:
avg_score_cols_to_display = [
'student_id', 'Fall', 'Winter', 'Spring', 'weighted_avg_score']
# Displaying only these columns in the following cells will help
# condense the output
df_student_results_wide['weighted_avg_score'] = (
df_student_results_wide['Fall'] * 0.2
+ df_student_results_wide['Winter'] * 0.3
+ df_student_results_wide['Spring'] * 0.5)
df_student_results_wide[avg_score_cols_to_display].head()
student_id | Fall | Winter | Spring | weighted_avg_score | |
---|---|---|---|---|---|
0 | 2020-1 | 88.0 | 81.0 | 86.0 | 84.9 |
1 | 2020-10 | 69.0 | 63.0 | 73.0 | 69.2 |
2 | 2020-100 | 68.0 | 60.0 | 88.0 | 75.6 |
3 | 2020-1000 | 58.0 | 55.0 | 65.0 | 60.6 |
4 | 2020-1001 | 88.0 | NaN | 100.0 | NaN |
This code works fine for students with valid scores for all 3 seasons, but those with a NaN winter value will also end up with a NaN average score:
df_student_results_wide.query("Winter.isna()")[
avg_score_cols_to_display].head()
student_id | Fall | Winter | Spring | weighted_avg_score | |
---|---|---|---|---|---|
4 | 2020-1001 | 88.0 | NaN | 100.0 | NaN |
6 | 2020-1003 | 69.0 | NaN | 90.0 | NaN |
10 | 2020-1007 | 93.0 | NaN | 96.0 | NaN |
12 | 2020-1009 | 74.0 | NaN | 69.0 | NaN |
14 | 2020-1010 | 76.0 | NaN | 92.0 | NaN |
A naive approach to compensate for these NaN results would be to call fillna()
to replace all NaN values with 0, as shown in the following block of code. However, this approach will result in inaccurately low average score values for students with missing winter results. This is because our valid score weights for these students (0.2 for fall and 0.5 for spring) add up to only 0.7.
In the following output, note how the weighted averages for students with missing winter scores are lower than both their fall and spring scores, which doesn’t make sense.
df_student_results_wide['weighted_avg_score'] = (
df_student_results_wide['Fall'].fillna(0) * 0.2
+ df_student_results_wide['Winter'].fillna(0) * 0.3
+ df_student_results_wide['Spring'].fillna(0) * 0.5)
df_student_results_wide.query("Winter.isna()")[
avg_score_cols_to_display].head()
student_id | Fall | Winter | Spring | weighted_avg_score | |
---|---|---|---|---|---|
4 | 2020-1001 | 88.0 | NaN | 100.0 | 67.6 |
6 | 2020-1003 | 69.0 | NaN | 90.0 | 58.8 |
10 | 2020-1007 | 93.0 | NaN | 96.0 | 66.6 |
12 | 2020-1009 | 74.0 | NaN | 69.0 | 49.3 |
14 | 2020-1010 | 76.0 | NaN | 92.0 | 61.2 |
Here’s a better approach that, while a bit more complex, successfully adjusts for missing values. First, we’ll create a weight column for each season that displays either our predetermined weight (if a student has a valid score for that season) or 0 (if the student does not). We’ll also create a column that adds all of these weights together.
The following code applies np.where()
to determine whether or not to assign a given student/season pair a weight of 0. If a score is missing for a given season, that season will receive a weight entry of 0; otherwise, it will be assigned the usual weight. (See https://numpy.org/doc/stable/reference/generated/numpy.where.html for more information about np.where().)
np.where()
is a great option for filling in DataFrame fields based on two specific conditions (e.g. is survey data for a season missing or present?). However, if you ever need to handle three or more conditions, consider using np.select()
instead. (We’ll cover this function later within PFN.)
season_weight_dict = {'Fall':0.2,'Winter':0.3,'Spring':0.5}
# Using a for loop to create these columns makes our code a bit more
# concise.
for season in ['Fall', 'Winter', 'Spring']:
df_student_results_wide[season+'_weight'] = np.where(
df_student_results_wide[season].isna(), 0,
season_weight_dict[season])
# adding axis=1 as an argument to df.sum() ensures that the calculations
# will be made row-wise rather than column-wise.
df_student_results_wide['weight_sum'] = df_student_results_wide[[
'Fall_weight', 'Winter_weight', 'Spring_weight']].sum(axis=1)
df_student_results_wide.head()
starting_year | student_id | Fall | ... | Winter_weight | Spring_weight | weight_sum | |
---|---|---|---|---|---|---|---|
0 | 2023 | 2020-1 | 88.0 | ... | 0.3 | 0.5 | 1.0 |
1 | 2023 | 2020-10 | 69.0 | ... | 0.3 | 0.5 | 1.0 |
2 | 2023 | 2020-100 | 68.0 | ... | 0.3 | 0.5 | 1.0 |
3 | 2023 | 2020-1000 | 58.0 | ... | 0.3 | 0.5 | 1.0 |
4 | 2023 | 2020-1001 | 88.0 | ... | 0.0 | 0.5 | 0.7 |
5 rows × 10 columns
We can now accurately calculate average scores for all students by (1) multiplying each score by its corresponding weight value (which will be 0 in the case of missing scores); (2) adding these products together; and then (3) dividing the sum by the weight_sum
column. If a student has missing values for a given season, his or her weight_sum
value will also be lower, thus compensating for his/her lower sum of scores.
(Note that we’ll still fill in missing scores with 0 in order to prevent final NaN outputs.)
df_student_results_wide['weighted_avg_score'] = (
df_student_results_wide['Fall'].fillna(0)
* df_student_results_wide['Fall_weight']
+ df_student_results_wide['Winter'].fillna(0)
* df_student_results_wide['Winter_weight']
+ df_student_results_wide['Spring'].fillna(0)
* df_student_results_wide['Spring_weight']) / (
df_student_results_wide['weight_sum'])
df_student_results_wide.query("Winter.isna()").head()
starting_year | student_id | Fall | ... | Winter_weight | Spring_weight | weight_sum | |
---|---|---|---|---|---|---|---|
4 | 2023 | 2020-1001 | 88.0 | ... | 0.0 | 0.5 | 0.7 |
6 | 2023 | 2020-1003 | 69.0 | ... | 0.0 | 0.5 | 0.7 |
10 | 2023 | 2020-1007 | 93.0 | ... | 0.0 | 0.5 | 0.7 |
12 | 2023 | 2020-1009 | 74.0 | ... | 0.0 | 0.5 | 0.7 |
14 | 2023 | 2020-1010 | 76.0 | ... | 0.0 | 0.5 | 0.7 |
5 rows × 10 columns
The following output confirms that students’ weighted average scores are no longer being dragged down by missing winter results:
df_student_results_wide.query("Winter.isna()")[
avg_score_cols_to_display].head()
student_id | Fall | Winter | Spring | weighted_avg_score | |
---|---|---|---|---|---|
4 | 2020-1001 | 88.0 | NaN | 100.0 | 96.571429 |
6 | 2020-1003 | 69.0 | NaN | 90.0 | 84.000000 |
10 | 2020-1007 | 93.0 | NaN | 96.0 | 95.142857 |
12 | 2020-1009 | 74.0 | NaN | 69.0 | 70.428571 |
14 | 2020-1010 | 76.0 | NaN | 92.0 | 87.428571 |
A few additional notes:
This approach would also successfully compensate for students with missing fall or spring scores. It would only fail to work for students who had no survey results at all–but such students should be excluded from these calculations to begin with.
Note that, for students with missing winter weights, this code uses a fall score weight of 0.2/0.7 (28.6%) and a spring score weight of 0.5/0.7 (71.4%). Thus, fall and spring results counted more for these students than they did for students with valid winter results.
Now that we’ve calculated weighted average results for all students, we can finally answer the administrators’ original question: what percentage of students had a weigted average survey score below 60?
df_student_results_wide['weighted_avg_below_60'] = np.where(
df_student_results_wide['weighted_avg_score'] < 60, 1, 0)
# Calling value_counts(normalize=True), then multiplying the results
# by 100, allows us to calculate the percentage of students with weighted
# averages below 60.
100*df_student_results_wide['weighted_avg_below_60'].value_counts(
normalize=True)
weighted_avg_below_60
0 81.304932
1 18.695068
Name: proportion, dtype: float64
It turns out that around 18.7% of students had a weighted average score below 60.
The danger of relying on column index positions when analyzing datasets#
There are several different ways to specify the DataFrame columns on which you would like to perform operations. So far, I have been selecting columns using their names (e.g. df_student_results_wide['Spring']
). However, it’s also possible to select them via their index positions. In this section, I’ll explain why this is often not a good idea.
df_valid_survey_results
, a copy of our student-level results table that excludes missing winter results, has four columns: ‘student_id’, ‘Fall’, ‘Winter’, and ‘Spring’. These columns’ index positions are 0, 1, 2, and 3, respectively.
df_valid_survey_results.head()
student_id | Fall | Winter | Spring | |
---|---|---|---|---|
0 | 2020-1 | 88.0 | 81.0 | 86.0 |
1 | 2020-10 | 69.0 | 63.0 | 73.0 |
2 | 2020-100 | 68.0 | 60.0 | 88.0 |
3 | 2020-1000 | 58.0 | 55.0 | 65.0 |
5 | 2020-1002 | 62.0 | 54.0 | 59.0 |
If I wanted to create a non-weighted average of students’ winter and spring scores, I could use .iloc[]
(which allows columns to be selected via these index positions) to retrieve the scores for these two seasons.
In the following code, .iloc[:,2:4]
selects all rows for the columns between index positions 2 (inclusive) and 4 (exclusive)–or, in other words, the columns with index positions 2 and 3.
df_valid_survey_results[
'Winter/Spring Avg.'] = df_valid_survey_results.iloc[
:,2:4].mean(axis=1)
df_valid_survey_results
student_id | Fall | Winter | Spring | Winter/Spring Avg. | |
---|---|---|---|---|---|
0 | 2020-1 | 88.0 | 81.0 | 86.0 | 83.5 |
1 | 2020-10 | 69.0 | 63.0 | 73.0 | 68.0 |
2 | 2020-100 | 68.0 | 60.0 | 88.0 | 74.0 |
3 | 2020-1000 | 58.0 | 55.0 | 65.0 | 60.0 |
... | ... | ... | ... | ... | ... |
16380 | 2023-996 | 62.0 | 58.0 | 69.0 | 63.5 |
16381 | 2023-997 | 47.0 | 42.0 | 42.0 | 42.0 |
16382 | 2023-998 | 77.0 | 67.0 | 74.0 | 70.5 |
16383 | 2023-999 | 64.0 | 62.0 | 65.0 | 63.5 |
13926 rows × 5 columns
This code accurately calculates the average of each student’s winter and spring survey scores. What makes it dangerous, though, is that a minor change to the table could end up filling this field with incorrect data.
For instance, suppose that a new column gets added to the left of our survey scores in the future:
df_valid_survey_results.insert(1, 'University', 'NVCU')
df_valid_survey_results.head()
student_id | University | Fall | Winter | Spring | Winter/Spring Avg. | |
---|---|---|---|---|---|---|
0 | 2020-1 | NVCU | 88.0 | 81.0 | 86.0 | 83.5 |
1 | 2020-10 | NVCU | 69.0 | 63.0 | 73.0 | 68.0 |
2 | 2020-100 | NVCU | 68.0 | 60.0 | 88.0 | 74.0 |
3 | 2020-1000 | NVCU | 58.0 | 55.0 | 65.0 | 60.0 |
5 | 2020-1002 | NVCU | 62.0 | 54.0 | 59.0 | 56.5 |
If we call the same code on this modified column, our ‘Winter/Spring Avg.’ column will now show the average of fall and winter scores, since the ‘University’ column has changed which columns correspond to the index positions of 2 and 3.
df_valid_survey_results[
'Winter/Spring Avg.'] = df_valid_survey_results.iloc[
:,2:4].mean(axis=1)
df_valid_survey_results.head()
student_id | University | Fall | Winter | Spring | Winter/Spring Avg. | |
---|---|---|---|---|---|---|
0 | 2020-1 | NVCU | 88.0 | 81.0 | 86.0 | 84.5 |
1 | 2020-10 | NVCU | 69.0 | 63.0 | 73.0 | 66.0 |
2 | 2020-100 | NVCU | 68.0 | 60.0 | 88.0 | 64.0 |
3 | 2020-1000 | NVCU | 58.0 | 55.0 | 65.0 | 56.5 |
5 | 2020-1002 | NVCU | 62.0 | 54.0 | 59.0 | 58.0 |
Thus, a much safer approach is to explicitly name the columns that you wish to incorporate into a calculation. The following code will work fine regardless of the index positions of the ‘Winter’ and ‘Spring’ columns.
df_valid_survey_results[
'Winter/Spring Avg.'] = df_valid_survey_results[
['Winter', 'Spring']].mean(axis=1)
df_valid_survey_results.head()
student_id | University | Fall | Winter | Spring | Winter/Spring Avg. | |
---|---|---|---|---|---|---|
0 | 2020-1 | NVCU | 88.0 | 81.0 | 86.0 | 83.5 |
1 | 2020-10 | NVCU | 69.0 | 63.0 | 73.0 | 68.0 |
2 | 2020-100 | NVCU | 68.0 | 60.0 | 88.0 | 74.0 |
3 | 2020-1000 | NVCU | 58.0 | 55.0 | 65.0 | 60.0 |
5 | 2020-1002 | NVCU | 62.0 | 54.0 | 59.0 | 56.5 |
Note that the code that explicitly states the name of each column is also more intuitive, as it eliminates the need to cross-reference which column corresponds to each index position. And the more intuitive you can keep your code, particularly within complex projects, the better.
Why column-wise operations should be preferred over for loops#
If you’re a newcomer to Pandas, you may initially want to try using for loops to update values within DataFrames (especially if you’ve been used to using such loops within languages like C or C++). However, I recommend that you use column-wise operations (e.g. code that will apply to all rows within a column) rather than loops whenever possible, simply because the former are much, much faster than the latter.
Let’s say that we wanted to create an unweighted average of our fall, winter, and spring scores within df_valid_survey_results
. One option would be to initialize an empty ‘Average’ column, then fill it in by looping through all rows in our dataset.
Creating our ‘Average’ column:
# Removing columns that we no longer need:
df_valid_survey_results.drop(
['University', 'Winter/Spring Avg.'], axis=1, inplace=True)
df_valid_survey_results['Average'] = np.nan
df_valid_survey_results.head()
student_id | Fall | Winter | Spring | Average | |
---|---|---|---|---|---|
0 | 2020-1 | 88.0 | 81.0 | 86.0 | NaN |
1 | 2020-10 | 69.0 | 63.0 | 73.0 | NaN |
2 | 2020-100 | 68.0 | 60.0 | 88.0 | NaN |
3 | 2020-1000 | 58.0 | 55.0 | 65.0 | NaN |
5 | 2020-1002 | 62.0 | 54.0 | 59.0 | NaN |
Determining the index position of our ‘Average’ column:
(If we tried to use this column’s name within our for loop rather than its index position, the averages wouldn’t actually get added in, and we’d end up with a SettingWithCopyWarning
.)
avg_index = df_valid_survey_results.columns.get_loc('Average')
avg_index
4
Calculating this average for each row:
for i in range(len(df_valid_survey_results)):
df_valid_survey_results.iloc[i, avg_index] = (
# Note: the following alternative to the previous line would not work,
# as noted in the documentation above:
# df_valid_survey_results.iloc[i]['Average'] = ( # Incorrect!
df_valid_survey_results.iloc[i]['Fall'] +
df_valid_survey_results.iloc[i]['Winter'] +
df_valid_survey_results.iloc[i]['Spring']) / 3
df_valid_survey_results.head()
student_id | Fall | Winter | Spring | Average | |
---|---|---|---|---|---|
0 | 2020-1 | 88.0 | 81.0 | 86.0 | 85.000000 |
1 | 2020-10 | 69.0 | 63.0 | 73.0 | 68.333333 |
2 | 2020-100 | 68.0 | 60.0 | 88.0 | 72.000000 |
3 | 2020-1000 | 58.0 | 55.0 | 65.0 | 59.333333 |
5 | 2020-1002 | 62.0 | 54.0 | 59.0 | 58.333333 |
I ran the previous cell 5 times on my computer and found that it took an average of 1.72 seconds to execute. There are 13,926 rows within df_valid_survey_results
, so even on a fast laptop, it will take Python a decent while to process each one.
The following cell, in contrast, uses a column-wise operation to compute this average:
df_valid_survey_results['Average'] = (
df_valid_survey_results['Fall'] +
df_valid_survey_results['Winter'] +
df_valid_survey_results['Spring']) / 3
df_valid_survey_results.head()
student_id | Fall | Winter | Spring | Average | |
---|---|---|---|---|---|
0 | 2020-1 | 88.0 | 81.0 | 86.0 | 85.000000 |
1 | 2020-10 | 69.0 | 63.0 | 73.0 | 68.333333 |
2 | 2020-100 | 68.0 | 60.0 | 88.0 | 72.000000 |
3 | 2020-1000 | 58.0 | 55.0 | 65.0 | 59.333333 |
5 | 2020-1002 | 62.0 | 54.0 | 59.0 | 58.333333 |
I ran this cell 5 times as well and found that it needed only 5.4 milliseconds (on average) to execute. In other words, it was over 300 times faster than the for loop-based approach!
The difference between 1.72 seconds and 0.006 seconds may not seem like much in nominal terms. However, when you’re dealing with a dataset that contains 13 million rows rather than 13 thousand, or a script that performs dozens of these sorts of calculations, the performance advantage of column-wise calculations will become much clearer.
There are many ways to execute even complex calculations in column-wise form; tools like np.where()
, np.select()
, Series.map()
, and Series.apply()
can help you do so.
Microdata, pre-baked data, and the ‘average of averages’ problem#
df_transactions
, a table of NVCU dining hall transactions that the following cell loads in, is an excellent candidate for creating a diverse set of pivot tables and charts because each transaction has its own row. (In other words, the table contains microdata; for more on this subject, reference https://en.wikipedia.org/wiki/Microdata_(statistics) .) Microdata-based tables allow you to easily calculate statistics for your choice of comparison variables by (1) grouping rows into unique combinations of these variables, then (2) applying one or more statistical functions (mean
, median
, etc.) to each group. (As you’ve already seen, pivot tables make this process very simple.)
df_transactions = pd.read_sql(
'select * from dining_transactions', con=e)
df_transactions['transactions'] = 1
df_transactions.head(5)
starting_year | weekday | level | amount | transactions | |
---|---|---|---|---|---|
0 | 2023 | We | Fr | 13.36 | 1 |
1 | 2023 | Tu | Se | 12.22 | 1 |
2 | 2023 | We | Se | 23.10 | 1 |
3 | 2023 | Su | Fr | 18.78 | 1 |
4 | 2023 | Tu | Fr | 11.36 | 1 |
If you wanted to calculate the average dining hall transaction amount, you could simply find the mean of all items within this table:
df_transactions['amount'].mean()
np.float64(12.65563577250256)
If you instead wanted to find the average amount spent by weekday and level, you could easily do so via Pandas’ pivot_table()
function:
df_transactions.pivot_table(
index=['weekday', 'level'], values='amount',
aggfunc='mean').reset_index().sort_values(
by='amount', ascending=False).head()
weekday | level | amount | |
---|---|---|---|
14 | Su | Se | 22.426887 |
26 | We | Se | 21.550155 |
6 | Mo | Se | 21.346364 |
22 | Tu | Se | 21.268938 |
18 | Th | Se | 21.072613 |
However, you may not always have access to this type of data. Instead, you might receive an aggregated dataset in which averages by different groups are pre-baked–e.g. already present in the output. As you’ll soon see, this makes the table much less flexible.
‘pre-baked’ is not, to my knowledge, a common statistical term, but I find that it works pretty well for describing this type of data. For instance, once you’ve baked a pie using apples, wheat, and sugar (or whatever goes into a pie–I’m not a baker!), it’s pretty hard to convert that dish into a caramelized apple. Similarly, as the following examples will show, once you’ve ‘baked’ a list of transactions into separate sets of averages by level and by weekday, it will be impossible to use that data to calculate total spending amounts by level and weekday–as we no longer know how each level value relates to each weekday value.
To illustrate this issues caused by pre-baked datasets, let’s generate two DataFrames, the first of which will show average transaction amounts by level, and the second of which will show average amounts by weekday.
df_average_spending_by_level = df_transactions.pivot_table(
index='level', values=[
'amount', 'transactions'], aggfunc={
'amount':'mean', 'transactions':'count'}).reset_index()
df_average_spending_by_level
level | amount | transactions | |
---|---|---|---|
0 | Fr | 9.911620 | 12410 |
1 | Ju | 17.133287 | 3289 |
2 | Se | 21.294252 | 2046 |
3 | So | 13.006451 | 4708 |
df_average_spending_by_weekday = df_transactions.pivot_table(
index='weekday', values=[
'amount', 'transactions'], aggfunc={
'amount':'mean', 'transactions':'count'}).reset_index()
df_average_spending_by_weekday
weekday | amount | transactions | |
---|---|---|---|
0 | Fr | 12.721265 | 2466 |
1 | Mo | 12.632190 | 3608 |
2 | Sa | 12.253912 | 708 |
3 | Su | 12.775699 | 1123 |
4 | Th | 12.592931 | 3964 |
5 | Tu | 12.747500 | 5267 |
6 | We | 12.624990 | 5317 |
We can see that lower levels (e.g. freshmen and sophomores) tend to spend less per dining hall visit than upper levels; in addition, we can see that spending doesn’t vary too much by day of the week. However, because these tables are pre-baked, we don’t have any way of calculating average spending by level and weekday like we could with df_transactions.
In addition, let’s say that we wanted to calculate average dining hall spending for all students using df_average_spending_by_level
. A naive approach would be to simply calculate the average of each row:
# The following approach is incorrect:
df_average_spending_by_level['amount'].mean()
np.float64(15.336402324109617)
This average is way off the actual average, which we calculated earlier in this code using df_transactions
. Why is this the case? As df_average_spending_by_level
shows, seniors and juniors spend much more than freshmen and sophomores, yet they also use the dining hall less (as shown by their smaller transaction counts). As a result, if we simply average the mean transaction amounts for each level, we’ll overrepresent upperclassmen and thus skew our average transaction amount upward.
In order to avoid this ‘average of averages’ issue, which arises whenever differences in group sizes skew an average that’s in turn based on averages for those groups, we’ll need to create a weighted average. We can do so by multiplying each row’s ‘amount’ column by its ‘transaction’ column; adding these products together; and then dividing them by the ‘transaction’ column.
df_average_spending_by_level['amount_x_transactions'] = (
df_average_spending_by_level['amount']
* df_average_spending_by_level['transactions'])
df_average_spending_by_level
level | amount | transactions | amount_x_transactions | |
---|---|---|---|---|
0 | Fr | 9.911620 | 12410 | 123003.20 |
1 | Ju | 17.133287 | 3289 | 56351.38 |
2 | Se | 21.294252 | 2046 | 43568.04 |
3 | So | 13.006451 | 4708 | 61234.37 |
Here’s our weighted average, which matches the average calculated earlier:
(df_average_spending_by_level['amount_x_transactions'].sum()
/ df_average_spending_by_level['transactions'].sum())
np.float64(12.65563577250256)
The following function can be used to calculate weighted means for other datasets:
def weighted_mean(original_df, metric_col, weight_col):
'''This function calculates, then returns, a weighted mean for
the DataFrame passed to original_df.
metric_col: the column storing the variable for which to calculate
a mean.
weight_col: the column storing weight values that will be incorporated
into this weighted mean.
'''
df = original_df.copy() # Prevents the function from modifying
# the original DataFrame
df['metric_x_weight'] = df[metric_col] * df[weight_col]
weighted_mean = (df['metric_x_weight'].sum() /
df[weight_col].sum())
return weighted_mean
Let’s try putting this function into action by calculating the average transaction amount using df_average_spending_by_weekday:
weighted_mean(df_average_spending_by_weekday,
metric_col='amount',
weight_col='transactions')
np.float64(12.65563577250256)
This average matches the weighted average that we calculated within df_average_spending_by_level
.
Incidentally, because average transaction amounts are relatively constant across weekdays, simply averaging all ‘amount’ values within df_average_spending_by_weekday
will get us very close to the actual average (despite the considerable variation in transaction counts by weekday). These kinds of ‘believable’, yet incorrect values are particularly insidious: they may go unnoticed for quite a while, whereas an obviously incorrect value (e.g. an average transaction amount of -5 or 83,000) would get caught right away.
df_average_spending_by_weekday['amount'].mean()
np.float64(12.621212399856491)
We were able to calculate correct averages within these pre-baked tables because we also knew the number of transactions within each row. In the real world, though, such sample size information may not be available.
From a data analysis perspective, it would be ideal to have all of your source data in microdata (rather than pre-baked) form. However, the microdata approach has its own drawbacks.
First, data privacy needs may preclude the issuance of microdata. Imagine, for instance, that NVCU released a dataset that contained individual course grades for each student along with those students’ majors. If you happened to be the only music major who took a C++ course, anyone with that knowledge could find out how you performed in the class.
A more private approach, in this case, would be to release separate ‘pre-baked’ tables that showed averages for courses and by major (but not averages for each course/major pair). (Even with this strategy, if a given pre-baked average was based on only a few students, it might be best to remove that row’s data so as to protect those students’ privacy.)
Second, microdata can take up much more storage size than pre-baked data. To illustrate this, let’s compare the amount of memory, in kilobytes, used by df_transactions
(a microdata-based table) with that used by our two pre-baked tables:
microdata_kb = df_transactions.memory_usage(
index = True, deep = True).sum() / 1000
# For more on df.memory_usage(), see
# https://pandas.pydata.org/pandas-docs/stable/reference/api/
# pandas.DataFrame.memory_usage.html
microdata_kb
np.float64(2829.21)
pre_baked_kb = (df_average_spending_by_level.memory_usage(
index = True, deep = True).sum() +
df_average_spending_by_weekday.memory_usage(
index = True, deep = True).sum()) / 1000
pre_baked_kb
np.float64(1.033)
microdata_kb / pre_baked_kb
np.float64(2738.828654404647)
The microdata table takes up over 2,700 times more memory than our two pre-baked tables combined! Therefore, it’s understandable that data providers may prefer to share pre-calculated averages rather than original datasets.
Handling missing data when creating pivot tables#
If a field passed to the index
or columns
argument of a pivot_table()
call has a missing value, that missing value won’t get incorporated into the final table. This can cause calculation errors if you end up using the pivot table for further analyses. However, mitigating this issue isn’t too difficult.
To demonstrate this issue, I’ll create a ‘faulty’ version of df_transactions
in the following cell that has np.nan
(i.e.. missing) entries for all ‘Wednesday’ weekday values and all ‘So’ level values. As you’ll see, these missing values will cause issues when we (1) create a pivot table of this data, then (2) attempt to use that pivot table to determine the sum of all transactions in our dataset.
# Passing the 'level' and 'weekday' fields as the final arguments
# within these function calls allows these fields' original
# values to get retained should the conditions in the first
# arguments not be met. (For instance, in the first np.where()
# call, Wednesday values will get replaced with missing entries,
# but non-Wednesday values will get replaced with themselves
# (thus leaving them unchanged).
df_transactions_faulty = df_transactions.copy()
df_transactions_faulty['weekday'] = np.where(
df_transactions_faulty['weekday'] == 'We',
np.nan, df_transactions_faulty['weekday'])
df_transactions_faulty['level'] = np.where(
df_transactions_faulty['level'] == 'So',
np.nan, df_transactions_faulty['level'])
df_transactions_faulty.head()
starting_year | weekday | level | amount | transactions | |
---|---|---|---|---|---|
0 | 2023 | NaN | Fr | 13.36 | 1 |
1 | 2023 | Tu | Se | 12.22 | 1 |
2 | 2023 | NaN | Se | 23.10 | 1 |
3 | 2023 | Su | Fr | 18.78 | 1 |
4 | 2023 | Tu | Fr | 11.36 | 1 |
First, let’s try converting this dataset into a pivot table that shows total transaction amounts by week.
df_transactions_faulty_pivot = df_transactions_faulty.pivot_table(
index='weekday', values='amount',
aggfunc='sum').reset_index()
df_transactions_faulty_pivot
weekday | amount | |
---|---|---|
0 | Fr | 31370.64 |
1 | Mo | 45576.94 |
2 | Sa | 8675.77 |
3 | Su | 14347.11 |
4 | Th | 49918.38 |
5 | Tu | 67141.08 |
There’s no sign whatsoever of the ‘Wednesday’ rows. This shouldn’t be too surprising (since we removed those values), but it’s worth highlighting that no ‘Missing’ or ‘N/A’ row gets returned by the pivot table function.
The complete absence of the ‘Wednesday’ row makes this this missing data issue easier to identify. However, many cases of missing data are far more insidious. For instance, suppose only a handful of ‘Wednesday’ entries were missing. We’d see a ‘Wednesday’ row within the pivot table, but we might not realize that its ‘amount’ sum was incorrect. (For this reason, it’s often a good idea to check for missing entries within any fields that you pass to the index
or column
arguments of a pivot_table()
call.)
If we try to use this pivot table to calculate the sum of all our transactions, we’ll end up with an incorrect result, since rows with missing ‘weekday’ values won’t factor into the calculation.
df_transactions_faulty_pivot['amount'].sum()
np.float64(217029.91999999998)
For reference, here’s the correct sum (that also includes Wednesday transactions):
df_transactions['amount'].sum()
np.float64(284156.99)
Thankfully, it’s not too hard to prevent this issue. We simply need to fill in missing values within any fields fields passed to the index
or columns
arguments of the pivot_table()
call.
I’ll demonstrate two ways to replace these values with a ‘Missing’ string. First, if you don’t wish to permanently modify your original DataFrame, you can call .fillna()
on the DataFrame being passed to the pivot_table()
function. This ensures that rows with missing data get incorporated into the pivot table but leaves the original DataFrame untouched.
The following code implements this approach via a dict comprehension. (For more on this valuable tool, see https://docs.python.org/3/tutorial/datastructures.html#dictionaries .) This approach allows me to use the same list of index
values that I’ll pass to pivot_table()
to specify which columns’ missing entries should get filled in. This results in cleaner and less error-prone code, as if I needed to update my index values twice, once for pivot_table()
and once for fillna()
, the two lists could easily get out of sync.
The pivot table we’ll create in the following cell will show total transactions by both level and weekday, but a similar method would work for tables that show sums for only one of these two groups.
(Another option, by the way, would be to fill missing entries within all columns with a ‘Missing’ string; this could be accomplished via df_transactions_faulty.fillna('Missing')
. However, if some columns with missing data use float or integer types, this approach could either raise a warning or an error–as you’d be attempting to pass a string into a list of missing values. Thus, I recommend using the more precise solution shown below.)
index = ['level', 'weekday'] # These values will get passed to the
# index argument of our pivot_table() call.
# The following dict comprehension will produce the dictionary
# {'level': 'Missing', 'weekday': 'Missing'}--which, when passed to
# fillna(), will instruct Pandas to fill in missing data with 'Missing'
# only within those two fields.
# (For more on this argument, see
# https://pandas.pydata.org/pandas-docs/stable/reference/
# api/pandas.DataFrame.fillna.html )
df_transactions_corrected_pivot = df_transactions_faulty.fillna(
{field:'Missing' for field in index}).pivot_table(
index=index, values='amount',
aggfunc='sum').reset_index()
df_transactions_corrected_pivot.head()
level | weekday | amount | |
---|---|---|---|
0 | Fr | Fr | 13183.14 |
1 | Fr | Missing | 28856.95 |
2 | Fr | Mo | 20214.29 |
3 | Fr | Sa | 3739.85 |
4 | Fr | Su | 6230.97 |
Filling in missing ‘level’ and ‘weekday’ rows with ‘Missing’ allowed them to appear within our updated pivot table. As a result, we can now produce an accurate sum of all transactions by adding up the ‘amount’ rows within this table:
df_transactions_corrected_pivot['amount'].sum()
np.float64(284156.99)
The following approach, unlike the one shown above, permanently replaces missing weekday and level values with ‘Missing’, thus saving you the trouble of repeating this step later in your code. It also iterates through each field in the index
list via a for loop in order to inform you which columns, in particular, have missing data.
In some cases, it may be preferable for your code to halt upon finding missing data; this halt serves as an alert that missing data exists, thus prompting you to fill in those missing entries with their actual values. The commented-out code above the first print statement in the following cell would stop this script’s operation (by raising a ValueError
) if it came across any missing data.
for val in index:
if df_transactions_faulty[val].isna().sum() > 0:
# raise ValueError(f"{val} has NaN values; address these before \
# running the following pivot table operation.")
print(f"{val} has NaN values; these will be filled with 'Missing' \
so that their rows' data can still get incorporated into the following \
pivot table.\n")
df_transactions_faulty[val] = df_transactions_faulty[val].fillna(
'Missing').copy()
# If a column with missing data was float-based rather than
# string-based, it would be best to enter a float-based missing
# data indicator rather than this string.
df_transactions_corrected_pivot = df_transactions_faulty.pivot_table(
index=index, values='amount',
aggfunc='sum').reset_index()
df_transactions_corrected_pivot
level has NaN values; these will be filled with 'Missing' so that their rows' data can still get incorporated into the following pivot table.
weekday has NaN values; these will be filled with 'Missing' so that their rows' data can still get incorporated into the following pivot table.
level | weekday | amount | |
---|---|---|---|
0 | Fr | Fr | 13183.14 |
1 | Fr | Missing | 28856.95 |
2 | Fr | Mo | 20214.29 |
3 | Fr | Sa | 3739.85 |
... | ... | ... | ... |
24 | Se | Sa | 1260.53 |
25 | Se | Su | 2377.25 |
26 | Se | Th | 7017.18 |
27 | Se | Tu | 10613.20 |
28 rows × 3 columns
This new version of df_transactions_corrected_pivot
also lets us calculate an accurate sum of all transactions within the original dataset:
df_transactions_corrected_pivot['amount'].sum()
np.float64(284156.99)
I’ll now undo my corrections to df_transactions_faulty in order to prepare the dataset for the following section.
df_transactions_faulty.replace(
'Missing', np.nan, inplace=True)
df_transactions_faulty
starting_year | weekday | level | amount | transactions | |
---|---|---|---|---|---|
0 | 2023 | NaN | Fr | 13.36 | 1 |
1 | 2023 | Tu | Se | 12.22 | 1 |
2 | 2023 | NaN | Se | 23.10 | 1 |
3 | 2023 | Su | Fr | 18.78 | 1 |
... | ... | ... | ... | ... | ... |
22449 | 2023 | Tu | Fr | 5.98 | 1 |
22450 | 2023 | Fr | NaN | 7.39 | 1 |
22451 | 2023 | Sa | Fr | 13.25 | 1 |
22452 | 2023 | Mo | Fr | 1.95 | 1 |
22453 rows × 5 columns
The advantages of map()
and np.select()
over np.where()
when missing data are present#
Suppose we’d like to find the total number of dining hall transactions that took place on the weekend versus those that didn’t. To make this calculation easier, we can add a ‘weekday’ column that will store a value of 1 if a transaction fell on Saturday or Sunday and 0 otherwise.
We could try initializing this column via np.where()
. The following cell uses this function to assign a ‘weekend’ value of 0 to all transactions whose ‘weekday’ value corresponds to Monday, Tuesday, Wednesday, Thursday, or Friday. Transactions that don’t have one of these weekday values are classified as weekend sales.
df_transactions_faulty['weekend'] = np.where(
df_transactions_faulty['weekday'].isin(
['Mo', 'Tu', 'We', 'Th', 'Fr']), 0, 1)
df_transactions_faulty
starting_year | weekday | level | amount | transactions | weekend | |
---|---|---|---|---|---|---|
0 | 2023 | NaN | Fr | 13.36 | 1 | 1 |
1 | 2023 | Tu | Se | 12.22 | 1 | 0 |
2 | 2023 | NaN | Se | 23.10 | 1 | 1 |
3 | 2023 | Su | Fr | 18.78 | 1 | 1 |
... | ... | ... | ... | ... | ... | ... |
22449 | 2023 | Tu | Fr | 5.98 | 1 | 0 |
22450 | 2023 | Fr | NaN | 7.39 | 1 | 0 |
22451 | 2023 | Sa | Fr | 13.25 | 1 | 1 |
22452 | 2023 | Mo | Fr | 1.95 | 1 | 0 |
22453 rows × 6 columns
The issue with this approach, as you might have recognized already, is that it classifies all transactions with missing ‘weekday’ values as taking place during the weekend. We know that this is inaccurate: the transactions with missing weekday entries actually took place on Wednesday. (In real life, of course, we wouldn’t know this to be the case–but we still wouldn’t want to assume that they fell on the weekend.)
Here is the total number of Saturday and Sunday dining hall transactions according to our ‘weekend’ column:
len(df_transactions_faulty.query("weekend == 1"))
7148
As it turns out, this sum much higher than the correct value (which we can calculate using our original transactions table):
len(df_transactions.query("weekday in ['Sa', 'Su']"))
1831
Let’s now overwrite this faulty ‘weekend’ column by using map()
instead. This function will allow us to map each individual weekday value to either 1 (for weekend) entries or 0 (for non-weekend values). Importantly, it also lets us map NaN entries to a third number, -1, which will represent missing values.
df_transactions_faulty['weekend'] = (
df_transactions_faulty['weekday'].map(
{'Su':0, 'Mo':0, 'Tu': 0, 'We':0, 'Th':0, 'Fr':0, 'Sa':1, 'Su':1,
np.nan:-1}))
df_transactions_faulty
starting_year | weekday | level | amount | transactions | weekend | |
---|---|---|---|---|---|---|
0 | 2023 | NaN | Fr | 13.36 | 1 | -1 |
1 | 2023 | Tu | Se | 12.22 | 1 | 0 |
2 | 2023 | NaN | Se | 23.10 | 1 | -1 |
3 | 2023 | Su | Fr | 18.78 | 1 | 1 |
... | ... | ... | ... | ... | ... | ... |
22449 | 2023 | Tu | Fr | 5.98 | 1 | 0 |
22450 | 2023 | Fr | NaN | 7.39 | 1 | 0 |
22451 | 2023 | Sa | Fr | 13.25 | 1 | 1 |
22452 | 2023 | Mo | Fr | 1.95 | 1 | 0 |
22453 rows × 6 columns
If we apply value_counts() to this column, we’ll find that our weekend transactions count matches that shown above. The sum for the -1 entry shows the number of transactions that are missing
df_transactions_faulty['weekend'].value_counts()
weekend
0 15305
-1 5317
1 1831
Name: count, dtype: int64
I could have also chosen to use ‘Missing’ as my marker for invalid data; however, since the other values output by map()
were integers, I wanted to make the missing data code an integer also.
As an aside, the use of 1 for weekend transactions and 0 for work-week ones makes it easy to determine the percentage of transactions that took place over the weekend:
print(f"{
round(100*df_transactions_faulty.query(
"weekend != -1")['weekend'].mean(), 2)}% of transactions with valid \
weekday entries took place on a Saturday or Sunday.")
10.69% of transactions with valid weekday entries took place on a Saturday or Sunday.
Note the ‘with valid weekday entries’ caveat in the above print statement. Since we wouldn’t know for sure on which weekdays the missing transactions took place, we wouldn’t want our statement to make any assertionss about them.
It was also crucial to exclude rows with weekend values of -1 from our above calculation. If we had kept them in, we would have ended up with a nonsensical percentage of weekend transactions:
print(f"{round(100 * df_transactions_faulty['weekend'].mean(), 2)}% \
of transactions took place on a Saturday or Sunday . . . wait, really? \
I think I need to double-check these numbers . . . ")
-15.53% of transactions took place on a Saturday or Sunday . . . wait, really? I think I need to double-check these numbers . . .
If you’re categorizing continuous rather than categorical data, the map()
approach above will be hard to implement–as you might end up having to code a ton of values. Therefore, you should consider using np.select()
instead. This function generally involves a bit more code than map()
for categorical data, but it easily accommodates continuous values as well.
Let’s say that we want to add a ‘large_purchase’ flag to each row that will equal 1 if the amount was at or above the 90th percentile and 0 otherwise. We could then use this flag to determine each level’s likelihood of making such a large purchase.
The 90th percentile can be calculated as follows:
large_purchase_threshold = df_transactions['amount'].quantile(0.9)
# See https://pandas.pydata.org/docs/reference/api/
# pandas.DataFrame.quantile.html
large_purchase_threshold
np.float64(24.067999999999994)
The following code will simplify df_transactions_faulty
by removing some columns that won’t be needed for the following section. It will also introduce missing numerical data by replacing all transactions whose final digit is a 6 with NaN.
df_transactions_faulty = df_transactions.copy().drop(
['weekday', 'transactions'],
axis=1)
df_transactions_faulty['amount'] = np.where(
df_transactions_faulty['amount'].astype('str').str[-1] == '6',
np.nan, df_transactions_faulty['amount'])
df_transactions_faulty.head()
starting_year | level | amount | |
---|---|---|---|
0 | 2023 | Fr | NaN |
1 | 2023 | Se | 12.22 |
2 | 2023 | Se | 23.10 |
3 | 2023 | Fr | 18.78 |
4 | 2023 | Fr | NaN |
We wouldn’t want to use np.where()
to initialize our ‘large_purchase’ column, as we would inadvertently group missing transactions as large or non-large. In addition, map()
would be a poor solution, as we now have thousands of unique values to code rather than just a few:
len(df_transactions_faulty['amount'].unique())
3131
Therefore, we’ll instead use np.select()
. This function applies a condlist
(a list of possible conditions) and a choicelist
(a list of values to apply for each of those conditions) in order to initialize or update a given column. (You don’t have to name these items condlist
and choicelist
, but those are their corresponding parameter names; see https://numpy.org/doc/stable/reference/generated/numpy.select.html ). Importantly, this function also has a default
argument that lets you determine what value to enter if none of the conditions were met.
The following code applies np.select()
by defining two possible conditions (e.g. a transaction is above the threshold or it isn’t) and two corresponding values to enter within our ‘large_purchase’ column (1 or 0). It also adds a ‘default’ value of -1; this value will get added to the large_purchase column when a given transaction amount is missing. (Note that, if no default value is specified, numpy will add values of 0–which would easily get mistaken here for the ‘not a large purchase’ condition.)
condlist = [
df_transactions_faulty['amount'] >= large_purchase_threshold,
df_transactions_faulty['amount'] <= large_purchase_threshold
]
choicelist = [1,
0] # Make sure that the order of these entries matches
# that of your condlist entries!
df_transactions_faulty['large_purchase'] = np.select(
condlist, choicelist, default=-1)
df_transactions_faulty
starting_year | level | amount | large_purchase | |
---|---|---|---|---|
0 | 2023 | Fr | NaN | -1 |
1 | 2023 | Se | 12.22 | 0 |
2 | 2023 | Se | 23.10 | 0 |
3 | 2023 | Fr | 18.78 | 0 |
... | ... | ... | ... | ... |
22449 | 2023 | Fr | 5.98 | 0 |
22450 | 2023 | So | 7.39 | 0 |
22451 | 2023 | Fr | 13.25 | 0 |
22452 | 2023 | Fr | 1.95 | 0 |
22453 rows × 4 columns
Now that we’ve added in this column, we can create a pivot table that shows the likelihood, for each level, that a given transaction was at or above the 90th percentile:
df_transactions_faulty.query("large_purchase != -1").pivot_table(
index = 'level',
values = 'large_purchase', aggfunc = 'mean').reset_index()
level | large_purchase | |
---|---|---|
0 | Fr | 0.000000 |
1 | Ju | 0.297705 |
2 | Se | 0.434444 |
3 | So | 0.081307 |
This likelihood is evidently much higher for upperclassmen (juniors and seniors) than for underclassmen (freshmen and sophomores). Of course, this analysis isn’t perfect, as it’s limited to the rows for which we have valid transaction amounts.
Conclusion#
These two chapters have offered a brief introduction–and certainly not a comprehensive reference–to the use of Pandas to calculate descriptive statistics within Python. My hope is that, now that you’ve seen how easily Pandas can perform various statistical calculations, you’ll be inspired to test this library out for other tasks as well.
I have found in my own work that, while descriptive stats are less ‘trendy’ than more recent developments in the field of statistics (e.g. machine learning, neural networks, etc.), they’re still incredibly useful as a way to share, both internally and externally, how your nonprofit is performing and how those whom it serves are doing. These kinds of descriptive analyses will also serve as the foundation for many different visualizations, as we’ll see later in Python for Nonprofits.
Now that we’ve learned how to retrieve, reformat, and analyze data, we can move on to a task (importing Census data) that applies each of these skills.