Pivot and Graph Functions#

This chapter will demonstrate how to use functions to simplify the Plotly graphing process. These functions will also prove useful within the online visualizations section of Python for Nonprofits.

The motivation behind this chapter#

As shown in the Graphing chapter, pivot tables play a crucial role in generating Plotly charts. They allow you to turn a raw dataset into a table of aggregate values that can easily be rendered as a bar graph, line graph, or some other chart type.

Thanks to the powerful Pandas library, these pivot tables are easy to produce within Python. However, creating separate pivot tables for each graph still requires a decent amount of work. In addition, there are times when coding these pivot tables would be impractical.

Suppose, for instance, that you’re putting together an online dashboard that visualizes mean NVCU survey result scores in bar chart form. You would like to allow users to compare these scores by up to 6 potential variables: starting year, season, gender, graduation Year, college, and level. (In other words, they should be able to view these results by season and college; college, level, and season; all six variables together; and even no variables at all–in which case the mean survey score will be displayed within a single bar.)

If you wanted to create a pivot table for each of these comparison options, you’d end up having to construct over 60 tables. (There are 6 comparison options, each of which can either be present or absent–resulting in a total of 2^6 = 64 tables.) In order to avoid the time-consuming task of pre-rendering all of these tables, we’ll need to utilize an ‘autopivot’ function that, given a list of comparison variables, a y variable, and a color variable, returns a pivot table, x and y variables, and a color variable. We will demonstrate such a function within this notebook.

Although this pivot table function will play a pivotal (pun intended) role in our final graph, we’ll also need to create a corresponding ‘autobar’ function that converts the output of the autopivot function into a graph. (We could add this code into our main pivot table function, but separating the two allows other graph types–such as line charts–to be created rather than bar charts, thus making the code more versatile.)

import pandas as pd
import plotly.express as px
from sqlalchemy import create_engine
e = create_engine('sqlite:///../Appendix/nvcu_db.db')

import sys
sys.path.insert(1, '../Appendix')
from helper_funcs import config_notebook, wadi
display_type = config_notebook(display_max_columns = 5)
        

Creating a detailed survey results dataset#

This dataset will include data from both our survey results and enrollment tables, thus allowing for additional comparison options.

Importing each dataset:#

df_enrollment = pd.read_sql(
    'select * from curr_enrollment', con=e)
df_enrollment.head()
first_name last_name ... level level_for_sorting
0 Ashley Jordan ... Se 3
1 Lisa Dunn ... Se 3
2 Alexandra Woods ... Se 3
3 Kathleen Carter ... Se 3
4 Donna Brewer ... Se 3

5 rows × 11 columns

df_survey_results = pd.read_sql(
    'select * from survey_results', con=e)
df_survey_results.head()
student_id starting_year season score
0 2020-1 2023 Fall 88
1 2020-2 2023 Fall 37
2 2020-3 2023 Fall 54
3 2020-4 2023 Fall 56
4 2020-5 2023 Fall 77

Merging these datasets together:#

df_se = df_survey_results.merge(
    df_enrollment, on='student_id', how='left')
# The 'se' in df_se stands for 'survey and enrollment.'

# Removing columns that we're not interested in using for our 
# comparisons:
df_se.drop(
    ['student_id', 'first_name', 'last_name',
     'date_of_birth', 'matriculation_number'], 
    axis=1, inplace=True)

# Reformatting our column names so that they'll show up better
# within graphs):

df_se.columns = [
    column.replace('_', ' ').title() 
    for column in df_se.columns]

df_se['Season For Sorting'] = (
    df_se[
'Season'].map({'Fall':0,'Winter':1,'Spring':2}))
df_se.sort_values(
    ['Season For Sorting', 'Level For Sorting'],
    inplace=True)

df_se.head()
Starting Year Season ... Level For Sorting Season For Sorting
10941 2023 Fall ... 0 0
10942 2023 Fall ... 0 0
10943 2023 Fall ... 0 0
10944 2023 Fall ... 0 0
10945 2023 Fall ... 0 0

5 rows × 10 columns

df_se['Responses'] = 1
df_se_pivot = df_se.pivot_table(
    index=['Starting Year', 'Season', 'Gender', 
             'Matriculation Year', 'College', 'Class Of',
             'Level', 'Level For Sorting', 'Season For Sorting'],
    values=['Score', 'Responses'],
    aggfunc={'Score':'mean', 'Responses':'sum'}).reset_index()
df_se_pivot
Starting Year Season ... Responses Score
0 2023 Fall ... 408 69.583333
1 2023 Fall ... 519 69.697495
... ... ... ... ... ...
62 2023 Spring ... 756 79.120370
63 2023 Spring ... 858 78.083916

64 rows × 11 columns

A basic autopivot function#

The following code shows a simple version of an ‘autopivot’ function that can automatically create both pivot tables to be graphed and corresponding graphing variables. (This table and corresponding variables can then be sent to a graphing function like px.bar().) One key variable generated by this function is a field that groups together all of the individual comparison values passed to x_vars[] and can thus serve as the ‘x’ argument within a Plotly graph call.

This function has a number of limitations: for instance, it doesn’t take color options into account, and it also can’t accommodate variables that should be included in the pivot table function (e.g. for sorting purposes) but excluded from the x variable column. However, I chose to include it here in order to highlight the core concepts of this ‘autopivot’ function. (The final autopivot function will be somewhat more complicated.)

def autopivot_simple(df, y, aggfunc, x_vars=[],
                    overall_data_name='All Data'):
    '''This is a simple version of an 'autopivot' chart. Given a 
    DataFrame, a y variable, an aggregate function, and an arbitrary list 
    of variables, it will return a pivot table, an x variable, 
    and a y variable that can then get fed into a graphing function.

    For more documentation on these steps, 
    see the full autopivot function within auto_pivot_and_graph.py,
    which is located within the PFN_Dash_App_Demo section of PFN. 
    (I excluded that documentation from this code in order to 
    avoid redundancy.)
    '''

    df_for_pivot = df.copy()
    df_for_pivot

    # Handling a situation in which no comparion variables were provided:
    if len(x_vars) == 0: 
        df_for_pivot[overall_data_name] = overall_data_name
        df_pivot = df_for_pivot.pivot_table(
            index=overall_data_name, values=y, 
            aggfunc=aggfunc).reset_index()
        x_val_name = overall_data_name
        color = overall_data_name
        barmode = 'relative'
        index = [overall_data_name]         
       
    else:
        print("x_vars:",x_vars)    

        # Creating a pivot table using the arguments provided:
        df_pivot = df_for_pivot.pivot_table(
            index=x_vars, values=y, aggfunc=aggfunc).reset_index()
        
        # Creating the x variable's column name:
        # (This name can then get fed into a graphing function.)
        x_val_name = ('/').join(x_vars)

        # Creating the values for the x_val_name column:
        # (The function does so by combining each row's values for 
        # the fields represented in x_vars into a single string.)
        # It starts with the first x variable provided, then loops
        # through the rest of x_vars to add in the others.
        df_pivot[x_val_name] = df_pivot[x_vars[0]].astype('str')
        for i in range(1, len(x_vars)):
            df_pivot[x_val_name] = (
                df_pivot[x_val_name] 
                + '/' + df_pivot[x_vars[i]].astype('str'))
            
    return df_pivot, x_val_name, y, aggfunc

A basic autobar function#

The following function uses the output of autopivot_simple() as inputs for a bar chart. It’s quite limited relative to the final autobar function that will be applied later in this code, but it still demonstrates the utility of the autopivot_simple() function.

def autobar_simple(df_pivot, x_val_name, y, aggfunc):
    '''This function creates a bar graph of a pivot table (such as one
    created by autopivot_simple.
    For more documentation on this function, view the full autobar()
    function within within auto_pivot_and_graph.py.
    '''
    
    fig = px.bar(df_pivot, x=x_val_name, y=y, text_auto='.0f',
                title=f"{aggfunc.title()} {y} by {x_val_name}") 
    return fig

Testing out autopivot_simple() and autobar_simple()#

Let’s try using these functions to graph average survey results by level and season. First, we’ll call autopivot_simple() to generate inputs for autobar_simple():

df_pivot, x_val_name, y, aggfunc = autopivot_simple(
    df_se, y='Score', aggfunc='mean', x_vars=['Level', 'Season'])

print(x_val_name, y, df_pivot.head())
x_vars: ['Level', 'Season']
Level/Season Score   Level  Season      Score Level/Season
0    Fr    Fall  69.609774      Fr/Fall
1    Fr  Spring  74.833180    Fr/Spring
2    Ju    Fall  69.768957      Ju/Fall
3    Ju  Spring  69.350671    Ju/Spring
4    Se    Fall  69.698085      Se/Fall

Next, we’ll call autobar_simple() to convert these values into a bar chart:

fig = autobar_simple(
    df_pivot=df_pivot, x_val_name=x_val_name, y=y,
aggfunc=aggfunc)                      

wadi(fig=fig, file_path='output/mean_score_by_level_and_season',
    display_type=display_type)
../_images/4ca3d0d2dc06a31e177555f8f5375434c2b142134b2aaa2d612771990545bd0c.png

Note that we only needed to specify our chart parameters within autopivot_simple(); its output could then get passed to autobar_simple() without any further tweaking.

Let’s try simplifying the process of creating these charts even more by defining a function called autopivot_plus_bar_simple(). This function, given the same list of inputs that we passed to autopivot_simple(), will call autopivot_simple() to create a pivot table, then immediately call autobar_simple() to generate a corresponding bar chart. It will then return both outputs for further use.

def autopivot_plus_bar_simple(df, y, aggfunc, x_vars=[],
                    overall_data_name='All Data'):
    '''This function calls both autopivot_simple() and autobar_simple()
    in order to create both a pivot table and a bar.'''
    df_pivot, x_val_name, y, aggfunc = autopivot_simple(
        df=df, y=y, aggfunc=aggfunc, x_vars=x_vars,
                    overall_data_name=overall_data_name)
    fig = autobar_simple(df_pivot=df_pivot, x_val_name=x_val_name, y=y,
                         aggfunc=aggfunc)       
    return df_pivot, fig

We can now create the same figure shown above, along with its corresponding pivot table, using a single function call. The following output shows what this pivot table looks like:

df_pivot, fig = autopivot_plus_bar_simple(
    df_se, y='Score', aggfunc='mean', x_vars=['Level', 'Season'])
df_pivot.head()
x_vars: ['Level', 'Season']
Level Season Score Level/Season
0 Fr Fall 69.609774 Fr/Fall
1 Fr Spring 74.833180 Fr/Spring
2 Ju Fall 69.768957 Ju/Fall
3 Ju Spring 69.350671 Ju/Spring
4 Se Fall 69.698085 Se/Fall

And here’s the image–which is identical to the one we created earlier.

wadi(fig=fig, file_path='output/mean_score_by_level_and_season',
    display_type=display_type)
../_images/4ca3d0d2dc06a31e177555f8f5375434c2b142134b2aaa2d612771990545bd0c.png

Next, we’ll call autopivot_plus_bar_simple() to display average scores by season, college, and level:

df_pivot, fig = autopivot_plus_bar_simple(
    df_se, y='Score', aggfunc='mean', x_vars=[
        'Season', 'College', 'Level'])

wadi(fig=fig, file_path='output/mean_score_by_season_college_and_level',
    display_type=display_type)
x_vars: ['Season', 'College', 'Level']
../_images/4c2dc1848fefe0632c3fa4c206398823891c564d8172798dd3e1bda914220a93.png

This chart would be both easier to interpret and more aesthetically pleasing with some color. Our simple autopivot and autobar functions don’t handle color values, but their full-fledged versions, which we’ll import shortly, do.

Finally, we’ll create a chart that groups all data together:

df_pivot, fig = autopivot_plus_bar_simple(
    df_se, y='Score', aggfunc='mean', x_vars=[])     

wadi(fig=fig, file_path='output/mean_score_by_all_data',
    display_type=display_type)
../_images/1a8b2ea9fb22a220e7f7e6b1ea6745ff9bf93442c726af4665425d05f626ed4b.png

The above examples demonstrate how the autopivot_simple() and autobar_simple() functions can work together to generate all kinds of different table and chart combinations, thus eliminating the need to write pivot table and bar chart code for each set of comparison variables a user might want to analyze.

However, the basic charts that they produce leave much to be desired. There are no colors: the titles could use some work; and levels are sorted alphabetically rather than chronologically. Therefore, we’ll now import more robust versions of these functions.

Importing autopivot and autobar functions#

These functions are stored within the ‘auto_pivot_and_graph.py’ file within the PFN_Dash_App_Demo component of PFN’s Online Visualizations section. The Dash App Demo project applies them to automatically create new pivot tables–and corresponding charts–based on the input variables requested by dashboard visitors.

I had originally kept a copy of these functions within this section as well, but that resulted in two redundant Python files that I would need to manually synchronize. Therefore, I decided to delete this section’s copy and instead direct readers to the Online Visualizations section if they needed to reference the code.

At this point, I do suggest that you take some time to look through the auto_pivot_and_graph.py file; it contains detailed commentary to help you understand what’s going on within each step.

sys.path.insert(2, '../Online_Visualizations/PFN_Dash_App_Demo')
from auto_pivot_and_graph import autopivot, autobar, autopivot_plus_bar

Demonstrating these functions#

Let’s try using these functions to graph average scores by the college, level, and season fields–just as we did in our previous chart. In order to simplify our x axis values (and improve the chart’s appearance), we’ll pass Season to this function’s color parameter. In addition, although we’re passing Level For Sorting to our x_vars argument (so that levels will appear within the correct order), we won’t want that argument to show up within our x axis labels, so we’ll add that value to x_vars_to_exclude.

First, here’s a look at the output of autopivot() on its own:

(df_pivot, x_val_name, y, color, barmode, 
 x_var_count, index, aggfunc) = autopivot(
    df=df_se, y='Score', 
    x_vars=['College', 'Level For Sorting', 'Level'], 
    color='Season', 
    x_vars_to_exclude=['Level For Sorting'],
    aggfunc='mean')
print(x_val_name, y, color, barmode)
df_pivot.head(5)
x_vars: ['College', 'Level For Sorting', 'Level']
index prior to pivot_table() call: ['College', 'Level For Sorting', 'Level', 'Season']
College/Level Score Season group
College Level For Sorting ... Score College/Level
0 STB 0 ... 69.593583 STB/Fr
1 STB 0 ... 69.177235 STB/Fr
2 STB 1 ... 70.377306 STB/So
3 STB 1 ... 65.377306 STB/So
4 STB 2 ... 69.950769 STB/Ju

5 rows × 6 columns

And here’s what autobar() produces when provided the arguments generated within autopivot():

fig = autobar(
    df_pivot=df_pivot,
    x_val_name=x_val_name, y=y,
    color=color, barmode=barmode, 
    x_var_count=x_var_count, index=index,
    aggfunc=aggfunc,
    text_auto='.0f')

wadi(fig=fig, 
     file_path='output/mean_score_by_college_level_and_season_autopivot',
     display_type=display_type)
                               
../_images/dd1d39625ddfeb31fd15c961546f738e859c8216e38a1c5bcdebafbd612b3f24.png

This chart is certainly an improvement over our previous one. The levels now show up in the correct order (Fr, So, Ju, and Se), and the Season variable no longer appears within the x axis labels. (Since it’s already featured within the legend, its presence there would be redundant.) In addition, our title now features better formatting also.

We can simplify this code even further by calling autopivot_plus_bar():

fig = autopivot_plus_bar(
    df=df_se, y='Score', 
    x_vars=['College', 'Level For Sorting', 'Level'], 
    color='Season', 
    x_vars_to_exclude=['Level For Sorting', 'Season For Sorting'],
    aggfunc='mean',
    text_auto='.0f')

wadi(fig=fig, 
     file_path='output/mean_score_by_college_level_and_\
season_autopivot_v2', display_type=display_type)
x_vars: ['College', 'Level For Sorting', 'Level']
index prior to pivot_table() call: ['College', 'Level For Sorting', 'Level', 'Season']
../_images/dd1d39625ddfeb31fd15c961546f738e859c8216e38a1c5bcdebafbd612b3f24.png

Here’s an alternative chart with just 2 comparison variables: (This chart demonstrates how to use autopivot_plus_bar()’s text_auto argument to add extra precision to our text labels–though this level of detail is overkill in this case.

fig = autopivot_plus_bar(
    df=df_se, y='Score', 
    x_vars=['Level', 'Season For Sorting'], 
    color='Season', 
    x_vars_to_exclude=['Season For Sorting'],
    aggfunc='mean',
    text_auto='.3f')

wadi(fig=fig, 
     file_path='output/mean_score_by_level_and_season_autopivot', 
     display_type=display_type)
x_vars: ['Level', 'Season For Sorting']
index prior to pivot_table() call: ['Level', 'Season For Sorting', 'Season']
../_images/8cc0f25ca39657bd66885239401aa50a0488f19eee6a7f70adf0f97bd1560d95.png

Note that the above output is identical to the output of the following cell, which uses an aggregated dataframe (df_se_pivot) as its data source. This identical output was achieved by adding in a weight column (Responses).

fig = autopivot_plus_bar(
    df=df_se_pivot, y='Score', 
    x_vars=['Level'], 
    color='Season', 
    x_vars_to_exclude=['Season For Sorting'],
    aggfunc='mean', weight_col='Responses',
    text_auto='.3f')        

wadi(fig=fig, 
     file_path='output/mean_score_by_level_and_season_autopivot_v2', 
     display_type=display_type)
x_vars: ['Level']
index prior to pivot_table() call: ['Level', 'Season']
../_images/8cc0f25ca39657bd66885239401aa50a0488f19eee6a7f70adf0f97bd1560d95.png

In contrast, the following output, which uses df_se_pivot as well but does not reference a weight column, differs slightly from the above two graphs. This is because it is falling into the ‘average of averages’ error noted within Descriptive Stats: Part 2.

fig = autopivot_plus_bar(
    df=df_se_pivot, y='Score', 
    x_vars=['Level'], 
    color='Season', 
    x_vars_to_exclude=['Season For Sorting'],
    aggfunc='mean').update_layout(title='Mean Score by Level and \
Season (with incorrect averages)')

# As shown by this update_layout() call, further modifications can be 
# made as needed to figures returned by autopivot_plus_bar().

wadi(fig=fig, 
     file_path='output/mean_score_by_level_and_season_autopivot_aoa_error', 
     display_type=display_type)
x_vars: ['Level']
index prior to pivot_table() call: ['Level', 'Season']
../_images/70eccb0dd336421cebe0cb0539852678bce35e2816789280acd464ecfbb11823.png

These functions also allow the color argument to be skipped, though the result is rather plain:

fig = autopivot_plus_bar(
    df=df_se, y='Score', 
    x_vars=['College', 'Level'],
    aggfunc='mean')

wadi(fig=fig, 
     file_path='output/mean_score_by_college_and_level_\
autopivot_colorless', 
     display_type=display_type)
x_vars: ['College', 'Level']
index prior to pivot_table() call: ['College', 'Level']
../_images/35bb6f1c635c34c9e762b65be48322dbee9244a5e48169777f8e0bda159c2bca.png

They also allow an empty list of comparison values to get passed to x_vars, thus producing a single bar that represents the average of all values in the DataFrame.

fig = autopivot_plus_bar(
    df=df_se, y='Score', 
    x_vars=[], 
    color=None, 
    x_vars_to_exclude=[],
    aggfunc='mean')       

wadi(fig=fig, 
     file_path='output/overall_mean_score_autopivot', 
     display_type=display_type)
../_images/ef82727b38852191e94331fe9d57bcf55991c3dea052d69e178a9e0241ba44e2.png

At the opposite extreme, we can graph 6 comparison variables within the same chart (though the result is not the easiest to read):

fig = autopivot_plus_bar(
    df=df_se, y='Score', 
    x_vars=[
        'Starting Year', 'Season', 'Gender', 'College', 'Class Of', 
        'Level For Sorting', 'Level'], 
    color='Season', 
    x_vars_to_exclude=['Level For Sorting'],
    aggfunc='mean', text_auto=False)      

wadi(fig=fig, 
     file_path='output/mean_score_by_6_variables_autopivot', 
     display_type=display_type)
x_vars: ['Starting Year', 'Gender', 'College', 'Class Of', 'Level For Sorting', 'Level']
index prior to pivot_table() call: ['Starting Year', 'Gender', 'College', 'Class Of', 'Level For Sorting', 'Level', 'Season']
../_images/35abeb5f481ff83bda26539449ce09f35e8f883d1a6af712a2ab4f58a3caaa8e.png

Conclusion#

The autopivot(), autobar(), and autopivot_plus_bar() functions shown here can help speed up the process of creating standalone Plotly charts. However, these functions will prove even more useful within the Online Visualizations section of Python for Nonprofits–as they’ll allow us to turn a small set of user inputs into a wide variety of charts and tables.

Now that we’ve learned how to use Plotly to create various types of charts, we’ll shift our focus to mapping–an area in which Plotly also excels.