Creating Interactive and Static Charts with Plotly#

By Kenneth Burchfiel

Released under the MIT License

Now that we’ve learned how to import, clean, and analyze data, we can finally turn to one of my favorite uses of Python: visualizations. In this section, we’ll learn how to use the Plotly library to create bar and line graphs; scatter plots; histograms; and treemaps. Our main data source will be NVCU survey data, though we’ll also make use of current enrollment data for our treemaps and bookstore transaction data for our scatter plots and histograms. We’ll then save these charts to both interactive HTML files and static .png files.

When I was first learning Python, I mainly used Matplotlib for my charts. Matplotlib is a great library as well, but I switched over to Plotly for as my default chart library for several reasons:

  1. Plotly makes it easy to produce both interactive and static versions of a given chart. These interactive versions, which work great on websites (but can also be opened as standalone HTML files), allow users to access more information than they could via static .png or .jpg copies.

  2. Plotly integrates very well with Dash, a Python library that allows you to build interactive web apps within Python. (A later section of Python for Nonprofits will demonstrate how to build both simple and more complex Dash web apps).

  3. My general sense is that Plotly tends to require less code to produce my desired result than does Matplotlib, although I don’t have any hard data on which to base this point.

Note: this script will create graphs using Plotly Express (https://plotly.com/python/plotly-express/), which allows you to create detailed charts with relatively few lines of code. For more complex charts, you may need to instead work with Plotly graph objects, which allow for more customization: however, I’d recommend using Plotly Express code within your scripts when possible. After all, as the page linked to above notes, “Any figure created in a single function call with Plotly Express could be created using graph objects alone, but with between 5 and 100 times more code.”

import pandas as pd
import numpy as np
import plotly.express as px

from sqlalchemy import create_engine
e = create_engine('sqlite:///../Appendix/nvcu_db.db')

import sys
sys.path.insert(1, '../Appendix')
from helper_funcs import config_notebook
display_type = config_notebook(display_max_columns=8)

from IPython.display import Image # Based on 
# a StackOverflow answer from 'zach' at
# https://stackoverflow.com/a/11855133/13097194 .

Creating bar graphs#

(For more details on bar charts in Plotly, consult the px.bar() documentation (https://plotly.com/python/bar-charts/).)

In order to create bar graphs of NVCU survey data, we’ll first import the pivot tables of this data created within PFN’s Descriptive Stats section, then feed them into our Plotly graphing script.

df_survey_results_by_college_long = pd.read_csv(
    '../Descriptive_Stats/survey_results_by_college_long.csv')
# Capitalizing all column names will allow our charts' x and y axis labels 
# to appear capitalized by default.
df_survey_results_by_college_long.columns = [
    column.title() for column in 
    df_survey_results_by_college_long.columns]
df_survey_results_by_college_long
Starting_Year College Season Score
0 2023 STB Fall 69.797119
1 2023 STB Winter 64.319684
... ... ... ... ...
10 2023 STM Winter 64.315841
11 2023 STM Spring 76.639004

12 rows × 4 columns

Simple bar graphs#

We’ll begin by creating a bar chart that shows spring survey results by college:

fig_spring_survey_results_by_college = px.bar(
    df_survey_results_by_college_long.query(
    "Season == 'Spring'"), x='College', y='Score',
      title='Spring Survey Scores by College',
      color='College', text_auto='.0f')
# See https://plotly.com/python/bar-charts/

# The text_auto argument allows you to easily add bar labels using 
# a format of your choice. '.0f' rounds labels to the nearest integer;
# to show values with 2 decimal points, use '.2f' instead. You can also
# convert proportions into percentages by replacing 'f' with '%'.

fig_spring_survey_results_by_college if display_type == '.html' else None

Plotly charts are generated in HTML format by default; this lets them offer a range of interactive features, such as tooltips (extra information that appears when you hover over a chart element) and panning/zooming options. You can also filter the x axis elements by clicking once on a legend element to remove it and clicking twice to keep only that element.

These interactive charts can easily be saved as HTML files via the write_html() function:

fig_spring_survey_results_by_college.write_html(
    'output/spring_results_by_college.html')

This .html file can now be opened in a web browser as a standalone document or embedded in a website for public display.

However, it’s also convenient to create static copies of these charts (e.g. for use within a slide deck or a web page that doesn’t allow certain HTML elements to be displayed). Thankfully, it’s very simple to create static copies of these charts. We simply need to call write_image() instead of write_html().

However, the default settings for write_image() produce, in my view, a relatively blurry image that isn’t suitable for presenting. Therefore, the following cell include arguments for three optional parameters for write_image(): height, width, and scale. The settings shown below produce a chart with 4K resolution (3840 * 2160 pixels) whose text is also large enough to be easily readable.

height = 405
aspect_ratio = 16/9
width = height * aspect_ratio
scale = 2160 / height

# The above approach allows the height setting to automatically adjust
# the width and scale setting while preserving a resolution of 
# 3840 * 2160.

# The chart's height and width (in pixels) will be height * scale 
# and width * scale, respectively. Smaller height and width values will 
# result in larger (and thus more readable) text, but in order to keep 
# the resulting file sharp, I increased the scale to compensate. For 
# instance, a height of 540 will result in a scale of 4 (2160 / 540); 
# a height of 360, on the other hand, will result in a scale of 6.

# You may want to experiment with different settings in order to find the 
# result that best meets your needs.
# The 16/9 aspect ratio is used in many modern TVs and monitors, thus 
# making it ideal for full-screen displays of charts.

print("Height:",height, "\nWidth:", width, "\nScale:",scale)

fig_spring_survey_results_by_college.write_image(
    'output/spring_results_by_college.png', height=height, 
width=width, scale=scale)
# For more information about write_image, see:
# https://plotly.com/python-api-reference/generated/plotly.io.
# write_image.html
Height: 405 
Width: 720.0 
Scale: 5.333333333333333

Note: in order for the above code to work, you’ll most likely need to have installed the kaleido library beforehand. When using the conda package manager on Windows, I’ve found that I’ve needed to request version 0.1.0 of kaleido when installing the library (e.g. by calling conda install python-kaleido=0.1.0 within my command prompt); this is because the later version appears to cause image generation scripts to hang indefinitely. (I found this solution via this GitHub post: plotly/Kaleido#134).

I haven’t encountered this problem on Linux, so on that operating system (and presumably on Mac), conda install python-kaleido should work fine.

The kaleido library makes creating static versions of HTML-based Plotly graphs very straightforward. In the Mapping section of Python for Nonprofits, we’ll go over a more manual approach for creating static versions of HTML files (as that section will utilize a library that doesn’t yet have kaleido support).

To display this image within this notebook, we can call Image() from IPython.display:

Image('output/spring_results_by_college.png', width=720)
# Based on the StackOverflow answer from 'zach'
# at https://stackoverflow.com/a/11855133/13097194 ;
# See also:
# https://ipython.readthedocs.io/en/stable/api/generated/IPython.display.
# html#IPython.display.Image
../_images/77528acbaf02881d76deb1a552509ccce37b0964ba26a40d1b741f4fddebbb46.png

Importing a function to simplify the process of saving and displaying charts#

We can more easily save our Plotly charts to HTML and PNG files, then display a copy of the output by importing a function that takes care of all three operations for us. This function, wadi(), will get applied within other sections of Python for Nonprofits also. I highly recommend reading its source code, which can be found inside the Appendix folder’s helper_funcs.py file, so that you understand what the function does and how to modify it for your own needs.

from helper_funcs import wadi

More detailed bar graphs#

The bar graph created above shows that STL and STM had better survey results than did STB and STC. However, NVCU’s administrators will also want to know how each college’s survey results changed over the course of the year. We can visualize these changes by adding season as our argument for the color variable within px.bar(), thus creating a chart that shows separate color-coded bars for each college/season pair. (We’ll also add barmode='group' in order to display grouped bars rather than stacked ones.)

Sorting DataFrames using a custom key#

Note that, because ‘Spring’ precedes ‘Winter’ alphabetically, these charts will show winter results before spring results by default. We can display them in chronological order instead by sorting our DataFrame to have winter rows precede spring ones.

The easiest way to accomplish this sort would be to first add a separate ‘season_sort_order’ column that contains values of 0, 1, and 2 for ‘Fall’, ‘Winter’,’ and ‘Spring’ rows, respectively. Once this column is in place, we could then sort the DataFrame by it. However, for demonstration purposes, the following cell shows how this sort can be accomplished without adding an additional column. The approach shown below relies on the key argument available within sort_values() together with Pandas’ map() function.

df_survey_results_by_college_long.sort_values(
    'Season', key=lambda col: col.map(
        {'Fall':0,'Winter':1,'Spring':2}), inplace=True)
# See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.
# DataFrame.sort_values.html
df_survey_results_by_college_long
Starting_Year College Season Score
0 2023 STB Fall 69.797119
3 2023 STC Fall 69.568665
... ... ... ... ...
8 2023 STL Spring 76.727809
11 2023 STM Spring 76.639004

12 rows × 4 columns

Now that the DataFrame is sorted correctly, we can create our grouped bar graph:

fig_survey_results_by_college_and_season = px.bar(
    df_survey_results_by_college_long, 
    x='College', y='Score', 
    title='Survey Scores by College and Season',
      color='Season', text_auto='.0f', barmode='group')

Calling wadi to save static and interactive copies of this chart, then display the format specified by display_type:

display_type
'png'
wadi(fig=fig_survey_results_by_college_and_season,
     file_path='output/results_by_college_and_season',
     display_type=display_type)
../_images/133048145b02ddfebf59f897c7750a597b9d1618e1927deb4cab3546b0ebaeb9.png

This graph clearly shows how each school’s scores changed over time. However, in order to more easily compare results among schools for each season, we can switch the x and color arguments:

fig_survey_results_by_season_and_college = px.bar(
    df_survey_results_by_college_long, 
    x='Season', y='Score', 
    title='Survey Scores by Season and College',
      color='College', text_auto='.0f', barmode='group')

wadi(fig=fig_survey_results_by_season_and_college,
     file_path='output/results_by_season_and_college',
     display_type=display_type)
../_images/d44e783edbbd6fe8143ae9eb52816221b9045ba360281eaeb28f2321fde34510.png

This chart demonstrates that each college had an average fall score around 70 followed by a 6-point drop to 64 during the winter. Although all colleges demonstrated growth from the fall to the winter, STL and STM clearly outperformed STB and STC (whose spring scores failed to match their fall scores).

Graphing wide data#

The previous two graphs used long-formatted data, meaning each row showed only one metric (in this case, one survey score). However, you may often encounter (or choose to create) tables in which multiple metrics are found within the same row. These datasets would be considered ‘wide.’ Here’s an alternative to df_survey_results_by_college_long that uses a wide format:

df_survey_results_by_college_wide = pd.read_csv(
    '../Descriptive_Stats/survey_results_by_college_wide.csv')
df_survey_results_by_college_wide.columns = [
    column.title() for column in 
    df_survey_results_by_college_wide.columns]
df_survey_results_by_college_wide.drop('Starting_Year', axis=1)
College Fall Winter Spring Fall-Spring Change Fall-Winter Change Winter-Spring Change
0 STB 69.797119 64.319684 67.077551 -2.719568 -5.477435 2.757867
1 STC 69.568665 64.029750 66.911444 -2.657221 -5.538914 2.881694
2 STL 69.596675 64.160327 76.727809 7.131134 -5.436348 12.567482
3 STM 69.735685 64.315841 76.639004 6.903320 -5.419844 12.323163

This dataset contains the same score data as df_survey_results_by_college_long, but it places those each college’s seasonal scores on the same row. (This made it easier to calculate changes in scores over time, which this dataset also features.)

Here’s another look at df_survey_results_by_college_long for comparison:

df_survey_results_by_college_long
Starting_Year College Season Score
0 2023 STB Fall 69.797119
3 2023 STC Fall 69.568665
... ... ... ... ...
8 2023 STL Spring 76.727809
11 2023 STM Spring 76.639004

12 rows × 4 columns

Plotly makes graphing wide data like this very straightforward. Instead of passing season to the ‘color’ argument (as we did with the long-formatted table), we can simply add each season to a list that will then serve as our y argument:

fig_survey_results_by_college_and_season_wide = px.bar(
    df_survey_results_by_college_wide, x='College', 
       y=['Fall', 'Winter', 'Spring'], barmode='group',
      title='Survey Scores by College and Season',
    text_auto='.0f')
# See:
# https://plotly.com/python/bar-charts/#bar-charts-with-wide-format-data

# Because we are displaying multiple y values rather than just one, Plotly
# sets the y axis and legend titles as 'value' and 'variables' by default. 
# We can change these to 'Score' and 'Season', respectively, via the 
# following code:
fig_survey_results_by_college_and_season_wide.update_layout(
    yaxis_title='Score', legend_title='Season')
# This code was based on https://plotly.com/python/figure-labels/

wadi(fig=fig_survey_results_by_college_and_season_wide,
     file_path='output/results_by_college_and_season_wide',
     display_type=display_type)
../_images/133048145b02ddfebf59f897c7750a597b9d1618e1927deb4cab3546b0ebaeb9.png

Note that this graph, though based on a ‘wide’ rather than ‘long’ dataset, is equvialent to fig_survey_results_by_college_and_season.

Bar charts with three or more variables#

If you need to differentiate between three different variables (such as college, level, and season), you have a few different options. One would be to represent the third variable via Plotly’s pattern_shape argument, which adds different patterns to bars to represent different variable values. Another would be to use facet_row or facet_col to create a separate plot for each value within this third variable. Finally, you could consider grouping two variables into one via Pandas, then passing that grouped column to px.bar()’s x or color parameters. The following code will demonstrate each of these three options.

Reading in a dataset that shows results by school, level, and college (hence the use of ‘slc’ in the DataFrame name):

df_survey_results_slc = pd.read_csv(
    '../Descriptive_Stats/survey_results_slc_long.csv')
df_survey_results_slc.columns = [
    column.title() for column in 
    df_survey_results_slc.columns]

# In order for our graphs to display levels and seasons in the correct
# order, we'll add in a 'Season_For_Sorting' column (an equivalent for 
# levels already exists), then sort the DataFrame by these two columns.
# (I'm using this approach rather than the key-based one shown earlier
# because I don't believe that that one works with more than one sort 
# value.)

df_survey_results_slc['Season_For_Sorting'] = df_survey_results_slc[
'Season'].map({'Fall':0,'Winter':1,'Spring':2})
df_survey_results_slc.sort_values(['Season_For_Sorting',
                                   'Level_For_Sorting'], inplace=True)
df_survey_results_slc.drop('Season_For_Sorting', axis = 1).head()
Starting_Year College Level_For_Sorting Level Season Score
0 2023 STB 0 Fr Fall 69.593583
12 2023 STC 0 Fr Fall 69.598913
24 2023 STL 0 Fr Fall 69.585030
36 2023 STM 0 Fr Fall 69.650503
3 2023 STB 1 So Fall 70.377306

Graphing results by college, level, and season#

I’ll first use the pattern_shape argument to represent ‘Level’ values:

fig_survey_results_slc_pattern_shape = px.bar(
    df_survey_results_slc, x='College', 
    color='Season', y='Score', pattern_shape='Level', 
    barmode='group', 
    title='Survey Scores by College, Season, and Level', 
    text_auto='.0f')

wadi(fig=fig_survey_results_slc_pattern_shape,
     file_path='output/results_slc_pattern_shape',
     display_type=display_type, height=450)
../_images/5099c90f80b88f4d639804e21f21a62c1dd3855393950ec69b0cc7f0c1ee5c56.png

This graph, as shown within the image above, is awfully cluttered and difficult to interpret. Your outcomes with pattern_shape might be better if you have, say, only two different values to pass to that argument (e.g. Fall and Spring results), but even then, your visualization may not be very intuitive.

As an alternative to pattern_shape, you might prefer to display results for a third comparison variable via the facet_col or facet_row arguments instead. Plotly will then create a separate chart for each value within the variable passed to those arguments, then display them together in the same graphic.

Here’s what we get if we pass ‘Level’ to facet_col rather than to pattern_shape:

fig_survey_results_slc_facet_col = px.bar(
    df_survey_results_slc, x='College', 
    color='Season', y='Score', facet_col='Level', 
    barmode='group', 
    title='Survey Scores by College, Season, and Level', 
    text_auto='.0f')

wadi(fig=fig_survey_results_slc_facet_col,
     file_path='output/results_slc_facet_col',
     display_type=display_type, height=450)
../_images/e71154ccf55c3ef336563a4efb603be7e601aeb5337e8519e4f5fdae1f7309da.png

Note that each of the four graphs shows data for one particular level. These graphs are displayed within separate columns inside the same row; if we instead selected facet_row, they would be shown within separate rows inside the same column.

A third option is to combine two fields (such as College and Level) into one, then pass that combined field to our graphing code:

df_survey_results_slc['College and Level'] = (
    df_survey_results_slc['College'] 
    + ' ' + df_survey_results_slc['Level'])
df_survey_results_slc.head()
Starting_Year College Level_For_Sorting Level Season Score Season_For_Sorting College and Level
0 2023 STB 0 Fr Fall 69.593583 0 STB Fr
12 2023 STC 0 Fr Fall 69.598913 0 STC Fr
24 2023 STL 0 Fr Fall 69.585030 0 STL Fr
36 2023 STM 0 Fr Fall 69.650503 0 STM Fr
3 2023 STB 1 So Fall 70.377306 0 STB So

Now that we’ve combined College and Level into our x variable, we no longer need a pattern_shape or facet_row/facet_col argument:

fig_survey_results_slc_combined_x = px.bar(
    df_survey_results_slc, x='College and Level', 
    color='Season', y='Score', barmode='group',
    title='Survey Scores by College, Season, and Level',
text_auto='.0f')

wadi(fig=fig_survey_results_slc_combined_x,
     file_path='output/results_slc_combined_x',
     display_type=display_type)
../_images/77565462f80a22c8533d9133e263356575bba0b69bc83b27bfeab7156ad6980e.png

I find this chart to be somewhat more intuitive than the one that incorporated a pattern_shape argument, though it’s still quite cluttered. However, users can simplify their view of the interactive version of this chart via Plotly’s Box Select feature, which allows them to zoom in on a specific section.

If you have four or more variables to display, you could combine them via the same method used to generate the ‘College and Level’ column above. However, you may conclude that limiting your chart to three (or ideally two) comparison variables is a better option.

Creating line graphs#

Line graphs are an excellent choice for displaying changes in data over time. The following code will provide a basic overview of building line charts within Plotly; for more detailed instructions, I recommend referencing the Plotly Express documentation for line charts (https://plotly.com/python/line-charts/) and the plotly.express.line reference (https://plotly.com/python-api-reference/generated/plotly.express.line).

First, we’ll create a simple line chart that shows how survey scores changed over time. This chart will show the same data as fig_survey_results_by_college_and_season, a bar chart that we created earlier; however, it will also make the divergence in spring scores between STL/STM and STC/STB more readily apparent.

Here’s another look at df_survey_results_by_college_long, the basis for this chart, for reference:

df_survey_results_by_college_long
Starting_Year College Season Score
0 2023 STB Fall 69.797119
3 2023 STC Fall 69.568665
... ... ... ... ...
8 2023 STL Spring 76.727809
11 2023 STM Spring 76.639004

12 rows × 4 columns

fig_survey_results_by_college_and_season_line = px.line(
    df_survey_results_by_college_long, x='Season', y='Score',
        color='College', title='Survey Scores by College and Season')
wadi(fig_survey_results_by_college_and_season_line,
     file_path='output/results_by_college_and_season_line',
     display_type=display_type)
../_images/167cbfe100931b780dcf2caa8f87583d13185c3ad315f4800da3d7c439d6e901.png

Representing multiple variables within line charts#

If you need to represent more than one comparison variable in your line chart, consider assigning one to the color argument (as shown in the above chart) and the other one to the line_dash argument (as shown in the following chart). However, this approach works better if your total variable count remains quite limited.

The following chart shows how survey scores changed over time for freshmen and juniors in STC and STM. (I purposefully reduced the number of possible combinations from 16 to 4 so that the chart would remain readable. However, if you’re producing an interactive chart, you could consider showing a larger number of color and dash combinations than you would within a static chart. This is because your viewers can click and double-click within legend entries in order to specify which items to keep, and because they can hover over specific points to verify the variables that it represents.)

This chart also adds in markers to draw readers’ attention to the actual points being graphed. (You can also vary the marker type for each line (using px.line()'s symbol argument), rather than the line style, in order to designate which variable is represented by a given line.)

fig_selected_survey_results_line = px.line(
    df_survey_results_slc.query("College in ['STM', 'STC'] and Level \
in ['Fr', 'Ju']"), x='Season', y='Score',
        color='College', line_dash='Level', markers=True,
    title='STM and STC Survey Scores by Season for Freshmen and Juniors')
wadi(fig_selected_survey_results_line,
     file_path='output/selected_survey_results_line',
     display_type=display_type)
../_images/f614238b1583440faf238cef80746af44edd82ceaa7f4e3b70dc0411561ed824.png

Adding labels to points#

Adding labels to line charts is relatively straightforward; however, making them readable can be a bit more challenging. For instance, if two lines overlap, their labels will also overlap by default.

Let’s try adding labels to a chart that shows changes in survey scores for two schools: STB and STC. First, we’ll create a new ‘Rounded_Score’ field within df_survey_results_by_college_long that we can use as text labels. (Using the original ‘Score’ field for our labels would display far more decimal points for each point than we need.

df_survey_results_by_college_long['Rounded_Score'] = (
    df_survey_results_by_college_long['Score'].round(2))
df_survey_results_by_college_long
Starting_Year College Season Score Rounded_Score
0 2023 STB Fall 69.797119 69.80
3 2023 STC Fall 69.568665 69.57
... ... ... ... ... ...
8 2023 STL Spring 76.727809 76.73
11 2023 STM Spring 76.639004 76.64

12 rows × 5 columns

We can add labels to our chart using px.line()’s text argument (rather than text_auto, which is supported for bar charts but not line charts). By default, these labels will appear directly over each point, which I find makes the chart less readable. Fortunately, we can use the textposition argument of update_traces() to provide an offset between each point and its corresponding label.

However, as shown within the first line chart we produced, STM and STC’s fall and winter scores were almost identical. This means that, by default, the labels for these text options will overlap. To avoid this issue, we’ll shift STC’s and STM’s labels above and below their corresponding points, respectively. This can be accomplished by calling update_traces twice, once for each line; we can then specify the line whose textposition value we wish to update via the selector argument of update_traces(). (For more on this argument, visit https://plotly.com/python/reference/scatter/#scatter . Special thanks to python-trader, akroma, and adamschroeder in this thread (https://community.plotly.com/t/need-to-target-a-specific-trace-by-name-when-use-update-traces-s-selector/63151/2) for helping me figure out this solution.)

fig_stc_stm_survey_results_line = px.line(
    df_survey_results_by_college_long.query("College in ['STM', 'STC']"), 
    x='Season', y='Score',
        color='College', markers=True,
    text='Rounded_Score',
    title='STM and STC Survey Scores by Season').update_traces(
    textposition='top center', selector={'name':'STM'}).update_traces(
    textposition='bottom center', 
    selector={'name':'STC'}).update_layout(yaxis_range=[60, 80])

wadi(fig_stc_stm_survey_results_line,
     file_path='output/stc_stm_survey_results_line',
     display_type=display_type)
../_images/7eaa4a2d6d730ec096331234dd323d1dcbabc9c7fc9a2f5293e43fbbff7cd695.png

To figure out what item to pass to the ‘name’ argument (or another argument of your choice) of selector, try viewing the .data attribute of your image. Note that ‘STC’ and ‘STM’ are listed below as the names of each trace.

fig_stc_stm_survey_results_line.data
(Scatter({
     'hovertemplate': 'College=STC<br>Season=%{x}<br>Score=%{y}<br>Rounded_Score=%{text}<extra></extra>',
     'legendgroup': 'STC',
     'line': {'color': '#636efa', 'dash': 'solid'},
     'marker': {'symbol': 'circle'},
     'mode': 'lines+markers+text',
     'name': 'STC',
     'orientation': 'v',
     'showlegend': True,
     'text': array([69.57, 64.03, 66.91]),
     'textposition': 'bottom center',
     'x': array(['Fall', 'Winter', 'Spring'], dtype=object),
     'xaxis': 'x',
     'y': array([69.56866485, 64.02975048, 66.91144414]),
     'yaxis': 'y'
 }),
 Scatter({
     'hovertemplate': 'College=STM<br>Season=%{x}<br>Score=%{y}<br>Rounded_Score=%{text}<extra></extra>',
     'legendgroup': 'STM',
     'line': {'color': '#EF553B', 'dash': 'solid'},
     'marker': {'symbol': 'circle'},
     'mode': 'lines+markers+text',
     'name': 'STM',
     'orientation': 'v',
     'showlegend': True,
     'text': array([69.74, 64.32, 76.64]),
     'textposition': 'top center',
     'x': array(['Fall', 'Winter', 'Spring'], dtype=object),
     'xaxis': 'x',
     'y': array([69.73568465, 64.3158411 , 76.63900415]),
     'yaxis': 'y'
 }))

Scatter plots#

(For a more detailed overview of scatter plots, see https://plotly.com/python/line-and-scatter/ .)

Scatter plots represent relationships between two variables by visualizing paired copies of those variables. For instance, if we wish to see how fall and spring NVCU bookstore spending are correlated, we could create a scatter plot in which each student is represented by a point; each point’s x value represents fall spending; and each point’s y value represents spring spending. That’s exactly what we’ll do in this section!

First, we’ll load in our sales data:

df_bookstore_sales = pd.read_sql("Select * from bookstore_sales", 
                                        con=e)
df_bookstore_sales.head()
student_id gender college level Fall Spring Fall_Spring_Change
0 2020-1 F STC Se 66.80 58.24 -8.56
1 2020-2 F STM Se 104.67 151.90 47.23
2 2020-3 F STC Se 46.17 16.56 -29.61
3 2020-4 F STC Se 58.68 73.77 15.09
4 2020-5 F STM Se 99.73 108.78 9.05

We’ll now create a scatter plot that compares Fall and Spring spending for each student. We’ll also include a linear trend line that helps make sense of the relationship between these two values. (You can modify the color of this line via the trendline_color_override argument as noted at https://plotly.com/python/linear-fits/).

fig_scatter = px.scatter(df_bookstore_sales, 
           x='Fall', y='Spring',
          trendline='ols', trendline_color_override='red',
                        title='Fall vs. Spring Bookstore Sales')
wadi(fig_scatter, 'output/sales_scatter', display_type=display_type)
../_images/85daf5ccb826bf3c127d7276a31a336576cb8e03e9c4cd89b639afda417c909e.png

We can visualize additional categories using the color and symbol arguments of px.scatter(). The following plot uses different colors to represent levels and different symbols to represent different colleges.

To be honest, I don’t think that these added elements contribute much to this particular chart; with so many points present, it’s still hard to make out anything but the overall trend (that fall and spring spending are positively correlated). However, if your scatter plot has fewer points, you may find these arguments more useful.

fig_scatter_with_color = px.scatter(df_bookstore_sales, 
           x='Fall', y='Spring', color='level', symbol='college',
                                    opacity=0.5,
          trendline='ols', trendline_color_override='red',
                                    trendline_scope='overall',
                        title='Fall vs. Spring Bookstore Sales')

# I chose a higher height value here so that all legend entries would 
# appear within the static verison of the chart.
wadi(fig_scatter_with_color, 'output/sales_scatter_with_color', 
     height=550, display_type=display_type)
../_images/06c3170afc8ed561141b3495ef394bcce19b9feea9bdf680ad78070b8144048b.png

Histograms#

Histograms allow you to view the number of entries in a dataset that fall into a particular range. For instance, the following histogram shows the number of students who accumulated a certain range of bookstore sales during the fall and spring semesters.

fig_sales_histogram = px.histogram(df_bookstore_sales, 
             x=['Fall', 'Spring'], barmode='group', nbins=30,
title='Distributions of fall and spring sales').update_layout(
    xaxis_title='Sales', yaxis_title='Students', 
    legend_title='Season')
wadi(fig_sales_histogram, 'output/sales_histogram', 
     display_type=display_type)
../_images/77dcc9cd48f22287b87f8d516e972de1dbaec26cf481745de303d691220ec0e9.png

This histogram indicates that students’ spending varied more during the spring than during the fall. It also looks like students spent more during the spring, on average, than they did during the fall.

We can confirm these hunches by calling describe on our underlying dataset: (‘std’ here stands for ‘standard deviation’, a measure of how dispersed a given set of values is.)

df_bookstore_sales.describe().loc[['count', 'mean', 'std']]
Fall Spring Fall_Spring_Change
count 16384.000000 16384.000000 16384.000000
mean 80.091662 83.644286 3.552625
std 25.023476 37.297291 27.534548

To see whether these distributions varied by level, we can pass that variable as an argument to the facet_row parameter:

fig_sales_by_level = px.histogram(df_bookstore_sales.sort_values(
    'level', key=lambda col: col.map({'Fr':0,'So':1,'Ju':2, 'Se':3})), 
             x=['Fall', 'Spring'], barmode='group', nbins=30,
            facet_row='level',
            title='Fall/spring sales distributions \
by level').update_layout(
    xaxis_title='Sales', legend_title='Season')
wadi(fig_sales_by_level, 'output/sales_by_level_histogram',
     display_type=display_type)
../_images/cd63162b606c60dc014980bc08b426bed9c58789d89405d2c1d61bb27cb95877.png

Note that, in the output above, the spring bars appear to be shifted to the right of the fall bars for juniors and seniors, but not for freshmen and sophomores. This suggests that fall-to-spring spending growth was stronger for upperclassmen than it was for underclassmen.

If we wanted to show fall and spring spending distributions for each level/college combination, we could update our previous chart by assigning ‘college’ to the facet_col parameter:

fig_sales_by_college_and_level = px.histogram(
    df_bookstore_sales.sort_values(
    'level', key=lambda col: col.map({'Fr':0,'So':1,'Ju':2, 'Se':3})), 
             x=['Fall', 'Spring'], barmode='group', nbins=30,
            facet_row='level',
            facet_col='college',
title='Fall/Spring Sales Distributions by College and Level')
wadi(fig_sales_by_college_and_level, 
     'output/sales_by_college_and_level_histogram', 
     display_type=display_type)
../_images/045631b76706c6b43996536c218555edb4f94bc1937c0ef76a56c222179baaa1.png

However, this type of chart will cause many readers’ eyes to glaze over; the more meme-aware members of your audience might respond with “I Ain’t Reading All That” (https://knowyourmeme.com/memes/i-aint-reading-all-that). As the saying goes, Just because you can doesn’t mean you should.

What would be a simpler way to visualize these trends? In the Regressions section of PFN, we’ll create a grouped bar chart that compares average fall/spring sales growth by college and level. By replacing individual fall and spring distributions with a single growth metric, we can display all 16 level and college combinations within the same graphic.

Once you make it to the Regressions section, compare that chart (titled ‘Average Fall-Spring Sales Growth by College and Level’ to this one; I think you’ll agree that the latter is much easier to interpret than the matrix of histograms shown above.)

Treemaps#

You might have expected to see pie charts covered in this section, and Plotly certainly supports them (https://plotly.com/python/pie-charts/). However, I would like to humbly suggest that you use treemaps as opposed to pie charts. Why? As Joel Abrams notes at https://theconversation.com/heres-why-you-should-almost-never-use-a-pie-chart-for-your-data-214576 , it can be hard to estimate area within pie charts alone; in addition, when many different categories are present, it can be quite hard to figure out the value corresponding to one specific category.

Abrams suggests that you consider using bar charts instead, which is perfectly fine; however, treemaps are a worthy alternative to consider as well. Because they show data in rectangular (rather than wedge-shaped form), they’re much easier to interpret; in addition, they do a good job of displaying hierarchical data. (I’ll also add a bar chart in this section that shows equivalent data as the tree chart, thus allowing you to compare the two.)

Loading our NVCU current enrollment table into the script:

df_curr_enrollment = pd.read_sql("Select * from curr_enrollment", con=e)
df_curr_enrollment['Count'] = 1
df_curr_enrollment.head()
first_name last_name gender matriculation_year ... class_of level level_for_sorting Count
0 Ashley Jordan F 2020 ... 2024 Se 3 1
1 Lisa Dunn F 2020 ... 2024 Se 3 1
2 Alexandra Woods F 2020 ... 2024 Se 3 1
3 Kathleen Carter F 2020 ... 2024 Se 3 1
4 Donna Brewer F 2020 ... 2024 Se 3 1

5 rows × 12 columns

I’ll now create a ‘college/level’ field that will prove useful for one of our treemaps.

df_curr_enrollment['college/level'] = (
    df_curr_enrollment['college'] + ' (' 
    +  df_curr_enrollment['level'] + ')')

# Now that I've created this field, I'll replace the original 'level'
# values with their non-abbreviated equivalents. (The abbreviated versions
# will fit better within the combined college/level field, so I waited
# to make these changes until that field had been created.)
df_curr_enrollment['level'] = df_curr_enrollment['level'].map(
    {'Fr':'Freshman', 'So':'Sophomore', 'Ju':'Junior','Se':'Senior'})

df_curr_enrollment
first_name last_name gender matriculation_year ... level level_for_sorting Count college/level
0 Ashley Jordan F 2020 ... Senior 3 1 STC (Se)
1 Lisa Dunn F 2020 ... Senior 3 1 STM (Se)
... ... ... ... ... ... ... ... ... ...
16382 Andrew Copeland M 2023 ... Freshman 0 1 STB (Fr)
16383 Aaron Thomas M 2023 ... Freshman 0 1 STM (Fr)

16384 rows × 13 columns

The following treemap shows NVCU’s current enrollment by level and by college. The outer rectangles show the total enrollment for each level, while the inner rectangles show the enrollment distribution, by college, for all students within that level.

Note that the Freshman rectangle has the largest area and that the Senior rectangle has the smallest; this suggests that NVCU’s enrollment is growing (though it could also indicate an attrition problem–which, fortunately for NVCU, isn’t the case.) Also note that, while STL’s rectangle within the Senior field is relatively small relative to those for the other schools, its relative size within the Freshman rectangle is much larger. This suggests that STL has become more popular (relative to other colleges) over time, though it could also indicate that students tend to transfer out of it between their first and last years at NVCU.

fig_level_college_treemap = px.treemap(
    df_curr_enrollment, path=['level', 'college'], 
    values='Count', branchvalues='total',
title='Enrollment by Level and College')

# In the following wadi() call, setting aspect ratio to 1 makes our
# chart square and thus a bitmore compact. (I'll use this same ratio for 
# the other treemaps in this script.)
wadi(fig_level_college_treemap, 'output/treemap_level_college', 
     height=500, aspect_ratio=1,
     display_type=display_type, display_width=500)
../_images/29b34d21d64b631f7d800eadb579caa458807a0adf3089f5310198b2c6fda2c5.png

In this treemap, college enrollments are nested within level-wide enrollments. However, as shown below, we can use the ‘college/level’ field within df_curr_enrollment to create an ‘ungrouped’ equivalent of this chart. We’ll color the rectangles within this chart by college to make our output, which is sorted by size, somewhat more intuitive.

fig_level_college_treemap_ungrouped = px.treemap(
    df_curr_enrollment, path=['college/level'], 
    values='Count', color='college',
width=600, height=600, 
title='Enrollment by College and Level')
wadi(fig_level_college_treemap_ungrouped, 
     'output/treemap_level_college_ungrouped',
     height=500, aspect_ratio=1,
     display_type=display_type, display_width=500)
../_images/07d01221e673acdee2b06d708a12cb7a670c2d3684eac34bb37f8ec15c5cfeba.png

We can view the relative size of our enrollments for each level a bit more clearly within the following treemap, which groups students by level only.

fig_level_treemap = px.treemap(
    df_curr_enrollment, path=['level'], 
    values='Count', color='level',
title='Enrollment by Level')
wadi(fig_level_treemap, 'output/treemap_level',
     height=500, aspect_ratio=1,
     display_type=display_type, display_width=500)
../_images/d597340000ba87ab3f3218fffa76986f9c3c628ad617189797d0a5c0ef077693.png

Getting data labels inserted into treemaps created with Plotly Express is trickier than it is for bar charts, as there’s no text_auto or text argument within the px.treemap() function. (See https://plotly.com/python-api-reference/generated/plotly.express.treemap for reference.)

If your chart is HTML-based (i.e. interactive), users will still be able to view the actual values corresponding to a rectangle when they hover over it. However, if you need these labels to show up within a static chart, you could create a new field that shows both categories and their values, then pass that field to the path argument of px.treemap(). The following cells provide an example of this approach.

We’ll first create a pivot table that shows enrollment by level, then add a new field to it that contains both level names and enrollment totals.

df_enrollment_by_level = df_curr_enrollment.pivot_table(
    index=['level_for_sorting', 'level'], values='Count', 
    aggfunc='sum').reset_index().rename(
    columns={'Count':'Enrollment', 'level':'Level'})

# Creating a column that contains both enrollment totals by level
# and the name of that level: 
# (These levels will be converted to plural forms by appending 's' to
# them. The script will also add in a correction for the plural version
# of 'freshman.')
df_enrollment_by_level['Labeled Enrollment'] = (
df_enrollment_by_level['Enrollment'].astype('str') 
+ ' ' + df_enrollment_by_level['Level'] + 's').str.replace(
    'Freshmans','Freshmen')

df_enrollment_by_level
level_for_sorting Level Enrollment Labeled Enrollment
0 0 Freshman 5443 5443 Freshmen
1 1 Sophomore 3999 3999 Sophomores
2 2 Junior 3653 3653 Juniors
3 3 Senior 3289 3289 Seniors

Next, we’ll incorporate this ‘Labeled Enrollment’ field into a new treemap:

fig_labeled_level_treemap = px.treemap(
    df_enrollment_by_level, path=['Labeled Enrollment'], 
    values='Enrollment', color='Level',
title='Enrollment by Level')
wadi(fig_labeled_level_treemap, 'output/treemap_labeled_levels',
     height=500, aspect_ratio=1,
     display_type=display_type, display_width=500)
../_images/59a563ddcc6733cc7bc03c7b9ef3f3a57a289beb049c23526308376c1c197f83.png

Comparing these treemaps to bar charts#

As Joel Abrams pointed out, you can use bar charts to display totals as well. Let’s create a bar-chart equivalent of our ‘Enrollment by Level and College’ treemap so that you can more easily compare these two visualization types.

First, we’ll create a pivot table that can serve as the basis for this chart:

df_enroll_pivot = df_curr_enrollment.pivot_table(
    index=['level_for_sorting', 'level', 'matriculation_year', 
             'college'], values='Count', 
    aggfunc='sum').reset_index()
df_enroll_pivot.rename(columns = {'matriculation_year':
                                   'Matriculation Year',
                                   'college':'College',
                                   'Count':'Enrollment',
                                  'level':'Level'}, inplace=True)

# Converting the Matriculation Year to a string will ensure that these 
# values, when used as x axis labels within an upcoming line chart,
# will appear as distinct items rather than a range. We could also
# achieve this outcome after creating a chart by applying
# the following code:
# fig.update_layout(xaxis_type='category')

df_enroll_pivot['Matriculation Year']
df_enroll_pivot['Matriculation Year'] = (
    df_enroll_pivot['Matriculation Year'].astype('str'))
df_enroll_pivot
level_for_sorting Level Matriculation Year College Enrollment
0 0 Freshman 2023 STB 1309
1 0 Freshman 2023 STC 920
... ... ... ... ... ...
14 3 Senior 2020 STL 488
15 3 Senior 2020 STM 1007

16 rows × 5 columns

Next, we’ll create our bar chart. Note that the output shows the same values as our earlier ‘Enrollment by Level and College’ treemap; these values are simply arranged differently.

I’ll admit that I do prefer this bar chart over its corresponding treemap, as the former does a better job of highlighting the differences in enrollment totals between each level.

fig_enroll_bar = px.bar(
    df_enroll_pivot, 
        x='Level', y='Enrollment', 
       color='College', text_auto='.0f',
title='Enrollment by Level and College')
wadi(fig_enroll_bar, 'output/enroll_bar', display_type=display_type)
../_images/9fd347f19a55723c99e6a2a8167336b2ef2d15bc7a778353704a9e2f2a24314c.png

If we replace Level as our x-axis value with Matriculation Year, we can also create a line graph that indicates how freshman enrollment at each college has grown over time. (Most of the seniors in our current enrollment dataset matriculated in 2020, and would thus be considered freshmen in that year; juniors matriculated in 2021; sophomores matriculated in 2022; and freshmen matriculated in 2023.)

*(I say ‘indicates’ here because we’re working off a current enrollment dataset that does not take transfers or attrition into account; if NVCU had lots of attrition for a given year, or many transfers, this chart would thus underestimate or overestimate freshman enrollment for that year, respectively.)

fig_enroll_line = px.line(df_enroll_pivot.sort_values(
    'Matriculation Year'), 
        x='Matriculation Year', y='Enrollment',
       color='College',
       title='Enrollment by Matriculation Year and College',
                          markers=True)
wadi(fig_enroll_line, 'output/enroll_line', display_type=display_type)
../_images/89fcd3ec1503ce47959def6638897e95e66e44952b562a2e095325bca2240862.png

This chart shows some intriguing patterns. First, it’s evident that freshman enrollment has grown over time (barring the caveats about this dataset that I explained earlier). However, this growth is not uniform across colleges: while STL has had particularly strong growth in freshman enrollment, STC’s numbers are actually lower now than they were back in 2020.

That does it for the first part of the Graphing section. In the following Pivot and Graph Functions notebook, you’ll learn how to speed up the process of creating pivot tables and charts–a skill that will prove especially useful when we begin creating interactive online visualizations.