NVCU Database Generator#

By Kenneth Burchfiel

This script will create a SQLite database for the (fictional) Northern Virginia Catholic University that can be referenced within other sections of Python for Nonprofits. The data within this database will be fictional also.

from sqlalchemy import create_engine
import pandas as pd
import numpy as np
# Setting up random number generation capabilities:
rng = np.random.default_rng(2325) 
# Based on https://numpy.org/doc/stable/reference/random/generator.html 
# The faker library is a great tool for creating fictional
# database records. I set the locale to 'en_US' because NVCU
# is located in the United States.
from faker import Faker
fake = Faker('en_US')  
from helper_funcs import config_notebook
display_type = config_notebook(display_max_columns=7,
                              display_max_rows=5)

Connecting to our NVCU database via the SQLAlchemy library#

(This code will work even if no database currently exists at the path shown below.)

e = create_engine('sqlite:///nvcu_db.db')
# Based on: https://docs.sqlalchemy.org/en/20/dialects/sqlite.html#pysqlite
student_count = 2**14 # The university will have around 16,300 students.
student_count
16384

Creating a current enrollment table#

Creating lists of names#

We’ll use the Faker library to create equal numbers of female and male first names, then concatenate these lists to create a single list of first names.

(Note: The Faker documentation for the ‘en_US’ (US English) locale (https://faker.readthedocs.io/en/master/locales/en_US.html) was a useful resource in writing this code.)

student_count
# It's convenient to also create a variable for the value equal
# to half of the student count. (Dividing the student count by
# 2 produces a float by default, so we'll convert this value to 
# an int so that it can be used as a range within list comprehensions.)
half_student_count = int(student_count / 2)
half_student_count
8192
female_first_names = [
    fake.first_name_female() 
    for i in range(half_student_count)]
male_first_names = [
    fake.first_name_male() for i in range(half_student_count)]
first_names = female_first_names + male_first_names
last_names = [
    fake.last_name() for i in range(student_count)]
female_first_names[0:5], male_first_names[0:5]
(['Ashley', 'Lisa', 'Alexandra', 'Cynthia', 'Mary'],
 ['Charles', 'Ryan', 'Miguel', 'Brian', 'Curtis'])
# In order to make our genders match our names, we'll 
# make the first half of our gender list female and the second 
# half male (as the first and second halves of our first names
# list show male and female names, respectively.)
genders = (['F' for i in range(half_student_count)] 
+ ['M' for i in range(half_student_count)])

Initializing our current enrollment table#

df_curr_enrollment = pd.DataFrame(
    index=np.arange(0,student_count), 
data = {'first_name':first_names,
        'last_name':last_names,
        'gender':genders})
df_curr_enrollment.head()
first_name last_name gender
0 Ashley Jordan F
1 Lisa Dunn F
2 Alexandra Woods F
3 Cynthia Lozano F
4 Mary Kemp F

Creating matriculation years:#

In order to simulate increasing enrollment over time, weights were added to the rng.choice() call so that recent years would appear more frequently.

rng.choice(
    [2020, 2021, 2022, 2023], 
    p=[0.20, 0.22, 0.25, 0.33], size=student_count)
array([2023, 2023, 2020, ..., 2023, 2022, 2023], shape=(16384,))
df_curr_enrollment['matriculation_year'] = rng.choice(
    [2020, 2021, 2022, 2023], p=[0.20, 0.22, 0.25, 0.33], 
    size=student_count)
# See https://numpy.org/doc/stable/reference/random/generated/numpy.
# random.Generator.choice.html
df_curr_enrollment.head()
first_name last_name gender matriculation_year
0 Ashley Jordan F 2020
1 Lisa Dunn F 2020
2 Alexandra Woods F 2020
3 Cynthia Lozano F 2022
4 Mary Kemp F 2023

Creating student IDs#

Student IDs will use the format matriculation_year-matriculation_number. matriculation_number represents the order in which students enrolled for a given year; these numbers are unique within each year, but not between years. This number can then be combined with students’ matriculation years to form a unique ID.

# Calculating matriculation numbers by grouping the DataFrame by
# matriculation year, then assigning each student within each year a unique
# number:
# (This can be achieved via df.groupby() and df.rank(). See
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.
# DataFrame.groupby.html
# and https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.
# DataFrame.rank.html )
df_curr_enrollment['matriculation_number'] = df_curr_enrollment.groupby(
    'matriculation_year')['matriculation_year'].rank(
    method='first').astype('int')
df_curr_enrollment.head()
first_name last_name gender matriculation_year matriculation_number
0 Ashley Jordan F 2020 1
1 Lisa Dunn F 2020 2
2 Alexandra Woods F 2020 3
3 Cynthia Lozano F 2022 1
4 Mary Kemp F 2023 1
# Adding matriculation years, matriculation numbers, and hyphens together
# to create student IDs:
df_curr_enrollment['student_id'] = (
    df_curr_enrollment['matriculation_year'].astype('str') 
    + '-' 
    + df_curr_enrollment['matriculation_number'].astype('str'))
# Sorting the DataFrame by matriculation year and matriculation number:
# (The DataFrame could also be sorted by student_id, but because these 
# numbers would be interpreted as strings, ids like 2020-10 would appear 
# in front of ones like 2020-2.)
df_curr_enrollment.sort_values(
    ['matriculation_year', 'matriculation_number'], inplace=True)
df_curr_enrollment.reset_index(drop=True,inplace=True)

df_curr_enrollment.head()
first_name last_name gender matriculation_year matriculation_number student_id
0 Ashley Jordan F 2020 1 2020-1
1 Lisa Dunn F 2020 2 2020-2
2 Alexandra Woods F 2020 3 2020-3
3 Kathleen Carter F 2020 4 2020-4
4 Donna Brewer F 2020 5 2020-5

We’ll also use matriculation years as a basis for dates of birth. (These date of birth calculations aren’t the most realistic, but they are good enough for use within our fictional dataset.)

df_curr_enrollment['birth_month'] = rng.integers(
    low=1, high=13, size=student_count).astype('str')
df_curr_enrollment['birth_day'] = rng.integers(
    low=1, high=29, size=student_count).astype('str')

# Adding leading zeros to month and date values:
for column in ['birth_month', 'birth_day']:
    df_curr_enrollment[column] = df_curr_enrollment[column].str.zfill(2)

# In order to prevent a non-existent date (e.g. February 31) from getting 
# stored in our results, we'll limit our birthday integers to the 
# range [1, 28].
df_curr_enrollment['birth_year'] = (df_curr_enrollment[
'matriculation_year'] - 18).astype('str')

# Combining these columns together to create date of birth values in
# YYYY-MM-DD format:
df_curr_enrollment['date_of_birth'] = (
    df_curr_enrollment['birth_year'] + '-'
    + df_curr_enrollment['birth_month'] + '-'
    + df_curr_enrollment['birth_day'])

# Now that we have our date_of_birth column in place, we no longer need
# the columns that stored individual parts of those birthdays:
df_curr_enrollment.drop(
    ['birth_month', 'birth_day', 
    'birth_year'], axis=1, inplace=True)

df_curr_enrollment
first_name last_name gender matriculation_year matriculation_number student_id date_of_birth
0 Ashley Jordan F 2020 1 2020-1 2002-12-16
1 Lisa Dunn F 2020 2 2020-2 2002-09-26
... ... ... ... ... ... ... ...
16382 Andrew Copeland M 2023 5442 2023-5442 2005-11-03
16383 Aaron Thomas M 2023 5443 2023-5443 2005-04-05

16384 rows × 7 columns

Assigning students to different colleges#

(I used Wikipedia’s ‘List of patron saints by occupation and activity’ page to determine the saint after which each college would be named.)

NVCU has four different colleges:

  1. St. Luke’s, a humanities college. (St. Luke is one of the patron saints of artists.) Abbreviation: STL

  2. St. Benedict’s, a STEM college. (St. Benedict is one of the patron saints of engineers.) Abbreviation: STB

  3. St. Matthew’s, a business college. (St. Matthew is the patron saint of accountants.) Abbreviation: STM

  4. St. Catherine’s, a health sciences college. (St. Catherine of Alexandria is one of the patron saints of nurses.) Abbreviation: STC

We can use np.choice() to assign students to different colleges. However, to make the data more interesting, we’ll have one college (STL) increase in popularity over time; another (STC) decrease in popularity; and the two remaining colleges remain roughly constant in popularity. We can simulate these changes by (1) creating filtered versions of the DataFrame for each year; (2) calling np.choice() with different probability sets for each year in order to create the ‘college’ column; and (3) recreating df_curr_enrollment by adding these year-specific DataFrames back together.

df_list = []
for year in df_curr_enrollment['matriculation_year'].unique():
    print(f"Now adding in college enrollments for {year}.")
    df = df_curr_enrollment.query("matriculation_year == @year").copy()
    if year == 2020:
        probabilities = [0.15, 0.25, 0.3, 0.3]
    elif year == 2021:
        probabilities = [0.19, 0.26, 0.29, 0.26]
    elif year == 2022:
        probabilities = [0.25, 0.27, 0.27, 0.21]
    elif year == 2023:
        probabilities = [0.27, 0.25, 0.31, 0.17]
    else:
        raise ValueError(
            f"A probability list needs to be added in for {year}.")
    df['college'] = rng.choice(
        ['STL', 'STB', 'STM', 'STC'], 
        p=probabilities, size=len(df))
    df_list.append(df)
df_curr_enrollment = pd.concat([df for df in df_list])
df_curr_enrollment.head()
Now adding in college enrollments for 2020.
Now adding in college enrollments for 2021.
Now adding in college enrollments for 2022.
Now adding in college enrollments for 2023.
first_name last_name gender ... student_id date_of_birth college
0 Ashley Jordan F ... 2020-1 2002-12-16 STC
1 Lisa Dunn F ... 2020-2 2002-09-26 STM
2 Alexandra Woods F ... 2020-3 2002-05-18 STC
3 Kathleen Carter F ... 2020-4 2002-11-15 STC
4 Donna Brewer F ... 2020-5 2002-06-21 STM

5 rows × 8 columns

The commented-out cell below shows an alternative approach to assigning colleges to each student. Because it iterates through each row in the DataFrame, it took 2.25 seconds to run on my laptop versus 0.016 seconds for the method shown above; in other words, the method in the previous cell was 133 times faster.

# df_curr_enrollment['college'] = ''
# college_col = df_curr_enrollment.columns.get_loc('college')
# for i in range(len(df_curr_enrollment)):
#     year = df_curr_enrollment.iloc[i]['matriculation_year']
#     if year == 2020:
#         probabilities = [0.15, 0.25, 0.3, 0.3]
#     elif year == 2021:
#         probabilities = [0.19, 0.26, 0.29, 0.26]
#     elif year == 2022:
#         probabilities = [0.25, 0.27, 0.27, 0.21]
#     elif year == 2022:
#         probabilities = [0.27, 0.25, 0.31, 0.17]
#     df_curr_enrollment.iloc[i, college_col] = rng.choice(
#     ['STL', 'STB', 'STM', 'STC'], p=probabilities)

Saving this table to our NVCU database#

(This operation would also create our database file if it didn’t exist already.)

df_curr_enrollment.to_sql(
    'curr_enrollment', 
    con=e, if_exists='replace', index=False)
16384

To demonstrate that the above operation was successful, we can read in a copy of this table via pd.read_sql():

pd.read_sql('curr_enrollment', con=e).head()
first_name last_name gender ... class_of level level_for_sorting
0 Ashley Jordan F ... 2024 Se 3
1 Lisa Dunn F ... 2024 Se 3
2 Alexandra Woods F ... 2024 Se 3
3 Kathleen Carter F ... 2024 Se 3
4 Donna Brewer F ... 2024 Se 3

5 rows × 11 columns

This same table can also be saved as a standalone .csv file (thus making it easier to examine via a spreadsheet editor):

df_curr_enrollment.to_csv('curr_enrollment.csv', index = False)
df_curr_enrollment.to_csv(
    '../Data_Retrieval/curr_enrollment_tab_separated.csv', 
    index=False, sep='\t')

You may also save it as an .xlsx file if you’d prefer: (Note that the openpyxl library is required for the following line to run successfully.)

df_curr_enrollment.to_excel(
    '../Data_Retrieval/curr_enrollment.xlsx', index=False)

Creating a survey results table#

This table will store fall and spring student survey results for the most recent school year. We’ll configure our results so that certain groups (e.g. freshmen, seniors, and students in the STB and STL colleges) report higher results over time than do others.

Creating fall survey results#

def limit_results(df, column, min, max):
    '''This function restricts the values within a particular DataFrame
    column to a range provided by the user.
    
    df: the DataFrame to modify.
    
    column: the column within df whose values will be restricted.
    
    min and max: the minimum and max values, respectively, to allow within
    this column.
    '''
    df[column] = np.where(
        df[column] < min, min, df[column])
    df[column] = np.where(
        df[column] > max, max, df[column])
# Initializing the table as a copy of selected columns from 
# df_curr_enrollment:
df_fall_survey = df_curr_enrollment.copy()[
['matriculation_year', 'matriculation_number',
 'student_id', 'college', 'level']]
df_fall_survey['starting_year'] = 2023 # Other years could be stored
# within this table as well.
df_fall_survey['season'] = 'Fall'
# We'll use rng.normal(), which produces random numbers that follow a 
# normal distribution, to initialize our fall results.
# See https://numpy.org/doc/stable/reference/random/generated/numpy.
# random.Generator.normal.html
df_fall_survey['score'] = rng.normal(
    loc=70, scale=10, size=student_count).astype('int')
# calling limit_results to restrict the scores to the range [0, 100]:
limit_results(df_fall_survey, 'score', 0, 100)
df_fall_survey.head()
matriculation_year matriculation_number student_id ... starting_year season score
0 2020 1 2020-1 ... 2023 Fall 88
1 2020 2 2020-2 ... 2023 Fall 37
2 2020 3 2020-3 ... 2023 Fall 54
3 2020 4 2020-4 ... 2023 Fall 56
4 2020 5 2020-5 ... 2023 Fall 77

5 rows × 8 columns

Confirming that the minimum and maximum ‘score’ values within df_fall_survey lie within the range [0, 100]:

df_fall_survey['score'].min(), df_fall_survey['score'].max()
(np.int64(30), np.int64(100))

Creating spring results#

In order to create these results, we’ll first create a copy of our fall results table. Next, we’ll calculate our spring results by creating a score_change variable that defaults to 10; adjusting this variable up or down based on students’ college membership and level; and then adding the product of this variable and rng.random()* to a student’s fall score.

We’ll use .iloc[] to make these updates. An alternative would be df.at[], but the approach shown in this cell allows for different index labels to be used.

*rng.random() produces a float in the range [0, 1). See the numpy documentation for more details. The use of rng.random() allows for more variation in individual fall-to-spring changes.

df_spring_survey = df_fall_survey.copy().rename(
    columns={'score':'fall_score'}).replace('Fall','Spring')
df_spring_survey['score'] = np.nan
score_col = df_spring_survey.columns.get_loc('score')
for i in range(len(df_spring_survey)):
    score_change = 10
    college = df_spring_survey.iloc[i]['college']
    level = df_spring_survey.iloc[i]['level']
    if college in ['STL', 'STM']:
        score_change += 20
    if level in ['So', 'Ju']:
        score_change -= 10
    # Using .iloc to make our updates:
    # This approach prevents 'setting with copy' warnings from being 
    # displayed. Note the use of get_loc to determine which column index 
    # value to incorporate into .iloc.
    df_spring_survey.iloc[i, score_col] = (df_spring_survey.iloc[
    i]['fall_score'] -5 + rng.random() * score_change)
    
limit_results(df_spring_survey, 'score', 0, 100)
df_spring_survey['score'] = df_spring_survey['score'].astype('int')
df_spring_survey.head()
matriculation_year matriculation_number student_id ... season fall_score score
0 2020 1 2020-1 ... Spring 88 86
1 2020 2 2020-2 ... Spring 37 35
2 2020 3 2020-3 ... Spring 54 54
3 2020 4 2020-4 ... Spring 56 57
4 2020 5 2020-5 ... Spring 77 74

5 rows × 9 columns

A pivot table of results by college and level reveals the variation in mean scores created by the above for loop.

df_spring_survey.pivot_table(
    index=['college', 'level'], 
    values='score', aggfunc='mean').reset_index()
college level score
0 STB Fr 69.177235
1 STB Ju 64.950769
... ... ... ...
14 STM Se 79.309831
15 STM So 73.415690

16 rows × 3 columns

Combining fall and spring results into the same table#

We’ll use a ‘long’ format for this table. In other words, rather than show fall and spring results side by side for each student (an example of a ‘wide’-formatted table), we’ll list these results as separate rows. I consider this setup to be more realistic: in real life, this table would likely be created by stacking individual sets of survey results on top of one another.

The only identifier that we’ll preserve is the student ID, as this will be sufficient for linking this dataset with any other table (such as our current enrollment dataset) that also contains this key.

df_survey = pd.concat(
    [df_fall_survey, df_spring_survey]).reset_index()[
['student_id', 'starting_year', 'season', 'score']]
df_survey.head()
student_id starting_year season score
0 2020-1 2023 Fall 88
1 2020-2 2023 Fall 37
2 2020-3 2023 Fall 54
3 2020-4 2023 Fall 56
4 2020-5 2023 Fall 77
df_survey.to_sql('survey_results', if_exists='replace', 
                 index=False, con=e)
df_survey.to_csv('survey_results.csv', index=False)

Creating a set of winter results that can be incorporated into a data cleaning/reformatting script#

In order to demonstrate how the Pandas library can be used to clean and reformat data, I’ll now create a winter survey results table that contains several issues, including:

  1. Column names that differ from those in the fall/spring results

  2. Different data formats

  3. A missing column

  4. Duplicate values

  5. Missing values for certain students

  6. Results spread over 16 separate files (one for each school/level pair)

These issues will be addressed within the data_cleaning_and_reformatting.ipynb script within PFN’s Data Prep section.

df_winter_survey = df_fall_survey.copy()
df_winter_survey['SEASON'] = 'W'
# Initializing winter survey results as lower copies of fall ones, 
# then converting them to a different data type:
# (Converting the results to integers, then strings prevents the original
# decimal values from getting added to the final output.
df_winter_survey['SURVEY_SCORE'] = (
    df_winter_survey['score'] 
    + (rng.random(size=student_count) * -10)).astype(
        'int').astype('str')+'.0%'
df_winter_survey['STARTINGYR'] = df_winter_survey['starting_year'] - 2000
df_winter_survey['MATRICYR'] = (
    df_winter_survey['matriculation_year'] - 2000).astype('str')
df_winter_survey.rename(columns={'matriculation_number':'MATRIC#'}, 
                        inplace=True)
df_winter_survey.drop(
    ['matriculation_year', 'student_id', 'starting_year',
     'season','score'], axis=1, inplace=True)
# Removing 15% of students from the results via df.sample:      
df_winter_survey = df_winter_survey.sample(frac=0.85).copy()
# See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.
# DataFrame.sample.html
# using df.sample() to add duplicate records for 5% of the students present
# within the sample:
df_winter_survey = pd.concat(
    [df_winter_survey, df_winter_survey.sample(frac=0.05)])
df_winter_survey.reset_index(drop=True,inplace=True)
df_winter_survey.head()
MATRIC# college level SEASON SURVEY_SCORE STARTINGYR MATRICYR
0 1751 STC So W 58.0% 23 22
1 2028 STL So W 59.0% 23 22
2 2594 STL Se W 49.0% 23 20
3 1733 STL Se W 56.0% 23 20
4 2088 STM Fr W 62.0% 23 23

Dividing these results into 16 individual datasets (one for each college/level pair), then saving those results into the Data Prep section of PFN:

(The ‘college’ and ‘level’ fields within df_winter_survey were only necessary for filtering data and creating school and level-based filenames, so they don’t need to be included in the .csv copies of the data.)

for college in df_winter_survey['college'].unique():
    for level in df_winter_survey['level'].unique():
        df_results = df_winter_survey.query(
            "college == @college & level == @level").copy().drop(
            ['college', 'level'], axis=1)
        df_results.to_csv(
            f'../Data_Prep/winter_results/{college}_{level}_results.csv', 
            index=False)

Creating a dining transactions table#

I created this table in order to illustrate the ‘average of averages’ problem within Descriptive Stats: Part 2. It currently shows only four items for each transaction: (1) the amount of money spent; (2) the abbreviated level (Fr for freshman, So for sophomore, etc.) of the diner; (3) the starting school year in which the transaction was made; and (4) the weekday of the transaction.

The table is designed so that younger levels will have more entries than older levels, but older levels will spend higher amounts, on average, per transaction.

dining_transaction_count = 22453
# Initializing the table:
# Note the use of rng.choice() to make younger-level transactions
# more common than older-level ones and mid-week transactions 
# more common than weekend-ones.
# This set of code uses ng.random() to initialize each transaction as a 
# number between $0 and $20; however, these numbers will get adjusted
# later in this cell.
df_dining_transactions = pd.DataFrame(
    index=np.arange(dining_transaction_count),
    data={'starting_year':2023,
            'weekday':rng.choice(
        ['Su', 'Mo', 'Tu', 'We', 'Th', 'Fr', 'Sa'],
        p=[0.05, 0.16, 0.23, 0.24, 0.18, 0.11, 0.03], 
                size=dining_transaction_count),
            'level':rng.choice(
        ['Fr', 'So', 'Ju', 'Se'],
        p=[0.55, 0.21, 0.15, 0.09], size=dining_transaction_count),
           'base_charge':[
               rng.random() * 20 for i in range(
                   dining_transaction_count)]})

# Creating a 'multiplier' column that will cause average transaction
# amounts for older levels to exceed those of younger levels:
df_dining_transactions['multiplier'] = df_dining_transactions[
'level'].map(
    {'Fr':1, 'So':1.3, 'Ju': 1.7, 'Se':2.1})
df_dining_transactions['amount'] = np.round(
    df_dining_transactions['base_charge'] 
    * df_dining_transactions['multiplier'], 2)

# Removing columns that are no longer needed:
df_dining_transactions.drop(
    ['base_charge', 'multiplier'], axis=1, inplace=True)
df_dining_transactions
starting_year weekday level amount
0 2023 We Fr 13.36
1 2023 Tu Se 12.22
... ... ... ... ...
22451 2023 Sa Fr 13.25
22452 2023 Mo Fr 1.95

22453 rows × 4 columns

df_dining_transactions.to_sql('dining_transactions', if_exists='replace', 
                 index=False, con=e)
df_dining_transactions.to_csv('dining_transactions.csv', index=False)

Creating a table of fall and spring bookstore sales by student for a given year:#

(This table will play a key role in the Regressions section of Python for Nonprofits.)

We’ll initialize this table as a subset of df_curr_enrollment, as certain demographic items will play a role in students’ spring purchase totals.

df_sales = df_curr_enrollment[
['student_id', 'gender', 'college', 'level']].copy()

# Specifying an RNG seed: (I sometimes use the time of day, in HHMMSS
# format, as a basis for a seed value.)
rng = np.random.default_rng(seed=225403)

# Fall sales will be normally distributed for all students:
df_sales['Fall'] = rng.normal(
    loc=80, scale=25, size=len(df_sales))

# Spring sales will be higher than fall sales by default. The following
# code specifies these changes using a normal distribution, as I found that
# doing so helped make my regression residuals more normally distributed
# also.)
spring_change = rng.normal(loc=11, scale=25, size=len(df_sales))
df_sales['Spring'] = df_sales['Fall'] + spring_change
# Modifying Spring totals based on demographic components:
spring_col = df_sales.columns.get_loc('Spring')
for i in range(len(df_sales)):
    # Unhealthy snacks were removed from the checkout aisle; 
    # this ended up reducing revenue among freshmen and sophomore
    # (who particularly liked these snacks.)
    if df_sales.iloc[i]['level'] in ['Fr', 'So']:
        df_sales.iloc[i, spring_col] = (
            df_sales.iloc[i, spring_col] + rng.normal(
                loc=-21, scale=3))
    # An intensive marketing campaign was carried out at STM and STL;
    # if it ended up being successful, it would then be implemented
    # at the other colleges also.
    if df_sales.iloc[i]['college'] in ['STM', 'STL']:
        df_sales.iloc[i, spring_col] = (
            df_sales.iloc[i, spring_col] + rng.normal(
                loc=9, scale=3))

# I'll leave in negative Fall and Spring values, as they could be 
# explained by refunds.

df_sales['Fall'] = df_sales['Fall'].round(2)
df_sales['Spring'] = df_sales['Spring'].round(2)
df_sales['Fall_Spring_Change'] = df_sales['Spring'] - df_sales['Fall']

df_sales.to_sql('bookstore_sales', if_exists='replace', 
                 index=False, con=e)
df_sales.to_csv('bookstore_sales.csv', index=False)

df_sales
student_id gender college level Fall Spring Fall_Spring_Change
0 2020-1 F STC Se 66.80 58.24 -8.56
1 2020-2 F STM Se 104.67 151.90 47.23
... ... ... ... ... ... ... ...
16382 2023-5442 M STB Fr 100.52 38.29 -62.23
16383 2023-5443 M STM Fr 57.45 29.43 -28.02

16384 rows × 7 columns

That’s it for this script! We’ve now created all of the fictional data we’ll need for the various sections of Python for Nonprofits.