NVCU Database Generator

NVCU Database Generator#

By Kenneth Burchfiel

This script will create a SQLite database for the (fictional) Northern Virginia Catholic University that can be referenced within other sections of Python for Nonprofits. The data within this database will be fictional also.

from sqlalchemy import create_engine
import pandas as pd
import numpy as np
# Setting up random number generation capabilities:
rng = np.random.default_rng(2325) 
# Based on https://numpy.org/doc/stable/reference/random/generator.html 
# The faker library is a great tool for creating fictional
# database records. I set the locale to 'en_US' because NVCU
# is located in the United States.
from faker import Faker
fake = Faker('en_US')  
from helper_funcs import config_notebook
display_type = config_notebook(display_max_columns=7,
                              display_max_rows=5)

Connecting to our NVCU database via the SQLAlchemy library#

(This code will work even if no database currently exists at the path shown below.)

e = create_engine('sqlite:///nvcu_db.db')
# Based on: https://docs.sqlalchemy.org/en/20/dialects/sqlite.html#pysqlite

student_count = 2**14 # The university will have around 16,300 students.
student_count

Creating a current enrollment table#

Creating lists of names#

We’ll use the Faker library to create equal numbers of female and male first names, then concatenate these lists to create a single list of first names.

(Note: The Faker documentation for the ‘en_US’ (US English) locale (https://faker.readthedocs.io/en/master/locales/en_US.html) was a useful resource in writing this code.)

student_count
# It's convenient to also create a variable for the value equal
# to half of the student count. (Dividing the student count by
# 2 produces a float by default, so we'll convert this value to 
# an int so that it can be used as a range within list comprehensions.)
half_student_count = int(student_count / 2)
half_student_count

female_first_names = [
    fake.first_name_female() 
    for i in range(half_student_count)]
male_first_names = [
    fake.first_name_male() for i in range(half_student_count)]
first_names = female_first_names + male_first_names
last_names = [
    fake.last_name() for i in range(student_count)]
female_first_names[0:5], male_first_names[0:5]

(['Ashley', 'Lisa', 'Alexandra', 'Cynthia', 'Mary'],
 ['Charles', 'Ryan', 'Miguel', 'Brian', 'Curtis'])

# In order to make our genders match our names, we'll 
# make the first half of our gender list female and the second 
# half male (as the first and second halves of our first names
# list show male and female names, respectively.)
genders = (['F' for i in range(half_student_count)] 
+ ['M' for i in range(half_student_count)])

Initializing our current enrollment table#

df_curr_enrollment = pd.DataFrame(
    index=np.arange(0,student_count), 
data = {'first_name':first_names,
        'last_name':last_names,
        'gender':genders})
df_curr_enrollment.head()

	first_name	last_name	gender
0	Ashley	Jordan	F
1	Lisa	Dunn	F
2	Alexandra	Woods	F
3	Cynthia	Lozano	F
4	Mary	Kemp	F

Creating matriculation years:#

In order to simulate increasing enrollment over time, weights were added to the rng.choice() call so that recent years would appear more frequently.

rng.choice(
    [2020, 2021, 2022, 2023], 
    p=[0.20, 0.22, 0.25, 0.33], size=student_count)

array([2023, 2023, 2020, ..., 2023, 2022, 2023], shape=(16384,))

df_curr_enrollment['matriculation_year'] = rng.choice(
    [2020, 2021, 2022, 2023], p=[0.20, 0.22, 0.25, 0.33], 
    size=student_count)
# See https://numpy.org/doc/stable/reference/random/generated/numpy.
# random.Generator.choice.html
df_curr_enrollment.head()

	first_name	last_name	gender	matriculation_year
0	Ashley	Jordan	F	2020
1	Lisa	Dunn	F	2020
2	Alexandra	Woods	F	2020
3	Cynthia	Lozano	F	2022
4	Mary	Kemp	F	2023

Creating student IDs#

Student IDs will use the format matriculation_year-matriculation_number. matriculation_number represents the order in which students enrolled for a given year; these numbers are unique within each year, but not between years. This number can then be combined with students’ matriculation years to form a unique ID.

# Calculating matriculation numbers by grouping the DataFrame by
# matriculation year, then assigning each student within each year a unique
# number:
# (This can be achieved via df.groupby() and df.rank(). See
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.
# DataFrame.groupby.html
# and https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.
# DataFrame.rank.html )
df_curr_enrollment['matriculation_number'] = df_curr_enrollment.groupby(
    'matriculation_year')['matriculation_year'].rank(
    method='first').astype('int')
df_curr_enrollment.head()

	first_name	last_name	gender	matriculation_year	matriculation_number
0	Ashley	Jordan	F	2020	1
1	Lisa	Dunn	F	2020	2
2	Alexandra	Woods	F	2020	3
3	Cynthia	Lozano	F	2022	1
4	Mary	Kemp	F	2023	1

# Adding matriculation years, matriculation numbers, and hyphens together
# to create student IDs:
df_curr_enrollment['student_id'] = (
    df_curr_enrollment['matriculation_year'].astype('str') 
    + '-' 
    + df_curr_enrollment['matriculation_number'].astype('str'))
# Sorting the DataFrame by matriculation year and matriculation number:
# (The DataFrame could also be sorted by student_id, but because these 
# numbers would be interpreted as strings, ids like 2020-10 would appear 
# in front of ones like 2020-2.)
df_curr_enrollment.sort_values(
    ['matriculation_year', 'matriculation_number'], inplace=True)
df_curr_enrollment.reset_index(drop=True,inplace=True)

df_curr_enrollment.head()

	first_name	last_name	gender	matriculation_year	matriculation_number	student_id
0	Ashley	Jordan	F	2020	1	2020-1
1	Lisa	Dunn	F	2020	2	2020-2
2	Alexandra	Woods	F	2020	3	2020-3
3	Kathleen	Carter	F	2020	4	2020-4
4	Donna	Brewer	F	2020	5	2020-5

We’ll also use matriculation years as a basis for dates of birth. (These date of birth calculations aren’t the most realistic, but they are good enough for use within our fictional dataset.)

df_curr_enrollment['birth_month'] = rng.integers(
    low=1, high=13, size=student_count).astype('str')
df_curr_enrollment['birth_day'] = rng.integers(
    low=1, high=29, size=student_count).astype('str')

# Adding leading zeros to month and date values:
for column in ['birth_month', 'birth_day']:
    df_curr_enrollment[column] = df_curr_enrollment[column].str.zfill(2)

# In order to prevent a non-existent date (e.g. February 31) from getting 
# stored in our results, we'll limit our birthday integers to the 
# range [1, 28].
df_curr_enrollment['birth_year'] = (df_curr_enrollment[
'matriculation_year'] - 18).astype('str')

# Combining these columns together to create date of birth values in
# YYYY-MM-DD format:
df_curr_enrollment['date_of_birth'] = (
    df_curr_enrollment['birth_year'] + '-'
    + df_curr_enrollment['birth_month'] + '-'
    + df_curr_enrollment['birth_day'])

# Now that we have our date_of_birth column in place, we no longer need
# the columns that stored individual parts of those birthdays:
df_curr_enrollment.drop(
    ['birth_month', 'birth_day', 
    'birth_year'], axis=1, inplace=True)

df_curr_enrollment

	first_name	last_name	gender	matriculation_year	matriculation_number	student_id	date_of_birth
0	Ashley	Jordan	F	2020	1	2020-1	2002-12-16
1	Lisa	Dunn	F	2020	2	2020-2	2002-09-26
...	...	...	...	...	...	...	...
16382	Andrew	Copeland	M	2023	5442	2023-5442	2005-11-03
16383	Aaron	Thomas	M	2023	5443	2023-5443	2005-04-05

16384 rows × 7 columns

Assigning students to different colleges#

(I used Wikipedia’s ‘List of patron saints by occupation and activity’ page to determine the saint after which each college would be named.)

NVCU has four different colleges:

St. Luke’s, a humanities college. (St. Luke is one of the patron saints of artists.) Abbreviation: STL
St. Benedict’s, a STEM college. (St. Benedict is one of the patron saints of engineers.) Abbreviation: STB
St. Matthew’s, a business college. (St. Matthew is the patron saint of accountants.) Abbreviation: STM
St. Catherine’s, a health sciences college. (St. Catherine of Alexandria is one of the patron saints of nurses.) Abbreviation: STC

We can use np.choice() to assign students to different colleges. However, to make the data more interesting, we’ll have one college (STL) increase in popularity over time; another (STC) decrease in popularity; and the two remaining colleges remain roughly constant in popularity. We can simulate these changes by (1) creating filtered versions of the DataFrame for each year; (2) calling np.choice() with different probability sets for each year in order to create the ‘college’ column; and (3) recreating df_curr_enrollment by adding these year-specific DataFrames back together.

df_list = []
for year in df_curr_enrollment['matriculation_year'].unique():
    print(f"Now adding in college enrollments for {year}.")
    df = df_curr_enrollment.query("matriculation_year == @year").copy()
    if year == 2020:
        probabilities = [0.15, 0.25, 0.3, 0.3]
    elif year == 2021:
        probabilities = [0.19, 0.26, 0.29, 0.26]
    elif year == 2022:
        probabilities = [0.25, 0.27, 0.27, 0.21]
    elif year == 2023:
        probabilities = [0.27, 0.25, 0.31, 0.17]
    else:
        raise ValueError(
            f"A probability list needs to be added in for {year}.")
    df['college'] = rng.choice(
        ['STL', 'STB', 'STM', 'STC'], 
        p=probabilities, size=len(df))
    df_list.append(df)
df_curr_enrollment = pd.concat([df for df in df_list])
df_curr_enrollment.head()

Now adding in college enrollments for 2020.
Now adding in college enrollments for 2021.
Now adding in college enrollments for 2022.
Now adding in college enrollments for 2023.

	first_name	last_name	gender	...	student_id	date_of_birth	college
0	Ashley	Jordan	F	...	2020-1	2002-12-16	STC
1	Lisa	Dunn	F	...	2020-2	2002-09-26	STM
2	Alexandra	Woods	F	...	2020-3	2002-05-18	STC
3	Kathleen	Carter	F	...	2020-4	2002-11-15	STC
4	Donna	Brewer	F	...	2020-5	2002-06-21	STM

5 rows × 8 columns

The commented-out cell below shows an alternative approach to assigning colleges to each student. Because it iterates through each row in the DataFrame, it took 2.25 seconds to run on my laptop versus 0.016 seconds for the method shown above; in other words, the method in the previous cell was 133 times faster.

# df_curr_enrollment['college'] = ''
# college_col = df_curr_enrollment.columns.get_loc('college')
# for i in range(len(df_curr_enrollment)):
#     year = df_curr_enrollment.iloc[i]['matriculation_year']
#     if year == 2020:
#         probabilities = [0.15, 0.25, 0.3, 0.3]
#     elif year == 2021:
#         probabilities = [0.19, 0.26, 0.29, 0.26]
#     elif year == 2022:
#         probabilities = [0.25, 0.27, 0.27, 0.21]
#     elif year == 2022:
#         probabilities = [0.27, 0.25, 0.31, 0.17]
#     df_curr_enrollment.iloc[i, college_col] = rng.choice(
#     ['STL', 'STB', 'STM', 'STC'], p=probabilities)

Assigning additional year-related values:#

df_curr_enrollment['class_of'] = df_curr_enrollment[
'matriculation_year'] + 4
# The earlier the matriculation year, the higher the student's current
# level.
df_curr_enrollment['level'] = df_curr_enrollment[
'matriculation_year'].map(
    {2020:'Se',2021:'Ju',
     2022:'So',2023:'Fr'}) # Fr, So, Ju, and Se stand for 
# Freshman, Sophomore, Junior, and Senior, respectively.
# Creating an integer-based equivalent to 'level':
df_curr_enrollment['level_for_sorting'] = ((
    df_curr_enrollment['matriculation_year'] - 2023) * -1)
df_curr_enrollment.head()

	first_name	last_name	gender	...	class_of	level	level_for_sorting
0	Ashley	Jordan	F	...	2024	Se	3
1	Lisa	Dunn	F	...	2024	Se	3
2	Alexandra	Woods	F	...	2024	Se	3
3	Kathleen	Carter	F	...	2024	Se	3
4	Donna	Brewer	F	...	2024	Se	3

5 rows × 11 columns

Saving this table to our NVCU database#

(This operation would also create our database file if it didn’t exist already.)

df_curr_enrollment.to_sql(
    'curr_enrollment', 
    con=e, if_exists='replace', index=False)

To demonstrate that the above operation was successful, we can read in a copy of this table via pd.read_sql():

pd.read_sql('curr_enrollment', con=e).head()

	first_name	last_name	gender	...	class_of	level	level_for_sorting
0	Ashley	Jordan	F	...	2024	Se	3
1	Lisa	Dunn	F	...	2024	Se	3
2	Alexandra	Woods	F	...	2024	Se	3
3	Kathleen	Carter	F	...	2024	Se	3
4	Donna	Brewer	F	...	2024	Se	3

5 rows × 11 columns

This same table can also be saved as a standalone .csv file (thus making it easier to examine via a spreadsheet editor):

df_curr_enrollment.to_csv('curr_enrollment.csv', index = False)
df_curr_enrollment.to_csv(
    '../Data_Retrieval/curr_enrollment_tab_separated.csv', 
    index=False, sep='\t')

You may also save it as an .xlsx file if you’d prefer: (Note that the openpyxl library is required for the following line to run successfully.)

df_curr_enrollment.to_excel(
    '../Data_Retrieval/curr_enrollment.xlsx', index=False)

Creating a survey results table#

This table will store fall and spring student survey results for the most recent school year. We’ll configure our results so that certain groups (e.g. freshmen, seniors, and students in the STB and STL colleges) report higher results over time than do others.

Creating fall survey results#

def limit_results(df, column, min, max):
    '''This function restricts the values within a particular DataFrame
    column to a range provided by the user.
    
    df: the DataFrame to modify.
    
    column: the column within df whose values will be restricted.
    
    min and max: the minimum and max values, respectively, to allow within
    this column.
    '''
    df[column] = np.where(
        df[column] < min, min, df[column])
    df[column] = np.where(
        df[column] > max, max, df[column])

# Initializing the table as a copy of selected columns from 
# df_curr_enrollment:
df_fall_survey = df_curr_enrollment.copy()[
['matriculation_year', 'matriculation_number',
 'student_id', 'college', 'level']]
df_fall_survey['starting_year'] = 2023 # Other years could be stored
# within this table as well.
df_fall_survey['season'] = 'Fall'
# We'll use rng.normal(), which produces random numbers that follow a 
# normal distribution, to initialize our fall results.
# See https://numpy.org/doc/stable/reference/random/generated/numpy.
# random.Generator.normal.html
df_fall_survey['score'] = rng.normal(
    loc=70, scale=10, size=student_count).astype('int')
# calling limit_results to restrict the scores to the range [0, 100]:
limit_results(df_fall_survey, 'score', 0, 100)
df_fall_survey.head()

	matriculation_year	matriculation_number	student_id	...	starting_year	season	score
0	2020	1	2020-1	...	2023	Fall	88
1	2020	2	2020-2	...	2023	Fall	37
2	2020	3	2020-3	...	2023	Fall	54
3	2020	4	2020-4	...	2023	Fall	56
4	2020	5	2020-5	...	2023	Fall	77

5 rows × 8 columns

Confirming that the minimum and maximum ‘score’ values within df_fall_survey lie within the range [0, 100]:

df_fall_survey['score'].min(), df_fall_survey['score'].max()

(np.int64(30), np.int64(100))

Creating spring results#

In order to create these results, we’ll first create a copy of our fall results table. Next, we’ll calculate our spring results by creating a score_change variable that defaults to 10; adjusting this variable up or down based on students’ college membership and level; and then adding the product of this variable and rng.random()* to a student’s fall score.

We’ll use .iloc[] to make these updates. An alternative would be df.at[], but the approach shown in this cell allows for different index labels to be used.

*rng.random() produces a float in the range [0, 1). See the numpy documentation for more details. The use of rng.random() allows for more variation in individual fall-to-spring changes.

df_spring_survey = df_fall_survey.copy().rename(
    columns={'score':'fall_score'}).replace('Fall','Spring')
df_spring_survey['score'] = np.nan
score_col = df_spring_survey.columns.get_loc('score')
for i in range(len(df_spring_survey)):
    score_change = 10
    college = df_spring_survey.iloc[i]['college']
    level = df_spring_survey.iloc[i]['level']
    if college in ['STL', 'STM']:
        score_change += 20
    if level in ['So', 'Ju']:
        score_change -= 10
    # Using .iloc to make our updates:
    # This approach prevents 'setting with copy' warnings from being 
    # displayed. Note the use of get_loc to determine which column index 
    # value to incorporate into .iloc.
    df_spring_survey.iloc[i, score_col] = (df_spring_survey.iloc[
    i]['fall_score'] -5 + rng.random() * score_change)
    
limit_results(df_spring_survey, 'score', 0, 100)
df_spring_survey['score'] = df_spring_survey['score'].astype('int')
df_spring_survey.head()

	matriculation_year	matriculation_number	student_id	...	season	fall_score	score
0	2020	1	2020-1	...	Spring	88	86
1	2020	2	2020-2	...	Spring	37	35
2	2020	3	2020-3	...	Spring	54	54
3	2020	4	2020-4	...	Spring	56	57
4	2020	5	2020-5	...	Spring	77	74

5 rows × 9 columns

A pivot table of results by college and level reveals the variation in mean scores created by the above for loop.

df_spring_survey.pivot_table(
    index=['college', 'level'], 
    values='score', aggfunc='mean').reset_index()

	college	level	score
0	STB	Fr	69.177235
1	STB	Ju	64.950769
...	...	...	...
14	STM	Se	79.309831
15	STM	So	73.415690

16 rows × 3 columns

Combining fall and spring results into the same table#

We’ll use a ‘long’ format for this table. In other words, rather than show fall and spring results side by side for each student (an example of a ‘wide’-formatted table), we’ll list these results as separate rows. I consider this setup to be more realistic: in real life, this table would likely be created by stacking individual sets of survey results on top of one another.

The only identifier that we’ll preserve is the student ID, as this will be sufficient for linking this dataset with any other table (such as our current enrollment dataset) that also contains this key.

df_survey = pd.concat(
    [df_fall_survey, df_spring_survey]).reset_index()[
['student_id', 'starting_year', 'season', 'score']]
df_survey.head()

	student_id	starting_year	season	score
0	2020-1	2023	Fall	88
1	2020-2	2023	Fall	37
2	2020-3	2023	Fall	54
3	2020-4	2023	Fall	56
4	2020-5	2023	Fall	77

df_survey.to_sql('survey_results', if_exists='replace', 
                 index=False, con=e)
df_survey.to_csv('survey_results.csv', index=False)

Creating a set of winter results that can be incorporated into a data cleaning/reformatting script#

In order to demonstrate how the Pandas library can be used to clean and reformat data, I’ll now create a winter survey results table that contains several issues, including:

Column names that differ from those in the fall/spring results
Different data formats
A missing column
Duplicate values
Missing values for certain students
Results spread over 16 separate files (one for each school/level pair)

These issues will be addressed within the data_cleaning_and_reformatting.ipynb script within PFN’s Data Prep section.

df_winter_survey = df_fall_survey.copy()
df_winter_survey['SEASON'] = 'W'
# Initializing winter survey results as lower copies of fall ones, 
# then converting them to a different data type:
# (Converting the results to integers, then strings prevents the original
# decimal values from getting added to the final output.
df_winter_survey['SURVEY_SCORE'] = (
    df_winter_survey['score'] 
    + (rng.random(size=student_count) * -10)).astype(
        'int').astype('str')+'.0%'
df_winter_survey['STARTINGYR'] = df_winter_survey['starting_year'] - 2000
df_winter_survey['MATRICYR'] = (
    df_winter_survey['matriculation_year'] - 2000).astype('str')
df_winter_survey.rename(columns={'matriculation_number':'MATRIC#'}, 
                        inplace=True)
df_winter_survey.drop(
    ['matriculation_year', 'student_id', 'starting_year',
     'season','score'], axis=1, inplace=True)
# Removing 15% of students from the results via df.sample:      
df_winter_survey = df_winter_survey.sample(frac=0.85).copy()
# See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.
# DataFrame.sample.html
# using df.sample() to add duplicate records for 5% of the students present
# within the sample:
df_winter_survey = pd.concat(
    [df_winter_survey, df_winter_survey.sample(frac=0.05)])
df_winter_survey.reset_index(drop=True,inplace=True)
df_winter_survey.head()

	MATRIC#	college	level	SEASON	SURVEY_SCORE	STARTINGYR	MATRICYR
0	1751	STC	So	W	58.0%	23	22
1	2028	STL	So	W	59.0%	23	22
2	2594	STL	Se	W	49.0%	23	20
3	1733	STL	Se	W	56.0%	23	20
4	2088	STM	Fr	W	62.0%	23	23

Dividing these results into 16 individual datasets (one for each college/level pair), then saving those results into the Data Prep section of PFN:

(The ‘college’ and ‘level’ fields within df_winter_survey were only necessary for filtering data and creating school and level-based filenames, so they don’t need to be included in the .csv copies of the data.)

for college in df_winter_survey['college'].unique():
    for level in df_winter_survey['level'].unique():
        df_results = df_winter_survey.query(
            "college == @college & level == @level").copy().drop(
            ['college', 'level'], axis=1)
        df_results.to_csv(
            f'../Data_Prep/winter_results/{college}_{level}_results.csv', 
            index=False)

Creating a dining transactions table#

I created this table in order to illustrate the ‘average of averages’ problem within Descriptive Stats: Part 2. It currently shows only four items for each transaction: (1) the amount of money spent; (2) the abbreviated level (Fr for freshman, So for sophomore, etc.) of the diner; (3) the starting school year in which the transaction was made; and (4) the weekday of the transaction.

The table is designed so that younger levels will have more entries than older levels, but older levels will spend higher amounts, on average, per transaction.

dining_transaction_count = 22453
# Initializing the table:
# Note the use of rng.choice() to make younger-level transactions
# more common than older-level ones and mid-week transactions 
# more common than weekend-ones.
# This set of code uses ng.random() to initialize each transaction as a 
# number between $0 and $20; however, these numbers will get adjusted
# later in this cell.
df_dining_transactions = pd.DataFrame(
    index=np.arange(dining_transaction_count),
    data={'starting_year':2023,
            'weekday':rng.choice(
        ['Su', 'Mo', 'Tu', 'We', 'Th', 'Fr', 'Sa'],
        p=[0.05, 0.16, 0.23, 0.24, 0.18, 0.11, 0.03], 
                size=dining_transaction_count),
            'level':rng.choice(
        ['Fr', 'So', 'Ju', 'Se'],
        p=[0.55, 0.21, 0.15, 0.09], size=dining_transaction_count),
           'base_charge':[
               rng.random() * 20 for i in range(
                   dining_transaction_count)]})

# Creating a 'multiplier' column that will cause average transaction
# amounts for older levels to exceed those of younger levels:
df_dining_transactions['multiplier'] = df_dining_transactions[
'level'].map(
    {'Fr':1, 'So':1.3, 'Ju': 1.7, 'Se':2.1})
df_dining_transactions['amount'] = np.round(
    df_dining_transactions['base_charge'] 
    * df_dining_transactions['multiplier'], 2)

# Removing columns that are no longer needed:
df_dining_transactions.drop(
    ['base_charge', 'multiplier'], axis=1, inplace=True)
df_dining_transactions

	starting_year	weekday	level	amount
0	2023	We	Fr	13.36
1	2023	Tu	Se	12.22
...	...	...	...	...
22451	2023	Sa	Fr	13.25
22452	2023	Mo	Fr	1.95

22453 rows × 4 columns

df_dining_transactions.to_sql('dining_transactions', if_exists='replace', 
                 index=False, con=e)
df_dining_transactions.to_csv('dining_transactions.csv', index=False)

Creating a table of fall and spring bookstore sales by student for a given year:#

(This table will play a key role in the Regressions section of Python for Nonprofits.)

We’ll initialize this table as a subset of df_curr_enrollment, as certain demographic items will play a role in students’ spring purchase totals.

df_sales = df_curr_enrollment[
['student_id', 'gender', 'college', 'level']].copy()

# Specifying an RNG seed: (I sometimes use the time of day, in HHMMSS
# format, as a basis for a seed value.)
rng = np.random.default_rng(seed=225403)

# Fall sales will be normally distributed for all students:
df_sales['Fall'] = rng.normal(
    loc=80, scale=25, size=len(df_sales))

# Spring sales will be higher than fall sales by default. The following
# code specifies these changes using a normal distribution, as I found that
# doing so helped make my regression residuals more normally distributed
# also.)
spring_change = rng.normal(loc=11, scale=25, size=len(df_sales))
df_sales['Spring'] = df_sales['Fall'] + spring_change
# Modifying Spring totals based on demographic components:
spring_col = df_sales.columns.get_loc('Spring')
for i in range(len(df_sales)):
    # Unhealthy snacks were removed from the checkout aisle; 
    # this ended up reducing revenue among freshmen and sophomore
    # (who particularly liked these snacks.)
    if df_sales.iloc[i]['level'] in ['Fr', 'So']:
        df_sales.iloc[i, spring_col] = (
            df_sales.iloc[i, spring_col] + rng.normal(
                loc=-21, scale=3))
    # An intensive marketing campaign was carried out at STM and STL;
    # if it ended up being successful, it would then be implemented
    # at the other colleges also.
    if df_sales.iloc[i]['college'] in ['STM', 'STL']:
        df_sales.iloc[i, spring_col] = (
            df_sales.iloc[i, spring_col] + rng.normal(
                loc=9, scale=3))

# I'll leave in negative Fall and Spring values, as they could be 
# explained by refunds.

df_sales['Fall'] = df_sales['Fall'].round(2)
df_sales['Spring'] = df_sales['Spring'].round(2)
df_sales['Fall_Spring_Change'] = df_sales['Spring'] - df_sales['Fall']

df_sales.to_sql('bookstore_sales', if_exists='replace', 
                 index=False, con=e)
df_sales.to_csv('bookstore_sales.csv', index=False)

df_sales

	student_id	gender	college	level	Fall	Spring	Fall_Spring_Change
0	2020-1	F	STC	Se	66.80	58.24	-8.56
1	2020-2	F	STM	Se	104.67	151.90	47.23
...	...	...	...	...	...	...	...
16382	2023-5442	M	STB	Fr	100.52	38.29	-62.23
16383	2023-5443	M	STM	Fr	57.45	29.43	-28.02