NVCU Database Generator#
By Kenneth Burchfiel
This script will create a SQLite database for the (fictional) Northern Virginia Catholic University that can be referenced within other sections of Python for Nonprofits. The data within this database will be fictional also.
from sqlalchemy import create_engine
import pandas as pd
import numpy as np
# Setting up random number generation capabilities:
rng = np.random.default_rng(2325)
# Based on https://numpy.org/doc/stable/reference/random/generator.html
# The faker library is a great tool for creating fictional
# database records. I set the locale to 'en_US' because NVCU
# is located in the United States.
from faker import Faker
fake = Faker('en_US')
from helper_funcs import config_notebook
display_type = config_notebook(display_max_columns=7,
display_max_rows=5)
Connecting to our NVCU database via the SQLAlchemy library#
(This code will work even if no database currently exists at the path shown below.)
e = create_engine('sqlite:///nvcu_db.db')
# Based on: https://docs.sqlalchemy.org/en/20/dialects/sqlite.html#pysqlite
student_count = 2**14 # The university will have around 16,300 students.
student_count
16384
Creating a current enrollment table#
Creating lists of names#
We’ll use the Faker library to create equal numbers of female and male first names, then concatenate these lists to create a single list of first names.
(Note: The Faker documentation for the ‘en_US’ (US English) locale (https://faker.readthedocs.io/en/master/locales/en_US.html) was a useful resource in writing this code.)
student_count
# It's convenient to also create a variable for the value equal
# to half of the student count. (Dividing the student count by
# 2 produces a float by default, so we'll convert this value to
# an int so that it can be used as a range within list comprehensions.)
half_student_count = int(student_count / 2)
half_student_count
8192
female_first_names = [
fake.first_name_female()
for i in range(half_student_count)]
male_first_names = [
fake.first_name_male() for i in range(half_student_count)]
first_names = female_first_names + male_first_names
last_names = [
fake.last_name() for i in range(student_count)]
female_first_names[0:5], male_first_names[0:5]
(['Ashley', 'Lisa', 'Alexandra', 'Cynthia', 'Mary'],
['Charles', 'Ryan', 'Miguel', 'Brian', 'Curtis'])
# In order to make our genders match our names, we'll
# make the first half of our gender list female and the second
# half male (as the first and second halves of our first names
# list show male and female names, respectively.)
genders = (['F' for i in range(half_student_count)]
+ ['M' for i in range(half_student_count)])
Initializing our current enrollment table#
df_curr_enrollment = pd.DataFrame(
index=np.arange(0,student_count),
data = {'first_name':first_names,
'last_name':last_names,
'gender':genders})
df_curr_enrollment.head()
first_name | last_name | gender | |
---|---|---|---|
0 | Ashley | Jordan | F |
1 | Lisa | Dunn | F |
2 | Alexandra | Woods | F |
3 | Cynthia | Lozano | F |
4 | Mary | Kemp | F |
Creating matriculation years:#
In order to simulate increasing enrollment over time, weights were added to the rng.choice()
call so that recent years would appear more frequently.
rng.choice(
[2020, 2021, 2022, 2023],
p=[0.20, 0.22, 0.25, 0.33], size=student_count)
array([2023, 2023, 2020, ..., 2023, 2022, 2023], shape=(16384,))
df_curr_enrollment['matriculation_year'] = rng.choice(
[2020, 2021, 2022, 2023], p=[0.20, 0.22, 0.25, 0.33],
size=student_count)
# See https://numpy.org/doc/stable/reference/random/generated/numpy.
# random.Generator.choice.html
df_curr_enrollment.head()
first_name | last_name | gender | matriculation_year | |
---|---|---|---|---|
0 | Ashley | Jordan | F | 2020 |
1 | Lisa | Dunn | F | 2020 |
2 | Alexandra | Woods | F | 2020 |
3 | Cynthia | Lozano | F | 2022 |
4 | Mary | Kemp | F | 2023 |
Creating student IDs#
Student IDs will use the format matriculation_year-matriculation_number
. matriculation_number
represents the order in which students enrolled for a given year; these numbers are unique within each year, but not between years. This number can then be combined with students’ matriculation years to form a unique ID.
# Calculating matriculation numbers by grouping the DataFrame by
# matriculation year, then assigning each student within each year a unique
# number:
# (This can be achieved via df.groupby() and df.rank(). See
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.
# DataFrame.groupby.html
# and https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.
# DataFrame.rank.html )
df_curr_enrollment['matriculation_number'] = df_curr_enrollment.groupby(
'matriculation_year')['matriculation_year'].rank(
method='first').astype('int')
df_curr_enrollment.head()
first_name | last_name | gender | matriculation_year | matriculation_number | |
---|---|---|---|---|---|
0 | Ashley | Jordan | F | 2020 | 1 |
1 | Lisa | Dunn | F | 2020 | 2 |
2 | Alexandra | Woods | F | 2020 | 3 |
3 | Cynthia | Lozano | F | 2022 | 1 |
4 | Mary | Kemp | F | 2023 | 1 |
# Adding matriculation years, matriculation numbers, and hyphens together
# to create student IDs:
df_curr_enrollment['student_id'] = (
df_curr_enrollment['matriculation_year'].astype('str')
+ '-'
+ df_curr_enrollment['matriculation_number'].astype('str'))
# Sorting the DataFrame by matriculation year and matriculation number:
# (The DataFrame could also be sorted by student_id, but because these
# numbers would be interpreted as strings, ids like 2020-10 would appear
# in front of ones like 2020-2.)
df_curr_enrollment.sort_values(
['matriculation_year', 'matriculation_number'], inplace=True)
df_curr_enrollment.reset_index(drop=True,inplace=True)
df_curr_enrollment.head()
first_name | last_name | gender | matriculation_year | matriculation_number | student_id | |
---|---|---|---|---|---|---|
0 | Ashley | Jordan | F | 2020 | 1 | 2020-1 |
1 | Lisa | Dunn | F | 2020 | 2 | 2020-2 |
2 | Alexandra | Woods | F | 2020 | 3 | 2020-3 |
3 | Kathleen | Carter | F | 2020 | 4 | 2020-4 |
4 | Donna | Brewer | F | 2020 | 5 | 2020-5 |
We’ll also use matriculation years as a basis for dates of birth. (These date of birth calculations aren’t the most realistic, but they are good enough for use within our fictional dataset.)
df_curr_enrollment['birth_month'] = rng.integers(
low=1, high=13, size=student_count).astype('str')
df_curr_enrollment['birth_day'] = rng.integers(
low=1, high=29, size=student_count).astype('str')
# Adding leading zeros to month and date values:
for column in ['birth_month', 'birth_day']:
df_curr_enrollment[column] = df_curr_enrollment[column].str.zfill(2)
# In order to prevent a non-existent date (e.g. February 31) from getting
# stored in our results, we'll limit our birthday integers to the
# range [1, 28].
df_curr_enrollment['birth_year'] = (df_curr_enrollment[
'matriculation_year'] - 18).astype('str')
# Combining these columns together to create date of birth values in
# YYYY-MM-DD format:
df_curr_enrollment['date_of_birth'] = (
df_curr_enrollment['birth_year'] + '-'
+ df_curr_enrollment['birth_month'] + '-'
+ df_curr_enrollment['birth_day'])
# Now that we have our date_of_birth column in place, we no longer need
# the columns that stored individual parts of those birthdays:
df_curr_enrollment.drop(
['birth_month', 'birth_day',
'birth_year'], axis=1, inplace=True)
df_curr_enrollment
first_name | last_name | gender | matriculation_year | matriculation_number | student_id | date_of_birth | |
---|---|---|---|---|---|---|---|
0 | Ashley | Jordan | F | 2020 | 1 | 2020-1 | 2002-12-16 |
1 | Lisa | Dunn | F | 2020 | 2 | 2020-2 | 2002-09-26 |
... | ... | ... | ... | ... | ... | ... | ... |
16382 | Andrew | Copeland | M | 2023 | 5442 | 2023-5442 | 2005-11-03 |
16383 | Aaron | Thomas | M | 2023 | 5443 | 2023-5443 | 2005-04-05 |
16384 rows × 7 columns
Assigning students to different colleges#
(I used Wikipedia’s ‘List of patron saints by occupation and activity’ page to determine the saint after which each college would be named.)
NVCU has four different colleges:
St. Luke’s, a humanities college. (St. Luke is one of the patron saints of artists.) Abbreviation: STL
St. Benedict’s, a STEM college. (St. Benedict is one of the patron saints of engineers.) Abbreviation: STB
St. Matthew’s, a business college. (St. Matthew is the patron saint of accountants.) Abbreviation: STM
St. Catherine’s, a health sciences college. (St. Catherine of Alexandria is one of the patron saints of nurses.) Abbreviation: STC
We can use np.choice()
to assign students to different colleges. However, to make the data more interesting, we’ll have one college (STL) increase in popularity over time; another (STC) decrease in popularity; and the two remaining colleges remain roughly constant in popularity. We can simulate these changes by (1) creating filtered versions of the DataFrame for each year; (2) calling np.choice()
with different probability sets for each year in order to create the ‘college’ column; and (3) recreating df_curr_enrollment
by adding these year-specific DataFrames back together.
df_list = []
for year in df_curr_enrollment['matriculation_year'].unique():
print(f"Now adding in college enrollments for {year}.")
df = df_curr_enrollment.query("matriculation_year == @year").copy()
if year == 2020:
probabilities = [0.15, 0.25, 0.3, 0.3]
elif year == 2021:
probabilities = [0.19, 0.26, 0.29, 0.26]
elif year == 2022:
probabilities = [0.25, 0.27, 0.27, 0.21]
elif year == 2023:
probabilities = [0.27, 0.25, 0.31, 0.17]
else:
raise ValueError(
f"A probability list needs to be added in for {year}.")
df['college'] = rng.choice(
['STL', 'STB', 'STM', 'STC'],
p=probabilities, size=len(df))
df_list.append(df)
df_curr_enrollment = pd.concat([df for df in df_list])
df_curr_enrollment.head()
Now adding in college enrollments for 2020.
Now adding in college enrollments for 2021.
Now adding in college enrollments for 2022.
Now adding in college enrollments for 2023.
first_name | last_name | gender | ... | student_id | date_of_birth | college | |
---|---|---|---|---|---|---|---|
0 | Ashley | Jordan | F | ... | 2020-1 | 2002-12-16 | STC |
1 | Lisa | Dunn | F | ... | 2020-2 | 2002-09-26 | STM |
2 | Alexandra | Woods | F | ... | 2020-3 | 2002-05-18 | STC |
3 | Kathleen | Carter | F | ... | 2020-4 | 2002-11-15 | STC |
4 | Donna | Brewer | F | ... | 2020-5 | 2002-06-21 | STM |
5 rows × 8 columns
The commented-out cell below shows an alternative approach to assigning colleges to each student. Because it iterates through each row in the DataFrame, it took 2.25 seconds to run on my laptop versus 0.016 seconds for the method shown above; in other words, the method in the previous cell was 133 times faster.
# df_curr_enrollment['college'] = ''
# college_col = df_curr_enrollment.columns.get_loc('college')
# for i in range(len(df_curr_enrollment)):
# year = df_curr_enrollment.iloc[i]['matriculation_year']
# if year == 2020:
# probabilities = [0.15, 0.25, 0.3, 0.3]
# elif year == 2021:
# probabilities = [0.19, 0.26, 0.29, 0.26]
# elif year == 2022:
# probabilities = [0.25, 0.27, 0.27, 0.21]
# elif year == 2022:
# probabilities = [0.27, 0.25, 0.31, 0.17]
# df_curr_enrollment.iloc[i, college_col] = rng.choice(
# ['STL', 'STB', 'STM', 'STC'], p=probabilities)
Saving this table to our NVCU database#
(This operation would also create our database file if it didn’t exist already.)
df_curr_enrollment.to_sql(
'curr_enrollment',
con=e, if_exists='replace', index=False)
16384
To demonstrate that the above operation was successful, we can read in a copy of this table via pd.read_sql()
:
pd.read_sql('curr_enrollment', con=e).head()
first_name | last_name | gender | ... | class_of | level | level_for_sorting | |
---|---|---|---|---|---|---|---|
0 | Ashley | Jordan | F | ... | 2024 | Se | 3 |
1 | Lisa | Dunn | F | ... | 2024 | Se | 3 |
2 | Alexandra | Woods | F | ... | 2024 | Se | 3 |
3 | Kathleen | Carter | F | ... | 2024 | Se | 3 |
4 | Donna | Brewer | F | ... | 2024 | Se | 3 |
5 rows × 11 columns
This same table can also be saved as a standalone .csv file (thus making it easier to examine via a spreadsheet editor):
df_curr_enrollment.to_csv('curr_enrollment.csv', index = False)
df_curr_enrollment.to_csv(
'../Data_Retrieval/curr_enrollment_tab_separated.csv',
index=False, sep='\t')
You may also save it as an .xlsx file if you’d prefer: (Note that the openpyxl library is required for the following line to run successfully.)
df_curr_enrollment.to_excel(
'../Data_Retrieval/curr_enrollment.xlsx', index=False)
Creating a survey results table#
This table will store fall and spring student survey results for the most recent school year. We’ll configure our results so that certain groups (e.g. freshmen, seniors, and students in the STB and STL colleges) report higher results over time than do others.
Creating fall survey results#
def limit_results(df, column, min, max):
'''This function restricts the values within a particular DataFrame
column to a range provided by the user.
df: the DataFrame to modify.
column: the column within df whose values will be restricted.
min and max: the minimum and max values, respectively, to allow within
this column.
'''
df[column] = np.where(
df[column] < min, min, df[column])
df[column] = np.where(
df[column] > max, max, df[column])
# Initializing the table as a copy of selected columns from
# df_curr_enrollment:
df_fall_survey = df_curr_enrollment.copy()[
['matriculation_year', 'matriculation_number',
'student_id', 'college', 'level']]
df_fall_survey['starting_year'] = 2023 # Other years could be stored
# within this table as well.
df_fall_survey['season'] = 'Fall'
# We'll use rng.normal(), which produces random numbers that follow a
# normal distribution, to initialize our fall results.
# See https://numpy.org/doc/stable/reference/random/generated/numpy.
# random.Generator.normal.html
df_fall_survey['score'] = rng.normal(
loc=70, scale=10, size=student_count).astype('int')
# calling limit_results to restrict the scores to the range [0, 100]:
limit_results(df_fall_survey, 'score', 0, 100)
df_fall_survey.head()
matriculation_year | matriculation_number | student_id | ... | starting_year | season | score | |
---|---|---|---|---|---|---|---|
0 | 2020 | 1 | 2020-1 | ... | 2023 | Fall | 88 |
1 | 2020 | 2 | 2020-2 | ... | 2023 | Fall | 37 |
2 | 2020 | 3 | 2020-3 | ... | 2023 | Fall | 54 |
3 | 2020 | 4 | 2020-4 | ... | 2023 | Fall | 56 |
4 | 2020 | 5 | 2020-5 | ... | 2023 | Fall | 77 |
5 rows × 8 columns
Confirming that the minimum and maximum ‘score’ values within df_fall_survey lie within the range [0, 100]:
df_fall_survey['score'].min(), df_fall_survey['score'].max()
(np.int64(30), np.int64(100))
Creating spring results#
In order to create these results, we’ll first create a copy of our fall results table. Next, we’ll calculate our spring results by creating a score_change variable that defaults to 10; adjusting this variable up or down based on students’ college membership and level; and then adding the product of this variable and rng.random()
* to a student’s fall score.
We’ll use .iloc[] to make these updates. An alternative would be df.at[], but the approach shown in this cell allows for different index labels to be used.
*rng.random() produces a float in the range [0, 1). See the numpy documentation for more details. The use of rng.random()
allows for more variation in individual fall-to-spring changes.
df_spring_survey = df_fall_survey.copy().rename(
columns={'score':'fall_score'}).replace('Fall','Spring')
df_spring_survey['score'] = np.nan
score_col = df_spring_survey.columns.get_loc('score')
for i in range(len(df_spring_survey)):
score_change = 10
college = df_spring_survey.iloc[i]['college']
level = df_spring_survey.iloc[i]['level']
if college in ['STL', 'STM']:
score_change += 20
if level in ['So', 'Ju']:
score_change -= 10
# Using .iloc to make our updates:
# This approach prevents 'setting with copy' warnings from being
# displayed. Note the use of get_loc to determine which column index
# value to incorporate into .iloc.
df_spring_survey.iloc[i, score_col] = (df_spring_survey.iloc[
i]['fall_score'] -5 + rng.random() * score_change)
limit_results(df_spring_survey, 'score', 0, 100)
df_spring_survey['score'] = df_spring_survey['score'].astype('int')
df_spring_survey.head()
matriculation_year | matriculation_number | student_id | ... | season | fall_score | score | |
---|---|---|---|---|---|---|---|
0 | 2020 | 1 | 2020-1 | ... | Spring | 88 | 86 |
1 | 2020 | 2 | 2020-2 | ... | Spring | 37 | 35 |
2 | 2020 | 3 | 2020-3 | ... | Spring | 54 | 54 |
3 | 2020 | 4 | 2020-4 | ... | Spring | 56 | 57 |
4 | 2020 | 5 | 2020-5 | ... | Spring | 77 | 74 |
5 rows × 9 columns
A pivot table of results by college and level reveals the variation in mean scores created by the above for loop.
df_spring_survey.pivot_table(
index=['college', 'level'],
values='score', aggfunc='mean').reset_index()
college | level | score | |
---|---|---|---|
0 | STB | Fr | 69.177235 |
1 | STB | Ju | 64.950769 |
... | ... | ... | ... |
14 | STM | Se | 79.309831 |
15 | STM | So | 73.415690 |
16 rows × 3 columns
Combining fall and spring results into the same table#
We’ll use a ‘long’ format for this table. In other words, rather than show fall and spring results side by side for each student (an example of a ‘wide’-formatted table), we’ll list these results as separate rows. I consider this setup to be more realistic: in real life, this table would likely be created by stacking individual sets of survey results on top of one another.
The only identifier that we’ll preserve is the student ID, as this will be sufficient for linking this dataset with any other table (such as our current enrollment dataset) that also contains this key.
df_survey = pd.concat(
[df_fall_survey, df_spring_survey]).reset_index()[
['student_id', 'starting_year', 'season', 'score']]
df_survey.head()
student_id | starting_year | season | score | |
---|---|---|---|---|
0 | 2020-1 | 2023 | Fall | 88 |
1 | 2020-2 | 2023 | Fall | 37 |
2 | 2020-3 | 2023 | Fall | 54 |
3 | 2020-4 | 2023 | Fall | 56 |
4 | 2020-5 | 2023 | Fall | 77 |
df_survey.to_sql('survey_results', if_exists='replace',
index=False, con=e)
df_survey.to_csv('survey_results.csv', index=False)
Creating a set of winter results that can be incorporated into a data cleaning/reformatting script#
In order to demonstrate how the Pandas library can be used to clean and reformat data, I’ll now create a winter survey results table that contains several issues, including:
Column names that differ from those in the fall/spring results
Different data formats
A missing column
Duplicate values
Missing values for certain students
Results spread over 16 separate files (one for each school/level pair)
These issues will be addressed within the data_cleaning_and_reformatting.ipynb script within PFN’s Data Prep section.
df_winter_survey = df_fall_survey.copy()
df_winter_survey['SEASON'] = 'W'
# Initializing winter survey results as lower copies of fall ones,
# then converting them to a different data type:
# (Converting the results to integers, then strings prevents the original
# decimal values from getting added to the final output.
df_winter_survey['SURVEY_SCORE'] = (
df_winter_survey['score']
+ (rng.random(size=student_count) * -10)).astype(
'int').astype('str')+'.0%'
df_winter_survey['STARTINGYR'] = df_winter_survey['starting_year'] - 2000
df_winter_survey['MATRICYR'] = (
df_winter_survey['matriculation_year'] - 2000).astype('str')
df_winter_survey.rename(columns={'matriculation_number':'MATRIC#'},
inplace=True)
df_winter_survey.drop(
['matriculation_year', 'student_id', 'starting_year',
'season','score'], axis=1, inplace=True)
# Removing 15% of students from the results via df.sample:
df_winter_survey = df_winter_survey.sample(frac=0.85).copy()
# See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.
# DataFrame.sample.html
# using df.sample() to add duplicate records for 5% of the students present
# within the sample:
df_winter_survey = pd.concat(
[df_winter_survey, df_winter_survey.sample(frac=0.05)])
df_winter_survey.reset_index(drop=True,inplace=True)
df_winter_survey.head()
MATRIC# | college | level | SEASON | SURVEY_SCORE | STARTINGYR | MATRICYR | |
---|---|---|---|---|---|---|---|
0 | 1751 | STC | So | W | 58.0% | 23 | 22 |
1 | 2028 | STL | So | W | 59.0% | 23 | 22 |
2 | 2594 | STL | Se | W | 49.0% | 23 | 20 |
3 | 1733 | STL | Se | W | 56.0% | 23 | 20 |
4 | 2088 | STM | Fr | W | 62.0% | 23 | 23 |
Dividing these results into 16 individual datasets (one for each college/level pair), then saving those results into the Data Prep section of PFN:
(The ‘college’ and ‘level’ fields within df_winter_survey were only necessary for filtering data and creating school and level-based filenames, so they don’t need to be included in the .csv copies of the data.)
for college in df_winter_survey['college'].unique():
for level in df_winter_survey['level'].unique():
df_results = df_winter_survey.query(
"college == @college & level == @level").copy().drop(
['college', 'level'], axis=1)
df_results.to_csv(
f'../Data_Prep/winter_results/{college}_{level}_results.csv',
index=False)
Creating a dining transactions table#
I created this table in order to illustrate the ‘average of averages’ problem within Descriptive Stats: Part 2. It currently shows only four items for each transaction: (1) the amount of money spent; (2) the abbreviated level (Fr for freshman, So for sophomore, etc.) of the diner; (3) the starting school year in which the transaction was made; and (4) the weekday of the transaction.
The table is designed so that younger levels will have more entries than older levels, but older levels will spend higher amounts, on average, per transaction.
dining_transaction_count = 22453
# Initializing the table:
# Note the use of rng.choice() to make younger-level transactions
# more common than older-level ones and mid-week transactions
# more common than weekend-ones.
# This set of code uses ng.random() to initialize each transaction as a
# number between $0 and $20; however, these numbers will get adjusted
# later in this cell.
df_dining_transactions = pd.DataFrame(
index=np.arange(dining_transaction_count),
data={'starting_year':2023,
'weekday':rng.choice(
['Su', 'Mo', 'Tu', 'We', 'Th', 'Fr', 'Sa'],
p=[0.05, 0.16, 0.23, 0.24, 0.18, 0.11, 0.03],
size=dining_transaction_count),
'level':rng.choice(
['Fr', 'So', 'Ju', 'Se'],
p=[0.55, 0.21, 0.15, 0.09], size=dining_transaction_count),
'base_charge':[
rng.random() * 20 for i in range(
dining_transaction_count)]})
# Creating a 'multiplier' column that will cause average transaction
# amounts for older levels to exceed those of younger levels:
df_dining_transactions['multiplier'] = df_dining_transactions[
'level'].map(
{'Fr':1, 'So':1.3, 'Ju': 1.7, 'Se':2.1})
df_dining_transactions['amount'] = np.round(
df_dining_transactions['base_charge']
* df_dining_transactions['multiplier'], 2)
# Removing columns that are no longer needed:
df_dining_transactions.drop(
['base_charge', 'multiplier'], axis=1, inplace=True)
df_dining_transactions
starting_year | weekday | level | amount | |
---|---|---|---|---|
0 | 2023 | We | Fr | 13.36 |
1 | 2023 | Tu | Se | 12.22 |
... | ... | ... | ... | ... |
22451 | 2023 | Sa | Fr | 13.25 |
22452 | 2023 | Mo | Fr | 1.95 |
22453 rows × 4 columns
df_dining_transactions.to_sql('dining_transactions', if_exists='replace',
index=False, con=e)
df_dining_transactions.to_csv('dining_transactions.csv', index=False)
Creating a table of fall and spring bookstore sales by student for a given year:#
(This table will play a key role in the Regressions section of Python for Nonprofits.)
We’ll initialize this table as a subset of df_curr_enrollment
, as certain demographic items will play a role in students’ spring purchase totals.
df_sales = df_curr_enrollment[
['student_id', 'gender', 'college', 'level']].copy()
# Specifying an RNG seed: (I sometimes use the time of day, in HHMMSS
# format, as a basis for a seed value.)
rng = np.random.default_rng(seed=225403)
# Fall sales will be normally distributed for all students:
df_sales['Fall'] = rng.normal(
loc=80, scale=25, size=len(df_sales))
# Spring sales will be higher than fall sales by default. The following
# code specifies these changes using a normal distribution, as I found that
# doing so helped make my regression residuals more normally distributed
# also.)
spring_change = rng.normal(loc=11, scale=25, size=len(df_sales))
df_sales['Spring'] = df_sales['Fall'] + spring_change
# Modifying Spring totals based on demographic components:
spring_col = df_sales.columns.get_loc('Spring')
for i in range(len(df_sales)):
# Unhealthy snacks were removed from the checkout aisle;
# this ended up reducing revenue among freshmen and sophomore
# (who particularly liked these snacks.)
if df_sales.iloc[i]['level'] in ['Fr', 'So']:
df_sales.iloc[i, spring_col] = (
df_sales.iloc[i, spring_col] + rng.normal(
loc=-21, scale=3))
# An intensive marketing campaign was carried out at STM and STL;
# if it ended up being successful, it would then be implemented
# at the other colleges also.
if df_sales.iloc[i]['college'] in ['STM', 'STL']:
df_sales.iloc[i, spring_col] = (
df_sales.iloc[i, spring_col] + rng.normal(
loc=9, scale=3))
# I'll leave in negative Fall and Spring values, as they could be
# explained by refunds.
df_sales['Fall'] = df_sales['Fall'].round(2)
df_sales['Spring'] = df_sales['Spring'].round(2)
df_sales['Fall_Spring_Change'] = df_sales['Spring'] - df_sales['Fall']
df_sales.to_sql('bookstore_sales', if_exists='replace',
index=False, con=e)
df_sales.to_csv('bookstore_sales.csv', index=False)
df_sales
student_id | gender | college | level | Fall | Spring | Fall_Spring_Change | |
---|---|---|---|---|---|---|---|
0 | 2020-1 | F | STC | Se | 66.80 | 58.24 | -8.56 |
1 | 2020-2 | F | STM | Se | 104.67 | 151.90 | 47.23 |
... | ... | ... | ... | ... | ... | ... | ... |
16382 | 2023-5442 | M | STB | Fr | 100.52 | 38.29 | -62.23 |
16383 | 2023-5443 | M | STM | Fr | 57.45 | 29.43 | -28.02 |
16384 rows × 7 columns
That’s it for this script! We’ve now created all of the fictional data we’ll need for the various sections of Python for Nonprofits.