Census Data Imports#
The US Census Bureau is a fantastic source of free demographic data. As this notebook will demonstrate, Python allows us to easily access large amounts of this data at once. In this script, we’ll apply the Census Bureau’s API to (1) calculate state- and county-level population growth data and (2) gather data on education levels and median incomes that will prove useful for future regression analyses.
(Note: I had initially planned to feature this script right after PFN’s Data Retrieval chapter. However, this notebook not only imports Census data, but also reformats it and generates some descriptive statistics. Therefore, I decided it would be best to wait to cover this subject until now.)
An introduction to the American Community Survey#
Many Americans probably associate the US Census Bureau with its decennial Census. However, the Census Bureau also conducts the American Community Survey (https://www.census.gov/data/developers/data-sets/acs-5year.html) each year, making it an ideal resource for recent demographic data.
This notebook will source data from the American Community Survey’s 5-year estimates, which show an average of results for the past 5 years. (For example, the 2021 ACS5 dataset shows results between 2017 and 2021). The 1-year ACS estimates (https://www.census.gov/data/developers/data-sets/acs-1year.html) offer results for a more recent timeframe; however, because the 5-year estimates are sourced from a larger pool of data, they may be more reliable (especially for smaller regions). In addition, 1-year estimates aren’t available for certain regions, such as counties with smaller populations and zip codes.
For the sake of brevity, I’ll often refer to the American Community Survey’s 5-year estimates as the ‘ACS5’ survey.
Another great source of Census data, which allows for a much wider variety of analyses, is American Community Survey microdata. These microdata tables show individual respondent rows rather than pre-calculated estimates, thus allowing for all sorts of custom filters, comparisons, and statistical tests. You can download 1- and 5-year ACS microdata from the Census website (https://www.census.gov/data/developers/data-sets/census-microdata-api.html); however, I would recommend using IPUMS (https://usa.ipums.org/usa/) for microdata retrieval, as it makes comparisons across different time periods much easier.
This script will not feature microdata imports, but if you plan to use Census data withn your own work, you’ll definitely want to learn more about the benefits (and disadvantages) of microdata. The US Census guide at https://www.census.gov/content/dam/Census/library/publications/2021/acs/acs_pums_handbook_2021.pdf is a good place to start. One notable disadvantage of microdata files is that, for confidentiality reasons, they do not include county- or zip-level information (though IPUMS has been able to map certain microdata records to certain US counties).
Finally, I’ll provide a brief introduction to the two main data-gathering tasks that we’ll tackle within this script:
Deciding where to move to start a family#
Let’s say that some NVCU seniors are interested in settling down and raising a family a few years after they graduate. Because they’d prefer to live in a growing region rather than a declining one, they want to know which areas have seen the highest growth rates in recent years. They’d like to see this data both for all residents within each county and those aged 25-29.
In order to answer these questions, we’ll use the Census API to retrieve US county population growth data from the ACS5 for a selected set of years. We’ll then use this data to calculate population growth rates across multiple periods.
Examining connections between education levels and median incomes#
The NVCU admissions department would like to make the case to high schoolers on the fence about pursuing higher education that going to college is (generally) worth it from a financial perspective. Therefore, they would like you to determine how the median income for a given county and state increases as the percentage of residents with at least a bachelor’s degree increases.* They figure that a strong relationship between the two could help convince teenagers to pursue a higher education.**
We can shed some light on these questions by performing a series of linear regressions–but first, we’ll need to obtain Census data that can get incorporated into those regressions.
*The monetary value of a bachelor’s degree can be better estimated by simply comparing the median incomes of state/county residents by different education levels. Therefore, this notebook will retrieve these medians as well. However, the continuous nature of the ‘% with bachelor’s’ metric will make it a great fit for our linear regression example.
**There are of course many non-monetary benefits to a college education–but given the size of student loans these days, it doesn’t hurt to look into the financial rewards of a bachelor’s degree.
Getting started with the Census API#
We’ll first import a few relevant libraries and set several configuration variables:
import time
program_start_time = time.time()
import pandas as pd
pd.set_option('display.max_columns', 1000) # This max column setting
# will prevent columns in the notebook from being hidden; however,
# if render_for_pdf (defined in the Appendix's helper_funcs.py file)
# is set to True, this setting will get overwritten.
import numpy as np
from iteration_utilities import duplicates
import sys
sys.path.insert(1, '../Appendix')
from helper_funcs import config_notebook, render_for_pdf
display_type = config_notebook(display_max_columns=5,
display_max_rows=6)
acs5_year = 2021 # By updating this variable when future American
# Community Surveys get released, you should be able to retrieve the most
# recent data possible. (If changes to the survey's format are made,
# however, updates to the scripts may be necessary.)
# Note: I had originally set acs5_year to 2022, the latest year for which
# ACS5 data were available at the time. However, due to a recent change
# in Connecticut's county-equivalent boundaries (see
# https://www.federalregister.gov/documents/2022/06/06/2022-12063/
# change-to-county-equivalents-in-the-state-of-connecticut for more
# information), ACS5 population growth data between previous
# years and 2022 appeared to be unavailable for that state. Therefore,
# I chose to retrieve data for 2021 instead.
acs5_latest_year = 2023 # This variable will be used to determine
# the ACS5 year for which a datsaet containing only one year of data
# will be retrieved. Since this dataset won't show historical trends
# in data, we don't need to worry about changes in county definitions;
# thus, we can use the most recent year for which data are available.
download_new_variable_list = False # If set to True, a new list of
# variables will be downloaded from the Census API website. If False,
# this list of variables will instead be read in from a local .csv copy
# (thus saving processing time).
Importing a Census API Key#
You can obtain a free Census API key at https://api.census.gov/data/key_signup.html . The following cell imports my own personal key, so you’ll need to replace this code with one that loads in your own API key.
with open ('census_api_key_path.txt') as file:
key_path = file.read()
with open(key_path) as file:
key = file.read()
In order to get better acquainted with the Census API, you may want to review its documentation. For instance, you’ll probably find the Census Data API User Guide (https://www.census.gov/content/dam/Census/data/developers/api-user-guide/api-guide.pdf) to be helpful in applying the Census API.
The list of ACS5 API call examples at (https://api.census.gov/data/2021/acs/acs5/examples.html) is another great resource. One of the sample URLs shown on this page for retrieving county-level data appears as follows:
https://api.census.gov/data/2021/acs/acs5?get=NAME,B01001_001E&for=county:*&key=YOUR_KEY_GOES_HERE
If you replace the ‘YOUR_KEY_GOES_HERE’ component of the URL with your actual key, then enter this link into your web browser, you’ll receive a very long list of counties, population values, and state and county codes. The top of the list for the 2021 ACS5 looks like this:
[["NAME","B01001_001E","state","county"],
["Autauga County, Alabama","58239","01","001"],
["Baldwin County, Alabama","227131","01","003"],
["Barbour County, Alabama","25259","01","005"],
["Bibb County, Alabama","22412","01","007"],
["Blount County, Alabama","58884","01","009"],
["Bullock County, Alabama","10386","01","011"],
‘B01001_001E’ refers to the total population estimates for a given county. We can find this out by going to the 2021 ACS5’s Detailed Tables page (https://api.census.gov/data/2021/acs/acs5/variables.html) and navigating to the row with a ‘Name’ value of ‘B01001_001E’. This link, which may take a little while to fully load, is available on the ACS5 API Documentation Page (https://www.census.gov/data/developers/data-sets/acs-5year.html).
We can use pd.read_json()
to easily read this same data into a DataFrame:
df_results = pd.read_json(
f'https://api.census.gov/data/{acs5_year}/\
acs/acs5?get=NAME,B01001_001E&for=county:*&key={key}')
# read_json documentation:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/
# pandas.read_json.html
df_results.head()
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | NAME | B01001_001E | state | county |
1 | Autauga County, Alabama | 58239 | 01 | 001 |
2 | Baldwin County, Alabama | 227131 | 01 | 003 |
3 | Barbour County, Alabama | 25259 | 01 | 005 |
4 | Bibb County, Alabama | 22412 | 01 | 007 |
At this point, the DataFrame’s columns are [0, 1, 2, 3], whereas the columns we want to use are stored within the first row. The following code sets these row values as our column values, then deletes this row:
df_results.columns = df_results.iloc[0]
df_results.drop(0, inplace=True)
df_results.head()
NAME | B01001_001E | state | county | |
---|---|---|---|---|
1 | Autauga County, Alabama | 58239 | 01 | 001 |
2 | Baldwin County, Alabama | 227131 | 01 | 003 |
3 | Barbour County, Alabama | 25259 | 01 | 005 |
4 | Bibb County, Alabama | 22412 | 01 | 007 |
5 | Blount County, Alabama | 58884 | 01 | 009 |
Importing custom Census functions#
Now that we’ve seen some basic applications of the Census API, we’ll import four Census data import and analysis functions that will help simplify some more complex procedures. I originally defined these functions within this script; however, I chose to move them to a separate Python file (‘census_import_scripts.py’) so that they can be incorporated more easily into other data projects. I highly recommend that you read the documentation for these files as well so that you can better understand what the functions are doing (and how to modify them for your own applications if needed).
from census_import_scripts import download_variable_list, \
create_variable_aliases, retrieve_census_data, create_comparison_fields
Retrieving variable and group information#
In order to determine which variable codes to enter into our script, we’ll first need to review a list of all American Community Survey variables and the overall groups into which they fit. The variable lists for each ACS5 are available on the Census website; for instance, the copy for 2021 is available at https://api.census.gov/data/2021/acs/acs5/variables.html. However, we can also use the download_variable_list()
function we just imported to save this list (and its corresponding list of variable groups) to a local .csv file, as shown below.
if download_new_variable_list == True: # This process can take a little
# while, so if you already have a copy of the variable list you need,
# consider setting download_variable_list to False.
download_variable_list(acs5_year, 'acs5')
# Reading the group and variable datasets into our script:
df_variables = pd.read_csv(
f'Datasets/acs5_{acs5_year}_variables.csv')
df_groups = pd.read_csv(
f'Datasets/acs5_{acs5_year}_groups.csv')
# The following code will show a condensed set of columns if the script
# is being run for PDF output; that way, the reader can view all of the
# most important columns. This step isn't necessary when the script is
# being run for HTML display, as the user can simply scroll horizontally
# to see all of the columns within the DataFrame.
df_variables[['Name', 'Label', 'Concept', 'Group']].head() if (
render_for_pdf == True) else df_variables.head()
Name | Label | Concept | Group | |
---|---|---|---|---|
0 | B01001_001E | Estimate!!Total: | SEX BY AGE | B01001 |
1 | B01001_002E | Estimate!!Total:!!Male: | SEX BY AGE | B01001 |
2 | B01001_003E | Estimate!!Total:!!Male:!!Under 5 years | SEX BY AGE | B01001 |
3 | B01001_004E | Estimate!!Total:!!Male:!!5 to 9 years | SEX BY AGE | B01001 |
4 | B01001_005E | Estimate!!Total:!!Male:!!10 to 14 years | SEX BY AGE | B01001 |
For reference, here’s a look at df_groups
: (this DataFrame is a simplified version of df_variables
that shows each group code and corresponding concept only once.)
df_groups.head()
Concept | Group | |
---|---|---|
0 | SEX BY AGE | B01001 |
1 | SEX BY AGE (WHITE ALONE) | B01001A |
2 | SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE) | B01001B |
3 | SEX BY AGE (AMERICAN INDIAN AND ALASKA NATIVE ... | B01001C |
4 | SEX BY AGE (ASIAN ALONE) | B01001D |
In order to find variables of interest, I recommend first searching for keywords of interest within the group table (which is much smaller in size) in order to identify relevant group IDs. Next, you can search for those group IDs inside the variables table in order to find the exact metrics to request from the Census API.
The following table stores variables for three separate groups: (1) the total population; (2) all males aged 25 to 29 years; and (3) all females aged 25 to 29 years. (The B01001 table that stores these values didn’t have an entry for all people aged 25 to 29; we’ll get around this limitation by retrieving sex-specific population totals within this age group, then adding those totals together.)
grad_destinations_variable_list = [
'B01001_001E', 'B01001_011E',
'B01001_035E']
Creating aliases and specifying retrieval years#
The demographic columns in the Census API’s output are labeled with their variable names (e.g. ‘B01001_001E’). These names are concise, but you’ll need a copy of the original variable list to interpret them. Therefore, I chose to replace these column names with a combination of the ‘Label’, ‘Concept’, and ‘Name’ entries in the original variable list. These column names are very long, but they do make the output easier to interpret (while also preserving the original names for reference).
In addition, if the description corresponding to a variable name happens to change from one year to another, the use of aliases will help you identify that change. (This will help prevent you from treating two different data types that happened to use the same variable code in different years as equal.)
The following function assists with this replacement by creating a dictionary whose keys are the original field names (e.g. ‘B0101_001E’) and whose values are the replacement names (e.g. ‘Sex by Age_Estimate!!Total:_B01001_001E’).
Creating our aliases:
grad_destinations_alias_dict = create_variable_aliases(
df_variables=df_variables,
variable_list=grad_destinations_variable_list)
grad_destinations_alias_dict
{'B01001_001E': 'SEX BY AGE_Estimate!!Total: (B01001_001E)',
'B01001_011E': 'SEX BY AGE_Estimate!!Total:!!Male:!!25 to 29 years (B01001_011E)',
'B01001_035E': 'SEX BY AGE_Estimate!!Total:!!Female:!!25 to 29 years (B01001_035E)'}
Next, we’ll define a list of years for which we would like to retrieve Census data. In order to make this code easier to use in future years, I’ll define these years as an offset of acs5_year
rather than hardcoding them.
years_to_retrieve = [acs5_year - 12, acs5_year -10,
acs5_year - 8,
acs5_year - 6,
acs5_year - 5,
acs5_year]
# American Community Survey 1-year estimates weren't available in 2020,
# so you'll want to remove that year from your list if it happens to be
# present. YOu can do so via the following code:
# if 2020 in years_to_retrieve:
# years_to_retrieve.remove(2020)
# However, because I decided to use 5-year rather than 1-year estimates,
# I commented out this line.
years_to_retrieve
[2009, 2011, 2013, 2015, 2016, 2021]
At this point, it’s a good idea to confirm that our three variable codes (‘B01001_001E’, ‘B01001_011E’, and ‘B01001_035E’) had the same meaning for all the years whose data we’ll be retrieving. We can do so by running the following code, which retrieves these variables and their corresponding descriptions for all of the years in years_to_retrieve.
(Due to the size of the variables.html page, this code can take a while to run, so I commented it out below.)
# var_meanings_by_year_df_list = []
# for year in years_to_retrieve:
# df_var_list = pd.read_html(
# f'https://api.census.gov/data/{year}/acs/acs5/variables.html')[
# 0][['Name', 'Label', 'Concept']].query(
# "Name in @grad_destinations_variable_list")
# df_var_list.insert(0, 'Year', year)
# var_meanings_by_year_df_list.append(df_var_list)
# df_var_meanings_by_year = pd.concat(
# [df for df in var_meanings_by_year_df_list])
# df_var_meanings_by_year.to_csv('var_meanings_by_year.csv', index=False)
# df_var_meanings_by_year
Here’s a series of outputs from the saved .csv copy of this table that confirms that these codes had the same meaning in each of the years we’re analyzing:
df_var_meanings_by_year = pd.read_csv(
'var_meanings_by_year.csv')
for name in grad_destinations_variable_list:
print(df_var_meanings_by_year.query("Name == @name"))
Year Name Label Concept
0 2009 B01001_001E Estimate!!Total SEX BY AGE
3 2011 B01001_001E Estimate!!Total SEX BY AGE
6 2013 B01001_001E Estimate!!Total SEX BY AGE
9 2015 B01001_001E Estimate!!Total SEX BY AGE
12 2016 B01001_001E Estimate!!Total SEX BY AGE
15 2021 B01001_001E Estimate!!Total: SEX BY AGE
Year Name Label Concept
1 2009 B01001_011E Estimate!!Total!!Male!!25 to 29 years SEX BY AGE
4 2011 B01001_011E Estimate!!Total!!Male!!25 to 29 years SEX BY AGE
7 2013 B01001_011E Estimate!!Total!!Male!!25 to 29 years SEX BY AGE
10 2015 B01001_011E Estimate!!Total!!Male!!25 to 29 years SEX BY AGE
13 2016 B01001_011E Estimate!!Total!!Male!!25 to 29 years SEX BY AGE
16 2021 B01001_011E Estimate!!Total:!!Male:!!25 to 29 years SEX BY AGE
Year Name Label Concept
2 2009 B01001_035E Estimate!!Total!!Female!!25 to 29 years SEX BY AGE
5 2011 B01001_035E Estimate!!Total!!Female!!25 to 29 years SEX BY AGE
8 2013 B01001_035E Estimate!!Total!!Female!!25 to 29 years SEX BY AGE
11 2015 B01001_035E Estimate!!Total!!Female!!25 to 29 years SEX BY AGE
14 2016 B01001_035E Estimate!!Total!!Female!!25 to 29 years SEX BY AGE
17 2021 B01001_035E Estimate!!Total:!!Female:!!25 to 29 years SEX BY AGE
Calling retrieve_census_data()
#
The retrieve_census_data()
defined within ‘census_import_scripts.py’ simplifies the process of retrieving data from the Census API. It also enables the user to rename variable fields (e.g. ‘B01001_001E’) with aliases for those fields (e.g. ‘Sex by Age_Estimate!!Total: (B01001_001E)’) if desired. In addition, it allows more than 50 variables to be retrieved at the same time, thus simplifying the process of importing especially large datasets.
We’re now ready to retrieve our population totals for the years referenced in years_to_retrieve
. We’ll do so by calling retrieve_census_data()
for each of these years via a for loop, then adding their respective DataFrames together using pd.concat()
.
(Note: Only 5825 results showed up when I requested ACS1 county-level data (as opposed to over 25,000 results for ACS5 data), so many counties were not getting incorporated within the ACS1 results. Therefore, the ACS5 will usually be the better of these two surveys for evaluating county-level growth.)
census_data_by_year_df_list = []
for year in years_to_retrieve:
df_data = retrieve_census_data(
survey='acs5', year=year,
region='county',
variable_list=grad_destinations_variable_list,
rename_data_fields=True,
field_vars_dict=grad_destinations_alias_dict, key=key)
census_data_by_year_df_list.append(df_data)
df_growth_data_by_year = pd.concat(df for df in census_data_by_year_df_list)
# Removing Puerto Rico from our list of results so as to focus only on counties
# and county equivalents within the 50 US states and DC:
df_growth_data_by_year.query("state != '72'", inplace=True)
df_growth_data_by_year
Year | state | ... | SEX BY AGE_Estimate!!Total:!!Male:!!25 to 29 years (B01001_011E) | SEX BY AGE_Estimate!!Total:!!Female:!!25 to 29 years (B01001_035E) | |
---|---|---|---|---|---|
1 | 2009 | 13 | ... | 634 | 574 |
2 | 2009 | 13 | ... | 531 | 352 |
3 | 2009 | 13 | ... | 3264 | 3900 |
... | ... | ... | ... | ... | ... |
3141 | 2021 | 56 | ... | 574 | 567 |
3142 | 2021 | 56 | ... | 175 | 200 |
3143 | 2021 | 56 | ... | 146 | 112 |
18856 rows × 7 columns
Analyzing this data#
The following cell adds together male and female population totals in order to calculate the total number of 25- to 29-year-olds within each county for each year. It also simplifies the original total population column name in order to improve its readability.
df_growth_data_by_year['Total_Pop_25_to_29'] = (df_growth_data_by_year[
'SEX BY AGE_Estimate!!Total:!!Male:!!25 to 29 years (B01001_011E)'] +
df_growth_data_by_year[
'SEX BY AGE_Estimate!!Total:!!Female:!!25 to 29 years (B01001_035E)'])
df_growth_data_by_year.rename(
columns={'SEX BY AGE_Estimate!!Total: (B01001_001E)':'Total_Pop'},
inplace=True)
df_growth_data_by_year.drop(
['SEX BY AGE_Estimate!!Total:!!Male:!!25 to 29 years (B01001_011E)',
'SEX BY AGE_Estimate!!Total:!!Female:!!25 to 29 years (B01001_035E)'],
axis=1, inplace=True)
df_growth_data_by_year
Year | state | ... | Total_Pop | Total_Pop_25_to_29 | |
---|---|---|---|---|---|
1 | 2009 | 13 | ... | 19695 | 1208 |
2 | 2009 | 13 | ... | 11641 | 883 |
3 | 2009 | 13 | ... | 95330 | 7164 |
... | ... | ... | ... | ... | ... |
3141 | 2021 | 56 | ... | 20514 | 1141 |
3142 | 2021 | 56 | ... | 7768 | 375 |
3143 | 2021 | 56 | ... | 6891 | 258 |
18856 rows × 6 columns
Next, we’ll apply pd.pivot()
to place all population totals for a given county within the same row, thus making future growth calculations easier.
df_growth_data_by_year_pivot = df_growth_data_by_year.copy().pivot(
columns='Year', index=['NAME', 'county', 'state']).reset_index()
# The values could be named explicitly, but since pivot() will infer them
# automatically, there's no need to do so. I thus removed the following
# code from the pivot() call:
# values=['Total_Pop',
# 'Total_Pop_25_to_29']
df_growth_data_by_year_pivot.head()
NAME | county | ... | Total_Pop_25_to_29 | ||
---|---|---|---|---|---|
Year | ... | 2016 | 2021 | ||
0 | Abbeville County, South Carolina | 001 | ... | 1226.0 | 1292.0 |
1 | Acadia Parish, Louisiana | 001 | ... | 4240.0 | 3705.0 |
2 | Accomack County, Virginia | 001 | ... | 1819.0 | 1907.0 |
3 | Ada County, Idaho | 001 | ... | 30646.0 | 33625.0 |
4 | Adair County, Iowa | 001 | ... | 362.0 | 411.0 |
5 rows × 15 columns
We’ll now call to_flat_index()
in order to group all column data within the same row:
df_growth_data_by_year_pivot.columns = (
df_growth_data_by_year_pivot.columns.to_flat_index())
df_growth_data_by_year_pivot.head()
(NAME, ) | (county, ) | ... | (Total_Pop_25_to_29, 2016) | (Total_Pop_25_to_29, 2021) | |
---|---|---|---|---|---|
0 | Abbeville County, South Carolina | 001 | ... | 1226.0 | 1292.0 |
1 | Acadia Parish, Louisiana | 001 | ... | 4240.0 | 3705.0 |
2 | Accomack County, Virginia | 001 | ... | 1819.0 | 1907.0 |
3 | Ada County, Idaho | 001 | ... | 30646.0 | 33625.0 |
4 | Adair County, Iowa | 001 | ... | 362.0 | 411.0 |
5 rows × 15 columns
Next, we’ll convert the tuple-based columns created by to_flat_index()
to string-based ones by executing a list comprehension. This list comprehension will convert tuples that contain both a string and an integer (e.g. (‘Total_Pop’, 2011)) to a single string with an underscore separating the two elements. Tuples with an empty second entry (e.g. (‘county’, ‘’) will get replaced with just the first entry (e.g. ‘county’).
df_growth_data_by_year_pivot.columns = [
column[0] + '_' + str(column[1]) if isinstance(column[1], int)
else column[0] for column in df_growth_data_by_year_pivot.columns]
df_growth_data_by_year_pivot.head()
NAME | county | ... | Total_Pop_25_to_29_2016 | Total_Pop_25_to_29_2021 | |
---|---|---|---|---|---|
0 | Abbeville County, South Carolina | 001 | ... | 1226.0 | 1292.0 |
1 | Acadia Parish, Louisiana | 001 | ... | 4240.0 | 3705.0 |
2 | Accomack County, Virginia | 001 | ... | 1819.0 | 1907.0 |
3 | Ada County, Idaho | 001 | ... | 30646.0 | 33625.0 |
4 | Adair County, Iowa | 001 | ... | 362.0 | 411.0 |
5 rows × 15 columns
Adding in comparison fields#
Now that we have population data for several years in a relatively easy-to-parse format, we can call create_comparison_fields()
to calculate nominal and percentage changes in population between certain years–along with corresponding percentile and rank information. In the following cell, this function will be called twice to create growth metrics for both total county populations and residents in the 25-29 age range.
for field_var in ['Total_Pop', 'Total_Pop_25_to_29']:
create_comparison_fields(
df=df_growth_data_by_year_pivot,
field_var=field_var,
year_list=years_to_retrieve,
field_year_separator='_')
df_growth_data_by_year_pivot.head(5)
NAME | county | ... | 2016-2021 Total_Pop_25_to_29 % Change Rank | 2016-2021 Total_Pop_25_to_29 % Change Percentile | |
---|---|---|---|---|---|
0 | Abbeville County, South Carolina | 001 | ... | 1357.0 | 56.829035 |
1 | Acadia Parish, Louisiana | 001 | ... | 2812.0 | 10.506208 |
2 | Accomack County, Virginia | 001 | ... | 1426.0 | 54.632283 |
3 | Ada County, Idaho | 001 | ... | 869.0 | 72.365489 |
4 | Adair County, Iowa | 001 | ... | 588.0 | 81.311684 |
5 rows × 75 columns
Ranking counties by population growth#
We’l now sort this dataset to answer a few questions:
Which counties had the highest (and lowest) total population growth rates between 2016 and 2021? (We’ll evaluate this question for both counties with at least 100,000 residents in 2016 and those with at least one million residents).
Which counties experienced the highest (and lowest) 25-to-29-year-old population growth rates?
First, we’ll specify a few variables that will be used within our analyses:
range_for_sorting = 5
range_start_year = acs5_year - range_for_sorting
total_pop_var_for_sorting = 'Total_Pop'
total_pop_sort_column = f'{range_start_year}-{acs5_year} \
{total_pop_var_for_sorting} % Change Rank'
young_pop_var_for_sorting = 'Total_Pop_25_to_29'
young_pop_sort_column = f'{range_start_year}-{acs5_year} \
{young_pop_var_for_sorting} % Change Rank'
total_pop_col_root = f'{range_start_year}-{acs5_year} \
{total_pop_var_for_sorting}' # The 'root' from which other column entries
# will derive
young_pop_col_root = f'{range_start_year}-{acs5_year} \
{young_pop_var_for_sorting}'
The following function will help condense the output of our analyses, thus making them more readable.
def gen_display_cols(col_root):
'''This function specifies which DataFrame columns to display within a
given output. It does so by adding various suffixes to a col_root
value (e.g. '2016-2021 Total_Pop'.'''
return ['NAME', f'{col_root} Change', f'{col_root} % Change',
f'{col_root} % Change Rank', f'{col_root} % Change Percentile']
total_pop_display_cols = gen_display_cols(total_pop_col_root)
young_pop_display_cols = gen_display_cols(young_pop_col_root)
total_pop_display_cols
['NAME',
'2016-2021 Total_Pop Change',
'2016-2021 Total_Pop % Change',
'2016-2021 Total_Pop % Change Rank',
'2016-2021 Total_Pop % Change Percentile']
Sorting counties with at least 100K residents in 2016 by their 2016-2021 total population growth rates:#
total_pop_sort_column # The following two sets of output will be
# sorted by percentage change (rather than nominal change) in the total
# population between 2016 and 2021.
'2016-2021 Total_Pop % Change Rank'
df_growth_data_by_year_pivot.query(
f"Total_Pop_{range_start_year} >= 100000").sort_values(
total_pop_sort_column).dropna(subset=total_pop_sort_column)[
total_pop_display_cols].reset_index(drop=True)
NAME | 2016-2021 Total_Pop Change | 2016-2021 Total_Pop % Change | 2016-2021 Total_Pop % Change Rank | 2016-2021 Total_Pop % Change Percentile | |
---|---|---|---|---|---|
0 | Hays County, Texas | 48887.0 | 26.327779 | 10.0 | 99.713467 |
1 | Comal County, Texas | 32023.0 | 25.776358 | 12.0 | 99.649793 |
2 | Kaufman County, Texas | 28315.0 | 25.319682 | 13.0 | 99.617956 |
... | ... | ... | ... | ... | ... |
588 | Wayne County, North Carolina | -6755.0 | -5.428014 | 2724.0 | 13.307864 |
589 | Hinds County, Mississippi | -13824.0 | -5.651769 | 2746.0 | 12.607450 |
590 | Robeson County, North Carolina | -15740.0 | -11.695993 | 3056.0 | 2.737982 |
591 rows × 5 columns
The three counties with the highest 2016-2021 growth rates were all in Texas. Hays, Comal, and Kaufman Counties are located within the Austin, San Antonio, and DFW metropolitan areas, respectively (according to Wikipedia).
Sorting counties with at least one million residents in 2016 by their 2016-2021 total population growth rates:#
Combining concat()
with head()
and tail()
will allow us to display the counties with the highest and lowest growth rates within the same DataFrame.
df_1m_pop_pct_changes = df_growth_data_by_year_pivot.query(
f"Total_Pop_{range_start_year} >= 1000000").sort_values(
total_pop_sort_column).dropna(
subset=total_pop_sort_column).copy().reset_index()
pd.concat([df_1m_pop_pct_changes.head(), df_1m_pop_pct_changes.tail()])[
total_pop_display_cols]
NAME | 2016-2021 Total_Pop Change | 2016-2021 Total_Pop % Change | 2016-2021 Total_Pop % Change Rank | 2016-2021 Total_Pop % Change Percentile | |
---|---|---|---|---|---|
0 | Orange County, Florida | 153894.0 | 12.252170 | 94.0 | 97.039160 |
1 | Travis County, Texas | 119619.0 | 10.418176 | 140.0 | 95.574658 |
2 | Hillsborough County, Florida | 121300.0 | 9.168147 | 180.0 | 94.301178 |
... | ... | ... | ... | ... | ... |
40 | Cuyahoga County, Ohio | 4957.0 | 0.393816 | 1468.0 | 53.295129 |
41 | St. Louis County, Missouri | 1422.0 | 0.142120 | 1543.0 | 50.907354 |
42 | Los Angeles County, California | -37520.0 | -0.373068 | 1687.0 | 46.322827 |
10 rows × 5 columns
Repeating these steps in order to compare 25-to-29-year-old population growth rates#
df_growth_data_by_year_pivot.query(
f"Total_Pop_{range_start_year} >= 100000").sort_values(
young_pop_sort_column).dropna(subset=young_pop_sort_column)[
young_pop_display_cols].reset_index(drop=True)
NAME | 2016-2021 Total_Pop_25_to_29 Change | 2016-2021 Total_Pop_25_to_29 % Change | 2016-2021 Total_Pop_25_to_29 % Change Rank | 2016-2021 Total_Pop_25_to_29 % Change Percentile | |
---|---|---|---|---|---|
0 | Douglas County, Colorado | 4368.0 | 31.155492 | 148.0 | 95.319962 |
1 | Comal County, Texas | 2035.0 | 31.116208 | 150.0 | 95.256288 |
2 | Kaufman County, Texas | 2062.0 | 30.602553 | 152.0 | 95.192614 |
... | ... | ... | ... | ... | ... |
588 | St. Lawrence County, New York | -745.0 | -10.792409 | 2755.0 | 12.320917 |
589 | Caddo Parish, Louisiana | -2048.0 | -11.088852 | 2765.0 | 12.002547 |
590 | Orleans Parish, Louisiana | -6371.0 | -16.970327 | 2898.0 | 7.768227 |
591 rows × 5 columns
Comal and Kaufman counties ranked #2 and #3, respectively, among counties with 100K+ residents for both total population growth and 25-29 year-old population growth between 2016 and 2021.
df_1m_pop_pct_changes = df_growth_data_by_year_pivot.query(
f"Total_Pop_{range_start_year} >= 1000000").sort_values(
young_pop_sort_column).dropna(
subset=young_pop_sort_column).copy().reset_index()
pd.concat([df_1m_pop_pct_changes.head(), df_1m_pop_pct_changes.tail()])[
young_pop_display_cols]
NAME | 2016-2021 Total_Pop_25_to_29 Change | 2016-2021 Total_Pop_25_to_29 % Change | 2016-2021 Total_Pop_25_to_29 % Change Rank | 2016-2021 Total_Pop_25_to_29 % Change Percentile | |
---|---|---|---|---|---|
0 | Pima County, Arizona | 9395.0 | 15.212850 | 490.0 | 84.431710 |
1 | Wayne County, Michigan | 16073.0 | 13.496289 | 590.0 | 81.248010 |
2 | King County, Washington | 22087.0 | 12.478743 | 643.0 | 79.560649 |
... | ... | ... | ... | ... | ... |
40 | Miami-Dade County, Florida | -11062.0 | -5.717180 | 2504.0 | 20.312003 |
41 | Montgomery County, Maryland | -4043.0 | -6.120379 | 2521.0 | 19.770774 |
42 | New York County, New York | -14977.0 | -7.548168 | 2608.0 | 17.000955 |
10 rows × 5 columns
Although it’s interesting to see which counties had the very highest and lowest growth rates, it’s hard to gain an understanding of nationwide population trends via these tables alone. We can get a more comprehensive view of such trends via choropleth maps, which will allow us to present all counties within an easy-to-interpret graphic. We’ll create examples of these maps within the Mapping section of Python for Nonprofits.
In order to allow our mapping code to access this data, we’ll first save it as a .csv file:
df_growth_data_by_year_pivot.to_csv(
f'Datasets/grad_destinations_acs_county_data.csv',
index=False)
Retrieving state-level population growth data#
Although county-level population growth provides a fascinating look at within-state trends, the broader view that state-level data provides will also prove useful. Therefore, we’ll now repurpose the functions, variables, and code that we used earlier to import county-level data in order to create a state-level equivalent of df_growth_data_by_year
.
census_data_by_year_df_list = []
for year in years_to_retrieve:
df_data = retrieve_census_data(
survey='acs5', year=year,
region='state',
variable_list=grad_destinations_variable_list,
rename_data_fields=True, field_vars_dict=grad_destinations_alias_dict,
key=key)
census_data_by_year_df_list.append(df_data)
df_state_growth_data_by_year = pd.concat(df for df in census_data_by_year_df_list)
df_state_growth_data_by_year.query("state != '72'", inplace=True)
df_state_growth_data_by_year
Year | state | ... | SEX BY AGE_Estimate!!Total:!!Male:!!25 to 29 years (B01001_011E) | SEX BY AGE_Estimate!!Total:!!Female:!!25 to 29 years (B01001_035E) | |
---|---|---|---|---|---|
1 | 2009 | 01 | ... | 155174 | 157476 |
2 | 2009 | 02 | ... | 28543 | 24663 |
3 | 2009 | 04 | ... | 251627 | 228420 |
... | ... | ... | ... | ... | ... |
49 | 2021 | 54 | ... | 55187 | 52671 |
50 | 2021 | 55 | ... | 189562 | 180360 |
51 | 2021 | 56 | ... | 19511 | 17813 |
306 rows × 6 columns
df_state_growth_data_by_year['Total_Pop_25_to_29'] = (
df_state_growth_data_by_year[
'SEX BY AGE_Estimate!!Total:!!Male:!!25 to 29 years (B01001_011E)'] +
df_state_growth_data_by_year[
'SEX BY AGE_Estimate!!Total:!!Female:!!25 to 29 years (B01001_035E)'])
df_state_growth_data_by_year.rename(
columns={'SEX BY AGE_Estimate!!Total: (B01001_001E)':'Total_Pop'},
inplace=True)
df_state_growth_data_by_year.drop(
['SEX BY AGE_Estimate!!Total:!!Male:!!25 to 29 years (B01001_011E)',
'SEX BY AGE_Estimate!!Total:!!Female:!!25 to 29 years (B01001_035E)'],
axis=1, inplace=True)
df_state_growth_data_by_year
Year | state | NAME | Total_Pop | Total_Pop_25_to_29 | |
---|---|---|---|---|---|
1 | 2009 | 01 | Alabama | 4633360 | 312650 |
2 | 2009 | 02 | Alaska | 683142 | 53206 |
3 | 2009 | 04 | Arizona | 6324865 | 480047 |
... | ... | ... | ... | ... | ... |
49 | 2021 | 54 | West Virginia | 1801049 | 107858 |
50 | 2021 | 55 | Wisconsin | 5871661 | 369922 |
51 | 2021 | 56 | Wyoming | 576641 | 37324 |
306 rows × 5 columns
df_state_growth_data_by_year_pivot = (
df_state_growth_data_by_year.copy().pivot(
columns = 'Year', index = ['NAME', 'state']).reset_index())
df_state_growth_data_by_year_pivot.columns = (
df_state_growth_data_by_year_pivot.columns.to_flat_index())
df_state_growth_data_by_year_pivot.columns = [
column[0] + '_' + str(column[1]) if isinstance(column[1], int)
else column[0] for column
in df_state_growth_data_by_year_pivot.columns]
df_state_growth_data_by_year_pivot.head()
NAME | state | ... | Total_Pop_25_to_29_2016 | Total_Pop_25_to_29_2021 | |
---|---|---|---|---|---|
0 | Alabama | 01 | ... | 319177 | 331435 |
1 | Alaska | 02 | ... | 61185 | 59218 |
2 | Arizona | 04 | ... | 456680 | 496623 |
3 | Arkansas | 05 | ... | 194179 | 197068 |
4 | California | 06 | ... | 2918435 | 2992210 |
5 rows × 14 columns
for field_var in ['Total_Pop', 'Total_Pop_25_to_29']:
create_comparison_fields(
df=df_state_growth_data_by_year_pivot,
field_var=field_var,
year_list=years_to_retrieve,
field_year_separator='_')
Saving this dataset to a .csv file so that it too can be used as a basis for state-level choropleth maps:
df_state_growth_data_by_year_pivot.to_csv(
f'Datasets/grad_destinations_acs_state_data.csv',
index=False)
Creating a DataFrame that shows the 5 states with the highest and lowest total population growth rates during the latest 5 years in the dataset:
df_state_growth_data_by_year_pivot.sort_values(
total_pop_sort_column, inplace=True)
df_state_growth_data_by_year_pivot.reset_index(drop=True, inplace=True)
pd.concat(
[df_state_growth_data_by_year_pivot.head(5),
df_state_growth_data_by_year_pivot.tail(5)])[
total_pop_display_cols]
NAME | 2016-2021 Total_Pop Change | 2016-2021 Total_Pop % Change | 2016-2021 Total_Pop % Change Rank | 2016-2021 Total_Pop % Change Percentile | |
---|---|---|---|---|---|
0 | Idaho | 176134 | 10.769540 | 1.0 | 100.000000 |
1 | Utah | 282943 | 9.596405 | 2.0 | 98.039216 |
2 | Nevada | 220066 | 7.751063 | 3.0 | 96.078431 |
... | ... | ... | ... | ... | ... |
48 | Mississippi | -22169 | -0.741639 | 49.0 | 5.882353 |
49 | Wyoming | -6388 | -1.095657 | 50.0 | 3.921569 |
50 | West Virginia | -45043 | -2.439911 | 51.0 | 1.960784 |
10 rows × 5 columns
Creating a similar DataFrame that displays 25-to-29-year-old population growth rates:
df_state_growth_data_by_year_pivot.sort_values(
young_pop_sort_column, inplace=True)
df_state_growth_data_by_year_pivot.reset_index(drop=True, inplace=True)
pd.concat(
[df_state_growth_data_by_year_pivot.head(5),
df_state_growth_data_by_year_pivot.tail(5)])[
young_pop_display_cols]
NAME | 2016-2021 Total_Pop_25_to_29 Change | 2016-2021 Total_Pop_25_to_29 % Change | 2016-2021 Total_Pop_25_to_29 % Change Rank | 2016-2021 Total_Pop_25_to_29 % Change Percentile | |
---|---|---|---|---|---|
0 | Utah | 29737 | 13.733051 | 1.0 | 100.000000 |
1 | Washington | 58422 | 11.363760 | 2.0 | 98.039216 |
2 | Tennessee | 46477 | 10.661207 | 3.0 | 96.078431 |
... | ... | ... | ... | ... | ... |
48 | Alaska | -1967 | -3.214840 | 49.0 | 5.882353 |
49 | Louisiana | -15558 | -4.581151 | 50.0 | 3.921569 |
50 | Wyoming | -3359 | -8.256520 | 51.0 | 1.960784 |
10 rows × 5 columns
Retrieving data for our education/income regression analyses#
The following code creates tables of populations, median incomes, median home values, and regions for each US county and state. This data can then get incorporated into regression analyses later in Python for Nonprofits.
Importing US Census data on educational attainment and median earnings#
# Retrieving a new list of ACS5 variables for the most recent year with
# ACS5 data (as of February 2025):
if download_new_variable_list == True:
download_variable_list(acs5_latest_year, 'acs5')
# Reading this dataset into our script:
df_variables_latest = pd.read_csv(
f'Datasets/acs5_{acs5_latest_year}_variables.csv')
regression_variable_list = ['B01001_001E', 'B15003_001E', 'B15003_022E',
'B15003_023E','B15003_024E','B15003_025E', 'B20004_001E',
'B20004_002E','B20004_003E','B20004_004E','B20004_005E','B20004_006E']
regression_alias_dict = create_variable_aliases(
df_variables=df_variables_latest,
variable_list=regression_variable_list)
df_regression_data_county = retrieve_census_data(
survey='acs5', year=acs5_latest_year,
region='county',
variable_list=regression_variable_list,
rename_data_fields=True, field_vars_dict=regression_alias_dict,
key=key)
df_regression_data_state = retrieve_census_data(
survey='acs5', year=acs5_latest_year,
region='state',
variable_list=regression_variable_list,
rename_data_fields=True, field_vars_dict=regression_alias_dict,
key=key)
# Retrieving national-level data (which will be useful for another
# data analysis that I'll include within the regressions section
# of Python for Nonprofits):
df_regression_data_national = retrieve_census_data(
survey='acs5', year=acs5_latest_year,
region='us',
variable_list=regression_variable_list,
rename_data_fields=True, field_vars_dict=regression_alias_dict,
key=key)
df_regression_data_state
Year | state | ... | Median Earnings in the Past 12 Months (in 2023 Inflation-Adjusted Dollars) by Sex by Educational Attainment for the Population 25 Years and Over_Estimate!!Total:!!Bachelor's degree (B20004_005E) | Median Earnings in the Past 12 Months (in 2023 Inflation-Adjusted Dollars) by Sex by Educational Attainment for the Population 25 Years and Over_Estimate!!Total:!!Graduate or professional degree (B20004_006E) | |
---|---|---|---|---|---|
1 | 2023 | 01 | ... | 59318 | 72667 |
2 | 2023 | 02 | ... | 68158 | 90606 |
3 | 2023 | 04 | ... | 65087 | 82033 |
... | ... | ... | ... | ... | ... |
50 | 2023 | 55 | ... | 64443 | 78648 |
51 | 2023 | 56 | ... | 58411 | 73535 |
52 | 2023 | 72 | ... | 28988 | 41197 |
52 rows × 15 columns
The following code calculates the percentage of residents within a given row who have at least a bachelor’s degree; it also dramatically condenses many column names. (I converted it into a function so that the same set of code could be used to modify our state-level, county-level, and national datasets.)
def rename_and_calc_dataset(df, year):
'''This dataset renames certain median earnings columns within
a copy of census results and also estimates the percentage of
respondents within each row who have at least a bachelor's degree.
df: the DataFrame containing census results to process.
year: the year in which the ACS was conducted. (This year shows
up in many median earnings columns.'''
df.rename(columns={
'Median Earnings in the \
Past 12 Months (in 2023 Inflation-Adjusted Dollars) by Sex by Educational \
Attainment for the Population 25 Years and Over_Estimate!!\
Total:!!Less than high school graduate (B20004_002E)':
'Median_Earnings_Less_Than_HS',
'Median Earnings in the Past 12 Months (in 2023 Inflation-Adjusted \
Dollars) by Sex by Educational Attainment for the Population 25 Years \
and Over_Estimate!!Total:!!High school graduate \
(includes equivalency) (B20004_003E)':'Median_Earnings_HS',
"Median Earnings in the Past 12 Months (in 2023 Inflation-Adjusted \
Dollars) by Sex by Educational Attainment for the Population 25 Years \
and Over_Estimate!!Total:!!Some college or associate's \
degree (B20004_004E)":"Median_Earnings_Some_College",
"Median Earnings in the Past 12 Months (in 2023 Inflation-Adjusted \
Dollars) by Sex by Educational Attainment for the Population 25 \
Years and Over_Estimate!!Total:!!Bachelor's degree (B20004_005E)":
"Median_Earnings_Bachelors_Degree",
"Median Earnings in the Past 12 Months (in 2023 \
Inflation-Adjusted Dollars) by Sex by Educational Attainment \
for the Population 25 Years and Over_Estimate!!Total:!!Graduate \
or professional degree (B20004_006E)":"Median_Earnings_Postgraduate",
"Median Earnings in the Past 12 Months (in 2023 \
Inflation-Adjusted Dollars) by Sex by Educational Attainment \
for the Population 25 Years and Over_Estimate!!Total: (B20004_001E)":
"Median_Earnings_for_Total_25plus_Population",
'Sex by Age_Estimate!!Total: (B01001_001E)':'Total_Population',
},
inplace=True)
df["Pct_With_Bachelors_Degree"] = (100*(
df["Educational Attainment for the Population \
25 Years and Over_Estimate!!Total:!!Bachelor's degree (B15003_022E)"] +
df["Educational Attainment for the Population 25 \
Years and Over_Estimate!!Total:!!Master's degree (B15003_023E)"] +
df["Educational Attainment for the Population \
25 Years and Over_Estimate!!Total:!!Professional school degree \
(B15003_024E)"] +
df["Educational Attainment for the Population 25 \
Years and Over_Estimate!!Total:!!Doctorate degree (B15003_025E)"])
/ df["Educational Attainment for the \
Population 25 Years and Over_Estimate!!Total: (B15003_001E)"])
rename_and_calc_dataset(df = df_regression_data_county,
year = acs5_latest_year)
rename_and_calc_dataset(df = df_regression_data_state,
year = acs5_latest_year)
rename_and_calc_dataset(df = df_regression_data_national,
year = acs5_latest_year)
df_regression_data_state
Year | state | ... | Median_Earnings_Postgraduate | Pct_With_Bachelors_Degree | |
---|---|---|---|---|---|
1 | 2023 | 01 | ... | 72667 | 27.755139 |
2 | 2023 | 02 | ... | 90606 | 31.242276 |
3 | 2023 | 04 | ... | 82033 | 32.610332 |
... | ... | ... | ... | ... | ... |
50 | 2023 | 55 | ... | 78648 | 32.758344 |
51 | 2023 | 56 | ... | 73535 | 29.920304 |
52 | 2023 | 72 | ... | 41197 | 29.130778 |
52 rows × 16 columns
Mapping each county and state to a geographical region#
As part of our regression analysis, we’ll want to determine whether the relationship between the prevalence of bachelor’s degrees and median earnings is influenced by the region in which a particular county or state is located. I’ll first merge in this data for our county-level dataset; next, I’ll do the same for the state-level table.
In order to add region information to our county-level dataset, we’ll first need to figure out the state corresponding to each county; we can then map these states to their respective regions. Our dataset already contains numerical values for each state in the form of FIPS codes; however, to make sense of these codes, we’ll also need to import a table that shows the actual state name for each code. Such a table is available on the Census wesbite at https://www2.census.gov/geo/docs/reference/codes2020/national_state2020.txt .
I copied and pasted the contents of this table into state_fips_codes_from_census.txt, which can be found in the same folder as this script. The following code reads this table into a DataFrame. Note that (1) you can apply read_csv()
on more than just .csv files and that (2) because the entries in this table were separated by the ‘|’ character, I needed to specify that character as my separator.
df_fips = pd.read_csv('state_fips_codes_from_census.txt',
sep='|')
# Renaming columns to help facilitate an upcoming merge:
df_fips.rename(columns={'STATE':'State_Abbrev',
'STATEFP':'state'}, inplace=True)
# Converting the 'state' column into a zero-filled string
# so that its values can get matched with the corresponding
# FIPS codes in df_regression_data_county:
df_fips['state'] = df_fips['state'].astype('str').str.zfill(2)
df_fips
State_Abbrev | state | STATENS | STATE_NAME | |
---|---|---|---|---|
0 | AL | 01 | 1779775 | Alabama |
1 | AK | 02 | 1785533 | Alaska |
2 | AZ | 04 | 1779777 | Arizona |
... | ... | ... | ... | ... |
54 | PR | 72 | 1779808 | Puerto Rico |
55 | UM | 74 | 1878752 | U.S. Minor Outlying Islands |
56 | VI | 78 | 1802710 | United States Virgin Islands |
57 rows × 4 columns
df_regression_data_county = df_regression_data_county.merge(
df_fips[['State_Abbrev', 'state']],
on = 'state', how = 'left')
df_regression_data_county
Year | state | ... | Pct_With_Bachelors_Degree | State_Abbrev | |
---|---|---|---|---|---|
0 | 2023 | 01 | ... | 28.282680 | AL |
1 | 2023 | 01 | ... | 32.797637 | AL |
2 | 2023 | 01 | ... | 11.464715 | AL |
... | ... | ... | ... | ... | ... |
3219 | 2023 | 72 | ... | 22.481068 | PR |
3220 | 2023 | 72 | ... | 18.360524 | PR |
3221 | 2023 | 72 | ... | 27.116371 | PR |
3222 rows × 18 columns
Now that we know each county’s state abbreviation, we can merge region data into our table as well. We’ll do so by importing state_regions.csv, a file that stores the state-region relationships found in this handy PDF file: https://www2.census.gov/geo/pdfs/maps-data/maps/reference/us_regdiv.pdf
df_state_regions = pd.read_csv('state_regions.csv')
df_state_regions
State_Abbrev | Region | |
---|---|---|
0 | WA | West |
1 | OR | West |
2 | CA | West |
... | ... | ... |
48 | VT | Northeast |
49 | NH | Northeast |
50 | ME | Northeast |
51 rows × 2 columns
df_regression_data_county = df_regression_data_county.merge(
df_state_regions, on='State_Abbrev', how='left')
df_regression_data_county
Year | state | ... | State_Abbrev | Region | |
---|---|---|---|---|---|
0 | 2023 | 01 | ... | AL | South |
1 | 2023 | 01 | ... | AL | South |
2 | 2023 | 01 | ... | AL | South |
... | ... | ... | ... | ... | ... |
3219 | 2023 | 72 | ... | PR | NaN |
3220 | 2023 | 72 | ... | PR | NaN |
3221 | 2023 | 72 | ... | PR | NaN |
3222 rows × 19 columns
Merging region information into our state-level dataset#
Although the state-level Census dataset contains full state names, I’ll still use the same approach that I took for the county level dataset, as that reduces the amount of additional code that I need to write. As the following cell demonstrates, it’s possible to chain multiple merge()
calls together; this allows for more concise code, though if anything goes wrong, you may need to call each merge operation individually for debugging purposes.
df_regression_data_state = df_regression_data_state.merge(
df_fips[['State_Abbrev', 'state']],
on = 'state', how = 'left').merge(
df_state_regions, on = 'State_Abbrev', how = 'left')
df_regression_data_state
Year | state | ... | State_Abbrev | Region | |
---|---|---|---|---|---|
0 | 2023 | 01 | ... | AL | South |
1 | 2023 | 02 | ... | AK | West |
2 | 2023 | 04 | ... | AZ | West |
... | ... | ... | ... | ... | ... |
49 | 2023 | 55 | ... | WI | Midwest |
50 | 2023 | 56 | ... | WY | West |
51 | 2023 | 72 | ... | PR | NaN |
52 rows × 18 columns
Saving these DataFrames to .csv files so that they can get incorporated into the regressions section of Python for Nonprofits:
df_regression_data_county.to_csv(
'Datasets/education_and_earnings_county.csv', index = False)
df_regression_data_state.to_csv(
'Datasets/education_and_earnings_state.csv', index = False)
df_regression_data_national.to_csv(
'Datasets/education_and_earnings_national.csv', index = False)
Appendix#
1: The Census Python Library#
It’s worth noting that there is also a ‘census’ Python library (available via pypi and conda) that helps simplify the process of requesting API data. You may choose to use it for your own Census research, but I ended up not needing it for the data retrieval tasks shown above. In addition, foregoing the library allowed me to demonstrate how to retrieve data directly from an API, which you may find helpful when working with APIs that don’t have a corresponding Python library.
Here’s an example of the Census library in use:
## Example of reading data from the Census library into a
# Pandas DataFrame:
from census import Census
c = Census(key)
pd.DataFrame(c.acs5.get(('NAME', 'B01001_001E'),
{'for': 'county:*'}))
NAME | B01001_001E | state | county | |
---|---|---|---|---|
0 | Autauga County, Alabama | 58761.0 | 01 | 001 |
1 | Baldwin County, Alabama | 233420.0 | 01 | 003 |
2 | Barbour County, Alabama | 24877.0 | 01 | 005 |
... | ... | ... | ... | ... |
3219 | Villalba Municipio, Puerto Rico | 21984.0 | 72 | 149 |
3220 | Yabucoa Municipio, Puerto Rico | 30313.0 | 72 | 151 |
3221 | Yauco Municipio, Puerto Rico | 33988.0 | 72 | 153 |
3222 rows × 4 columns
2: The requests library#
We can also use Python’s requests library to retrieve data from the Census API, then convert it to JSON format:
# The following code borrows from the requests library documentation at
# https://docs.python-requests.org/en/latest/index.html
import requests
r = requests.get(f'https://api.census.gov/data/{acs5_year}/\
acs/acs5?get=NAME,B01001_001E&for=county:*&key={key}')
# Printing the first 300 characters of this output:
print("r.text:\n",r.text[0:300],'\n')
# Printing the first 5 lines of r.json:
print("r.json:\n",r.json()[0:5],'\n')
r.text:
[["NAME","B01001_001E","state","county"],
["Autauga County, Alabama","58239","01","001"],
["Baldwin County, Alabama","227131","01","003"],
["Barbour County, Alabama","25259","01","005"],
["Bibb County, Alabama","22412","01","007"],
["Blount County, Alabama","58884","01","009"],
["Bullock County, Ala
r.json:
[['NAME', 'B01001_001E', 'state', 'county'], ['Autauga County, Alabama', '58239', '01', '001'], ['Baldwin County, Alabama', '227131', '01', '003'], ['Barbour County, Alabama', '25259', '01', '005'], ['Bibb County, Alabama', '22412', '01', '007']]
Converting our response to JSON allows it to be easily read into a Pandas DataFrame, as shown below:
pd.DataFrame(r.json()).head()
# Note that pd.DataFrame(r.text) would produce the following error:
# "ValueError: DataFrame constructor not properly called!"
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | NAME | B01001_001E | state | county |
1 | Autauga County, Alabama | 58239 | 01 | 001 |
2 | Baldwin County, Alabama | 227131 | 01 | 003 |
3 | Barbour County, Alabama | 25259 | 01 | 005 |
4 | Bibb County, Alabama | 22412 | 01 | 007 |
I included this approach in the appendix because you may find the requests library useful for other online data retrieval tasks. However, our use of pd.read_json()
to import Census data rendered an explicit call to the requests library unnecessary.
3: Importing 51+ variables#
I tested out the retrieve_census_data()
function’s ability to import data for 51 or more variables via the following code. Feel free to uncomment and run it yourself to test out this functionality.
# test_list = list(df_variables['Name'][0:151])
# test_alias_dict = create_variable_aliases(
# df_variables=df_variables,
# variable_list=test_list)
# test_acs5_data = retrieve_census_data(
# survey='acs5', year=acs5_year, region='county',
# variable_list=test_list,
# rename_data_fields=True,
# field_vars_dict=test_alias_dict, key=key)
# test_acs5_data
program_end_time = time.time()
run_time = round(program_end_time - program_start_time, 3)
print(f"Finished running script in {run_time} seconds.")
Finished running script in 11.382 seconds.
Conclusion#
We imported a great deal of data within this script! Although we performed a few basic analyses as well, the real value of the data we retrieved will be manifested within the Mapping and Regressions sections of Python for Nonprofits,
Next, I’ll introduce the ‘census_import_scripts.py’ file, whose functions helped simplify much of the import and analysis code found within this section. I do recommend reading through its documentation also, as that will help you modify its functions for your own use cases.