Census Data Imports

Census Data Imports#

The US Census Bureau is a fantastic source of free demographic data. As this notebook will demonstrate, Python allows us to easily access large amounts of this data at once. In this script, we’ll apply the Census Bureau’s API to (1) calculate state- and county-level population growth data and (2) gather data on education levels and median incomes that will prove useful for future regression analyses.

(Note: I had initially planned to feature this script right after PFN’s Data Retrieval chapter. However, this notebook not only imports Census data, but also reformats it and generates some descriptive statistics. Therefore, I decided it would be best to wait to cover this subject until now.)

An introduction to the American Community Survey#

Many Americans probably associate the US Census Bureau with its decennial Census. However, the Census Bureau also conducts the American Community Survey (https://www.census.gov/data/developers/data-sets/acs-5year.html) each year, making it an ideal resource for recent demographic data.

This notebook will source data from the American Community Survey’s 5-year estimates, which show an average of results for the past 5 years. (For example, the 2021 ACS5 dataset shows results between 2017 and 2021). The 1-year ACS estimates (https://www.census.gov/data/developers/data-sets/acs-1year.html) offer results for a more recent timeframe; however, because the 5-year estimates are sourced from a larger pool of data, they may be more reliable (especially for smaller regions). In addition, 1-year estimates aren’t available for certain regions, such as counties with smaller populations and zip codes.

For the sake of brevity, I’ll often refer to the American Community Survey’s 5-year estimates as the ‘ACS5’ survey.

Another great source of Census data, which allows for a much wider variety of analyses, is American Community Survey microdata. These microdata tables show individual respondent rows rather than pre-calculated estimates, thus allowing for all sorts of custom filters, comparisons, and statistical tests. You can download 1- and 5-year ACS microdata from the Census website (https://www.census.gov/data/developers/data-sets/census-microdata-api.html); however, I would recommend using IPUMS (https://usa.ipums.org/usa/) for microdata retrieval, as it makes comparisons across different time periods much easier.

This script will not feature microdata imports, but if you plan to use Census data withn your own work, you’ll definitely want to learn more about the benefits (and disadvantages) of microdata. The US Census guide at https://www.census.gov/content/dam/Census/library/publications/2021/acs/acs_pums_handbook_2021.pdf is a good place to start. One notable disadvantage of microdata files is that, for confidentiality reasons, they do not include county- or zip-level information (though IPUMS has been able to map certain microdata records to certain US counties).

Finally, I’ll provide a brief introduction to the two main data-gathering tasks that we’ll tackle within this script:

Deciding where to move to start a family#

Let’s say that some NVCU seniors are interested in settling down and raising a family a few years after they graduate. Because they’d prefer to live in a growing region rather than a declining one, they want to know which areas have seen the highest growth rates in recent years. They’d like to see this data both for all residents within each county and those aged 25-29.

In order to answer these questions, we’ll use the Census API to retrieve US county population growth data from the ACS5 for a selected set of years. We’ll then use this data to calculate population growth rates across multiple periods.

Examining connections between education levels and median incomes#

The NVCU admissions department would like to make the case to high schoolers on the fence about pursuing higher education that going to college is (generally) worth it from a financial perspective. Therefore, they would like you to determine how the median income for a given county and state increases as the percentage of residents with at least a bachelor’s degree increases.* They figure that a strong relationship between the two could help convince teenagers to pursue a higher education.**

We can shed some light on these questions by performing a series of linear regressions–but first, we’ll need to obtain Census data that can get incorporated into those regressions.

*The monetary value of a bachelor’s degree can be better estimated by simply comparing the median incomes of state/county residents by different education levels. Therefore, this notebook will retrieve these medians as well. However, the continuous nature of the ‘% with bachelor’s’ metric will make it a great fit for our linear regression example.

**There are of course many non-monetary benefits to a college education–but given the size of student loans these days, it doesn’t hurt to look into the financial rewards of a bachelor’s degree.

Getting started with the Census API#

We’ll first import a few relevant libraries and set several configuration variables:

import time
program_start_time = time.time()
import pandas as pd
pd.set_option('display.max_columns', 1000) # This max column setting
# will prevent columns in the notebook from being hidden; however,
# if render_for_pdf (defined in the Appendix's helper_funcs.py file)
# is set to True, this setting will get overwritten.
import numpy as np
from iteration_utilities import duplicates

import sys
sys.path.insert(1, '../Appendix')
from helper_funcs import config_notebook, render_for_pdf
display_type = config_notebook(display_max_columns=5,
                              display_max_rows=6) 

acs5_year = 2021 # By updating this variable when future American 
# Community Surveys get released, you should be able to retrieve the most
# recent data possible. (If changes to the survey's format are made,
# however, updates to the scripts may be necessary.)

# Note: I had originally set acs5_year to 2022, the latest year for which
# ACS5 data were available at the time. However, due to a recent change
# in Connecticut's county-equivalent boundaries (see
# https://www.federalregister.gov/documents/2022/06/06/2022-12063/
# change-to-county-equivalents-in-the-state-of-connecticut for more
# information), ACS5 population growth data between previous
# years and 2022 appeared to be unavailable for that state. Therefore,
# I chose to retrieve data for 2021 instead. 

acs5_latest_year = 2023 # This variable will be used to determine
# the ACS5 year for which a datsaet containing only one year of data
# will be retrieved. Since this dataset won't show historical trends
# in data, we don't need to worry about changes in county definitions;
# thus, we can use the most recent year for which data are available.

download_new_variable_list = False # If set to True, a new list of 
# variables will be downloaded from the Census API website. If False, 
# this list of variables will instead be read in from a local .csv copy 
# (thus saving processing time).

Importing a Census API Key#

You can obtain a free Census API key at https://api.census.gov/data/key_signup.html . The following cell imports my own personal key, so you’ll need to replace this code with one that loads in your own API key.

with open ('census_api_key_path.txt') as file:
    key_path = file.read()
with open(key_path) as file:
    key = file.read()

In order to get better acquainted with the Census API, you may want to review its documentation. For instance, you’ll probably find the Census Data API User Guide (https://www.census.gov/content/dam/Census/data/developers/api-user-guide/api-guide.pdf) to be helpful in applying the Census API.

The list of ACS5 API call examples at (https://api.census.gov/data/2021/acs/acs5/examples.html) is another great resource. One of the sample URLs shown on this page for retrieving county-level data appears as follows:

https://api.census.gov/data/2021/acs/acs5?get=NAME,B01001_001E&for=county:*&key=YOUR_KEY_GOES_HERE

If you replace the ‘YOUR_KEY_GOES_HERE’ component of the URL with your actual key, then enter this link into your web browser, you’ll receive a very long list of counties, population values, and state and county codes. The top of the list for the 2021 ACS5 looks like this:

[["NAME","B01001_001E","state","county"],
["Autauga County, Alabama","58239","01","001"],
["Baldwin County, Alabama","227131","01","003"],
["Barbour County, Alabama","25259","01","005"],
["Bibb County, Alabama","22412","01","007"],
["Blount County, Alabama","58884","01","009"],
["Bullock County, Alabama","10386","01","011"],

‘B01001_001E’ refers to the total population estimates for a given county. We can find this out by going to the 2021 ACS5’s Detailed Tables page (https://api.census.gov/data/2021/acs/acs5/variables.html) and navigating to the row with a ‘Name’ value of ‘B01001_001E’. This link, which may take a little while to fully load, is available on the ACS5 API Documentation Page (https://www.census.gov/data/developers/data-sets/acs-5year.html).

We can use pd.read_json() to easily read this same data into a DataFrame:

df_results = pd.read_json(
    f'https://api.census.gov/data/{acs5_year}/\
acs/acs5?get=NAME,B01001_001E&for=county:*&key={key}')
# read_json documentation:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/
# pandas.read_json.html
df_results.head()

	0	1	2	3
0	NAME	B01001_001E	state	county
1	Autauga County, Alabama	58239	01	001
2	Baldwin County, Alabama	227131	01	003
3	Barbour County, Alabama	25259	01	005
4	Bibb County, Alabama	22412	01	007

At this point, the DataFrame’s columns are [0, 1, 2, 3], whereas the columns we want to use are stored within the first row. The following code sets these row values as our column values, then deletes this row:

df_results.columns = df_results.iloc[0]
df_results.drop(0, inplace=True)

df_results.head()

	NAME	B01001_001E	state	county
1	Autauga County, Alabama	58239	01	001
2	Baldwin County, Alabama	227131	01	003
3	Barbour County, Alabama	25259	01	005
4	Bibb County, Alabama	22412	01	007
5	Blount County, Alabama	58884	01	009

Importing custom Census functions#

Now that we’ve seen some basic applications of the Census API, we’ll import four Census data import and analysis functions that will help simplify some more complex procedures. I originally defined these functions within this script; however, I chose to move them to a separate Python file (‘census_import_scripts.py’) so that they can be incorporated more easily into other data projects. I highly recommend that you read the documentation for these files as well so that you can better understand what the functions are doing (and how to modify them for your own applications if needed).

from census_import_scripts import download_variable_list, \
create_variable_aliases, retrieve_census_data, create_comparison_fields

Retrieving variable and group information#

In order to determine which variable codes to enter into our script, we’ll first need to review a list of all American Community Survey variables and the overall groups into which they fit. The variable lists for each ACS5 are available on the Census website; for instance, the copy for 2021 is available at https://api.census.gov/data/2021/acs/acs5/variables.html. However, we can also use the download_variable_list() function we just imported to save this list (and its corresponding list of variable groups) to a local .csv file, as shown below.

if download_new_variable_list == True: # This process can take a little
    # while, so if you already have a copy of the variable list you need,
    # consider setting download_variable_list to False.
    download_variable_list(acs5_year, 'acs5')

# Reading the group and variable datasets into our script:
df_variables = pd.read_csv(
    f'Datasets/acs5_{acs5_year}_variables.csv')

df_groups = pd.read_csv(
    f'Datasets/acs5_{acs5_year}_groups.csv')

# The following code will show a condensed set of columns if the script
# is being run for PDF output; that way, the reader can view all of the
# most important columns. This step isn't necessary when the script is 
# being run for HTML display, as the user can simply scroll horizontally
# to see all of the columns within the DataFrame. 
df_variables[['Name', 'Label', 'Concept', 'Group']].head() if (
    render_for_pdf == True) else df_variables.head()

	Name	Label	Concept	Group
0	B01001_001E	Estimate!!Total:	SEX BY AGE	B01001
1	B01001_002E	Estimate!!Total:!!Male:	SEX BY AGE	B01001
2	B01001_003E	Estimate!!Total:!!Male:!!Under 5 years	SEX BY AGE	B01001
3	B01001_004E	Estimate!!Total:!!Male:!!5 to 9 years	SEX BY AGE	B01001
4	B01001_005E	Estimate!!Total:!!Male:!!10 to 14 years	SEX BY AGE	B01001

For reference, here’s a look at df_groups: (this DataFrame is a simplified version of df_variables that shows each group code and corresponding concept only once.)

df_groups.head()

	Concept	Group
0	SEX BY AGE	B01001
1	SEX BY AGE (WHITE ALONE)	B01001A
2	SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)	B01001B
3	SEX BY AGE (AMERICAN INDIAN AND ALASKA NATIVE ...	B01001C
4	SEX BY AGE (ASIAN ALONE)	B01001D

In order to find variables of interest, I recommend first searching for keywords of interest within the group table (which is much smaller in size) in order to identify relevant group IDs. Next, you can search for those group IDs inside the variables table in order to find the exact metrics to request from the Census API.

The following table stores variables for three separate groups: (1) the total population; (2) all males aged 25 to 29 years; and (3) all females aged 25 to 29 years. (The B01001 table that stores these values didn’t have an entry for all people aged 25 to 29; we’ll get around this limitation by retrieving sex-specific population totals within this age group, then adding those totals together.)

grad_destinations_variable_list = [
    'B01001_001E', 'B01001_011E',
    'B01001_035E']

Creating aliases and specifying retrieval years#

The demographic columns in the Census API’s output are labeled with their variable names (e.g. ‘B01001_001E’). These names are concise, but you’ll need a copy of the original variable list to interpret them. Therefore, I chose to replace these column names with a combination of the ‘Label’, ‘Concept’, and ‘Name’ entries in the original variable list. These column names are very long, but they do make the output easier to interpret (while also preserving the original names for reference).

In addition, if the description corresponding to a variable name happens to change from one year to another, the use of aliases will help you identify that change. (This will help prevent you from treating two different data types that happened to use the same variable code in different years as equal.)

The following function assists with this replacement by creating a dictionary whose keys are the original field names (e.g. ‘B0101_001E’) and whose values are the replacement names (e.g. ‘Sex by Age_Estimate!!Total:_B01001_001E’).

Creating our aliases:

grad_destinations_alias_dict = create_variable_aliases(
    df_variables=df_variables, 
    variable_list=grad_destinations_variable_list)
grad_destinations_alias_dict

{'B01001_001E': 'SEX BY AGE_Estimate!!Total: (B01001_001E)',
 'B01001_011E': 'SEX BY AGE_Estimate!!Total:!!Male:!!25 to 29 years (B01001_011E)',
 'B01001_035E': 'SEX BY AGE_Estimate!!Total:!!Female:!!25 to 29 years (B01001_035E)'}

Next, we’ll define a list of years for which we would like to retrieve Census data. In order to make this code easier to use in future years, I’ll define these years as an offset of acs5_year rather than hardcoding them.

years_to_retrieve = [acs5_year - 12, acs5_year -10, 
                     acs5_year - 8,
                     acs5_year - 6,
                     acs5_year - 5,
                     acs5_year]
# American Community Survey 1-year estimates weren't available in 2020,
# so you'll want to remove that year from your list if it happens to be 
# present. YOu can do so via the following code:
# if 2020 in years_to_retrieve:
#     years_to_retrieve.remove(2020)
# However, because I decided to use 5-year rather than 1-year estimates,
# I commented out this line.
years_to_retrieve

[2009, 2011, 2013, 2015, 2016, 2021]

At this point, it’s a good idea to confirm that our three variable codes (‘B01001_001E’, ‘B01001_011E’, and ‘B01001_035E’) had the same meaning for all the years whose data we’ll be retrieving. We can do so by running the following code, which retrieves these variables and their corresponding descriptions for all of the years in years_to_retrieve.

(Due to the size of the variables.html page, this code can take a while to run, so I commented it out below.)

# var_meanings_by_year_df_list = []
# for year in years_to_retrieve:
#     df_var_list = pd.read_html(
#         f'https://api.census.gov/data/{year}/acs/acs5/variables.html')[
#     0][['Name', 'Label', 'Concept']].query(
#         "Name in @grad_destinations_variable_list")
#     df_var_list.insert(0, 'Year', year)
#     var_meanings_by_year_df_list.append(df_var_list)
# df_var_meanings_by_year = pd.concat(
#     [df for df in var_meanings_by_year_df_list])
# df_var_meanings_by_year.to_csv('var_meanings_by_year.csv', index=False)
# df_var_meanings_by_year

Here’s a series of outputs from the saved .csv copy of this table that confirms that these codes had the same meaning in each of the years we’re analyzing:

df_var_meanings_by_year = pd.read_csv(
    'var_meanings_by_year.csv')
for name in grad_destinations_variable_list:
    print(df_var_meanings_by_year.query("Name == @name"))

    Year         Name             Label     Concept
 2009  B01001_001E   Estimate!!Total  SEX BY AGE
 2011  B01001_001E   Estimate!!Total  SEX BY AGE
 2013  B01001_001E   Estimate!!Total  SEX BY AGE
 2015  B01001_001E   Estimate!!Total  SEX BY AGE
2016  B01001_001E   Estimate!!Total  SEX BY AGE
2021  B01001_001E  Estimate!!Total:  SEX BY AGE
    Year         Name                                    Label     Concept
 2009  B01001_011E    Estimate!!Total!!Male!!25 to 29 years  SEX BY AGE
 2011  B01001_011E    Estimate!!Total!!Male!!25 to 29 years  SEX BY AGE
 2013  B01001_011E    Estimate!!Total!!Male!!25 to 29 years  SEX BY AGE
2015  B01001_011E    Estimate!!Total!!Male!!25 to 29 years  SEX BY AGE
2016  B01001_011E    Estimate!!Total!!Male!!25 to 29 years  SEX BY AGE
2021  B01001_011E  Estimate!!Total:!!Male:!!25 to 29 years  SEX BY AGE
    Year         Name                                      Label     Concept
 2009  B01001_035E    Estimate!!Total!!Female!!25 to 29 years  SEX BY AGE
 2011  B01001_035E    Estimate!!Total!!Female!!25 to 29 years  SEX BY AGE
 2013  B01001_035E    Estimate!!Total!!Female!!25 to 29 years  SEX BY AGE
2015  B01001_035E    Estimate!!Total!!Female!!25 to 29 years  SEX BY AGE
2016  B01001_035E    Estimate!!Total!!Female!!25 to 29 years  SEX BY AGE
2021  B01001_035E  Estimate!!Total:!!Female:!!25 to 29 years  SEX BY AGE

Calling `retrieve_census_data()`#

The retrieve_census_data() defined within ‘census_import_scripts.py’ simplifies the process of retrieving data from the Census API. It also enables the user to rename variable fields (e.g. ‘B01001_001E’) with aliases for those fields (e.g. ‘Sex by Age_Estimate!!Total: (B01001_001E)’) if desired. In addition, it allows more than 50 variables to be retrieved at the same time, thus simplifying the process of importing especially large datasets.

We’re now ready to retrieve our population totals for the years referenced in years_to_retrieve. We’ll do so by calling retrieve_census_data() for each of these years via a for loop, then adding their respective DataFrames together using pd.concat().

(Note: Only 5825 results showed up when I requested ACS1 county-level data (as opposed to over 25,000 results for ACS5 data), so many counties were not getting incorporated within the ACS1 results. Therefore, the ACS5 will usually be the better of these two surveys for evaluating county-level growth.)

census_data_by_year_df_list = []
for year in years_to_retrieve:
    df_data = retrieve_census_data(
        survey='acs5', year=year, 
        region='county',
        variable_list=grad_destinations_variable_list, 
        rename_data_fields=True, 
        field_vars_dict=grad_destinations_alias_dict, key=key)
    census_data_by_year_df_list.append(df_data)
df_growth_data_by_year = pd.concat(df for df in census_data_by_year_df_list)
# Removing Puerto Rico from our list of results so as to focus only on counties
# and county equivalents within the 50 US states and DC:
df_growth_data_by_year.query("state != '72'", inplace=True)
df_growth_data_by_year

	Year	state	...	SEX BY AGE_Estimate!!Total:!!Male:!!25 to 29 years (B01001_011E)	SEX BY AGE_Estimate!!Total:!!Female:!!25 to 29 years (B01001_035E)
1	2009	13	...	634	574
2	2009	13	...	531	352
3	2009	13	...	3264	3900
...	...	...	...	...	...
3141	2021	56	...	574	567
3142	2021	56	...	175	200
3143	2021	56	...	146	112

18856 rows × 7 columns

Analyzing this data#

The following cell adds together male and female population totals in order to calculate the total number of 25- to 29-year-olds within each county for each year. It also simplifies the original total population column name in order to improve its readability.

df_growth_data_by_year['Total_Pop_25_to_29'] = (df_growth_data_by_year[
'SEX BY AGE_Estimate!!Total:!!Male:!!25 to 29 years (B01001_011E)'] + 
df_growth_data_by_year[
'SEX BY AGE_Estimate!!Total:!!Female:!!25 to 29 years (B01001_035E)'])
df_growth_data_by_year.rename(
    columns={'SEX BY AGE_Estimate!!Total: (B01001_001E)':'Total_Pop'}, 
inplace=True)
df_growth_data_by_year.drop(
    ['SEX BY AGE_Estimate!!Total:!!Male:!!25 to 29 years (B01001_011E)',
     'SEX BY AGE_Estimate!!Total:!!Female:!!25 to 29 years (B01001_035E)'],
     axis=1, inplace=True)
                             

df_growth_data_by_year

	Year	state	...	Total_Pop	Total_Pop_25_to_29
1	2009	13	...	19695	1208
2	2009	13	...	11641	883
3	2009	13	...	95330	7164
...	...	...	...	...	...
3141	2021	56	...	20514	1141
3142	2021	56	...	7768	375
3143	2021	56	...	6891	258

18856 rows × 6 columns

Next, we’ll apply pd.pivot() to place all population totals for a given county within the same row, thus making future growth calculations easier.

df_growth_data_by_year_pivot = df_growth_data_by_year.copy().pivot(
    columns='Year', index=['NAME', 'county', 'state']).reset_index()
# The values could be named explicitly, but since pivot() will infer them
    # automatically, there's no need to do so. I thus removed the following
# code from the pivot() call:
    # values=['Total_Pop',
    #           'Total_Pop_25_to_29']

df_growth_data_by_year_pivot.head()

	NAME	county	...	Total_Pop_25_to_29
Year			...	2016	2021
0	Abbeville County, South Carolina	001	...	1226.0	1292.0
1	Acadia Parish, Louisiana	001	...	4240.0	3705.0
2	Accomack County, Virginia	001	...	1819.0	1907.0
3	Ada County, Idaho	001	...	30646.0	33625.0
4	Adair County, Iowa	001	...	362.0	411.0

5 rows × 15 columns

We’ll now call to_flat_index() in order to group all column data within the same row:

df_growth_data_by_year_pivot.columns = (
    df_growth_data_by_year_pivot.columns.to_flat_index())

df_growth_data_by_year_pivot.head()

	(NAME, )	(county, )	...	(Total_Pop_25_to_29, 2016)	(Total_Pop_25_to_29, 2021)
0	Abbeville County, South Carolina	001	...	1226.0	1292.0
1	Acadia Parish, Louisiana	001	...	4240.0	3705.0
2	Accomack County, Virginia	001	...	1819.0	1907.0
3	Ada County, Idaho	001	...	30646.0	33625.0
4	Adair County, Iowa	001	...	362.0	411.0

5 rows × 15 columns

Next, we’ll convert the tuple-based columns created by to_flat_index() to string-based ones by executing a list comprehension. This list comprehension will convert tuples that contain both a string and an integer (e.g. (‘Total_Pop’, 2011)) to a single string with an underscore separating the two elements. Tuples with an empty second entry (e.g. (‘county’, ‘’) will get replaced with just the first entry (e.g. ‘county’).

df_growth_data_by_year_pivot.columns = [
    column[0] + '_' + str(column[1]) if isinstance(column[1], int) 
    else column[0] for column in df_growth_data_by_year_pivot.columns]
df_growth_data_by_year_pivot.head()

	NAME	county	...	Total_Pop_25_to_29_2016	Total_Pop_25_to_29_2021
0	Abbeville County, South Carolina	001	...	1226.0	1292.0
1	Acadia Parish, Louisiana	001	...	4240.0	3705.0
2	Accomack County, Virginia	001	...	1819.0	1907.0
3	Ada County, Idaho	001	...	30646.0	33625.0
4	Adair County, Iowa	001	...	362.0	411.0

5 rows × 15 columns

Adding in comparison fields#

Now that we have population data for several years in a relatively easy-to-parse format, we can call create_comparison_fields() to calculate nominal and percentage changes in population between certain years–along with corresponding percentile and rank information. In the following cell, this function will be called twice to create growth metrics for both total county populations and residents in the 25-29 age range.

for field_var in ['Total_Pop', 'Total_Pop_25_to_29']:
    create_comparison_fields(
        df=df_growth_data_by_year_pivot,
        field_var=field_var,
        year_list=years_to_retrieve,
        field_year_separator='_')

df_growth_data_by_year_pivot.head(5)

	NAME	county	...	2016-2021 Total_Pop_25_to_29 % Change Rank	2016-2021 Total_Pop_25_to_29 % Change Percentile
0	Abbeville County, South Carolina	001	...	1357.0	56.829035
1	Acadia Parish, Louisiana	001	...	2812.0	10.506208
2	Accomack County, Virginia	001	...	1426.0	54.632283
3	Ada County, Idaho	001	...	869.0	72.365489
4	Adair County, Iowa	001	...	588.0	81.311684

5 rows × 75 columns

Ranking counties by population growth#

We’l now sort this dataset to answer a few questions:

Which counties had the highest (and lowest) total population growth rates between 2016 and 2021? (We’ll evaluate this question for both counties with at least 100,000 residents in 2016 and those with at least one million residents).
Which counties experienced the highest (and lowest) 25-to-29-year-old population growth rates?

First, we’ll specify a few variables that will be used within our analyses:

range_for_sorting = 5
range_start_year = acs5_year - range_for_sorting 

total_pop_var_for_sorting = 'Total_Pop'
total_pop_sort_column = f'{range_start_year}-{acs5_year} \
{total_pop_var_for_sorting} % Change Rank'

young_pop_var_for_sorting = 'Total_Pop_25_to_29'
young_pop_sort_column = f'{range_start_year}-{acs5_year} \
{young_pop_var_for_sorting} % Change Rank'

total_pop_col_root = f'{range_start_year}-{acs5_year} \
{total_pop_var_for_sorting}' # The 'root' from which other column entries 
# will derive

young_pop_col_root = f'{range_start_year}-{acs5_year} \
{young_pop_var_for_sorting}'

The following function will help condense the output of our analyses, thus making them more readable.

def gen_display_cols(col_root):
    '''This function specifies which DataFrame columns to display within a 
    given output. It does so by adding various suffixes to a col_root
    value (e.g. '2016-2021 Total_Pop'.'''
    return ['NAME', f'{col_root} Change', f'{col_root} % Change',
            f'{col_root} % Change Rank', f'{col_root} % Change Percentile']

total_pop_display_cols = gen_display_cols(total_pop_col_root)
young_pop_display_cols = gen_display_cols(young_pop_col_root)
total_pop_display_cols

['NAME',
 '2016-2021 Total_Pop Change',
 '2016-2021 Total_Pop % Change',
 '2016-2021 Total_Pop % Change Rank',
 '2016-2021 Total_Pop % Change Percentile']

Sorting counties with at least 100K residents in 2016 by their 2016-2021 total population growth rates:#

total_pop_sort_column # The following two sets of output will be 
# sorted by percentage change (rather than nominal change) in the total 
# population between 2016 and 2021.

'2016-2021 Total_Pop % Change Rank'

df_growth_data_by_year_pivot.query(
    f"Total_Pop_{range_start_year} >= 100000").sort_values(
    total_pop_sort_column).dropna(subset=total_pop_sort_column)[
total_pop_display_cols].reset_index(drop=True)

	NAME	2016-2021 Total_Pop Change	2016-2021 Total_Pop % Change	2016-2021 Total_Pop % Change Rank	2016-2021 Total_Pop % Change Percentile
0	Hays County, Texas	48887.0	26.327779	10.0	99.713467
1	Comal County, Texas	32023.0	25.776358	12.0	99.649793
2	Kaufman County, Texas	28315.0	25.319682	13.0	99.617956
...	...	...	...	...	...
588	Wayne County, North Carolina	-6755.0	-5.428014	2724.0	13.307864
589	Hinds County, Mississippi	-13824.0	-5.651769	2746.0	12.607450
590	Robeson County, North Carolina	-15740.0	-11.695993	3056.0	2.737982

591 rows × 5 columns

The three counties with the highest 2016-2021 growth rates were all in Texas. Hays, Comal, and Kaufman Counties are located within the Austin, San Antonio, and DFW metropolitan areas, respectively (according to Wikipedia).

Sorting counties with at least one million residents in 2016 by their 2016-2021 total population growth rates:#

Combining concat() with head() and tail() will allow us to display the counties with the highest and lowest growth rates within the same DataFrame.

df_1m_pop_pct_changes = df_growth_data_by_year_pivot.query(
    f"Total_Pop_{range_start_year} >= 1000000").sort_values(
    total_pop_sort_column).dropna(
    subset=total_pop_sort_column).copy().reset_index()
pd.concat([df_1m_pop_pct_changes.head(), df_1m_pop_pct_changes.tail()])[
total_pop_display_cols]

	NAME	2016-2021 Total_Pop Change	2016-2021 Total_Pop % Change	2016-2021 Total_Pop % Change Rank	2016-2021 Total_Pop % Change Percentile
0	Orange County, Florida	153894.0	12.252170	94.0	97.039160
1	Travis County, Texas	119619.0	10.418176	140.0	95.574658
2	Hillsborough County, Florida	121300.0	9.168147	180.0	94.301178
...	...	...	...	...	...
40	Cuyahoga County, Ohio	4957.0	0.393816	1468.0	53.295129
41	St. Louis County, Missouri	1422.0	0.142120	1543.0	50.907354
42	Los Angeles County, California	-37520.0	-0.373068	1687.0	46.322827

10 rows × 5 columns

Repeating these steps in order to compare 25-to-29-year-old population growth rates#

df_growth_data_by_year_pivot.query(
    f"Total_Pop_{range_start_year} >= 100000").sort_values(
    young_pop_sort_column).dropna(subset=young_pop_sort_column)[
young_pop_display_cols].reset_index(drop=True)

	NAME	2016-2021 Total_Pop_25_to_29 Change	2016-2021 Total_Pop_25_to_29 % Change	2016-2021 Total_Pop_25_to_29 % Change Rank	2016-2021 Total_Pop_25_to_29 % Change Percentile
0	Douglas County, Colorado	4368.0	31.155492	148.0	95.319962
1	Comal County, Texas	2035.0	31.116208	150.0	95.256288
2	Kaufman County, Texas	2062.0	30.602553	152.0	95.192614
...	...	...	...	...	...
588	St. Lawrence County, New York	-745.0	-10.792409	2755.0	12.320917
589	Caddo Parish, Louisiana	-2048.0	-11.088852	2765.0	12.002547
590	Orleans Parish, Louisiana	-6371.0	-16.970327	2898.0	7.768227

591 rows × 5 columns

Comal and Kaufman counties ranked #2 and #3, respectively, among counties with 100K+ residents for both total population growth and 25-29 year-old population growth between 2016 and 2021.

df_1m_pop_pct_changes = df_growth_data_by_year_pivot.query(
    f"Total_Pop_{range_start_year} >= 1000000").sort_values(
    young_pop_sort_column).dropna(
    subset=young_pop_sort_column).copy().reset_index()
pd.concat([df_1m_pop_pct_changes.head(), df_1m_pop_pct_changes.tail()])[
young_pop_display_cols]

	NAME	2016-2021 Total_Pop_25_to_29 Change	2016-2021 Total_Pop_25_to_29 % Change	2016-2021 Total_Pop_25_to_29 % Change Rank	2016-2021 Total_Pop_25_to_29 % Change Percentile
0	Pima County, Arizona	9395.0	15.212850	490.0	84.431710
1	Wayne County, Michigan	16073.0	13.496289	590.0	81.248010
2	King County, Washington	22087.0	12.478743	643.0	79.560649
...	...	...	...	...	...
40	Miami-Dade County, Florida	-11062.0	-5.717180	2504.0	20.312003
41	Montgomery County, Maryland	-4043.0	-6.120379	2521.0	19.770774
42	New York County, New York	-14977.0	-7.548168	2608.0	17.000955

10 rows × 5 columns

Although it’s interesting to see which counties had the very highest and lowest growth rates, it’s hard to gain an understanding of nationwide population trends via these tables alone. We can get a more comprehensive view of such trends via choropleth maps, which will allow us to present all counties within an easy-to-interpret graphic. We’ll create examples of these maps within the Mapping section of Python for Nonprofits.

In order to allow our mapping code to access this data, we’ll first save it as a .csv file:

df_growth_data_by_year_pivot.to_csv(
f'Datasets/grad_destinations_acs_county_data.csv', 
    index=False)

Retrieving state-level population growth data#

Although county-level population growth provides a fascinating look at within-state trends, the broader view that state-level data provides will also prove useful. Therefore, we’ll now repurpose the functions, variables, and code that we used earlier to import county-level data in order to create a state-level equivalent of df_growth_data_by_year.

census_data_by_year_df_list = []
for year in years_to_retrieve:
    df_data = retrieve_census_data(
        survey='acs5', year=year, 
        region='state',
        variable_list=grad_destinations_variable_list, 
        rename_data_fields=True, field_vars_dict=grad_destinations_alias_dict,
        key=key)
    census_data_by_year_df_list.append(df_data)
df_state_growth_data_by_year = pd.concat(df for df in census_data_by_year_df_list)
df_state_growth_data_by_year.query("state != '72'", inplace=True)
df_state_growth_data_by_year

	Year	state	...	SEX BY AGE_Estimate!!Total:!!Male:!!25 to 29 years (B01001_011E)	SEX BY AGE_Estimate!!Total:!!Female:!!25 to 29 years (B01001_035E)
1	2009	01	...	155174	157476
2	2009	02	...	28543	24663
3	2009	04	...	251627	228420
...	...	...	...	...	...
49	2021	54	...	55187	52671
50	2021	55	...	189562	180360
51	2021	56	...	19511	17813

306 rows × 6 columns

df_state_growth_data_by_year['Total_Pop_25_to_29'] = (
    df_state_growth_data_by_year[
'SEX BY AGE_Estimate!!Total:!!Male:!!25 to 29 years (B01001_011E)'] + 
df_state_growth_data_by_year[
'SEX BY AGE_Estimate!!Total:!!Female:!!25 to 29 years (B01001_035E)'])
df_state_growth_data_by_year.rename(
    columns={'SEX BY AGE_Estimate!!Total: (B01001_001E)':'Total_Pop'}, 
inplace=True)
df_state_growth_data_by_year.drop(
    ['SEX BY AGE_Estimate!!Total:!!Male:!!25 to 29 years (B01001_011E)',
     'SEX BY AGE_Estimate!!Total:!!Female:!!25 to 29 years (B01001_035E)'],
     axis=1, inplace=True)
                             

df_state_growth_data_by_year

	Year	state	NAME	Total_Pop	Total_Pop_25_to_29
1	2009	01	Alabama	4633360	312650
2	2009	02	Alaska	683142	53206
3	2009	04	Arizona	6324865	480047
...	...	...	...	...	...
49	2021	54	West Virginia	1801049	107858
50	2021	55	Wisconsin	5871661	369922
51	2021	56	Wyoming	576641	37324

306 rows × 5 columns

df_state_growth_data_by_year_pivot = (
    df_state_growth_data_by_year.copy().pivot(
    columns = 'Year', index = ['NAME', 'state']).reset_index())

df_state_growth_data_by_year_pivot.columns = (
    df_state_growth_data_by_year_pivot.columns.to_flat_index())

df_state_growth_data_by_year_pivot.columns = [
    column[0] + '_' + str(column[1]) if isinstance(column[1], int) 
    else column[0] for column 
    in df_state_growth_data_by_year_pivot.columns]


df_state_growth_data_by_year_pivot.head()

	NAME	state	...	Total_Pop_25_to_29_2016	Total_Pop_25_to_29_2021
0	Alabama	01	...	319177	331435
1	Alaska	02	...	61185	59218
2	Arizona	04	...	456680	496623
3	Arkansas	05	...	194179	197068
4	California	06	...	2918435	2992210

5 rows × 14 columns

for field_var in ['Total_Pop', 'Total_Pop_25_to_29']:
    create_comparison_fields(
        df=df_state_growth_data_by_year_pivot,
        field_var=field_var,
        year_list=years_to_retrieve,
        field_year_separator='_')

Saving this dataset to a .csv file so that it too can be used as a basis for state-level choropleth maps:

df_state_growth_data_by_year_pivot.to_csv(
    f'Datasets/grad_destinations_acs_state_data.csv', 
    index=False)

Creating a DataFrame that shows the 5 states with the highest and lowest total population growth rates during the latest 5 years in the dataset:

df_state_growth_data_by_year_pivot.sort_values(
    total_pop_sort_column, inplace=True)
df_state_growth_data_by_year_pivot.reset_index(drop=True, inplace=True)
pd.concat(
    [df_state_growth_data_by_year_pivot.head(5), 
    df_state_growth_data_by_year_pivot.tail(5)])[
total_pop_display_cols]

	NAME	2016-2021 Total_Pop Change	2016-2021 Total_Pop % Change	2016-2021 Total_Pop % Change Rank	2016-2021 Total_Pop % Change Percentile
0	Idaho	176134	10.769540	1.0	100.000000
1	Utah	282943	9.596405	2.0	98.039216
2	Nevada	220066	7.751063	3.0	96.078431
...	...	...	...	...	...
48	Mississippi	-22169	-0.741639	49.0	5.882353
49	Wyoming	-6388	-1.095657	50.0	3.921569
50	West Virginia	-45043	-2.439911	51.0	1.960784

10 rows × 5 columns

Creating a similar DataFrame that displays 25-to-29-year-old population growth rates:

df_state_growth_data_by_year_pivot.sort_values(
    young_pop_sort_column, inplace=True)
df_state_growth_data_by_year_pivot.reset_index(drop=True, inplace=True)
pd.concat(
    [df_state_growth_data_by_year_pivot.head(5), 
    df_state_growth_data_by_year_pivot.tail(5)])[
young_pop_display_cols]

	NAME	2016-2021 Total_Pop_25_to_29 Change	2016-2021 Total_Pop_25_to_29 % Change	2016-2021 Total_Pop_25_to_29 % Change Rank	2016-2021 Total_Pop_25_to_29 % Change Percentile
0	Utah	29737	13.733051	1.0	100.000000
1	Washington	58422	11.363760	2.0	98.039216
2	Tennessee	46477	10.661207	3.0	96.078431
...	...	...	...	...	...
48	Alaska	-1967	-3.214840	49.0	5.882353
49	Louisiana	-15558	-4.581151	50.0	3.921569
50	Wyoming	-3359	-8.256520	51.0	1.960784

10 rows × 5 columns

Retrieving data for our education/income regression analyses#

The following code creates tables of populations, median incomes, median home values, and regions for each US county and state. This data can then get incorporated into regression analyses later in Python for Nonprofits.

Importing US Census data on educational attainment and median earnings#

# Retrieving a new list of ACS5 variables for the most recent year with
# ACS5 data (as of February 2025):
if download_new_variable_list == True:
    download_variable_list(acs5_latest_year, 'acs5')

# Reading this dataset into our script:
df_variables_latest = pd.read_csv(
    f'Datasets/acs5_{acs5_latest_year}_variables.csv')

regression_variable_list = ['B01001_001E', 'B15003_001E', 'B15003_022E',
'B15003_023E','B15003_024E','B15003_025E', 'B20004_001E',
'B20004_002E','B20004_003E','B20004_004E','B20004_005E','B20004_006E']

regression_alias_dict = create_variable_aliases(
    df_variables=df_variables_latest, 
    variable_list=regression_variable_list)

df_regression_data_county = retrieve_census_data(
        survey='acs5', year=acs5_latest_year, 
        region='county',
        variable_list=regression_variable_list, 
        rename_data_fields=True, field_vars_dict=regression_alias_dict,
        key=key)

df_regression_data_state = retrieve_census_data(
        survey='acs5', year=acs5_latest_year, 
        region='state',
        variable_list=regression_variable_list, 
        rename_data_fields=True, field_vars_dict=regression_alias_dict,
        key=key)

# Retrieving national-level data (which will be useful for another
# data analysis that I'll include within the regressions section
# of Python for Nonprofits):
df_regression_data_national = retrieve_census_data(
        survey='acs5', year=acs5_latest_year, 
        region='us',
        variable_list=regression_variable_list, 
        rename_data_fields=True, field_vars_dict=regression_alias_dict,
        key=key)

df_regression_data_state

	Year	state	...	Median Earnings in the Past 12 Months (in 2023 Inflation-Adjusted Dollars) by Sex by Educational Attainment for the Population 25 Years and Over_Estimate!!Total:!!Bachelor's degree (B20004_005E)	Median Earnings in the Past 12 Months (in 2023 Inflation-Adjusted Dollars) by Sex by Educational Attainment for the Population 25 Years and Over_Estimate!!Total:!!Graduate or professional degree (B20004_006E)
1	2023	01	...	59318	72667
2	2023	02	...	68158	90606
3	2023	04	...	65087	82033
...	...	...	...	...	...
50	2023	55	...	64443	78648
51	2023	56	...	58411	73535
52	2023	72	...	28988	41197

52 rows × 15 columns

The following code calculates the percentage of residents within a given row who have at least a bachelor’s degree; it also dramatically condenses many column names. (I converted it into a function so that the same set of code could be used to modify our state-level, county-level, and national datasets.)

def rename_and_calc_dataset(df, year):
    '''This dataset renames certain median earnings columns within 
    a copy of census results and also estimates the percentage of 
    respondents within each row who have at least a bachelor's degree.
    
    df: the DataFrame containing census results to process.
    
    year: the year in which the ACS was conducted. (This year shows
    up in many median earnings columns.'''
    
    df.rename(columns={
'Median Earnings in the \
Past 12 Months (in 2023 Inflation-Adjusted Dollars) by Sex by Educational \
Attainment for the Population 25 Years and Over_Estimate!!\
Total:!!Less than high school graduate (B20004_002E)':
'Median_Earnings_Less_Than_HS',
                       
'Median Earnings in the Past 12 Months (in 2023 Inflation-Adjusted \
Dollars) by Sex by Educational Attainment for the Population 25 Years \
and Over_Estimate!!Total:!!High school graduate \
(includes equivalency) (B20004_003E)':'Median_Earnings_HS',
        
"Median Earnings in the Past 12 Months (in 2023 Inflation-Adjusted \
Dollars) by Sex by Educational Attainment for the Population 25 Years \
and Over_Estimate!!Total:!!Some college or associate's \
degree (B20004_004E)":"Median_Earnings_Some_College",
        
"Median Earnings in the Past 12 Months (in 2023 Inflation-Adjusted \
Dollars) by Sex by Educational Attainment for the Population 25 \
Years and Over_Estimate!!Total:!!Bachelor's degree (B20004_005E)":
"Median_Earnings_Bachelors_Degree",
        
"Median Earnings in the Past 12 Months (in 2023 \
Inflation-Adjusted Dollars) by Sex by Educational Attainment \
for the Population 25 Years and Over_Estimate!!Total:!!Graduate \
or professional degree (B20004_006E)":"Median_Earnings_Postgraduate",
"Median Earnings in the Past 12 Months (in 2023 \
Inflation-Adjusted Dollars) by Sex by Educational Attainment \
for the Population 25 Years and Over_Estimate!!Total: (B20004_001E)":
"Median_Earnings_for_Total_25plus_Population",
        
'Sex by Age_Estimate!!Total: (B01001_001E)':'Total_Population',
},
    inplace=True)
    df["Pct_With_Bachelors_Degree"] = (100*(
    df["Educational Attainment for the Population \
25 Years and Over_Estimate!!Total:!!Bachelor's degree (B15003_022E)"] +
    df["Educational Attainment for the Population 25 \
Years and Over_Estimate!!Total:!!Master's degree (B15003_023E)"] + 
    df["Educational Attainment for the Population \
25 Years and Over_Estimate!!Total:!!Professional school degree \
(B15003_024E)"] + 
    df["Educational Attainment for the Population 25 \
Years and Over_Estimate!!Total:!!Doctorate degree (B15003_025E)"])
    / df["Educational Attainment for the \
Population 25 Years and Over_Estimate!!Total: (B15003_001E)"])

rename_and_calc_dataset(df = df_regression_data_county, 
                        year = acs5_latest_year)
rename_and_calc_dataset(df = df_regression_data_state, 
                        year = acs5_latest_year)
rename_and_calc_dataset(df = df_regression_data_national, 
                        year = acs5_latest_year)
df_regression_data_state

	Year	state	...	Median_Earnings_Postgraduate	Pct_With_Bachelors_Degree
1	2023	01	...	72667	27.755139
2	2023	02	...	90606	31.242276
3	2023	04	...	82033	32.610332
...	...	...	...	...	...
50	2023	55	...	78648	32.758344
51	2023	56	...	73535	29.920304
52	2023	72	...	41197	29.130778

52 rows × 16 columns

Mapping each county and state to a geographical region#

As part of our regression analysis, we’ll want to determine whether the relationship between the prevalence of bachelor’s degrees and median earnings is influenced by the region in which a particular county or state is located. I’ll first merge in this data for our county-level dataset; next, I’ll do the same for the state-level table.

In order to add region information to our county-level dataset, we’ll first need to figure out the state corresponding to each county; we can then map these states to their respective regions. Our dataset already contains numerical values for each state in the form of FIPS codes; however, to make sense of these codes, we’ll also need to import a table that shows the actual state name for each code. Such a table is available on the Census wesbite at https://www2.census.gov/geo/docs/reference/codes2020/national_state2020.txt .

I copied and pasted the contents of this table into state_fips_codes_from_census.txt, which can be found in the same folder as this script. The following code reads this table into a DataFrame. Note that (1) you can apply read_csv() on more than just .csv files and that (2) because the entries in this table were separated by the ‘|’ character, I needed to specify that character as my separator.

df_fips = pd.read_csv('state_fips_codes_from_census.txt',
            sep='|')

# Renaming columns to help facilitate an upcoming merge:
df_fips.rename(columns={'STATE':'State_Abbrev',
'STATEFP':'state'}, inplace=True)

# Converting the 'state' column into a zero-filled string
# so that its values can get matched with the corresponding
# FIPS codes in df_regression_data_county:
df_fips['state'] = df_fips['state'].astype('str').str.zfill(2)
df_fips

	State_Abbrev	state	STATENS	STATE_NAME
0	AL	01	1779775	Alabama
1	AK	02	1785533	Alaska
2	AZ	04	1779777	Arizona
...	...	...	...	...
54	PR	72	1779808	Puerto Rico
55	UM	74	1878752	U.S. Minor Outlying Islands
56	VI	78	1802710	United States Virgin Islands

57 rows × 4 columns

df_regression_data_county = df_regression_data_county.merge(
    df_fips[['State_Abbrev', 'state']],
    on = 'state', how = 'left')
df_regression_data_county

	Year	state	...	Pct_With_Bachelors_Degree	State_Abbrev
0	2023	01	...	28.282680	AL
1	2023	01	...	32.797637	AL
2	2023	01	...	11.464715	AL
...	...	...	...	...	...
3219	2023	72	...	22.481068	PR
3220	2023	72	...	18.360524	PR
3221	2023	72	...	27.116371	PR

3222 rows × 18 columns

Now that we know each county’s state abbreviation, we can merge region data into our table as well. We’ll do so by importing state_regions.csv, a file that stores the state-region relationships found in this handy PDF file: https://www2.census.gov/geo/pdfs/maps-data/maps/reference/us_regdiv.pdf

df_state_regions = pd.read_csv('state_regions.csv')
df_state_regions

	State_Abbrev	Region
0	WA	West
1	OR	West
2	CA	West
...	...	...
48	VT	Northeast
49	NH	Northeast
50	ME	Northeast

51 rows × 2 columns

df_regression_data_county = df_regression_data_county.merge(
    df_state_regions, on='State_Abbrev', how='left')
df_regression_data_county

	Year	state	...	State_Abbrev	Region
0	2023	01	...	AL	South
1	2023	01	...	AL	South
2	2023	01	...	AL	South
...	...	...	...	...	...
3219	2023	72	...	PR	NaN
3220	2023	72	...	PR	NaN
3221	2023	72	...	PR	NaN

3222 rows × 19 columns

Merging region information into our state-level dataset#

Although the state-level Census dataset contains full state names, I’ll still use the same approach that I took for the county level dataset, as that reduces the amount of additional code that I need to write. As the following cell demonstrates, it’s possible to chain multiple merge() calls together; this allows for more concise code, though if anything goes wrong, you may need to call each merge operation individually for debugging purposes.

df_regression_data_state = df_regression_data_state.merge(
    df_fips[['State_Abbrev', 'state']],
    on = 'state', how = 'left').merge(
    df_state_regions, on = 'State_Abbrev', how = 'left')
df_regression_data_state

	Year	state	...	State_Abbrev	Region
0	2023	01	...	AL	South
1	2023	02	...	AK	West
2	2023	04	...	AZ	West
...	...	...	...	...	...
49	2023	55	...	WI	Midwest
50	2023	56	...	WY	West
51	2023	72	...	PR	NaN

52 rows × 18 columns

Saving these DataFrames to .csv files so that they can get incorporated into the regressions section of Python for Nonprofits:

df_regression_data_county.to_csv(
    'Datasets/education_and_earnings_county.csv', index = False)

df_regression_data_state.to_csv(
    'Datasets/education_and_earnings_state.csv', index = False)

df_regression_data_national.to_csv(
    'Datasets/education_and_earnings_national.csv', index = False)

Appendix#

1: The Census Python Library#

It’s worth noting that there is also a ‘census’ Python library (available via pypi and conda) that helps simplify the process of requesting API data. You may choose to use it for your own Census research, but I ended up not needing it for the data retrieval tasks shown above. In addition, foregoing the library allowed me to demonstrate how to retrieve data directly from an API, which you may find helpful when working with APIs that don’t have a corresponding Python library.

Here’s an example of the Census library in use:

## Example of reading data from the Census library into a 
# Pandas DataFrame:
from census import Census
c = Census(key)
pd.DataFrame(c.acs5.get(('NAME', 'B01001_001E'),
{'for': 'county:*'}))

	NAME	B01001_001E	state	county
0	Autauga County, Alabama	58761.0	01	001
1	Baldwin County, Alabama	233420.0	01	003
2	Barbour County, Alabama	24877.0	01	005
...	...	...	...	...
3219	Villalba Municipio, Puerto Rico	21984.0	72	149
3220	Yabucoa Municipio, Puerto Rico	30313.0	72	151
3221	Yauco Municipio, Puerto Rico	33988.0	72	153

3222 rows × 4 columns

2: The requests library#

We can also use Python’s requests library to retrieve data from the Census API, then convert it to JSON format:

# The following code borrows from the requests library documentation at 
# https://docs.python-requests.org/en/latest/index.html
import requests
r = requests.get(f'https://api.census.gov/data/{acs5_year}/\
acs/acs5?get=NAME,B01001_001E&for=county:*&key={key}')
# Printing the first 300 characters of this output:
print("r.text:\n",r.text[0:300],'\n')
# Printing the first 5 lines of r.json:
print("r.json:\n",r.json()[0:5],'\n')

r.text:
 [["NAME","B01001_001E","state","county"],
["Autauga County, Alabama","58239","01","001"],
["Baldwin County, Alabama","227131","01","003"],
["Barbour County, Alabama","25259","01","005"],
["Bibb County, Alabama","22412","01","007"],
["Blount County, Alabama","58884","01","009"],
["Bullock County, Ala 

r.json:
 [['NAME', 'B01001_001E', 'state', 'county'], ['Autauga County, Alabama', '58239', '01', '001'], ['Baldwin County, Alabama', '227131', '01', '003'], ['Barbour County, Alabama', '25259', '01', '005'], ['Bibb County, Alabama', '22412', '01', '007']] 

Converting our response to JSON allows it to be easily read into a Pandas DataFrame, as shown below:

pd.DataFrame(r.json()).head()
# Note that pd.DataFrame(r.text) would produce the following error:
# "ValueError: DataFrame constructor not properly called!"

	0	1	2	3
0	NAME	B01001_001E	state	county
1	Autauga County, Alabama	58239	01	001
2	Baldwin County, Alabama	227131	01	003
3	Barbour County, Alabama	25259	01	005
4	Bibb County, Alabama	22412	01	007

I included this approach in the appendix because you may find the requests library useful for other online data retrieval tasks. However, our use of pd.read_json() to import Census data rendered an explicit call to the requests library unnecessary.

3: Importing 51+ variables#

I tested out the retrieve_census_data() function’s ability to import data for 51 or more variables via the following code. Feel free to uncomment and run it yourself to test out this functionality.

# test_list = list(df_variables['Name'][0:151])

# test_alias_dict = create_variable_aliases(
#     df_variables=df_variables, 
#     variable_list=test_list)

# test_acs5_data = retrieve_census_data(
#     survey='acs5', year=acs5_year, region='county',
#     variable_list=test_list, 
#     rename_data_fields=True, 
#     field_vars_dict=test_alias_dict, key=key)

# test_acs5_data

program_end_time = time.time()
run_time = round(program_end_time - program_start_time, 3)
print(f"Finished running script in {run_time} seconds.")

Finished running script in 11.382 seconds.

Conclusion#

We imported a great deal of data within this script! Although we performed a few basic analyses as well, the real value of the data we retrieved will be manifested within the Mapping and Regressions sections of Python for Nonprofits,

Next, I’ll introduce the ‘census_import_scripts.py’ file, whose functions helped simplify much of the import and analysis code found within this section. I do recommend reading through its documentation also, as that will help you modify its functions for your own use cases.

Census Data Imports

Contents

Census Data Imports#

An introduction to the American Community Survey#

Deciding where to move to start a family#

Examining connections between education levels and median incomes#

Getting started with the Census API#

Importing a Census API Key#

Importing custom Census functions#

Retrieving variable and group information#

Creating aliases and specifying retrieval years#

Calling retrieve_census_data()#

Analyzing this data#

Adding in comparison fields#

Ranking counties by population growth#

Sorting counties with at least 100K residents in 2016 by their 2016-2021 total population growth rates:#

Sorting counties with at least one million residents in 2016 by their 2016-2021 total population growth rates:#

Repeating these steps in order to compare 25-to-29-year-old population growth rates#

Retrieving state-level population growth data#

Retrieving data for our education/income regression analyses#

Importing US Census data on educational attainment and median earnings#

Mapping each county and state to a geographical region#

Merging region information into our state-level dataset#

Appendix#

1: The Census Python Library#

2: The requests library#

3: Importing 51+ variables#

Conclusion#

Calling `retrieve_census_data()`#