Data Retrieval#
By Kenneth Burchfiel
Released under the MIT License
In order to begin analyzing data, we’ll first need to load it into our program. This script will demonstrate how to import a variety of data formats into Python using pandas, a fantastic library for loading, reformatting, and analyzing data. Before I dive into the code, I’d like to provide a brief overview of Pandas for those who aren’t yet familiar with this library.
A Pandas primer#
You can think of Pandas as a tool for performing spreadsheet-based tasks* within Python programs. One benefit of performing such tasks in Python (rather than Excel, Google Sheets, or another spreadsheet program) is that, once you have them scripted, you can quickly rerun them whenever the original data gets updated.** You could even have your computer run your analysis script on a daily or hourly basis, thus freeing up time you’d need to spend on busywork for more interesting tasks.
A central feature of Pandas is the DataFrame, which is essentially a within-Python spreadsheet that you can filter, modify, and extend as needed. You can import data into DataFrames (as this section will demonstrate); perform calculations on specific DataFrame columns; graph DataFrame data using libraries like Plotly; and then export DataFrames to a variety of formats. DataFrames and Jupyter Notebooks work very well together: by executing Pandas code within Jupyter Notebook cells, you can easily see how particular operations will modify your DataFrame.
If you need more information about a given Pandas function, make sure to consult the library’s documentation (https://pandas.pydata.org/docs/user_guide/index.html); this is a well-written resource that I’ve relied on heavily in my own work. (In general, I recommend consulting official documentation (including the official Python documentation at https://docs.python.org/) whenever possible, though some libraries will have better documentation than others.)
Pandas will play a key or leading role in many sections of Python for Nonprofits. Although PFN will introduce you to many essential Pandas operations, you’ll likely come across many more in your own Python programming journey.
* Pandas is not the only Python library that allows you to perform spreadsheet operations. Polars (https://pola.rs/) is another promising tool that, as of June 2025, has some incredible performance advantages over Pandas. However, these advantages likely won’t become relevant unless you’re working with especially large (e.g. multi-gigabyte) source files. In addition, Pandas 3 should allow for faster data processing (https://thenewstack.io/python-pandas-ditches-numpy-for-speedier-pyarrow/), so it’s very possible that, by the time you read this, Pandas will have at least partly closed the performance gap between itself and Polars.
** There are certainly ways to automate Excel tasks as well (e.g. using Visual Basic). I don’t have any experience with Visual Basic, so I’m not the best person to compare these two tools; however, I have no doubt that learning it would take some time, and given Python’s versatility and power, I would recommend applying that time to learning Python instead. (You can get an estimate of the world’s interest in Python versus Visual Basic by checking out the TIOBE index.)
Back to data retrieval#
One of the key strengths of Pandas is its compatibility with a wide variety of data formats. This script will demonstrate how to use pandas to import data from .csv files, .xlsx files, SQL tables, and HTML pages; however, many other data types are supported, either by Python itself or via additional libraries. Later sections of Python for Nonprofits will introduce additional data sources.
import pandas as pd
from sqlalchemy import create_engine
# The above code can be found at
# https://docs.sqlalchemy.org/en/20/core/engines.html .
# The following two lines temporarily update your sys.path value
# to include the Appendix folder; that way, our import statement can
# access that folder in order to import its helper_funcs.py file.
# (This code was derived from Cameron's StackOverflow answer at
# https://stackoverflow.com/questions/4383571/
# importing-files-from-different-folder ).
import sys
sys.path.insert(1, '../Appendix')
from helper_funcs import config_notebook # This function improves the
# appearance of PDF-based copies of notebooks, as it limits the number
# of rows and columns that can be displayed. It also allows us to display
# charts and maps in PNG rather than HTML format, thus making these
# graphics easier to view in PDF format.
# For more details on this function, consult the documentation
# for config_notebook within helper_funcs.py.
display_type = config_notebook(display_max_columns=7)
Importing .csv data#
Pandas’ read_csv()
(https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function simplifies the process of reading .csv data into your Python script. The following cell provides an example of this function.
Note that the path argument begins with ../
. This instructs the code to move up one directory within the file system (to the root Python for Nonprofits folder), which in turn allows it to enter into the Appendix folder and retrieve the curr_enrollment.csv file stored there.
df_curr_enrollment_csv = pd.read_csv(
'../Appendix/curr_enrollment.csv')
df_curr_enrollment_csv.head() # head() prints the first 5 rows of
# a DataFrame by default, though you can specify a larger or smaller number
# within the parentheses if needed. The last 5 rows can be viewed via
# .tail().
first_name | last_name | gender | ... | class_of | level | level_for_sorting | |
---|---|---|---|---|---|---|---|
0 | Ashley | Jordan | F | ... | 2024 | Se | 3 |
1 | Lisa | Dunn | F | ... | 2024 | Se | 3 |
2 | Alexandra | Woods | F | ... | 2024 | Se | 3 |
3 | Kathleen | Carter | F | ... | 2024 | Se | 3 |
4 | Donna | Brewer | F | ... | 2024 | Se | 3 |
5 rows × 11 columns
However, you’ll sometimes need to add additional arguments to read_csv()
in order to correctly import your data. For example, if a separator other than a comma was used, you’ll want to specify that separator via the ‘sep’ argument.
The following example shows what you’ll see if you try to use read_csv()
to import tab-separated .csv data:
df_curr_enrollment_tab_csv = pd.read_csv(
'curr_enrollment_tab_separated.csv')
df_curr_enrollment_tab_csv.head()
first_name\tlast_name\tgender\tmatriculation_year\tmatriculation_number\tstudent_id\tdate_of_birth\tcollege\tclass_of\tlevel\tlevel_for_sorting | |
---|---|
0 | Ashley\tJordan\tF\t2020\t1\t2020-1\t2002-12-16... |
1 | Lisa\tDunn\tF\t2020\t2\t2020-2\t2002-09-26\tST... |
2 | Alexandra\tWoods\tF\t2020\t3\t2020-3\t2002-05-... |
3 | Kathleen\tCarter\tF\t2020\t4\t2020-4\t2002-11-... |
4 | Donna\tBrewer\tF\t2020\t5\t2020-5\t2002-06-21\... |
The '\t'
strings within the column and field values and the lack of column separators are dead giveaways that this file used tab separators rather than commas. To correctly import this information, you’ll need to add sep='\t'
within your read_csv()
call, as shown below. ('\t'
represents tabs, just as '\n'
represents newlines.)
df_curr_enrollment_tab_csv = pd.read_csv(
'curr_enrollment_tab_separated.csv', sep='\t')
df_curr_enrollment_tab_csv.head()
first_name | last_name | gender | ... | class_of | level | level_for_sorting | |
---|---|---|---|---|---|---|---|
0 | Ashley | Jordan | F | ... | 2024 | Se | 3 |
1 | Lisa | Dunn | F | ... | 2024 | Se | 3 |
2 | Alexandra | Woods | F | ... | 2024 | Se | 3 |
3 | Kathleen | Carter | F | ... | 2024 | Se | 3 |
4 | Donna | Brewer | F | ... | 2024 | Se | 3 |
5 rows × 11 columns
In addition, if your .csv file uses an encoding other than UTF-8 (the default), you may need to specify an alternative codec (https://docs.python.org/3/library/codecs.html#standard-encodings) using the ‘encoding’ argument.
read_csv()
can also be used for .csv files hosted online; you’ll simply substitute a URL in your function call for your local file path. For instance, the following cell reads in a .csv file from GitHub that will be processed by a later section of Python for Nonprofits:
df_web_csv_example = pd.read_csv(
'https://raw.githubusercontent.com/kburchfiel/pfn/\
refs/heads/main/Data_Prep/winter_results/STB_So_results.csv')
df_web_csv_example
MATRIC# | SEASON | SURVEY_SCORE | STARTINGYR | MATRICYR | |
---|---|---|---|---|---|
0 | 2885 | W | 60.0% | 23 | 22 |
1 | 1875 | W | 67.0% | 23 | 22 |
... | ... | ... | ... | ... | ... |
974 | 410 | W | 73.0% | 23 | 22 |
975 | 1581 | W | 61.0% | 23 | 22 |
976 rows × 5 columns
While this is a handy feature to have, if you’re dealing with a large and unchanging dataset, you may want to consider downloading it to your computer beforehand–as reading in the file from your local hard drive will likely be faster than retrieving it via your internet connection.
Importing .xlsx data#
Importing .xlsx files is also easy to do within Python, although I’ve found that this process can take longer to execute than does importing .csv files. (Therefore, I tend to prefer .csv files over .xlsx ones when importing data into Python.)
The following code imports an .xlsx version of the same current enrollment dataset that we imported earlier. You may need to install the openpyxl library in order for this code to run on your computer.
df_curr_enrollment_xlsx = pd.read_excel('curr_enrollment.xlsx')
df_curr_enrollment_xlsx.head()
first_name | last_name | gender | ... | class_of | level | level_for_sorting | |
---|---|---|---|---|---|---|---|
0 | Ashley | Jordan | F | ... | 2024 | Se | 3 |
1 | Lisa | Dunn | F | ... | 2024 | Se | 3 |
2 | Alexandra | Woods | F | ... | 2024 | Se | 3 |
3 | Kathleen | Carter | F | ... | 2024 | Se | 3 |
4 | Donna | Brewer | F | ... | 2024 | Se | 3 |
5 rows × 11 columns
This cell took 1.38 seconds to execute on my computer, whereas the first .csv import cell ran in just 21 miliseconds (or 0.021 seconds). In other words, the .csv import was around 65 times faster than the .xlsx import.
Importing SQL data#
Python’s SQLAlchemy and Pandas libraries make it easy to import SQL tables into your script. Many different types of SQL (such as PostgreSQL) are supported, but this example will focus on a SQLite table created within PFN’s appendix.
In order to import data from a database, we’ll first need to connect to it via SQLAlchemy’s create_engine
function:
e = create_engine('sqlite:///../Appendix/nvcu_db.db')
# Based on:
# https://docs.sqlalchemy.org/en/20/dialects/sqlite.html#pysqlite
Note that the first 3 forward slashes indicate that a relative path will be used. The actual relative path (‘../Appendix/nvcu_db.db’) follows those forward slashes.
For guidance on creating engines for other database types (such as PostgreSQL, MySQL, and others), visit https://docs.sqlalchemy.org/en/20/core/engines.html#database-urls .
Once this SQLAlchemy engine has been created, we can use it to read in data from our database:
df_curr_enrollment_sql = pd.read_sql(
'select * from curr_enrollment', con=e)
df_curr_enrollment_sql.head()
first_name | last_name | gender | ... | class_of | level | level_for_sorting | |
---|---|---|---|---|---|---|---|
0 | Ashley | Jordan | F | ... | 2024 | Se | 3 |
1 | Lisa | Dunn | F | ... | 2024 | Se | 3 |
2 | Alexandra | Woods | F | ... | 2024 | Se | 3 |
3 | Kathleen | Carter | F | ... | 2024 | Se | 3 |
4 | Donna | Brewer | F | ... | 2024 | Se | 3 |
5 rows × 11 columns
SQL is a whole language in itself, but you need not be a SQL expert to use Python to connect to database tables. In the above cell, 'select * from curr_enrollment'
is a line of SQL code that requests that all fields (and, thus, all data) be retrieved from the curr_enrollment
SQLite table.
SQL makes it possible to select specific columns, choose only particular groups of rows, and perform other more advanced operations. However, many of these operations can also be performed within Python. If you’re already a SQL whiz, feel free to pass more advanced code to read_sql()
; if you’re a SQL novice, ‘select * from [table]’ will still get you pretty far!
Also note that a SQLAlchemy engine can be used as the ‘con’ argument within both read_sql()
and to_sql()
. This is mentioned explicitly within Pandas’ to_sql()
documentation (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html) but can also be inferred from pandas’ read_sql()
page (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html). This page states that ‘con’ needs to be a ‘SQLAlchemy connectable,’ and the source code for sql.py() (pandas-dev/pandas) specifies that ‘SQLAlchemy connectable’ can be either an engine or a connection. I mention this in part because using an engine as your argument for the ‘con’ parameter in read_sql()
and to_sql()
can save you a bit of code.
Importing HTML data#
There are a number of ways to access data directly from the internet via Python. One means of doing so is pd.read_html()
, which lets you read HTML tables from websites driectly into DataFrames.
The following code imports a list of sample Census API connection strings into a DataFrame. The [0] at the end of the read_html()
call instructs Pandas to convert the first table retrieved by read_html()
into a DataFrame. (There’s only one HTML table on this page, but it’s still necessary to add [0].)
Note that the import isn’t perfect: for some reason, the data on the ‘Number’ field on the right didn’t get downloaded successfully. As noted in the read_html()
documentation (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html), “Expect to do some cleanup after you call this function.”
df_census_examples = pd.read_html(
'https://api.census.gov/data/2022/acs/acs5/examples.html')[0]
df_census_examples.head()
Geography Hierarchy | Geography Level | Example URL | Number | |
---|---|---|---|---|
0 | us | 010 | https://api.census.gov/data/2022/acs/acs5?get=... | NaN |
1 | us | 010 | https://api.census.gov/data/2022/acs/acs5?get=... | NaN |
2 | region | 020 | https://api.census.gov/data/2022/acs/acs5?get=... | NaN |
3 | region | 020 | https://api.census.gov/data/2022/acs/acs5?get=... | NaN |
4 | division | 030 | https://api.census.gov/data/2022/acs/acs5?get=... | NaN |
What about exports?#
The above examples focused on importing data; however, Pandas also makes it easy to export data to a variety of formats.
Exporting data to a .csv file:#
df_curr_enrollment_csv.to_csv(
'curr_enrollment_export.csv', index=False)
index=False
specifies that we do not want to include the DataFrame’s index within our .csv file. This is the best option in this case, since the DataFrame’s index doesn’t contain any meaningful data–and it will get recreated anyway when we import the .csv file back into Python.
As the following two cells demonstrate, If we leave this component out, then attempt to import the DataFrame, we’ll end up with an ugly Unnamed: 0
column:
df_curr_enrollment_csv.iloc[0:5].to_csv(
'curr_enrollment_export_with_index.csv') # Using .iloc to export
# only the first 5 rows will reduce the size of this file within
# our project folder.
pd.read_csv('curr_enrollment_export_with_index.csv')
Unnamed: 0 | first_name | last_name | ... | class_of | level | level_for_sorting | |
---|---|---|---|---|---|---|---|
0 | 0 | Ashley | Jordan | ... | 2024 | Se | 3 |
1 | 1 | Lisa | Dunn | ... | 2024 | Se | 3 |
2 | 2 | Alexandra | Woods | ... | 2024 | Se | 3 |
3 | 3 | Kathleen | Carter | ... | 2024 | Se | 3 |
4 | 4 | Donna | Brewer | ... | 2024 | Se | 3 |
5 rows × 12 columns
Exporting data to an .xlsx file:#
df_curr_enrollment_xlsx.to_excel(
'curr_enrollment_export.xlsx', index=False)
On my computer, this .xlsx export took 52 times as long as the .csv export (1.36 seconds vs. 26 milliseconds). Therefore, as noted earlier, you may want to use .csv files in place of .xlsx files when working in Python.
For examples of exporting data to SQL tables via to_sql()
, reference the nvcu_db_gen.ipynb
file within the Appendix.
A caveat about data types#
You’ll often find that using read_csv()
, read_excel()
, and read_sql()
will create identical DataFrames as long as the source data is the same. However, in some cases, the data types returned by these functions will differ–requiring you to add in some additional code to resolve this discrepancy.
For instance, if you retrieve a dataset from a SQL table, dates of birth might be formatted as datetime values. Meanwhile, those same values may be formatted as strings if they were imported from a .csv file. Therefore, if you’re switching your import source from a SQL table to a .csv file or vice versa, it’s not a bad idea to check the data types of your imported fields using df.dtypes
(https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html).
Conclusion#
As this notebook demonstrates, Pandas tends to make data retrieval very simple. Cleaning and reformatting data, the subject of the following chapter, isn’t always so simple–but Pandas will play a very valuable role with this process as well.
For a more comprehensive listing of the many data import (and export) tasks that Pandas can perform, visit https://pandas.pydata.org/pandas-docs/stable/reference/io.html .