Getting Data from the Census API (with R)
The census has a lot of data. Probably the biggest issue that I have with using census data is that there is so darn much of it. Also you can get it in a heck of a lot of ways.
This particular story is going to be about the five year American Community Survey data because that is reported down to the “block group” level and has information about an enormous number of variables. If you don’t know what a block group is, you can think of it as a collection of roughly 1000 people who all live near each other. A block group is a subset of a census tract.
First off, maybe you don’t want to use the API. Maybe your data are so precious that you
don’t want to risk having them logged on some server in the Department of Commerce. In
that case, you probably want to download a CSV from the Amerian FactFinder Download
Center. The Download
Center is really nice! And it has a lot of information organized by ZIP Code Tabulation
Area (ZCTA), which is the Census’s answer to the ZIP code. If you already have street
addresses with ZIP codes in your precious data set, you don’t even need to geocode
anything. You will, however, still need to deal with the the fact that there are a
heck of a lot of variables and they all have names like HC03_VC54
. Definitely get any
metadata that the Census offers you alongside your data.
And what if your data are not precious and you want to use the API? Well, R is my tool of choice, so I’m going to tell you about how this works with R, but you can easily adapt this to your tool of choice as well. Long story short: you construct a URL that passes the variables of interest, and the Census will send back the information that you asked for.
Since I’m talking about Census data, I’m thinking about questions of the form, “Tell me
about this feature relating to the people in this location.” So I need to specify a
location and a feature. The location can be a state, a county, a census tract, a block
group, or one of several other less-well-known political boundaries. The feature can
be something as straightforward as the total number of people who live in the location
or it can be something pretty complicated, like the number of people of a certain
combination of race, ethnicity, and age who rely on a specific mode of transportation to
commute to work. You’ll find the names of the variables in the first column of this
table. For example, B00001_001E
tells you the unweighted sample count of the population in the location.
So now that we know how to specify a variable, we need to also know how to specify a location. A block group is formed by combining the two-digit state code, the three-digit county code, the six-digit census tract code, and the one-digit block group code. Where do we get these codes?
We can use the Census Geocoder API!
In my example, I am going to feed in a latitude/longitude pair; the census geocoder can also work with street addresses.
Here’s the sample code, which should be pretty self-explanatory.
library(stringr)
library(jsonlite)
latitude <- "43.1010304"
longitude <- "-75.2919624"
geo_url <- str_c("https://geocoding.geo.census.gov/geocoder/geographies/coordinates?x=", longitude,"&y=", latitude, "&benchmark=Public_AR_Current&vintage=Current_Current&layer=10&format=json")
geo_info <- fromJSON(geo_url)
block_group <- geo_info[["result"]][["geographies"]][[1]][["BLKGRP"]]
state <- geo_info[["result"]][["geographies"]][[1]][["STATE"]]
county <- geo_info[["result"]][["geographies"]][[1]][["COUNTY"]]
tract <- geo_info[["result"]][["geographies"]][[1]][["TRACT"]]
I plugged the latitude and longitude in for the correct variables in the URL, sent the request, parsed the JSON, and then extracted the information. If you read through the raw response, you can learn that this location is in Oneida County, NY.
Next I can ask the Census for the total number of people who live in this block group. For this I will need an API key (do note that you can geocode without an API key in case you were wondering where to find some free geocoding). It is not hard to request an API key.
Adding to the code from above:
census_api_key <- "your_key_goes_here"
variable <- "B00001_001E"
query_url = str_c("https://api.census.gov/data/2017/acs/acs5?get=", variable, ",NAME&for=block%20group:", block_group, "&in=state:", state, "%20county:", county, "%20tract:", tract, "&key=", census_api_key)
my_data <- fromJSON(query_url)
And then the my_data
variable will hold the result of the call; the important part of
the payload is in my_data[2, 1]
.
There are R packages that will take care of assembling the URLs and extracting the data for you, but so far I have not found any tool that makes it easier to figure out the name of the variables that report the information that I care about. Watch out that you do not get sucked into an afternoon of sharing with everyone within earshot the median ages of people from various income groups who live in your neighborhood and who bicycle to work.