• Getting Data from the Census API (with R)

    The census has a lot of data. Probably the biggest issue that I have with using census data is that there is so darn much of it. Also you can get it in a heck of a lot of ways.

    This particular story is going to be about the five year American Community Survey data because that is reported down to the “block group” level and has information about an enormous number of variables. If you don’t know what a block group is, you can think of it as a collection of roughly 1000 people who all live near each other. A block group is a subset of a census tract.

    First off, maybe you don’t want to use the API. Maybe your data are so precious that you don’t want to risk having them logged on some server in the Department of Commerce. In that case, you probably want to download a CSV from the Amerian FactFinder Download Center. The Download Center is really nice! And it has a lot of information organized by ZIP Code Tabulation Area (ZCTA), which is the Census’s answer to the ZIP code. If you already have street addresses with ZIP codes in your precious data set, you don’t even need to geocode anything. You will, however, still need to deal with the the fact that there are a heck of a lot of variables and they all have names like HC03_VC54. Definitely get any metadata that the Census offers you alongside your data.

    And what if your data are not precious and you want to use the API? Well, R is my tool of choice, so I’m going to tell you about how this works with R, but you can easily adapt this to your tool of choice as well. Long story short: you construct a URL that passes the variables of interest, and the Census will send back the information that you asked for.

    Since I’m talking about Census data, I’m thinking about questions of the form, “Tell me about this feature relating to the people in this location.” So I need to specify a location and a feature. The location can be a state, a county, a census tract, a block group, or one of several other less-well-known political boundaries. The feature can be something as straightforward as the total number of people who live in the location or it can be something pretty complicated, like the number of people of a certain combination of race, ethnicity, and age who rely on a specific mode of transportation to commute to work. You’ll find the names of the variables in the first column of this table. For example, B00001_001E tells you the unweighted sample count of the population in the location.

    So now that we know how to specify a variable, we need to also know how to specify a location. A block group is formed by combining the two-digit state code, the three-digit county code, the six-digit census tract code, and the one-digit block group code. Where do we get these codes?

    We can use the Census Geocoder API!

    In my example, I am going to feed in a latitude/longitude pair; the census geocoder can also work with street addresses.

    Here’s the sample code, which should be pretty self-explanatory.

    library(stringr)
    library(jsonlite)
    
    latitude <- "43.1010304"
    longitude <- "-75.2919624"
    
    geo_url <- str_c("https://geocoding.geo.census.gov/geocoder/geographies/coordinates?x=", longitude,"&y=", latitude, "&benchmark=Public_AR_Current&vintage=Current_Current&layer=10&format=json")
    
    geo_info <- fromJSON(geo_url)
    
    block_group <- geo_info[["result"]][["geographies"]][[1]][["BLKGRP"]]
    state <- geo_info[["result"]][["geographies"]][[1]][["STATE"]]
    county <- geo_info[["result"]][["geographies"]][[1]][["COUNTY"]]
    tract <- geo_info[["result"]][["geographies"]][[1]][["TRACT"]]
    

    I plugged the latitude and longitude in for the correct variables in the URL, sent the request, parsed the JSON, and then extracted the information. If you read through the raw response, you can learn that this location is in Oneida County, NY.

    Next I can ask the Census for the total number of people who live in this block group. For this I will need an API key (do note that you can geocode without an API key in case you were wondering where to find some free geocoding). It is not hard to request an API key.

    Adding to the code from above:

    census_api_key <- "your_key_goes_here"
    
    variable <- "B00001_001E"
    
    query_url = str_c("https://api.census.gov/data/2017/acs/acs5?get=", variable, ",NAME&for=block%20group:", block_group, "&in=state:", state, "%20county:", county, "%20tract:", tract, "&key=", census_api_key)
    
    my_data <- fromJSON(query_url)
    

    And then the my_data variable will hold the result of the call; the important part of the payload is in my_data[2, 1].

    There are R packages that will take care of assembling the URLs and extracting the data for you, but so far I have not found any tool that makes it easier to figure out the name of the variables that report the information that I care about. Watch out that you do not get sucked into an afternoon of sharing with everyone within earshot the median ages of people from various income groups who live in your neighborhood and who bicycle to work.


  • Unknitting

    I had forgotten to account for the fact that cotton doesn’t stretch as much as wool does as well as my absurd affection for size 0 needles, so I had to rip this all the way back to the beginning and rewind the yarn.

    If anyone asks, I am telling him that it is a sock. Because I don’t want to hear his snarky comments about the fact that I am making legwarmers.

    Knitting project


  • Putting Children to Work

    One of my scripts alerted me to a new user on our forums whose email address belongs to a domain where the MX record is set up incorrectly. Although, since the user was able to complete the registration process, this domain apparently can receive email.

    I checked out the domain. It belongs to a consulting company that provides services to small businesses.

    This new user’s first post on our message board was the sort of messy real-life application of algebra that I saw a lot of when I was spending too much of my time hanging out in the College of Business Administration.

    Do we think that this individual works for a consulting company while attending B-school and is looking for help on homework? Or that this person can’t figure out how to use fairly standard techniques in algebra to solve a client’s problem?


  • Friday Omens

    1. Cat barf.

    2. Today in Facebook ads: How to ride the NYC subway. How to outsource work to the Philippines. An oral medication for lung cancer.

    3. A colleague from when I was at UTK came to talk to us about whether our curriculum products could be helpful for a project that she’s working on.


  • My Own Personal Time Machine

    I listen to NPR on my way to work. One day as I neared the office, I heard what I had assumed was an editing glitch in the story. The reporter repeated the last phrase, as if the audio wasn’t spliced together correctly. The overlap was just a few seconds.

    But then it happened again another day. Another case in which the story on the radio went back in time a few seconds.

    And then again another day.

    If this keeps up, I will probably end up venturing a few minutes into the past.

    As it always happens at roughly the same geographical location, my assumption is that the radio station broadcasts multiple signals and that they are out of sync. The spot near the top of the hill is probably where my car’s radio decides that one signal is better than the other.

    But it’s amusing to pretend that I have a time machine.


  • Brief Updates from Work

    1. Finally set the flags so that my pages will refuse to load on the live site and will only run on testservers. My queries take several seconds to run, even after I have optimized them quite aggressively, so they can not run on the live server.

    2. Relatedly: Finally figured out the right way to display the data so that it is easy for people to understand. There are still some decisions to be made, but now that I have acted like a grown-up and separated out the key part of one page into its own React component, it should not be too hard to change things up.

    3. Fighting with React. Remember how back in the 1990s all of your telephonic devices wanted to be closest to the wall? Well, in my world, all of the JavaScript libraries want to have control of the DOM. We all know that React will eventually get its way, which makes it the modem in this imperfect analogy and d3 is the answering machine, perhaps?

    4. It’s too bad that we don’t have some sort of IoT device monitoring the upstairs kitchen at work because I would like to make a real-time visualization of the snack situation. We are doing well on coffee, tea, yogurt, sparkling water, cheese, fruit, and chocolate, but the granola bar situation is troublesome. Also, the upstairs microwave broke, so I can get my hot beverages as hot as I would like them to be.


  • PostgreSQL, unnest, and array_agg respecting order

    tl;dr: WITH ORDINALITY and ORDER BY ordinality are your friends.

    Once upon a time I could write fairly simple SQL queries, pull my data into R, and then manipulate the resulting data in R. It worked really well for me. But now I live in a world where my only two tools are PostgreSQL and Javascript, and I don’t know anyone who reaches for Javascript when she needs to reshape a bunch of tabular data.

    The problem that I have right now is that each lesson is associated with an ordered list of problemIDs. But these id numbers are not meaningful to humans; the problems have other names that people know about. The lessons table has a column called problemIDs, which contains an array of the problems’ ids (in order). Meanwhile, the problems table has a column of ids and a column of the corresponding problemNames.

    Something sort of like this:

    lessons

    idproblemIDs
    100 {1, 2, 3, 4}
    101 {5, 6, 7, 8}

    problems

    idproblemName
    1 'some name'
    2 'whatever'
    3 'more stuff'
    4 'yadda yadda'

    (OK, not exactly. I have fudged the structure of things to simplify this example. Don’t even get me started on the column full of JSON that I also need to deal with in the real-life version of things.)

    My goal is to get the array problemIDs as well as the corresponding array of problemNames—and to have both of these in the same order as the original array problemIDs.

    That is, I want rows that look something like this:

    100, {1, 2, 3, 4}, {'some name', 'whatever', 'more stuff', 'yadda yadda'}

    Philosophically, what I need to do is unnest(problemIDs), then JOIN with the problems table to get the corresponding name for each problem, and then array_agg() everything back together into arrays.

    The issue, of course, is that goodness knows what order the query planner is going to do things in, and there is no guarantee that my arrays are going to be in the same order as the original array problemIDs. I had been working my way around this by using subqueries and various functions, but it was all pretty awkward.

    But on Friday I learned about the good news of unnest(foo) WITH ORDINALITY. Pair that with array_agg(bar ORDER BY ordinality), and the problem is solved.

    SELECT lessons.id, 
    array_agg("problemID" ORDER BY ordinality) AS "problemIDs", 
    array_agg(problems."problemName" ORDER BY ordinality) AS "problemNames" 
    FROM lessons, unnest("problemIDs") WITH ORDINALITY AS "problemID"
    INNER JOIN problems
    ON "problemID" = problems.id
    GROUP BY lessons.id
    

subscribe via RSS