Random Numbers
Based on how long it has been since I have updated this blog, I am now confident that the sticky note that says “write about random numbers” has been on my laptop for several weeks. This is supported by the fact that the ink is somewhat smudged and that the other sticky notes are arranged in such a way that the random numbers note looks like it was there before them.
Part of the reason for blog silence goes under the general umbrella of “interview stuff” and part of it is “don’t snark about the landscapers until the project is done.”
-
I got my lesson on the irrigation system yesterday, which signals that the project is pretty much done; I also wrote a large check, which supports that theory. We are not sure what gets watered by zone 10, nor have we determined which zone waters my lemon tree. My beloved zone 11 has been remapped to zone 9. Unlike the iPod that I had many years ago, the controller for the sprinkler system does not have a “shuffle” mode. I need to make decisions about how much water each zone will get how many times per day and how many times per week. The lemon tree might be on its own.
-
There is a project that I worked on last year and wrote thousands of lines of code. Right before it was supposed to be merged to master, the stakeholders decided that they wanted to make major changes, but that they were not sure exactly what needed to be done, so the project was put aside. (Added wrinkle: I also wrote a bunch of API endpoints – on an entirely different system – as helpers for this project, and they have been in production for nearly a year – but based on the older version of the spec.) Yesterday I discovered that they have come up with a plan of what they want and that someone who is not me is going to be working on this. (They have no idea what a good call they made by asking someone else to work on it.) I let her know where to find my previous work. I also let her know that the test server that can be used for those API endpoints always has a highly sanitized database that contains none of the information that is needed to actually test them. Since it’s not anywhere in my branch, I also pasted into the Slack thread the chunk of SQL that I use to fill the database with random data to be used for testing.
-
A house on my block went up for sale a little over a month ago for $1.3 million. It is currently listed as “pending.” This is crazy-money because I live in north county inland, roughly 35 miles from downtown San Diego. It is routinely 20 degrees hotter here than it is in coastal areas. People ask if it is safe to live in this city; we have uniformly terrible schools; someone at work refers to this city with the suffix “-ghetto” because it is known for having a lot of poor people. Yes, it’s a much nicer house than mine, but is it really worth that much? Hashtag California real estate. No matter what economics class tried to teach me about efficient markets, I find it hard to believe that there is any rhyme or reason about the price of houses in San Diego County. For years politicians have been telling us that California is going to be a ghost town because it is too expensive and everyone is moving to Texas. I’ve never had any interest in moving to Texas. I’ve come to discover that some people in Texas support my point of view.
-
This one is clearly about interview stuff: We are interviewing people to do things that are similar to what I do, so I was put in charge of setting up the interview tasks. So far we have not had much in the way of interview tasks that test the candidate’s skills with SQL. I’ve developed two sets of tasks: One for the preliminary phone screen (hello candidates who are googling me because you have phone screens this week or next week) and one for later in the process. The one for the later round needs to be similar in spirit to the data that we use on a daily basis, but it can’t be real because our real data comes from small children, and it would be wrong for us to share it – even anonymized – with interview candidates. I needed to conjure data out of thin air.
-
When creating the generated data for the interview task, I came face-to-face with the oft-repeated sentiment that R is slow. I’ve always brushed that off because most things that I’ve done with R have finished running in a few seconds. Is it worth it for me to write statistical code in a language not designed for statistics in order to save the computer a few seconds of work? Up until now, no. This interview task needs to parallel our real data in size and complexity, so I needed to make a lot of fake data. To make sure the relationships were not perfectly deterministic, I needed random number generators to fuzz the connections between some variables. I started my data-generation code running on a Friday afternoon, and it was still running Monday morning. At which point I decided to rewrite it in Node.js, even though it is definitely not meant for statistical programming. During this process I learned that d3 (an interactive visualization library used for making nice data visualizations for the web) has a really nice collection of random number generators. I was able to rewrite the entire thing in node before the R version finished running. The node version did take a few minutes to run, but I can live with that.
-
There is probably more that I could say about how I made my random data other than “R is slow, and d3 came through for me,” but this would reveal the total Rube Goldberg machine of infrastructure that I am dealing with as well as my short-term thinking of writing the data to the database belonging to the server where the data generation script is running rather than to the database belonging to the interview server.
-
On the other hand, the phone screen SQL task is pretty standard stuff without any nuance or complication to the data. You would probably write very similar queries when interviewing for other companies; if you have an understanding of fundamentals, you should be fine.