Up to this point, most of the data projects that I’ve been working on have been more along the lines of describing the past than of predicting the future. I’ve been doing a lot of stuff where I’ve been identifying which students demonstrate certain behaviors. I’ve been checking to see which of the small changes that we made in the homework system have made a difference in student performance. I parsed a whole bunch of JSON and wrote some regexes to see what sorts of things users were typing into a particular free-response box.

But my next project is predicting the future! (Based on what happened in the past.)

This is actually kind of terrifying because someone eyeballed the data once and very quickly came up with a very simple one-parameter model. And this model works pretty well most of the time! Later, I fed the data into something that actually calculates the parameter (instead of just guessing), and the calculated value was pretty close to the guess. In almost all cases!

The trouble is that when it doesn’t work, it fails in spectacular ways.

Also, making the model work better does not involve using a flashy algorithm. Making the model work better will require lots and lots of fussy details to improve the quality of the data being fed into the model. There are a lot of indirect measurements that a better model would need to know about. And since this is all data about humans in a system created by humans, if it looks like the future is evolving in a particular way and that the old model is not working, then someone will change some aspect of reality, and all the future data lives in the new reality. The database might not even know how we were changing the world, and nobody might have remembered to write down what we did and when. So it is hard to know which discontinuities are of our own making, and which are interesting and unexplained.

So this project is big enough to be meaningful, important, and interesting. But it is also big enough to be scary. I got nothing done on it today. The morning was spent doing administrative stuff and helping out with an interview, and I took the afternoon off. Next week is a new week, and it will take all of my good sense to chip away at writing tiny little functions that will extract sparkling clean data rather than building a road map that resembles a crazy wall of newspaper clippings, photographs, and string (like you would see in a detective show).