Stages of a Data Project

Oh wow! This sounds like a fun project! Previously we had just been adding up the total number of these, dividing by the total number of those, and then sorted based on this number. We can put together some interesting heuristics to determine which rows should be included in the analysis, and we can fit a more sophisticated model. Maybe we can run the model iteratively and apply a scoring metric that decays over time to incorporate what we know about the past. Since we have a bajillion rows in the table, we can have a training set and a test set. There is so much to do!
HUMANS ARE TERRIBLE. Approximately all of the data in the table was typed into fields on the site by people. People who have different views about the right way to go about doing things. People who can not spell. At. All. And don’t get me started about the fact that we designed the database that stores all this information, and you need to check the value in a particular column to figure out which other columns are important for understanding that row. There is no consistency.
A fairly naive regex catches enough of the important parts of the rows that we care about that we can still fit a good model, even if the data is not perfect. In fact, we can use the iterative idioms of the language to fit a whole family of models by grouping the data based on the values in important columns. Without lowering ourselves to using the sorts of for loops that seem very out of fashion these days.
Time passes. Code is written.
The parameters generated by the models are great! They have remarkably tiny p values. The conclusions drawn from the training set match with what we see in the rows that we withheld as a test set! This metric is so good at identifying the feature that we care about that we don’t even need to consider what happens over time. When we run the model, this code produces a score for each subgroup that is very effective for classification.
No matter which way we group the data, the scores seem to be remarkably consistent. In fact, they are all approximately equal to the value that you would get if you add up the total number of these and divide by the total number of those.