rstats, NA, NA.1, NA.2, and Stupid Mistakes

Because I hate conventional advice, I have no issues with telling the entire internet about my flaws and the stupid mistakes that I make. As a mathematician, I have a lot of practice at making mistakes. Fortunately, I am also getting better at fixing them. I also have no problem letting the internet know that I am skeptical about a lot of modern “machine learning” techniques, and I suspect that a lot of problems can be solved using less exciting methods that have been around since the 19th century.

Recently I have been getting a lot of crazy results from my R code that look like

NA       <NA> NA
NA.1     <NA> NA
NA.2     <NA> NA
NA.3     <NA> NA
14       <NA> 12
NA.5     <NA> NA
NA.6     <NA> NA

The row names are weirdly numbered NAs, like NA.1, NA.2, etc. The values are NA. It is a weird chaos of NA and things that just do not look right. Some sort of dataframe nightmare version of that scene from Being John Malkovich when John Malkovich went through the portal to John Malkovich’s mind. But with NAs instead of John Malkovich.

And then I figured out what I was doing wrong: If any of the variables contain NA entries, and I subset with the bracket notation with respect to that variable, then this happens.

For example, I can make this sort of thing happen with:

V1 <- c("red", "orange", NA, NA, "blue", "blue")
V2 <- c(1, 2, NA, 2, NA, 7)
the_data <- data.frame(V1=V1, V2=V2)
the_data[the_data$V2 > 1,]

This can be prevented with the_data[!is.na(the_data$V2) & the_data$V2 > 2, ], which is annoying (but effective).

In my particular case, it can also be prevented by not using an outer join when you meant to use an inner join. My colleague claims that the only join that one should ever use is the left join, but I am quite partial to the inner join. Except when my fingers type faster than my brain is thinking and then I use an outer join instead (by mistake).

There is also a terrible way to work around this if the NAs are in the data on purpose. This most often happens to me with dates, when the feature I’m looking at hasn’t happened. Fortunately for me, nothing that I care about existed back in 1969, so our standard way of making things go away is to send them to 1969.

First off, my terrible solution doesn’t work in R if you have factor variables, so you’d need to start off by doing something like the_data$V1 <- as.character(the_data$V1). I rarely have factor variables these days because I’m reading almost all of my data directly out of a database, so everything that might be factor-like comes in as characters already.

But once you have done that, you can very carefully conjugate with a replacement function.

the_data[is.na(the_data)] <- -1
# Do your stuff that breaks with NA
the_data[the_data == -1] <- NA

If you choose your value well, this replacement is an invertible function, so after we conjugate, we are right back where we started. If you choose poorly, then your inner mathematician ends up distracted by thoughts of whether it is worth thinking about what it means to define functions on strings, and what it would mean for such functions to be linear, and whether there are any interesting diagrams to be drawn. I am fortunate to work in an office where there are a lot of people who know a heck of a lot more category theory than I do, so if it ever were valuable to define a category whose objects are columns in our database, I would know exactly who to ask for help.

But since there is no reason to delve into category theory or to come up with an overly complicated system to describe the errors in my code, tomorrow I will make sure that I have removed all of the outer joins from my code, and I will be able to focus on our discussion if the right scoring metric to use is the difference between the two important columns or their ratio.