Using R to Apply a Function to Each Subset

tl;dr ave is my new best friend.

So far I am bravely surviving First Summer. It is supposed to get up to 90 today, and it will be in the 80s for most of the rest of the week. I have several cans of lightly flavored fizzy water (and air conditioning).

Conventional wisdom says that every time you think that you want to use a loop in R, you are wrong. Your mind is thinking, “For each level of my categorical variable, find the subset of my data that corresponds to that level, calculate some quantity on the rows in this subset, store this variable in a new column, and then put everything back together.” But there is someone out there who will judge you if you explicitly do this.

I can now solve this problem in just one line!

But, wait, there’s more! Sometimes I will do an aggregate followed by a merge, and it will make me feel bad about myself. In my heart of hearts I know that instead of SELECTing so much stuff and then doing an aggregate followed by a merge, I should have written a better query and made more clever use of JOIN and GROUP BY.

Let’s imagine that you have the following data:

set.seed(10)
values <- runif(20, 1, 100)
labels <- c(rep("a", 10), rep("b", 7), rep("c", 3))
my_data <- data.frame(values=values, labels=labels)

Suppose you want to add a new column to this dataframe with the average of the values for each label. So you still want 20 rows, but you want all of the a rows to have the value 38.98794 in my_data$average, the b rows have a value of 40.14164 in my_data$average and so on and so forth. Old me would have used an aggregate to find the average for each label and then used a merge (on label) to put everything back together.

New me does it with the classic application of the ave function.

my_data$average <- ave(my_data$values, my_data$labels, FUN = mean)

NB: If you have NAs in your data, this can rain sadness upon you.

For example, if you kill one of your data points with my_data$values[2] <- NA and try this again, you will get NA for the average of the a values, following the usual operation of the mean() function. However, you can ignore the NAs with the following.

my_data$average <- ave(my_data$values, my_data$labels, FUN = function(x) mean(x, na.rm=TRUE))

But nothing requires us to only calculate averages. We can calculate anything on our subsets! Without ending up with everything collapsed into its group (like happens with aggregate)!

In my particular case, I needed to rank the values within each subset. You can rank the values within each subset from high-to-low with

my_data$rank <- ave(my_data$values, my_data$labels, FUN = function(x) rank(-x, ties.method = "first"))

ave is now my go-to function for applying something to subsets of my data, especially when I do not feel like fighting with the *apply family of functions. I’m sure that everyone in the *apply family is a lovely function, but I have so far been unable to use them to quickly and efficiently write code that works. So for now I’m sticking with ave.

(Aside: How would you pronounce ave? I say av, like the beginning of average or avenue, like how I would say Mass Ave. But one of the cats suggests that it should rhyme with knave, and the other one insists on ahv-ay, like in Ave Maria.)