Using R to Apply a Function to Each Subset
tl;dr ave
is my new best friend.
So far I am bravely surviving First Summer. It is supposed to get up to 90 today, and it will be in the 80s for most of the rest of the week. I have several cans of lightly flavored fizzy water (and air conditioning).
Conventional wisdom says that every time you think that you want to use a loop in R, you are wrong. Your mind is thinking, “For each level of my categorical variable, find the subset of my data that corresponds to that level, calculate some quantity on the rows in this subset, store this variable in a new column, and then put everything back together.” But there is someone out there who will judge you if you explicitly do this.
I can now solve this problem in just one line!
But, wait, there’s more! Sometimes I will do an aggregate
followed by a merge
, and
it will make me feel bad about myself. In my heart of hearts I know that instead
of SELECT
ing so much stuff and then doing an aggregate
followed by a merge,
I
should have written a better query and made more clever use of JOIN
and GROUP BY
.
Let’s imagine that you have the following data:
set.seed(10)
values <- runif(20, 1, 100)
labels <- c(rep("a", 10), rep("b", 7), rep("c", 3))
my_data <- data.frame(values=values, labels=labels)
Suppose you want to add a new column to this dataframe with the average of the values
for each label. So you still want 20 rows, but you want all of the a
rows to have the
value 38.98794 in my_data$average
, the b
rows have a value of 40.14164 in
my_data$average
and so on and so forth. Old me would have used an aggregate
to
find the average for each label
and then used a merge
(on label
) to put
everything back together.
New me does it with the classic application of the ave
function.
my_data$average <- ave(my_data$values, my_data$labels, FUN = mean)
NB: If you have NA
s in your data, this can rain sadness upon you.
For example, if you kill one of your data points with my_data$values[2] <- NA
and try
this again, you will get NA
for the average of the a
values, following the usual
operation of the mean()
function. However, you can ignore the NA
s with the following.
my_data$average <- ave(my_data$values, my_data$labels, FUN = function(x) mean(x, na.rm=TRUE))
But nothing requires us to only calculate averages. We can calculate anything on our
subsets! Without ending up with everything collapsed into its group (like happens with
aggregate
)!
In my particular case, I needed to rank the values within each subset. You can rank the values within each subset from high-to-low with
my_data$rank <- ave(my_data$values, my_data$labels, FUN = function(x) rank(-x, ties.method = "first"))
ave
is now my go-to function for applying something to subsets of my data, especially
when I do not feel like fighting with the *apply
family of functions. I’m sure that
everyone in the *apply
family is a lovely function, but I have so far been unable to
use them to quickly and efficiently write code that works. So for now I’m sticking with
ave
.
(Aside: How would you pronounce ave
? I say av, like the beginning of average or
avenue, like how I would say Mass Ave. But one of the cats suggests that it should
rhyme with knave, and the other one insists on ahv-ay, like in Ave Maria.)