The Data Temple

Yesterday the local chapter of R-Ladies had its first book group session, and we discussed the first eight chapters of R for Data Science. I think I did a remarkably good job of stating my objections to the data management principles in Chapter 8 in a succinct but efficient way.

The text takes a pretty hardcore point of view about keeping all the files related to the project in the same directory (or in subdirectories thereof) and not using absolute path names. Specifically, it suggests keeping all the data in with the project.

Clearly the people who wrote and edited this book deal with different types of data than I’m used to working with.

Have they dealt with an IRB and its restrictions on data collected from human subjects? I worked on a project where all the data was required to be locked in a magical temple, defended by ninja monks. We certainly did not keep the data on our laptops. We used the absolute path name to where the data was stored on the remarkably secure server that one could only access with a one-time-password token. The analysis server had enough memory that no one needed to warn the IRB about swap files.
Have they worked with remarkably large datasets that are used for different projects? If you have a few dozen key-value pairs (like “NY”, “New York”), sure, keep a copy with each project that needs them. But I have a file that is remarkably large that I need to use on a regular basis. It doesn’t have anything particularly secret in it – in fact, I downloaded it from a government website. But it seems silly for me to have a separate copy of it included with each project that uses it. Also, the government updates the file every now and then, and it would be nice to only have to replace it in one place to get all of my scripts to use the newest version. I can keep the current version of the file in my data temple (no protective monks required), and I can systematically archive the older versions so that I can replicate previous analyses.
If you can have your script talk directly to the database where your data lives, that is likely going to be much better than a lot of other possibilities.
A lot of the book is just an ad for RStudio. And it’s fine for the authors to present it as a way to come up with a reproducible workflow system, but it just seems somewhat annoying that they don’t acknowledge that there are other sets of tools that could be just as effective.

Now that I’ve gotten myself all riled up, it might be time for me to get back to work on my manifesto.