Jagged Data
No good reason for a lack of post yesterday. Poor time management, perhaps? A shifting sleep schedule?
Lots of the things that I read about R exhort the value of modularizing your code and
building functions. Not surprising. This is taught early on in most computer science
courses. Even my elementary school-aged self had an intuitive understanding of the
GOSUB
-RETURN
paradigm and the even greater flexibility of TO BOX :SIDE
.
(Aside: Real ’80s nerds programmed in BASIC and LOGO. The fact that neither of these is mentioned in that overly popular book that has been made into the movie that I plan to see tomorrow makes me think that the author is just a fake-‘80s-nerd.)
(Another aside: Despite having used functions in programming for at least four years beforehand, when functions were formally defined in my ninth grade math class, they were totally confusing. Perhaps because instead of some sort of “function machine” metaphor or a recognition that most of the students taking honors math were avid programmers, my math class started by defining relations as subsets of the cartesian product of two arbitrary sets and then defining functions as a special type of relation. This way of thinking about things must be left over from the New Math, right?)
In any event, I really, really, really want to turn my repeated code into functions. Every time I copy-paste, I have a sense of foreboding dread, as I wait for a shoe to drop. This is double-extra-especially the case every time I copy-paste something from one file into another file because I am going to forget which is the original, and I am never going to remember all the places that need to be changed when something needs changing (and I know that this will happen).
Let me summarize my concerns in just a few words: “Call by value” and “rectangular data”.
Functions in R call by value. Which is fine. This is exactly what I would hope the default behavior would be because I have a liberal arts education that taught me that there are few sins worse than using global variables. As global variables live neither in the heap nor in the stack, they are unable to ascend to heaven. I get that. I also have unresolved childhood trauma from my mother’s fury every time she went to use her scissors, and they were not where she left them because one of the other people in the house had moved them and not put them back. (Here the scissors are a acting as a symbol representing global data. I learned about the difference between symbols and allegory in ninth grade English class.)
But also, it is a major pain to deal with data that is not relatively uniform and regular. The platonic data structure is the spreadsheet (but with a fancier name). Every time I create a function, the easiest thing to do is to have it return a scalar, an array, or a spreadsheet.
And since we are in the world of call by value, everything that my function does must be communicated back to the caller by the return value. And sometimes (and by “sometimes” I mean “often”) I want my function to do something more sophisticated than take a few things as inputs and then give me one thing as an output. Maybe I want my function to take all of the special things out of my array, put the special things in a new array, and then return some magical number. As a person of a certain age with a liberal arts education, I strongly believe that it is important for functions to return a scalar, ideally the number zero. Barring that, the function should return a pointer to something important.
Yes, I could package all of my objects into one horrific and complicated data structure.
I could have my function build all my things, and then I could make a heterogeneous
list of all of the things that I had created and have the function return this list.
The caller could then unpack the list. Even better, the caller could then use vectorized
functions to simultaneously act on all the elements of the substructures of the
various disparate multi-dimesional ojbects in my list without relying on iterators
or for
loops, hallelujah!
Almost all the documentation that pleads with me to create functions and tries to spread the good news of functions is suspiciously silent about what really is believed to be the right way to have my function work with a bunch of objects of wildly differing types. But my hunch is that packaging them into a list is what I am supposed to be doing. I’m pretty sure that when I put my function’s local variables into a list that they are not actually being copied, so I don’t need to worry about memory. (Also, I live in a future where we have seeminglessly infinite memory, so I don’t worry about memory like the way I did 25 years ago.)
Yet, because I am old and set in my ways, instead of putting in the time to become fluent with ways of working with lists of heterogeneous data, I am reconstructing my past and taking advantage of a way to create pointers in R and starting to take some of my fragile copy-paste code and build in some functions that are able to call by reference.