Bad Ideas and Machine Learning

I’ve come up with another bad idea that I would like to share.

One of the things that stops me from using machine learning (and by that I mean “machine learning that is more sophisticated than just fitting a logistic regression model”) is that most of the time I don’t have any labeled training data. I know how many clusters I want to have, and I know what I want my clusters to represent in the real world, but I don’t have a training set that can tell the computer about this.

Aside: When I say that one of my research interests is the analysis of partially ordered categorical data, that is just a dressed up way of saying that I work in education and there is a certain set of cultural expectations about grades, and I need to find ways to deal with grade data. A through F and “withdraw” is my canonical set of partially ordered categorical data. But each of our short answer items can take on one of five scores, and I don’t believe that our 0-1-3-5-7 grading scale is any more numerical than a Lickert scale. Also, what can you learn from the problems that were not attempted?

In our online school, each grade is hand-crafted by request.

However, we also have a small number of in-person learning centers, and at the end of the year, each student receives a certificate. The certificates come in tiers. (The lowest tier might actually be called “participation.”) We need to take a large number of (student, course) pairs and assign them to tiers. And when I say “we” I mean “I.”

This is a harder problem than when I was running the calculus machine. The registrar’s office handled the difficult parts of record keeping for my large lecture calculus classes at the university: They did not allow students to appear in my class most of the way through the term. When students disappeared midway through the term and got low scores because of that, the registrar’s office knew whether this was a grade to be recorded or if there was some sort of weird situation going on behind the scenes. The registrar’s office provided me with a reasonably canonical list of the students in my calculus class. How many of your university’s academic policies are set by database adminstrators in the registrar’s office? (I’m pretty sure that UT’s rule about only being able to waitlist two courses was a database thing.)

(In my remake of The Graduate, Mrs. Robinson will teach Benjamin about databases.)

Here, this is my problem. Policy states that students can appear at any time during the academic year. Students can switch courses and sections at will. I have spent a significant number of hours negotiating with the enrollments table. My life is a tangle of SQL WHERE dropped_at IS NULL OR switch_id > 0 and such. I have a regex to deal with the notes field because sometimes there are very special cases that I need to know about.

Last year this was a much smaller problem, and someone came up with a bespoke solution using an Excel spreadsheet and overrode any anomalies with institutional knowledge.

Do you know what this Excel spreadsheet is? IT IS A TRAINING SET! Instead of worrying about which students are legitimately in the course and which weeks’ of homework scores are fair to count in the grades and all of that, I can just feed the enrollments table and the homework_trials table and whatever other tables I can find into a machine learning algorithm and have it generate grades with the same finesse as the internet recognizing pictures of cats!

But this is a terrible idea, so I won’t do it. (Unless I have some free time after I solve the problem for real because I am sort of curious of what the algorithm would tell me about my data.)