I’ve recently been playing around a lot with some data analysis in R and in Jupyter notebooks, in preparation for two exciting summer ventures that’ll be taking up a lot of my time:
<run>:\the\world Girls’ Machine Learning Day Camp
and
These are both work projects, as I hope is clear from the links.
One of the datasets that’s made the rounds over time is the dataset about “chemical diabetes” from the 1979 paper by Miller and Reaven. It is the chemdiab dataset, with documentation at https://www.rdocumentation.org/packages/locfit/versions/1.5-9.1/topics/chemdiab. It’s in the locfit package, for instance. Check out that documentation: here we’ve got a well-used dataset included in R that doesn’t have any useful documentation on the R side. You’d think the R documentation would have units and variable meanings, for instance, but you’d be wrong.
I was particularly interested in this dataset because it’s used in one of the initial proof-of-concept papers for topological data analysis (TDA). As I worked to reproduce their work in R using the very nice work of Paul Pearson on the R “TDAmapper” library, I realized that this is a pretty funky task. First of all, “chemical diabetes” is a concept that’s really outdated. I asked my board-certified-in-internal-medicine-physician-husband about chemical diabetes, and he laughed at me and asked if I’d ask about “the grippe” next. He did kindly explain that “chemical diabetes” was from back when either you were really clearly diabetic, or maybe a blood lab test would indicate some progress toward diabetes. In any case, today we know about type 1 and type 2 diabetes. Second, the Miller-Reaven classification itself is pretty weird. If you read the paper, Miller and Reaven took some doctor-classified data (normal, overt diabetes, and chemical diabetes) and then used a new computer-aided classification scheme to reclassify the observations. It’s really cool in that it’s the beginning of machine learning! But these days, it seems to me it’s a little weird to try to develop a computer-executed algorithm that will classify cases to match a computer-executed algorithm. To test whether my algorithm can reproduce their algorithm is just…. weak. Machine learning truly wants to reproduce what humans can do, when humans surpass machines.
I made some pretty pictures, used TDA in R to more-or-less reproduce the results of the analysis here. I’ve moved on since then to Congressional vote data and science data. But I always ask my students to think about the data they’re using, what it shows and what it tells, and whether it can even answer your question. Since I was using this data set primarily to experiment with R data visualization in TDA, it was fine. But as a scientific dataset for classification problems in machine learning, it is not my favorite. The lack of documentation, outdated nature of the classification, and fact that the classification itself is generated by an algorithm rather than human observation makes it a bit problematic.