Miller-Reaven diabetes data set

I’ve recently been playing around a lot with some data analysis in R and in Jupyter notebooks, in preparation for two exciting summer ventures that’ll be taking up a lot of my time:

<run>:\the\world Girls’ Machine Learning Day Camp

and

The MCFAM Summer Seminar.

These are both work projects, as I hope is clear from the links.

One of the datasets that’s made the rounds over time is the dataset about “chemical diabetes” from the 1979 paper by Miller and Reaven. It is the chemdiab dataset, with documentation at https://www.rdocumentation.org/packages/locfit/versions/1.5-9.1/topics/chemdiab. It’s in the locfit package, for instance. Check out that documentation: here we’ve got a well-used dataset included in R that doesn’t have  any useful documentation on the R side. You’d think the R documentation would have units and variable meanings, for instance, but you’d be wrong.

I was particularly interested in this dataset because it’s used in one of the initial proof-of-concept papers for topological data analysis (TDA). As I worked to reproduce their work in R using the very nice work of Paul Pearson  on the R “TDAmapper” library, I realized that this is a pretty funky task. First of all, “chemical diabetes” is a concept that’s really outdated. I asked my board-certified-in-internal-medicine-physician-husband about chemical diabetes, and he laughed at me and asked if I’d ask about “the grippe” next. He did kindly explain that “chemical diabetes” was from back when either you were really clearly diabetic, or maybe a blood lab test would indicate some progress toward diabetes. In any case, today we know about type 1 and type 2 diabetes. Second, the Miller-Reaven classification itself is pretty weird. If you read the paper, Miller and Reaven took some doctor-classified data (normal, overt diabetes, and chemical diabetes) and then used a new computer-aided classification scheme to reclassify the observations. It’s really cool in that it’s the beginning of machine learning! But these days, it seems to me it’s a little weird to try to develop a computer-executed algorithm that will classify cases to match a computer-executed algorithm. To test whether my algorithm can reproduce their algorithm is just…. weak. Machine learning truly wants to reproduce what humans can do, when humans surpass machines.

I made some pretty pictures, used TDA in R to more-or-less reproduce the results of the analysis here. I’ve moved on since then to Congressional vote data and science data. But I always ask my students to think about the data they’re using, what it shows and what it tells, and whether it can even answer your question. Since I was using this data set primarily to experiment with R data visualization in TDA, it was fine. But as a scientific dataset for classification problems in machine learning, it is not my favorite. The lack of documentation, outdated nature of the classification, and fact that the classification itself is generated by an algorithm rather than human observation makes it a bit problematic.

Butt-first baby, a personal story

This is a bit outside the norm, but as I get closer to the 1-year anniversary of my kid’s birth (also known as the first birthday!) I figure maybe it’s a good time to write down the slightly abridged version of her birth story.

Sure, every mom has a birth story for every kid. And who really wants to hear it? Well, other moms or moms-to-be; maybe other folks with an academic or personal interest; and in this case, people dealing with breech birth. My kid’s story is a bit odd and is a little window onto medicine in America today. But if you don’t want to read birth stories, here is your moment to exit! Leave now! It’s not graphic, really. I did say slightly abridged, after all! Continue reading

Fun with multivariate optimization

Amusing to see my last post — yep, I’ve been working on writing ~every day again, but so much of it is on (gasp!) paper or on my class textbook Math for Finance that it doesn’t show up here. Maybe I could find some whiz-bang app that would update here whenever I publish a new version of the math text.

Last April-May I was quite pregnant, so while teaching and grading were all getting done (and quickly, so that I could have everything wrapped up when maternity leave started at the end of the semester!) there was not quite so much writing up of class notes. I started including a multivariate version of Newton’s method in class last semester, but didn’t rewrite the course notes to include it. This year I’m working on that. (The course I teach at the University of Minnesota covers calculus in one and many variables, probability, a lot of linear algebra, and a touch of differential equations.)

This year I’m also trying to put together some quick Python notebooks illustrating concepts from class. I can’t post them as notebooks here; will try to get them on Github. Here they are!

Right: that’s one reason I’m writing less here. I blog from home; home has a kid; kid is crying. Adios!

Writing every day, day 3

Today I spent my writing time on my course notes, Mathematical Preparation for Finance: a Wild Ride through Mathematics. Had to update chapter 6 on continuous random variables, although I added to it theorems that also apply to discrete random variables! Oh well. Markov’s inequality and Chebyshev’s inequality got a bit of space. I also need to update the sections on transformations and convolutions, and add more finance-specific examples that I’ve gathered this year. I’m always learning more and always want to add more, but should probably stop at some point.

Went to the Mia (Minneapolis art museum). Interesting paintings in the special exhibit, though not as thrilling as I rather hoped. I liked the new installation in the contemporary wing, though, the way they got someone to paint all over the walls and then installed art throughout and around the painted walls. Can’t find a link to it but I like the vibrancy of the walls, instead of just having white.

A fun math/science link, in line with my recent interest in order that emerges from randomness/randomness that emerges from rules: Dice become ordered when stirred, not shaken. Basically, gently stir dice and they’ll end up nicely stacked.

Kid crying again 🙂 It’s an 11 pm thing.