Topological data analysis at Minnesota Developers Conference

I’m at the Minnesota Developers Conference today, talking about topology and data. Hoping it’ll be fun to talk to developers about math and data! In the meantime, looking at cryptography and microservices and containerization. I’m working on trying to containerize some of my own projects…. but you might hear about that if you come to my talk.

Ok, links:

  • Here’s an older Github repository of info about TDA in R. There, launch a binder to try out the code yourself! Just press the “binder” button! This uses Docker to give you an R environment.
  • Here is a link for exploring super-level sets in Dow Jones data during the period 1991-2002, leading up and through the dotcom bust. You can change the threshold (what level of correlation you want to cut off the graph edges at) and you can play with the year.
  • Here is a link to an R Shiny page for exploring persistent homology in Dow Jones data during the same period, 1991-2002. Warning: it’s slow to load due to the persistent homology calculation at the beginning — give it a minute. Watch the red dots slide down and to the left as the crash happens. Why are there two groups of red dots at one point, in 2001?

If you want to learn more about using topology with big data, here are some links:

  • I’ve enjoyed creating visualizations with Kepler-mapper, a Python package. In my talk, some of the images of macroeconomic data are created using Kepler-mapper. Check out the examples there — they’re beautiful.
  • To make prettier mapper visualizations in R, I make heavy use of networkD3 and IGraph. IGraph also has Python bindings.
  • In Python, computation of Betti numbers for analysis of the stock crash-recovery cycle was done using moguTDA.
  • In R, I use TDAmapper for mapper visualization and TDA for persistent homology.

If you’re looking for other resources, my free Preparation for Math Finance course textbook is accessible and often-updated since I’m teaching right now. What else? Flash sale on coloring books 🙂

Installing Dionysus for Python

Today I finally managed to install Dionysus 2, a library for computing persistent homology. I easily set up Kepler-mapper a long time ago, and had given up on Dionysus previously. Dionysus 2 is a re-write that streamlines many things and promises to make installation easier and implement more plotting. Again, I had trouble installing the package, but this time I didn’t give up and eventually met with success.

First, I’ll say that I’m using the Anaconda distribution for Python etc. I have Python 2.7 as the base installation but have almost entirely switched over to Python 3 for my other work. I was merrily going on my way trying to install Dionysus 2 in my Anaconda Python 3.4 environment for a while. First I tried pip, following the commands under the “Get, Build, Install” section  link. I ran into some permission problems, some problems with Boost. Given that, I tried cloning the git repository. Still problems with Boost, though I had a fresh install of that as well. After the cmake .. command, I got problems like this:

— Could NOT find Boost

— pybind11 v2.2.0

CMake Error: The following variables are used in this project, but they are set to NOTFOUND.

Please set them or make sure they are set and tested correctly in the CMake files:

Boost_INCLUDE_DIR (ADVANCED)

and then a string of things. I modified my CMakeCache.txt file; modified the  CMakeLists.txt file; tried all sorts of things. Failures. Ok, incremental progress — I think I got it finding the Boost libraries and managed to get my error messages to 

— The C compiler identification is Clang 7.3.0

— The CXX compiler identification is Clang 7.3.0

— Check for working C compiler: /usr/bin/cc

— Check for working C compiler: /usr/bin/cc — works

— Detecting C compiler ABI info

— Detecting C compiler ABI info – done

— Check for working CXX compiler: /usr/bin/c++

— Check for working CXX compiler: /usr/bin/c++ — works

— Detecting CXX compiler ABI info

— Detecting CXX compiler ABI info – done

— Could NOT find Boost

— Found PythonInterp: /Users/kaisa/anaconda/envs/py34/bin/python (found version “3.4.5”)

— Found PythonLibs: /Users/kaisa/anaconda/envs/py34/lib/libpython3.4m.dylib

— Performing Test HAS_CPP14_FLAG

— Performing Test HAS_CPP14_FLAG – Success

— pybind11 v2.2.0

— Performing Test HAS_FLTO

— Performing Test HAS_FLTO – Success

— LTO enabled

CMake Error: The following variables are used in this project, but they are set to NOTFOUND.

Please set them or make sure they are set and tested correctly in the CMake files:

Boost_INCLUDE_DIR (ADVANCED)

Through Googling, I found a message thread talking about using conda to install instead. I hadn’t realized it was a conda package! I tried

conda install -c conda-forge dionysus

instead — and there I learned many things!  It listed many packages to download or upgrade, and then

The following packages will be DOWNGRADED:

pillow: 3.3.0-py34_0 –> 3.2.0-py27_0conda-forge

pyqt: 4.11.4-py34_4–> 4.11.4-py27_2 conda-forge

python: 3.4.5-0–> 2.7.14-0conda-forge

python.app: 1.2-py34_4 –> 1.2-py27_0conda-forge

Proceed ([y]/n)? n

What? It wants to downgrade my Python version? Is this the whole problem?!

I switched environments to my Python 2.7 install and repeated

conda install -c conda-forge dionysus

and it installed with no problem.

Had to remind myself that for my Python 2.7 install, I use “ipython notebook” to make new notebooks, rather than “jupyter notebook”, but after that I was able to interact with Dionysus 2 and run all the examples from the tutorial.

So, that’s my story. Maybe there’s a Boost problem, a path problem, whatever, but I think the real problem is that Dionysus 2 is built for Python 2.7 not Python 3.*, and so I’ll just have to make full use of the Anaconda distributions easy environment-switching. This Python version issue is not noted anywhere I saw online.

 

Miller-Reaven diabetes data set

I’ve recently been playing around a lot with some data analysis in R and in Jupyter notebooks, in preparation for two exciting summer ventures that’ll be taking up a lot of my time:

<run>:\the\world Girls’ Machine Learning Day Camp

and

The MCFAM Summer Seminar.

These are both work projects, as I hope is clear from the links.

One of the datasets that’s made the rounds over time is the dataset about “chemical diabetes” from the 1979 paper by Miller and Reaven. It is the chemdiab dataset, with documentation at https://www.rdocumentation.org/packages/locfit/versions/1.5-9.1/topics/chemdiab. It’s in the locfit package, for instance. Check out that documentation: here we’ve got a well-used dataset included in R that doesn’t have  any useful documentation on the R side. You’d think the R documentation would have units and variable meanings, for instance, but you’d be wrong.

I was particularly interested in this dataset because it’s used in one of the initial proof-of-concept papers for topological data analysis (TDA). As I worked to reproduce their work in R using the very nice work of Paul Pearson  on the R “TDAmapper” library, I realized that this is a pretty funky task. First of all, “chemical diabetes” is a concept that’s really outdated. I asked my board-certified-in-internal-medicine-physician-husband about chemical diabetes, and he laughed at me and asked if I’d ask about “the grippe” next. He did kindly explain that “chemical diabetes” was from back when either you were really clearly diabetic, or maybe a blood lab test would indicate some progress toward diabetes. In any case, today we know about type 1 and type 2 diabetes. Second, the Miller-Reaven classification itself is pretty weird. If you read the paper, Miller and Reaven took some doctor-classified data (normal, overt diabetes, and chemical diabetes) and then used a new computer-aided classification scheme to reclassify the observations. It’s really cool in that it’s the beginning of machine learning! But these days, it seems to me it’s a little weird to try to develop a computer-executed algorithm that will classify cases to match a computer-executed algorithm. To test whether my algorithm can reproduce their algorithm is just…. weak. Machine learning truly wants to reproduce what humans can do, when humans surpass machines.

I made some pretty pictures, used TDA in R to more-or-less reproduce the results of the analysis here. I’ve moved on since then to Congressional vote data and science data. But I always ask my students to think about the data they’re using, what it shows and what it tells, and whether it can even answer your question. Since I was using this data set primarily to experiment with R data visualization in TDA, it was fine. But as a scientific dataset for classification problems in machine learning, it is not my favorite. The lack of documentation, outdated nature of the classification, and fact that the classification itself is generated by an algorithm rather than human observation makes it a bit problematic.