Data Management

This post is for my senior project and research students (undergrads and grads at the University of Minnesota). Reproducibility of research is very important to me. If we’re writing a paper or doing a project together, every member of the project should be able to verify the results and present the project to others. As you go on in your career and build skills and trust with collaborators, you may end up in collaborations where you each contribute skills that the others don’t have — my collaborations with people in public health come to mind, for instance, because I don’t have their years of experience in their field and they have a different set of math and stats skills than I do — but the point of the current collaborations with undergrads and masters’ students is that we all can take ownership of the work!

Data management is a huge part of reproducibility and it will serve you well later. Maybe five years from now you’ll start a project and you’ll think, huh, I think I could use some ideas from that paper I did during my masters’ program… and then you won’t be able to find the files, or won’t be able to open them, or they won’t make any sense (what does KHUIG mean as a column header again?!!). Plan ahead and avoid that!

What files are important?
• Code that you are using for the project
• The input files for your final analysis
• A data dictionary for the input files. What do all those mysterious column names mean? It may be obvious to you now, but in five years the meaning of GVKEY may have slipped your memory.
• Drafts of the paper, in some cases

What format should files be in?
• Code files can be in various formats, but one should be able to open them with a text editor
• Written work should be in .tex or .txt or maybe a Word format….

Where do files go?
• Less good but OK: a Google Drive folder. Why is this less good? It doesn’t have dynamic updating — you need to update your work manually.
• More good but not best: a Dropbox folder. There is versioning and automatic updating.
• Best: Github, either at github.umn.edu or github.com. Github.com can only be used for data that is not owned by the University — in case of a data breach I don’t want to be responsible for theft of intellectual property. Github.umn.edu can be used for data that belongs to the University; by using the U’s service for the U’s data, we’re trusting that the U will take appropriate security measures.

Why else is Github best? You’re using version control, you have a history of your work, you can share with others easily, it’s a transferable skill that is valuable to many employers. If you have a public Github repository you can show your skills to others easily by including it on your resume and LinkedIn page, or in your profile if you give a talk someplace.

“My collaborator sent me this code and I can’t run anything. I don’t think they are incompetent — they said it worked on their machine. What’s the problem?”
• The problem is that you probably did not share all the information about packages and versions!
• For instance, Python 3 different in some significant ways from Python 2.7 — functions like map and range changed behavior in important ways.
• Many of the packages I/we are using in research are pretty new, too: the TDAstats package for R and the kepler-mapper package for Python both have had changes just during the time I’ve been working on projects with them.
• For true reproducibility, it would be best to actually package up a virtual environment in Docker (there are some other ways, too). This is called containerization — watch for it in job ads! If you want a career in a coding-intensive field, like some varieties of data science or machine learning, let me know, because it is a good way to do things but requires a bit of learning.

That’s it for now. We can talk in person about how you’ll implement this for your project.

Header photo by rawpixel on Unsplash