TRIPODS Data Science Boot Camp:
Machine learning research: Making the most of your data

Summer 2021, June 7-11

Matthew Stone, Ying Hung, Peng Zhang, and Hao Wang

Overview: Machine learning is a powerful tool for making predictions that you can use to make sense of scientific data sets and to develop more flexible and efficient solutions to engineering problems. In many cases, the limiting factor in using machine learning is the difficulty of obtaining data points that you can use to train models. In real applications, data points may take a lot of time to collect (because of the physical or computational processes needed to produce the data), a lot of money to collect (because they involve human effort to collect or annotate), or may simply involve rare events that don’t happen very often even in large data sets. Understanding the problems and research in dealing with limited data is therefore key for using machine learning techniques effectively.

Schedule and Participation:

This tutorial will explain some theoretical, computational and practical issues involved in these problems, through the lens of a number of Rutgers faculty who do research in the area. Where possible, sessions will include an interactive component, and will get to experiment with hands-on resources (e.g., python notebooks) that illustrate the problems and solutions discussed. The sessions will be held over zoom. Registration is free and open but required:

1. Monday, June 7. Matthew Stone (Rutgers, CS)
– Overview. Machine learning and limited data
– Building regression models and the fundamental bias/variance tradeoff.
Watch the Recording Password: L#7URu0$

2. Tuesdaye June 8. Ying Hung (Rutgers, Statistics)
– Experiment design, Part 1.
– Understanding what data you need to improve a model and how to get it.
Watch the Recording Password: n9Gx^QWi

3. Wednesday, June 9. Matthew Stone (Rutgers, CS)
– Validating models.
– Visualizing models and understanding the effects of chance in performance.
Watch the Recording Password: 9MhVm+ps

4. Thursday, June 10. Peng Zhang (Yale, joins Rutgers CS in fall)
– Experiment design, Part 2.
– Designing data sets to mitigate the effects of chance and improve noisy estimates.
Watch the Recording Password: $40rkYss

5. Friday, June 11. Hao Wang (Rutgers, CS)
– Handling imbalanced classes.
– Dealing with rare events with deep learning.
Watch the Recording Password: 6#xHi09w

Our goal is that these examples will give you some ideas, programming models and design patterns to support your own data-driven research this summer.

Because everyone is participating remotely, it makes sense to work in the cloud. We’ll be using google colab to explore our data interactively. Things will go smoother on Monday if you take a moment to familiarize yourself with roughly how google colab works:

Google colab is nicely integrated with google drive, which is an easy way to share notebooks, data sets and other resources for the class. Materials will be available in advance of each lecture. Here is the link that you can use to obtain materials:

When the materials are available, you’ll want to copy this directory into your own google drive, and you’ll then be able to access all the files from within colab.

For more tips on data driven research, check out last summer’s boot camp, which focused on visualizations and text data: