Pacing of "Supervised methods - Regression" episode #45

smangham · 2024-09-30T11:05:10Z

We just ran the workshop for the University of Southampton Astronomy & Astrophysics group, and I've just got a few comments.

The regression section is a bit of a hit to the pacing of the workshop. It's stuff that, realistically, most people will already be very familiar with but takes almost an hour, and involves writing a lot of code, most of which is boilerplate, and which is definitely useful for the exercises at the end but the necessity of which isn't clear during the taught section. Whilst emphasising that understanding the stats of your dataset and using functions as building blocks, is important, I think it sort of loses the audience - especially a more statistically/computationally literate one. If the workshop requires a baseline of statistical literacy, I think that'd be better served as a separate, explicit prerequisite workshop.

Plus, the idea of setting up a framework to show how with sklearn you can easily train and compare different models is good, but it doesn't quite do that as it creates a lot of bespoke functions for each model.

I think the episode could start at the "Realistic scenario" section, without losing much. You could then potentially illustrate how the same basic structure can fit a few different model types, or alternatively focus on errors a bit more (e.g. for a fit, which points are 1-2-3σ off, or what's the 1σ range of fits) to add depth. That then means the first two episodes use the same dataset, instead of switching.

As a side note, I don't think it's necessary for all the figures to include both the fit in green and the predictions of the fit for the X-values of the data as red crosses. I think everyone should be familiar enough with the concept of a fit that simply plotting the fit in red would be sufficient.

The text was updated successfully, but these errors were encountered:

mike-ivs · 2024-10-29T22:04:12Z

Thanks for the feedback @smangham. I've mulled it over for a while and the TL;DR of my thoughts are - yep I think you're right.

I think the episode could start at the "Realistic scenario" section, without losing much.

I went ahead and did a very quick version of this for the workshop we're delivering this week, (currently deployed here, hopefully i'll improve it) and I wouldn't say anything was lost except for a nicer polynomial-shaped dataset (Anscombe II)!

It's a bit better, but I'm not quite happy with it yet.

I think the "goal" of the regression lesson isn't clear and it's probably trying to do too much heavy lifting at once. Off the top of my head its trying to do:

introduce you (maybe) to regression for the first time
introduce you to the typical ML workflow
introduce you to the basic sklearn flow/structure (i.e. similar flow for different models)
introduce the idea of overfitting by setting the ground work for train-test split (without explicitly mentioning train-test)

It could probably do with a step-back and re-write from a clearer "data science/hands on" lens, and likely it will bleed into the classification episode too.

qualiaMachine · 2024-12-04T18:00:40Z

Hi @mike-ivs and @smangham. I tested @mike-ivs 's version at a workshop taught a couple of weeks ago, and having taught the "long" version a couple of times, I really appreciated Mike's condensed version.

I polished a couple of things in Mike's shortened regression episode (e.g., use x_train and y_train convention from the start, discuss importance of EDA, point out "overfitting" example as a missing predictor problem) and other episodes. Currently deployed here. Once the shorter regression episode is merged in the incubator repo, I can add or a pull request there. Or just do a pull request on Mike's repo for now?

One thing I'm struggling with is the dataset of choice. It's difficult to demonstrate true overfitting with this data given very few predictors (and lack of noise). In the regression episode, the "overfitting" we witness is more an issue of predictor choice than simply memorizing the data. It's also a bit odd to use poly regression to capture a bimodal data distribution. I wish we had a better example there. If we had a higher-dimensional dataset to start with, we could explore common problems in ML throughout the lesson more realistically. Alternatively, we could consider adding some artificial noise to the penguins dataset. Since the data is too clean/low-dimensional, and we don't see the general performance trends across models that we typically see with most datasets (e..g, decision trees should start to overfit as you add depth, SVM should do better than Decision trees in most cases). I actually ended up adding some noise as an exercise just so we could witness overfitting in decision trees.

mike-ivs · 2024-12-05T03:44:10Z

Thanks for the prompting @qualiaMachine and glad to hear it's coming in handy! I've raised it as a PR #47 as it's unlikely I'll improve it further for a while.

I agree the penguin dataset choice isn't a great for regression beyond a simple linear fit, although it does save time/cognitive load that would otherwise be spent understanding an extra dataset. The overfitting in the regression case isn't so much a predictor-choice issue as it is an un-representative training sample which doesn't capture the bimodal nature of the features of interest - it's overfit to a subset of the full data rather than the noise in said subset data. It helps emphasise the importance of your training data in a pretty blunt way!

Exploratory Data Analysis EDA came up as one of the "missing things" from the lesson when we discussed our teaching in house the other day so very happy to see some focus on it :) Likewise for the data standardising/pre-processing in your classification episode!

The dataset choice is a tricky one as it's a balance between:

being simple enough to understand / avoid cognitive load in understanding the data itself
having suitable datasets for a given task but not so many that we lose time trying to properly understand each dataset
being large/complex enough to see the caveats/gotchas in ML, but not so realistic that we get lost in heavy- technical details while teaching an introduction course
having an "off-the-shelf" importable dataset versus a bespoke dataset that requires file I/O or code to produce it (I shudder at the cumulative hours lost to loading datafiles during training sessions!)

I like the idea of modifying the penguin dataset in-code to help add complexity / help emphasise ML gotchas as it's a nice compromise for the last bullet point above. Adding in noise to the penguin data to emphasise the noise-overfitting shortcomings of decision-trees is a pretty nice example of this (i'm obliged to point out that the Penguin data is still real data and has noise in it: https://allisonhorst.github.io/palmerpenguins/).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pacing of "Supervised methods - Regression" episode #45

Pacing of "Supervised methods - Regression" episode #45

smangham commented Sep 30, 2024

mike-ivs commented Oct 29, 2024

qualiaMachine commented Dec 4, 2024

mike-ivs commented Dec 5, 2024

Pacing of "Supervised methods - Regression" episode #45

Pacing of "Supervised methods - Regression" episode #45

Comments

smangham commented Sep 30, 2024

mike-ivs commented Oct 29, 2024

qualiaMachine commented Dec 4, 2024

mike-ivs commented Dec 5, 2024