-
-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pacing of "Supervised methods - Regression" episode #45
Comments
Thanks for the feedback @smangham. I've mulled it over for a while and the TL;DR of my thoughts are - yep I think you're right.
I went ahead and did a very quick version of this for the workshop we're delivering this week, (currently deployed here, hopefully i'll improve it) and I wouldn't say anything was lost except for a nicer polynomial-shaped dataset (Anscombe II)! It's a bit better, but I'm not quite happy with it yet. I think the "goal" of the regression lesson isn't clear and it's probably trying to do too much heavy lifting at once. Off the top of my head its trying to do:
It could probably do with a step-back and re-write from a clearer "data science/hands on" lens, and likely it will bleed into the classification episode too. |
Hi @mike-ivs and @smangham. I tested @mike-ivs 's version at a workshop taught a couple of weeks ago, and having taught the "long" version a couple of times, I really appreciated Mike's condensed version. I polished a couple of things in Mike's shortened regression episode (e.g., use x_train and y_train convention from the start, discuss importance of EDA, point out "overfitting" example as a missing predictor problem) and other episodes. Currently deployed here. Once the shorter regression episode is merged in the incubator repo, I can add or a pull request there. Or just do a pull request on Mike's repo for now? One thing I'm struggling with is the dataset of choice. It's difficult to demonstrate true overfitting with this data given very few predictors (and lack of noise). In the regression episode, the "overfitting" we witness is more an issue of predictor choice than simply memorizing the data. It's also a bit odd to use poly regression to capture a bimodal data distribution. I wish we had a better example there. If we had a higher-dimensional dataset to start with, we could explore common problems in ML throughout the lesson more realistically. Alternatively, we could consider adding some artificial noise to the penguins dataset. Since the data is too clean/low-dimensional, and we don't see the general performance trends across models that we typically see with most datasets (e..g, decision trees should start to overfit as you add depth, SVM should do better than Decision trees in most cases). I actually ended up adding some noise as an exercise just so we could witness overfitting in decision trees. |
Thanks for the prompting @qualiaMachine and glad to hear it's coming in handy! I've raised it as a PR #47 as it's unlikely I'll improve it further for a while. I agree the penguin dataset choice isn't a great for regression beyond a simple linear fit, although it does save time/cognitive load that would otherwise be spent understanding an extra dataset. The overfitting in the regression case isn't so much a predictor-choice issue as it is an un-representative training sample which doesn't capture the bimodal nature of the features of interest - it's overfit to a subset of the full data rather than the noise in said subset data. It helps emphasise the importance of your training data in a pretty blunt way! Exploratory Data Analysis EDA came up as one of the "missing things" from the lesson when we discussed our teaching in house the other day so very happy to see some focus on it :) Likewise for the data standardising/pre-processing in your classification episode! The dataset choice is a tricky one as it's a balance between:
I like the idea of modifying the penguin dataset in-code to help add complexity / help emphasise ML gotchas as it's a nice compromise for the last bullet point above. Adding in noise to the penguin data to emphasise the noise-overfitting shortcomings of decision-trees is a pretty nice example of this (i'm obliged to point out that the Penguin data is still real data and has noise in it: https://allisonhorst.github.io/palmerpenguins/). |
We just ran the workshop for the University of Southampton Astronomy & Astrophysics group, and I've just got a few comments.
The regression section is a bit of a hit to the pacing of the workshop. It's stuff that, realistically, most people will already be very familiar with but takes almost an hour, and involves writing a lot of code, most of which is boilerplate, and which is definitely useful for the exercises at the end but the necessity of which isn't clear during the taught section. Whilst emphasising that understanding the stats of your dataset and using functions as building blocks, is important, I think it sort of loses the audience - especially a more statistically/computationally literate one. If the workshop requires a baseline of statistical literacy, I think that'd be better served as a separate, explicit prerequisite workshop.
Plus, the idea of setting up a framework to show how with
sklearn
you can easily train and compare different models is good, but it doesn't quite do that as it creates a lot of bespoke functions for each model.I think the episode could start at the "Realistic scenario" section, without losing much. You could then potentially illustrate how the same basic structure can fit a few different model types, or alternatively focus on errors a bit more (e.g. for a fit, which points are 1-2-3σ off, or what's the 1σ range of fits) to add depth. That then means the first two episodes use the same dataset, instead of switching.
As a side note, I don't think it's necessary for all the figures to include both the fit in green and the predictions of the fit for the X-values of the data as red crosses. I think everyone should be familiar enough with the concept of a fit that simply plotting the fit in red would be sufficient.
The text was updated successfully, but these errors were encountered: