Data can be found on Kaggle: https://www.kaggle.com/jsphyg/weather-dataset-rattle-package The data is weather data in Australia and I am using this to predict the chances of Rain occurring tomorrow or not. I used the predictors of Humidity, Pressure, and WindSPeed to predict the chance of rain
Some of the steps that had to be made to cleanup the data was to re-do how the data was presented. The chance of rain was given in percent, but even a small chance of rain is enough for it to predict wheather it'll rain or not so I redid the whole column of RainToday and RainTomorrow to 1 and 0 if there is rain or no rain. In order to do that I used the class of tidyverse and after converting, had to reclassify the column to numerical data so that I could evaluate on it. Alot of the rows had N/A so to fix that I had to exclude any rows that did not have data because I couldn't evaluate on it. In addition to that, I removed a majority of the columns that I thought were irrelevant such as Date, Evaporation, Sunshine, Clouds, and Location. It could rain anywhere, and I'm not trying to predict a certain location in Australia, It can rain with the Sun, with or without clouds, and most of the evaporation or Risk MM data were N/A so i removed the whole whole thing.
I chose to run a Linear Regression because I am trying to find the best model for correlation for when it rains given the Wind, Humidity, and Pressure. Linear regression works well for a model to find a linear correlation between variables, and given the following variables, I was able to create a correlation of: 48% That means that given the variables, the model predicted that there is a 48% correlation between rain happening tomorrow given the wind, humidity, and pressure. There was also a MSE of .15, given that the the rainfall is either 1 or 0, which is close to an almost perfect model.
I chose to run a KNN Regression because the data given to predict when it rains is entirely Numerical. This makes it easy for KNN regression to predict a numerical target based on similarity measurements to predict if it'll rain tomorrow or not. I was able to create a correlation of: 52% This means that given the variables, the model predicted that there is a 52% correlation between rain happening given the wind, humidity, and pressure. There was also a MSE of .12, given that the the rainfall is either 1 or 0, which is close to an almost perfect model.
I chose to run a Decision Tree Regression because the tree breaks down the set into smaller subsets to evaluate which nodes or leafs are the best for correlation. This makes it easy for Decision Tree Regression to predict a regression because all the values are numerical as well. Given this model, I was able to create a correlation of 54% This means that given the variables, the model predicted that there is a 54% correlation between rain happening tomorrow given the wind, humidity, and pressure. There was also a RMSE of .61, meaning we were about .61 off the rainfall percetange on average given that chance of rainfall is 0 or 1. THis is relatively close to an almost perfect model
Ranking the algorithms from best to worst:
- Decision Tree Regression
- KNN Regression
- Multiple Linear Regression
The reason why Linear Regression was ranked last was because it had the lowest correlation. I think this attributed to the fact that the predictors used can vary widely. For example, wind does not always guarantee rain, but humidity and pressure does. Wind can still bring rain, but it is not always guaranteed. Given this source of error, the Linear Regression had values that were not in line of correlation, causing the correlation to be lower than the others. The next best model was KNN Regression. I think this attributed to the fact that the observations are generally bunched up so it's easier for data to take the average target value. Still the correlation was about 52% and the reason why this wasn't a super high correlation was due to the fact that rain can come from different sources of attributes. A high humidity does not always guarantee rain, and neither does pressure. Yet this still performed better than the linear regression because it was able to identify an average of the values to create a regression line. The best model to perform was the decision tree Regression. I think this attributed to the fact that there was many variables that could be split up to identify a split in the data to create a regression. The RSS in this case was minimized to the best ability because all the predictors would cause rain, but by splitting the data the averages could be accounted for each unique possibility for the predictors and as a result we got the best model of 54%. The reason why there was such low correlation for all of them, is because predicting the weather is not always guaranteed. Even in modern weather predictions the chance of rain or snow is still wild and no weather channel always gets it right, but what weather channels can do is predict the chance of rain, which is essenttially what the models I created are doing given the observations. All the model scripts were able to learn from the data and this is useful to know because if you wanted to predict the weather given certain situations, you can determine which model would work best given the predictors you wanted to use. The decision tree regression would be best if you wanted the data to tell you the chances of rain or snow for each weather pattern.