The goal of RefineMod is to provide functions to refine and optimize linear regression models. Models can be refined to only those predictors that are statistically significant in explaining the response variable. Linear regression models can also be compared on the basis of their performance like RMSE, R2 and MAE. Package website RefineMod

Package Development

This package was developed as an assignment for STAT545B at UBC using devtools and usethis packages.

The codes and pipeline used to create this package is as follows

usethis::create_package(path = "RefineMod") #To initiate the package project locally

usethis::use_git() #To create a git repository of the package locally

This local repo is then linked to an empty github repository created with the same name RefineMod

git remote add origin
git branch -M main
git push -u origin main
usethis::use_r("function name") #To create script files of the functions in this package

#All the functions were documented using the roxygen skeleton

devtools::document() #To record the documentation files and update NAMESPACE

usethis::use_readme_rmd() #README file initiation

usethis::use_mit_license() #LICENSE file

usethis::use_code_of_conduct() #CODE OF CONDUCT file

usethis::use_testthat() #To Create tests to include test scripts for the functions
usethis::use_test("function name") #intialize Funstion specific test scripts

usethis::use_package("package name") #To include package dependencies in the DESCRIPTION file

usethis::use_vignette("vignette name") #Initialize vignette RMD file

usethis::use_news_md() #Adding Changelogs


devtools::test() #Run all testthat files
devtools::check() #Head to Toe evaluation of the package


The development version of RefineMod can be installed from GitHub with:



Refining a linear regression model with only its significant predictors

cancer_sample data set from datateachr package

radius_mean as response variable and all (except diagnosis) as input predictors


mod <- lm(radius_mean ~ ., cancer_sample[,-2]) #lm() call without any predictor selection

#> Call:
#> lm(formula = radius_mean ~ ., data = cancer_sample[, -2])
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -0.34069 -0.02957  0.00027  0.02509  0.23880 
#> Coefficients:
#>                           Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)              4.076e-01  1.049e-01   3.886 0.000115 ***
#> ID                      -4.588e-12  2.064e-11  -0.222 0.824201    
#> texture_mean             4.605e-04  1.975e-03   0.233 0.815756    
#> perimeter_mean           1.339e-01  2.359e-03  56.748  < 2e-16 ***
#> area_mean                9.010e-04  1.246e-04   7.230 1.67e-12 ***
#> smoothness_mean          1.835e+00  4.950e-01   3.706 0.000232 ***
#> compactness_mean        -3.810e+00  2.879e-01 -13.233  < 2e-16 ***
#> concavity_mean          -1.567e+00  2.510e-01  -6.244 8.64e-10 ***
#> concave_points_mean      1.836e-01  4.918e-01   0.373 0.708981    
#> symmetry_mean            2.726e-01  1.842e-01   1.480 0.139370    
#> fractal_dimension_mean   2.274e+00  1.381e+00   1.646 0.100249    
#> radius_se                1.228e-01  7.706e-02   1.594 0.111590    
#> texture_se               1.449e-02  9.148e-03   1.584 0.113744    
#> perimeter_se            -4.591e-02  1.002e-02  -4.580 5.78e-06 ***
#> area_se                  3.497e-04  3.492e-04   1.001 0.317079    
#> smoothness_se            1.970e+00  1.649e+00   1.195 0.232579    
#> compactness_se          -5.029e-01  5.388e-01  -0.933 0.351066    
#> concavity_se             1.283e+00  3.185e-01   4.029 6.42e-05 ***
#> concave_points_se        3.246e+00  1.352e+00   2.401 0.016669 *  
#> symmetry_se              4.777e-02  6.780e-01   0.070 0.943859    
#> fractal_dimension_se    -2.915e+00  2.900e+00  -1.005 0.315145    
#> radius_worst             1.598e-01  1.265e-02  12.633  < 2e-16 ***
#> texture_worst           -1.627e-03  1.725e-03  -0.943 0.346078    
#> perimeter_worst         -9.250e-03  1.420e-03  -6.515 1.68e-10 ***
#> area_worst              -5.744e-04  7.548e-05  -7.610 1.24e-13 ***
#> smoothness_worst        -1.108e+00  3.533e-01  -3.137 0.001799 ** 
#> compactness_worst        3.505e-01  9.401e-02   3.729 0.000213 ***
#> concavity_worst          1.275e-02  6.674e-02   0.191 0.848587    
#> concave_points_worst     1.892e-02  2.272e-01   0.083 0.933684    
#> symmetry_worst          -8.857e-02  1.228e-01  -0.721 0.471035    
#> fractal_dimension_worst -4.446e-01  5.919e-01  -0.751 0.452838    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> Residual standard error: 0.05869 on 538 degrees of freedom
#> Multiple R-squared:  0.9997, Adjusted R-squared:  0.9997 
#> F-statistic: 6.824e+04 on 30 and 538 DF,  p-value: < 2.2e-16

Above is the model built using all the input predictors as the independent variables while building the model. Many of these variables don’t show any statistical significance (in terms of their p-value) to be included in the model.

sig_mod <- lm_significant(cancer_sample[,-2], res = "radius_mean") #model with optimized predictors
#> Response Variable: radius_mean 
#> Input Predictors: ID texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave_points_mean symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se area_se smoothness_se compactness_se concavity_se concave_points_se symmetry_se fractal_dimension_se radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave_points_worst symmetry_worst fractal_dimension_worst 
#> Fitting a linear model 
#> Optimization of Predictors
#> ....
#> Final Optimization...
#> Final Optimized Predictors: perimeter_mean compactness_mean radius_worst area_worst concavity_mean perimeter_worst compactness_worst

#> Call:
#> stats::lm(formula = form1, data = data)
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -0.43905 -0.03010 -0.00420  0.03113  0.27932 
#> Coefficients:
#>                     Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)        4.401e-01  3.107e-02  14.166  < 2e-16 ***
#> perimeter_mean     1.487e-01  5.931e-04 250.709  < 2e-16 ***
#> compactness_mean  -3.926e+00  1.559e-01 -25.182  < 2e-16 ***
#> radius_worst       1.462e-01  7.390e-03  19.778  < 2e-16 ***
#> area_worst        -1.897e-04  3.205e-05  -5.918 5.68e-09 ***
#> concavity_mean    -6.334e-01  9.568e-02  -6.620 8.39e-11 ***
#> perimeter_worst   -1.671e-02  1.020e-03 -16.381  < 2e-16 ***
#> compactness_worst  2.310e-01  4.123e-02   5.602 3.32e-08 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> Residual standard error: 0.06754 on 561 degrees of freedom
#> Multiple R-squared:  0.9996, Adjusted R-squared:  0.9996 
#> F-statistic: 2.208e+05 on 7 and 561 DF,  p-value: < 2.2e-16

perimeter_mean, compactness_mean, radius_worst, area_worst, concavity_mean, perimeter_worst and compactness_worst are the predictors that were found to be statistically significant while building a model for the response variable radius_mean.

Comparing Model Performance between one or more lm models

train <- mtcars[1:20,]
test <- mtcars[21:30,]

mod1 <- lm(mpg~wt, train)
mod2 <- lm(mpg~cyl, train)
mod3 <- lm(mpg~wt+cyl, train)
mod4 <- lm(mpg~carb, train)

comp_mods(mod1, mod2, mod3, mod4, newdata = test)
#>            RMSE     Rsquared      MAE
#> model1 7.558188 0.0276731943 6.090093
#> model2 8.893606 0.0006549141 7.565000
#> model3 8.315105 0.0044296323 6.741422
#> model4 9.741102 0.1946252582 7.860596

