## Finding better adjusted r-squared values by removing variables

Maple

A couple of weeks ago, I recorded a short video that discussed various applications for the Statistics:-Fit command. One of the more interesting examples examined how manually adjusting the number of parameters used for a regression model affected the resulting adjusted r-squared value.

I won’t go into detail about r-squared here, but to briefly summarize: In a linear regression model, r-squared measures the proportion of the variation in a model's dependent variable explained by the independent variables. Basically, r-squared gives a statistical measure of how well the regression line approximates the data. R-squared values usually range from 0 to 1 and the closer it gets to 1, the better it is said that the model performs as it accounts for a greater proportion of the variance (an r-squared value of 1 means a perfect fit of the data). When more variables are added, r-squared values typically increase. They can never decrease when adding a variable; and if the fit is not 100% perfect, then adding a variable that represents random data will increase the r-squared value with probability 1. The adjusted r-squared attempts to account for this phenomenon by adjusting the r-squared value based on the number of independent variables in the model.

The formula for the adjusted r-squared is:

Where:

n is the number of points in the data sample

k is the number of independent variables in the model excluding the constant

By taking the number of independent variables into consideration, the adjusted r-squared behaves different than r-squared; adding more variables doesn’t necessarily produce better fitting models. In many cases, more variables can often lead to lower adjusted r-squared values. In particular, if you add a variable representing random data, the expected change in the adjusted r-squared is 0.

As such, the adjusted r-squared has a slightly different interpretation than the r-squared. While r-squared is perceived to give an indication of the measure of fit for a chosen regression model, the adjusted r-squared is perceived more as a comparative tool that can be useful for picking variables and designing models that may require less predictors than other models. The science of “gaming” models is a broad topic, so I won’t go into any more detail here, but there’s lots of great information out there if you are looking to learn more (here’s a good place to start).

The following example adjusts a fitted model by adding or removing variables in order to find better adjusted r-squared values.

with(Statistics):

The Import command reads a datafile into a new DataFrame.

ExperimentalData := Import(FileTools:-JoinPath(["Excel", "ExperimentalData.xls"], base = datadir));

The dataset has seven variables: time and experimental readings for 6 various concentrations. Removing “time” from our variable set, the convert command converts the values in the DataFrame to a Matrix of values.

ExMat := convert( ExperimentalData, Matrix )[..,2..7];

We start by fitting a model that includes predicting variables for each of the columns of data. We mark “Concentration A” as our dependent variable.

Fit( C + C2*v + C3*w + C4*x + C5*y + C6*z, ExMat[..,2..6], ExMat[..,1], [v,w,x,y,z], summarize=embed ):

From the above, we can observe that both the r-squared and adjusted r-squared are reasonably high, however only one of the coefficient values has a significant p-value, C3.

Note: Maple shows all p-values less than 0.05 in bold.

Let's try to fit the data again, this time keeping the two coefficients with the lowest p-values and the intercept.

Fit( C + C3*v + C5*w, ExMat[..,[3,5]], ExMat[..,1], [v,w], summarize=embed ):

From the above, we can see that the r-squared value does go down, however the adjusted r-squared goes up! Let's fit the model one last time to see if removing C5 increases or decreases the adjusted r-squared.

Fit( C + C3*v, ExMat[..,3], ExMat[..,1], [v], summarize=embed ):

We can see that the final adjusted r-squared value is lower than the previous two, so we are probably better to keep the additional C5 coefficient value.

You can see this example as well as a couple of other examples of using the Fit command in the following video: https://www.youtube.com/watch?v=YWRgRvlJpjY