**Example**

Input file

Height Weight FreeThrow_percent FieldGoal_percent Score 6.8 225 0.672 0.435 9.2 6.3 180 0.797 0.563 11.7 6.4 190 0.761 0.567 15.8 6.2 180 0.651 0.432 8.6 6.9 205 0.9 0.643 23.2 6.4 225 0.78 0.645 27.4 6.3 185 0.771 0.485 9.3 6.8 235 0.75 0.521 16 6.9 235 0.818 0.334 4.7 6.7 210 0.825 0.626 12.5 6.9 245 0.632 0.644 20.1 6.9 245 0.757 0.421 9.1 6.3 185 0.709 0.367 8.1 6.1 185 0.782 0.442 8.6

The dataset named 'Linear_regression.txt' could be downloaded. Data Reference: The official NBA basketball Encyclopedia, Villard Books

For linear regression, which is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables denoted X. The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called multiple linear regression.

This tool includes 6 main parts:

- Model Assumption Check. It includes linear relation test and constant variance test.

The tool introduces two tests for linear relationship, one is Harvey-Collier test and one is Rainbow test. The Harvey-Collier test performs a t-test on the recursive residuals. If the true relationship is not linear but convex or concave the mean of the recursive residuals should differ from 0 significantly. The Rainbow test is that even if the true relationship is non-linear, a good linear fit can be achieved on a subsample in the "middle" of the data. The null hypothesis is rejected the overall fit is s ignificantly worse than the fit for the subsample.

- Model Selection. Users could plot scattermatrix plot, and do stepwise selection.
- Model Fitting. A linear regression model could be fitted. If it's a simple linear model, a linear plot could be shown.
- Diagnostic Check. Four diagnostic plots could be shown for users to analyze. User could also use cross validation to evalute the model.
- Model Comparison. It includes four model comparision tests. If it's a comparision of simple linear models, linear comparison plots could be shown.

For four tests to compare two models. They are Cox Test, J Test, Anova Test, and Likelihood Ratio Test. The latter two tests are for nested linear models.

Cox Test is a test that if the first model contains the correct set of regressors, then a fit of the regressors from the second model to the fitted values from first model should have no further explanatory value. But if it has, it can be concluded that model 1 does not contain the correct set of regressors. And J Test is that if the first model contains the correct set of regressors, then including the fitted values of the second model into the set of regressors should provide no significant improvement. But if it does, it can be concluded that model 1 does not contain the correct set of regressors.

- Prediction. By the model fitted, users could input new data to make a prediction.

An example of linear regression model is shown as below. First, we just check assumption of regression model and select a proper model.

In the part of checking assumption of regression model, we could get two test results: one is linear relation test, and other is Heteroskedasticity test.

**Result:**

The assumption check result:

Result of linear relation test Harvey-Collier test data: form_assumption HC = 1.8014, df = 45, p-value = 0.07834 Result of Heteroskedasticity test Harrison-McCabe test data: form_assumption HMC = 0.66995, p-value = 0.959 The form_assumption is: Score ~ Height+Weight+FreeThrow_percent

From this example, we could find both of the P-value are larger than 0.05, so it indicates we need to accept the null hypothesis. That is, the assumptions are satisfied.

The result of model selection:

Start: AIC=180.27 Score ~ 1 Df Sum of Sq RSS AIC + FieldGoal_percent 1 116.244 1651.5 178.87 none 1767.8 180.27 + Height 1 12.539 1755.2 181.92 + Weight 1 0.438 1767.3 182.26 Step: AIC=178.87 Score ~ FieldGoal_percent Df Sum of Sq RSS AIC none 1651.5 178.87 + Weight 1 6.5584 1645.0 180.67 + Height 1 0.5708 1651.0 180.85 Start: AIC=181.92 Score ~ Height + Weight + FieldGoal_percent Df Sum of Sq RSS AIC - Height 1 24.627 1645.0 180.67 - Weight 1 30.615 1651.0 180.85 none 1620.4 181.92 - FieldGoal_percent 1 117.965 1738.3 183.43 Step: AIC=180.67 Score ~ Weight + FieldGoal_percent Df Sum of Sq RSS AIC - Weight 1 6.558 1651.5 178.87 none 1645.0 180.67 - FieldGoal_percent 1 122.365 1767.3 182.26 Step: AIC=178.87 Score ~ FieldGoal_percent Df Sum of Sq RSS AIC none 1651.5 178.87 - FieldGoal_percent 1 116.24 1767.8 180.27

For the model selection by backward (forward), users could compare model's AIC with one predictor deleted(added). The model will be better with the smaller AIC.For example, for backward stepwise, the AIC of full model (with all predictors fitted) is 57.622. If we delete Height, the AIC of model without Height is 55.922, or if we delete Weight first, the AIC is 56.598, or if we delete Shoot_percent first, the AIC is 56.618. So, we could find that deleting Height could get the smallest AIC. If we decide to delete a predictor, Height will be the first candidate. However, in this case, no matter which predictor is deleted, the AIC of three models are almost the same, so no predicter can be easily preferred to be deleted.

Also, we could decide to select variables by scattermatrix plot. In the scattermatrix plot above, there may exists a linear relationship between Weight and Height, so we could delete one of them. With consideration of the AIC above, Height is probably deleted.

So, we would apply linear model to fit dataset without Height.

**Result:**

The linear regression model result:

Summary of Linear Regression Model Call: lm(formula = form_lm, data = df1) Residuals: Min 1Q Median 3Q Max -9.454 -3.215 -1.908 2.030 14.945 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.6745 15.5018 0.173 0.8638 Height -0.2452 1.9238 -0.127 0.8991 FieldGoal_percent 14.5509 8.4454 1.723 0.0915 . Signif. codes: 0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1 Residual standard error: 5.927 on 47 degrees of freedom Multiple R-squared: 0.06608, Adjusted R-squared: 0.02634 F-statistic: 1.663 on 2 and 47 DF, p-value: 0.2006 The form_lm is: Score ~ Height+FieldGoal_percent

The result of linear regression will give the estimates of the coefficient of the predictors and the intercepts. Also, it gives the P-value of each coefficients. When the P-value

**Result:**

The result of non-nested model comparison:

Cox test for non-nested model comparison Model 1: Score ~ Height + FreeThrow_percent Model 2: Score ~ Weight + FreeThrow_percent Estimate Std. Error z value Pr(>|z|) fitted(M1) ~ M2 -0.093687 0.061822 -1.5154 0.1297 fitted(M2) ~ M1 0.057947 0.209691 0.2763 0.7823

System Message: WARNING/2 (`<string>`, line 187)

The plot in the upper left shows the residual errors plotted versus their fitted values. The residuals should be randomly distributed around the horizontal line representing a residual error of zero; that is, there should not be a distinct trend in the distribution of points. The plot in the lower left is a standard Q-Q plot, which should suggest that the residual errors are normally distributed. The scale-location plot in the upper right shows the square root of the standardized residuals (sort of a square root of relative error) as a function of the fitted values. Again, there should be no obvious trend in this plot. Finally, the plot in the lower right shows each points leverage, which is a measure of its importance in determining the regression result. Superimposed on the plot are contour lines for the Cookâs distance, which is another measure of the importance of each observation to the regression. Smaller distances means that removing the observation has little affect on the regression results. Distances larger than 1 are suspicious and suggest the presence of a possible outlier or a poor model.

**Result:**

The result of prediction with fitted model:

fit lwr upr 1 11.450894 6.388202 16.51359 2 16.324531 9.507230 23.14183 3 13.574305 8.546600 18.60201 4 10.602826 1.340620 19.86503 5 14.364199 10.144601 18.58380 6 14.467042 9.037434 19.89665 7 14.415620 10.536495 18.29475 8 14.923237 6.142741 23.70373 9 3.122351 -20.914445 27.15915 10 14.390941 9.593128 19.18875

In the result of prediction, the fit value is the predicted value by the model fitted in the Use Linear Model part. The lwr and upr are the low-bound and up-bound values of the prediction interval.