Implementation of R codes for simple and multivariate regression:

Remember for simple linear regression that we have discussed in class so far, the model is \(y = \beta_0 + \beta_1x + \epsilon\) and the predicted value for observation \(i = 1,2,......,n\) is \(\hat y_i = \hat \beta_0 + \hat \beta_1 x_i\). We now implement R if you are going to use R as an alternative to JMP.

# Set up data
# Independent variable
x = c(26.6, 26.0, 27.4, 21.7, 14.9, 11.3, 15.0, 8.7, 8.2)
# Dependent variable
y = c(1.58, 1.45, 1.13, 0.96, 0.99, 1.05, 0.82, 0.68, 0.56)

# Step 1: Build linear model

linearmodel <- lm(y ~ x)  # build linear regression model on full data
summary_mod = summary(linearmodel); summary_mod
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.23691 -0.12513 -0.02289  0.13280  0.25479 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 0.393960   0.171536   2.297  0.05526 . 
## x           0.035509   0.008927   3.978  0.00534 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1967 on 7 degrees of freedom
## Multiple R-squared:  0.6933, Adjusted R-squared:  0.6494 
## F-statistic: 15.82 on 1 and 7 DF,  p-value: 0.00534

\(\bullet\) The summary here gives the least-square line for prediction is \(\hat y = 0.39396 + 0.035509\cdot x\), one unit increase of x is associated with the increase of y by 0.035509. Then, we look at the Std. Error part, the summary gives that \(SE(\hat \beta_0) = 0.171536\) and \(SE(\hat \beta_1) = 0.008927\) and degree of freedom \(n - 2 = 7\). In Multiple R-squared part, the summary table gives that the \(R^2 = 0.6933\), indicating that approximately 69.33 % of the observed variation in the model can be explained by the simple linear regression relationship between x and y.

# Step 2: Confidence interval for intercept and slope
conf = confint(linearmodel); round(conf, 4)
##               2.5 % 97.5 %
## (Intercept) -0.0117 0.7996
## x            0.0144 0.0566
# For intercept, the first value is estimate, the second value is SE
c(summary_mod$coefficients[1,1],summary_mod$coefficients[1,2])
## [1] 0.3939595 0.1715359
conf_intercept = c(summary_mod$coefficients[1,1] + qt(0.025, 7) * summary_mod$coefficients[1,2],
                   summary_mod$coefficients[1,1] + qt(0.975, 7) * summary_mod$coefficients[1,2])
round(conf_intercept, 4) # This will give the same result as the confint() function
## [1] -0.0117  0.7996
# For slope, the first value is estimate, the second value is SE
c(summary_mod$coefficients[2,1],summary_mod$coefficients[2,2])
## [1] 0.035509163 0.008927328
conf_intercept = c(summary_mod$coefficients[2,1] + qt(0.025, 7) * summary_mod$coefficients[2,2],
                   summary_mod$coefficients[2,1] + qt(0.975, 7) * summary_mod$coefficients[2,2])
round(conf_intercept, 4) # This will give the same result as the confint() function
## [1] 0.0144 0.0566
# Step 3: Predicted interval for each observation
predval = predict(linearmodel); round(predval, 4)
##      1      2      3      4      5      6      7      8      9 
## 1.3385 1.3172 1.3669 1.1645 0.9230 0.7952 0.9266 0.7029 0.6851

\(\bullet\) From the confint(), we are approximately 95% confident that the intercept term is between -0.0117 and 0.7996. Also, we are approximately 95% confident that the slope term is between 0.0144 and 0.0566. Particularly, we can extract the values from the summary table, with the aids of qt() function for t-score, to compute the confidence interval by hand.

\(\bullet\) From the predict(), we can obtain the predicted value for the 9 observations after performing simple linear regression.

# Step 4: Inference on the mean response, and Prediction interval for future observation

# a) Let us assume we want to know the inference when x = 30
newdata = data.frame(x=30)
# Confidence interval for mean response
conf30 = predict(linearmodel, newdata, interval="confidence"); round(conf30,4)
##      fit    lwr    upr
## 1 1.4592 1.1578 1.7606
# Prediction interval for one future observation
pred30 = predict(linearmodel, newdata, interval="predict"); round(pred30, 4) 
##      fit   lwr    upr
## 1 1.4592 0.905 2.0135
newdata1 = data.frame(x=7)
conf7 = predict(linearmodel, newdata1, interval="confidence");round(conf7, 4)
##      fit    lwr    upr
## 1 0.6425 0.3676 0.9175
pred7 = predict(linearmodel, newdata1, interval="predict"); round(pred7, 4)
##      fit    lwr    upr
## 1 0.6425 0.1022 1.1828

\(\bullet\) By adding new data and performing the predict() function, we obtain that when \(x=30\), the confidence interval for mean response is (1.1578, 1.7606), and the prediction interval for a future observation is (0.905, 2.0135). By similar method, we obtain that when \(x=7\), the confidence interval for mean response is (0.3676, 0.9175), and the prediction interval for a future observation is (0.1022, 1.1828).

\(\bullet\) We see that the prediction interval for future observation is wider than the confidence interval for mean response for both cases. Do you remember the formula to derive these two intervals? See the note below before attempting the homework. Good Luck!

# Step 5: Plot Diagnostics

# Scatterplot
scatter.smooth(x, y, main="Regression model: y ~ x")  # scatterplot

# Boxplot (No outlier in this case)
par(mfrow=c(1, 2))  # divide graph area in 2 columns
boxplot(x, main="Boxplot for x", sub=paste("Outlier rows: ", boxplot.stats(x)$out))  # Boxplot for x
boxplot(y, main="Boxplot for y", sub=paste("Outlier rows: ", boxplot.stats(y)$out))  # Boxplot for y

# Q-Q plot
par(mfrow=c(1, 2))
qqnorm(x, xlim = c(-3,3))
qqline(x, lwd=2)

qqnorm(y, xlim = c(-3,3))
qqline(y, lwd=2)

Key Concepts During Lecture 22 and 23:

\(\bullet\) Correlation, Coefficient of Determination: See class note 21.

\(\bullet\) JMP outputs in Lecture note and ANOVA: See class note 21.

\(\bullet\) Simple Linear Model: The simple linear model (true model for least-squares regression) is given by:

\[\begin{equation} \boxed{y_i = \beta_0 + \beta_1 x_i + \epsilon_i \ \ or \ \ Y = \beta_0 + \beta_1 X + \epsilon} \end{equation}\]

, where \(y_i\) is the observed value of the response variable, \(\beta_0\) is the least-squares y-intercept, \(\beta_1\) is the least-squares slope for the model. The \(\epsilon_i\) is the error term with the ith measurement.

Normal Assumption for Linear Model:

\(\bullet\) 1) We need the observations are independent.

\(\bullet\) 2) We need the errors \(\epsilon_1, \epsilon_2, ......, \epsilon_n\) are random and independent. In particular, the magnitude of any error term does not influence the value of the other error terms. Again, if the error terms appear to be monotone increasing or decreasing, U-shape or inverted-U-shape, the assumption for linear models is violated.

\(\bullet\) 3) We need the error terms to be normally distributed, with mean 0 and with same variance (homoscedasticity). Notationally speaking, we need \(\epsilon_i \overset{{\scriptstyle\text{i.i.d.}}}{\sim} N(0,\sigma^2)\)

If the normal assumption is fulfilled, we can further have the response variable \(y_i\) follows normal distribution. Notationally speaking, \(y_i \overset{{\scriptstyle\text{i.i.d.}}}{\sim} N(\beta_0 + \beta_1x_i,\sigma^2)\), for \(i=1,2,......,n\).

Intercept and Slope for regression:

I will provide only the formulas of \(\hat \beta_0\), \(\hat \beta_1\), \(SE(\hat \beta_0)\) and \(SE(\hat \beta_1)\) here, while skipping the proofs for the derivation for simplicity. To see the proofs, I will upload side notes for further study after the final examination. If you are interested in regression, take STATS 500 or other related courses.

\[\begin{equation} \boxed{\hat \beta_1 = \frac{\sum_{i=1}^n (y_i - \bar y)(x_i - \bar x)}{\sum_{i=1}^n (x_i - \bar x)^2}} \end{equation}\] \[\begin{equation} \boxed{\hat \beta_0 = \bar y - \hat \beta_1 \bar x} \end{equation}\] \[\begin{equation} \boxed{SE(\hat \beta_1) = \sqrt{Var(\frac{\sum_{i=1}^n (y_i - \bar y)(x_i - \bar x)}{\sum_{i=1}^n (x_i - \bar x)^2})} = \frac{\sigma}{\sqrt{\sum_{i=1}^n (x_i - \bar x)^2}}} \end{equation}\] \[\begin{equation} \boxed{SE(\hat \beta_0) = \sqrt{Var(\bar y - \hat \beta_1 \bar x)} = \sigma \cdot \sqrt{\frac{1}{n} + \frac{\bar x^2}{\sum_{i=1}^n (x_i - \bar x)^2}}} \end{equation}\] \[\begin{equation} \boxed{\hat \sigma^2 = s^2 = MSE = (RMSE)^2 = \frac{\sum_{i=1}^n (y_i - \hat y_i)^2}{n-2}} \end{equation}\]

Hypothesis Tests and Confidence Intervals for the Intercept:

\(\bullet\) For Intercept, our hypothesis is \(H_0: \beta_0 =0 \ \ vs \ \ H_1:\beta_0 \ne 0\).

\(\bullet\) We perform standardization for \(\beta_0 \rightarrow \frac{Observed - Mean}{SD} = \frac{\hat \beta_0 - \beta_0}{SE(\hat\beta_0)}\). This value should follow the t-distribution with \(n-2\) degree of freedom, because the degree of freedom is computed by (sample size - number of coefficients estimated \(\beta_0\), \(\beta_1\)), which is \(n-2\).

\(\bullet\) If the p-value is smaller than the significance level \(\alpha\), we reject the null hypothesis and we have sufficient evidence to conclude that the intercept is significantly different from 0.

\(\bullet\) The \(100(1-\alpha) \%\) confidence interval (you know how to derive the CUB and CLB already) of intercept, we have: \(\hat \beta_0 \pm t_{n-2, \frac{\alpha}{2}} \cdot SE(\hat\beta_0)\). Then, we are approximately \(100(1-\alpha) \%\) confident that the intercept value falls (between/below/above) ……….

Hypothesis Tests and Confidence Intervals for the Slope:

\(\bullet\) For Slope, our hypothesis is \(H_0: \beta_1 =0 \ \ vs \ \ H_1:\beta_1 \ne 0\).

\(\bullet\) We perform standardization for \(\beta_1 \rightarrow \frac{Observed - Mean}{SD} = \frac{\hat \beta_1 - \beta_1}{SE(\hat\beta_1)}\). This value should follow the t-distribution with \(n-2\) degree of freedom, because the degree of freedom is computed by (sample size - number of coefficients estimated \(\beta_0\), \(\beta_1\)), which is \(n-2\).

\(\bullet\) If the p-value is smaller than the significance level \(\alpha\), we reject the null hypothesis and we have sufficient evidence to conclude that the slope (regression effect) is significantly different from 0. In this case, the predictor is significant in our linear model.

\(\bullet\) On the other hand, if the p-value is greater than the significance level \(\alpha\), we fail to reject the null hypothesis and we do have sufficient evidence to conclude that the slope (regression effect) is significantly different from 0. In this case, the predictor can be dropped in our linear model.

\(\bullet\) The \(100(1-\alpha) \%\) confidence interval (you know how to derive the CUB and CLB already) of slope, we have: \(\hat \beta_1 \pm t_{n-2, \frac{\alpha}{2}} \cdot SE(\hat\beta_1)\). Then, we are approximately \(100(1-\alpha) \%\) confident that the intercept value falls (between/below/above) ……….

Inference for Mean Response:

Recap: \(Y|x_i \sim N(\beta_0 +\beta_1x_1, \sigma^2)\), \(Y|x^* \sim N(\beta_0 +\beta_1x^*, \sigma^2)\).

\(\bullet\) For \(\hat Y = \hat \beta_0 + \hat \beta_1 x^* \rightarrow E[\hat Y] = \beta_0 + \beta_1 x^* \ \ , \ \ Var(\hat Y) = \sigma^2 \Big[\frac{1}{n} + \frac{(x^*-\bar x)^2}{\sum_{i=1}^n(x_i - \bar x)^2}\Big]\)

\(\bullet\) Now for the estimated standard deviation of \(\hat Y\), we have

\[\begin{equation} \boxed{s_{\hat Y} = s \cdot \sqrt{\frac{1}{n} + \frac{(x^* - \bar x)^2}{\sum_{i=1}^n (x_i - \bar x)^2}} \ \ , \ \ s = RMSE = \sqrt{MSE} = \sqrt{\frac{\sum_{i=1}^n (y_i - \hat y_i)^2}{n-2}} = \sqrt{\frac{(1-r^2)\sum_{i=1}^n (y_i - \bar y)^2}{n-2}}} \end{equation}\]

\(\bullet\) Therefore, for hypothesis testing, we use the test statistics \(\frac{\hat Y - (\beta_0 + \beta_1 x^*)}{s_{\hat Y}} \sim t_{n-2}\), remember we have degree of freedom \(n -2\) because we are estimating intercept \(\beta_0\) and slope \(\beta_1\). In JMP output, we can see the degree of freedom \(n-2\) by looking at the Df for Error in Analysis of Variance.

\(\bullet\) To construct \(100(1-\alpha)\%\) confidence interval for mean response, we use

\[\begin{equation} \boxed{(\hat \beta_0 + \hat \beta_1 x^*) \pm t_{\frac{\alpha}{2}, n-2} \cdot s_{\hat Y} = (\hat \beta_0 + \hat \beta_1 x^*) \pm t_{\frac{\alpha}{2}, n-2} \cdot \sqrt{MSE \cdot \Big[\frac{1}{n} + \frac{(x^* - \bar x)^2}{Sum \ of \ Square}\Big]}} \end{equation}\]

\(\bullet\) The MSE in JMP can be found in Analysis of Variance \(\rightarrow\) Mean Square \(\rightarrow\) Error row, while \(\sum_{i=1}^n (x_i - \bar x)^2\) can be cound in Corrected SS in Summary Statistics.

Inference for Future Observations (Prediction Interval):

Since the error in prediction is calculated by \(y - \hat y = (\beta_0 + \beta_1 x^* + \epsilon) - (\hat \beta_0 + \hat \beta_1 x^*)\), the prediction interval for a future observation has more variability which makes it wider than a confidence interval for mean response.

\(\bullet\) Since the future observation \(Y_{new} = \beta_0 + \beta_1x^* + \epsilon\) is assumed to be independent of the observed values, we have \(cov(Y_{new}, \hat Y) = 0\). Therefore, the error of prediction has variance:

\[\begin{equation} \boxed{Var(Y_{new} - \hat Y) = Var(Y_{new}) - 2cov(Y_{new}, \hat Y) + Var(\hat Y) = Var(\beta_0 + \beta_1 x^* + \epsilon) + Var(\hat Y)} \end{equation}\] \[\begin{equation} \boxed{=Var(\epsilon) + \sigma^2 \cdot [\frac{1}{n} + \frac{(x^* - \bar x)^2}{\sum_{i=1}^n (x_i -\bar x)^2}] = \sigma^2 + \sigma^2 \cdot [\frac{1}{n} + \frac{(x^* - \bar x)^2}{\sum_{i=1}^n (x_i -\bar x)^2}] = \sigma^2 \cdot [1 + \frac{1}{n} + \frac{(x^* - \bar x)^2}{\sum_{i=1}^n (x_i -\bar x)^2}]} \end{equation}\]

\(\bullet\) Therefore, with the similar logic of mean response, the of \(Y_{new}\), we have

\[\begin{equation} \boxed{s_{Y_{new}} = \sqrt{s^2_{\hat Y} + s^2} = s \cdot \sqrt{1 + \frac{1}{n} + \frac{(x^* - \bar x)^2}{\sum_{i=1}^n (x_i - \bar x)^2}} = (RMSE) \cdot \sqrt{1 + \frac{1}{n} + \frac{(x^* - \bar x)^2}{Sum \ of \ Square}}} \end{equation}\]

\(\bullet\) Therefore, for hypothesis testing, we use the test statistics \(\frac{Y_{new} - (\hat \beta_0 + \hat \beta_1 x^*)}{s_{Y_{new}}} \sim t_{n-2}\), remember we have degree of freedom \(n -2\) because we are estimating intercept \(\beta_0\) and slope \(\beta_1\). In JMP output, we can see the degree of freedom \(n-2\) by looking at the Df for Error in Analysis of Variance.

\(\bullet\) To construct \(100(1-\alpha)\%\) prediction interval for future observations, we use

\[\begin{equation} \boxed{(\hat \beta_0 + \hat \beta_1 x^*) \pm t_{\frac{\alpha}{2}, n-2} \cdot s_{Y_{new}} = (\hat \beta_0 + \hat \beta_1 x^*) \pm t_{\frac{\alpha}{2}, n-2} \cdot \sqrt{MSE \cdot \Big[1 + \frac{1}{n} + \frac{(x^* - \bar x)^2}{Sum \ of \ Square}\Big]}} \end{equation}\]

Worked Example 1: USWage dataset in Faraway

library(faraway)
data(uswages)
attach(uswages)
exper [exper == -2] =NA; exper [exper == -1] = NA # Eliminate wrong data

\(\bullet~~\) Similar to what had done in assignment 1, I eliminated 33 observations with negative value in year of experience before performing regression analysis.

Fit a regression model with weekly wages as the response and years of education and experience as predictors.

# Linear regression and statistical inference w.r.t wage, education and experience
lm.fit1=lm(wage~ educ+exper); summary(lm.fit1) 
## 
## Call:
## lm(formula = wage ~ educ + exper)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1014.7  -235.2   -52.1   150.1  7249.2 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -239.1146    50.7111  -4.715 2.58e-06 ***
## educ          51.8654     3.3423  15.518  < 2e-16 ***
## exper          9.3287     0.7602  12.271  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 426.8 on 1964 degrees of freedom
##   (33 observations deleted due to missingness)
## Multiple R-squared:  0.1348, Adjusted R-squared:  0.1339 
## F-statistic:   153 on 2 and 1964 DF,  p-value: < 2.2e-16

\(\bullet~~\) Referring to the coefficients, the regression model with weekly wages as the response and years of education and experience as predictors is as follows:

\[wage = -239.11 + 51.87\cdot(education) + 9.33 \cdot(experience)\]

, which implies education and experience are proportional to wage. Particularly, every increase of one year education and experience is associated with an increase of approximately 51.87 dollars and 9.33 dollars in weekly wage, respectively.

\(\bullet~~\) The \(R^2 \approx 0.135\) (Multiple R-squared) implies that approximately 13.5% of the variability of the wage is explained by the above regression model.

Which observation has the largest (positive) residual? Give the case number

\(\bullet~~\) The p-values of the predictors are extremely low, so there are strong indication that the response and predictors are of linear relationship (common criteria for p-value are p < 0.01 or p < 0.05).

# Finding the object with the largest positive residual and its value
residuals(lm.fit1)[which.max(residuals(lm.fit1))]
##     1576 
## 7249.174

Compute the mean and median of the residuals. Explain what the difference between the mean and the median indicates.

\(\bullet~~\) As the result suggests, the 1576th observation has the largest positive residual and its value is 7249.174.

# Finding mean and median of the residuals
mean(residuals(lm.fit1)); median(residuals(lm.fit1))
## [1] -1.381535e-15
## [1] -52.14337

\(\bullet~~\) The mean of residuals is \(-1.38 \times 10^{-15}\) and the median of residuals is -52.14. It is reasonable that the mean is nearly 0 but not zero, since R could not perform the calculation so precisely to get the perfect answer.

\(\bullet~~\) Obviously, the mean of residuals is greater than the median of residuals. This positive difference indicates that the distribution of the response, i.e., weekly wage is skew to the right. A few of people’s weekly wage are more than the predicted value (and the value tends to be high), while most people have weekly wage lower than predicted.

# Performing residual analysis: Plotting residual and finding non-linearity
par(mfrow=c(1,3))
plot(predict(lm.fit1), residuals(lm.fit1))
plot(predict(lm.fit1), rstudent(lm.fit1))
plot(hatvalues(lm.fit1))

\(\bullet~~\) The above figures represent the residuals, studentized residuals and leverage of the fitted observations. \(\bullet~~\) The first two plots demonstrate the non-constant variance in error terms, i.e., they are expanding in trend. This presence of funnel shape (heteroscedasticity) in residuals indicates the high possibility that a transformation of the current model is needed.

Compute the correlation of the residuals with the fitted values. Plot residuals against fitted values.

\(\bullet~~\) In statistical context, we may consider an observation a possible outlier if its studentized residuals locate outside 3 in absolute value.

# Finding the correlation of the residuals with the fitted values
zapsmall(cor(fitted(lm.fit1), residuals(lm.fit1)))
## [1] 6.35678e-17

\(\bullet~~\) To solve question 6, referring to the model \[wage = -239.11 + 51.87\cdot(education) + 9.33 \cdot(experience)\] , we obtain that two people with the same education and one year difference in experience will have approximately 9.33 dollars difference in weekly wage. \(\bullet~~\) Since the above linear model shows heteroscedasticity, now perform log transformation.

Fit the same model but with log(weekly wages) as the response and interpret the regression coefficient for experience. Which model has a more natural interpretation?

# Log-linear regression and statistical inference w.r.t wage, education and experience
loglm.fit1=lm(log(wage)~ educ+exper); summary(loglm.fit1) 
## 
## Call:
## lm(formula = log(wage) ~ educ + exper)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7527 -0.3383  0.1002  0.4297  3.5728 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.675518   0.077085   60.65   <2e-16 ***
## educ        0.091940   0.005080   18.10   <2e-16 ***
## exper       0.016516   0.001156   14.29   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6488 on 1964 degrees of freedom
##   (33 observations deleted due to missingness)
## Multiple R-squared:  0.1747, Adjusted R-squared:  0.1739 
## F-statistic: 207.9 on 2 and 1964 DF,  p-value: < 2.2e-16
# Performing residual analysis for log-linear regression model
par(mfrow=c(1,3))
plot(predict(loglm.fit1), residuals(loglm.fit1))
plot(predict(loglm.fit1), rstudent(loglm.fit1))
plot(hatvalues(loglm.fit1))

\(\bullet~~\) Now we have log-linear model \[ln(wage) = 4.676 + 0.092\cdot(education) + 0.017 \cdot(experience)\] , which means that every increase of one year experience is associated with an increase of approximately 0.017 in log(weekly wage). \(\bullet~~\) From the two corresponding pairs of residual plots, we can see that the transformed model shows no particular pattern, while the original model shows non-constant variance. Additionally, the studentized residuals of the transformed model shrank to 4 in absolute value, indicating that the transformed model has a more natural interpretation.

Worked Example 2: SAT dataset in Faraway

library(faraway)
attach(sat)
lm.fit1 = lm(total~ takers+ratio+salary, sat)
summary(lm.fit1)$coef; summary(lm.fit1)$r.squared
##                Estimate Std. Error    t value     Pr(>|t|)
## (Intercept) 1057.898162 44.3286685  23.864876 1.496236e-27
## takers        -2.913350  0.2282436 -12.764215 1.007685e-16
## ratio         -4.639428  2.1215142  -2.186847 3.387652e-02
## salary         2.552470  1.0045183   2.540989 1.449126e-02
## [1] 0.8238643

Interpretation:

\(\bullet~~\) Referring to the coefficients, the regression model with total sat score as the response and percentage of takers, pupil teacher ratio and teacher’s annual salary as predictors is as follows:

\[total = 1057.90 - 2.91\cdot(\% takers) - 4.64 \cdot(ratio) + 2.55\cdot (salary)\]

, which means the changes of total SAT score by an unit increase of takers, ratio and salary (keeping other conditions constant) are -2.91, -4.64 and (+)2.55, respectively.

\(\bullet~~\) The \(R^2\) value is 0.8239 in this linear model, so approximately 82.4% of the variability of the total SAT score is explained by the above regression model.

\(\bullet~~\) Then we test hypothesis \(H_0: \beta_{salary}=0\) against \(H_a: \beta_{salary} \neq0\). We use the formula:

\[\frac{|\hat \beta_{salary}-0|}{se(\hat \beta_{salary})}= \frac{2.5525}{1.0045}=2.541>t_{46}^{0.025}\approx 2.013\]

Therefore, under significance level \(\alpha=0.05\), we reject \(H_0\) and say salary is a significant factor to total SAT score and can not be removed from the model.

\(\bullet~~\) Then we test hypothesis \(H_0: \beta_{takers}=\beta_{ratio}=\beta_{salary}=0\) against \(H_a: at ~least~one ~factor \neq 0\). We use the following R command:

h0_1 = lm(total~ 1) # Null hypothesis: all 3 factors are insignificant
h0a_1 = lm(total~ takers+ratio+salary)
anova(h0_1,h0a_1)
## Analysis of Variance Table
## 
## Model 1: total ~ 1
## Model 2: total ~ takers + ratio + salary
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     49 274308                                  
## 2     46  48315  3    225992 71.721 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

\(\bullet~~\) From the ANOVA, since the value \(Pr(>F) = 2.2 \times10^{-16} <0.05\), we reject \(H_0\) and claim that as least one factor among the three is significant to the total SAT score.

Confidence Interval:

\(\bullet~~\) For question 2(generating confidence intervals), since we already have the standard errors from the above summary table, i.e., \(se(\hat \beta_{salary}) = 1.0045\), \(se(\hat \beta_{takers})=0.2282\) and \(se(\hat \beta_{ratio})=2.1215\), we can compute the 95% and 99% confidence intervals of the factors by using t-distribution with df = 46 (\(t_{46}^{0.025}=2.013\), \(t_{46}^{0.005}=2.687\)).

For example, the 95% confidence interval of salary is given by:

\[\hat \beta_{salary}-2.013\cdot se(\hat \beta_{salary})\leq \beta_{salary} \leq \hat \beta_{salary}+2.013\cdot se(\hat \beta_{salary}) \rightarrow \beta_{salary} \in [0.5304, 4.5746]\]

\(\bullet~~\) In general, using R command, the 95% and 99% confidence intervals are computed by:

list(confint(lm.fit1, level=0.95), confint(lm.fit1, level=0.99))
## [[1]]
##                   2.5 %       97.5 %
## (Intercept) 968.6691802 1147.1271438
## takers       -3.3727807   -2.4539197
## ratio        -8.9098147   -0.3690414
## salary        0.5304797    4.5744605
## 
## [[2]]
##                  0.5 %      99.5 %
## (Intercept) 938.786432 1177.009892
## takers       -3.526644   -2.300057
## ratio       -10.339965    1.061109
## salary       -0.146684    5.251624

\(\bullet~~\) Only using these intervals, we find that the confidence lower bound of \(\beta_{salary}\) is positive in 95% level, but that of \(\beta_{salary}\) becomes negative in 99% level, which means that the p-value for salary in the regression should be between 0.01 and 0.05.

Confidence Region (Confidence Interval for 2-dimensions):

library(ellipse)
## Warning: package 'ellipse' was built under R version 3.4.3
## 
## Attaching package: 'ellipse'
## The following object is masked from 'package:graphics':
## 
##     pairs
# Plot the confidence region and add the estimates
plot(ellipse(lm.fit1, c("ratio","salary")), type="l", xlim=c(-10,1))
points(lm.fit1$coef['ratio'], lm.fit1$coef['salary'], pch=18)
points(0,0, pch=1) # Add the origin to the plot
abline(v=confint(lm.fit1)['ratio',], lty=2) # Add the vertical line representing confidence interval
abline(h=confint(lm.fit1)['salary',], lty=2)

Interpretation of Confidence Region:

\(\bullet~~\) Since the confidence region does not contain the origin(0,0), in significance level \(\alpha = 0.05\)(this confidence region is by default 95%), we are confident to reject the null hypothesis: \(H_0:\hat \beta_{ratio}=\hat \beta_{salary}=0\) and conclude that at least one factor among ratio and salary is statistically significant, regarding candidate’s total SAT score (\(H_a:at ~least~one ~factor \neq0\)).

lm.fit2 = lm(total~ takers+ratio+salary+expend)
summary(lm.fit2)$coef; summary(lm.fit2)$r.squared
##                Estimate Std. Error     t value     Pr(>|t|)
## (Intercept) 1045.971536  52.869760  19.7839283 7.857530e-24
## takers        -2.904481   0.231260 -12.5593745 2.606559e-16
## ratio         -3.624232   3.215418  -1.1271418 2.656570e-01
## salary         1.637917   2.387248   0.6861110 4.961632e-01
## expend         4.462594  10.546528   0.4231339 6.742130e-01
## [1] 0.8245623

\(\bullet~~\) Now, regression model with total sat score as the response and percentage of takers, pupil teacher ratio, average expenditure and teacher’s annual salary as predictors is as follows:

\[total = 1045.97 - 2.90\cdot(\% takers) - 3.62 \cdot(ratio) + 1.64\cdot (salary) + 4.46\cdot (expend)\]

\(\bullet~~\) Comparing with the model in question 1, the coefficients of takers, ratio and salary change slightly, but without changing the positive/negative relationship. Notably, a unit increase of expend is associated with 4.46 total SAT score.

\(\bullet~~\) In terms of significance, the p-values of takers, ratio and salary all increase greatly to a level exceeding 0.05, which is the critical point to claim that a factor is statistically significant to the regression model.

\(\bullet~~\) The \(R^2\) remains fairly constant compared with the first model, i.e., \(R^2=0.8246\) in this model versus \(R^2 = 0.8239\) in the previous model(approximately 82.5% of the variability of total SAT score is explained by the model now).

R Code for ANOVA - Multiple hypothesis testing:

\(\bullet~~\) To test the null hypothesis \(H_0: \beta_{salary}=\beta_{expend}=\beta_{ratio}=0\) against \(H_a: at ~least~one ~factor \neq0\). We use the R command:

h0_2 = lm(total~ takers) # Null hypothesis: only the factor takers is significant to the model 
h0a_2 = lm(total~ takers+ratio+salary+expend)
anova(h0_2,h0a_2)
## Analysis of Variance Table
## 
## Model 1: total ~ takers
## Model 2: total ~ takers + ratio + salary + expend
##   Res.Df   RSS Df Sum of Sq      F  Pr(>F)  
## 1     48 58433                              
## 2     45 48124  3     10309 3.2133 0.03165 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

\(\bullet~~\) From the ANOVA, since the value \(Pr(>F) = 0.03165 <0.05\), we say in signifance level 0.05, we reject \(H_0\) and claim that as least one factor among the three is significant to the total SAT score.

\(\bullet~~\) Utilizing the first model and substituting values \(\beta_{takers}=35\) and \(\beta_{ratio}=17\), we simplify the model as

\[total = 1057.90 - 2.91\cdot(35) - 4.64 \cdot(17) + 2.55\cdot (salary) \rightarrow total = 877.17 + 2.55\cdot (salary)\]

result = lm(total~ takers + ratio + salary, sat)
grid = seq(25,52,1)
conf_line = predict(result, data.frame(salary=grid, takers = 35, ratio = 17), interval="confidence")
pred_line = predict(result, data.frame(salary=grid, takers = 35, ratio = 17), interval="prediction")
matplot(grid, cbind(conf_line, pred_line[,-1]), lty= c(1,2,2,3,3), col=1, 
        type ="l", xlab="salary", ylab="total") # Plot a matrix
rug(sat$salary) # Visualize the plot generated by the grid

\(\bullet~~\) The bold line at the center represents the fitted value of total SAT score, with predictor salary changing and takers and ratio constant; since it is a linear model, the line is straight.

\(\bullet~~\) The inner and outer curves represent the 95% confidence interval for the mean and 95% prediction interval for the actual total sat score observation, respectively. Undoubtedly, the prediction interval in this model is always of greater distance from the mean than the confidence interval.

\(\bullet~~\) Discussion: The narrowest 95% confidence interval occurs at approximately 36 (salary), and non-constant variance occurs in the two tails. This implies a possibility that separating the model into two independent models (teacher’s salary less than 36,000 each year and more than 36,000 each year) may obtain more reasonable results.

Last Comment:

Please inform me to fix the typos and grammatical mistakes if they exist. It is a great practice of writing and I appreciate your help!