Remember for simple linear regression that we have discussed in class so far, the model is \(y = \beta_0 + \beta_1x + \epsilon\) and the predicted value for observation \(i = 1,2,......,n\) is \(\hat y_i = \hat \beta_0 + \hat \beta_1 x_i\). We now implement R if you are going to use R as an alternative to JMP.
# Set up data
# Independent variable
x = c(26.6, 26.0, 27.4, 21.7, 14.9, 11.3, 15.0, 8.7, 8.2)
# Dependent variable
y = c(1.58, 1.45, 1.13, 0.96, 0.99, 1.05, 0.82, 0.68, 0.56)
# Step 1: Build linear model
linearmodel <- lm(y ~ x) # build linear regression model on full data
summary_mod = summary(linearmodel); summary_mod
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.23691 -0.12513 -0.02289 0.13280 0.25479
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.393960 0.171536 2.297 0.05526 .
## x 0.035509 0.008927 3.978 0.00534 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1967 on 7 degrees of freedom
## Multiple R-squared: 0.6933, Adjusted R-squared: 0.6494
## F-statistic: 15.82 on 1 and 7 DF, p-value: 0.00534
\(\bullet\) The summary here gives the least-square line for prediction is \(\hat y = 0.39396 + 0.035509\cdot x\), one unit increase of x is associated with the increase of y by 0.035509. Then, we look at the Std. Error part, the summary gives that \(SE(\hat \beta_0) = 0.171536\) and \(SE(\hat \beta_1) = 0.008927\) and degree of freedom \(n - 2 = 7\). In Multiple R-squared part, the summary table gives that the \(R^2 = 0.6933\), indicating that approximately 69.33 % of the observed variation in the model can be explained by the simple linear regression relationship between x and y.
# Step 2: Confidence interval for intercept and slope
conf = confint(linearmodel); round(conf, 4)
## 2.5 % 97.5 %
## (Intercept) -0.0117 0.7996
## x 0.0144 0.0566
# For intercept, the first value is estimate, the second value is SE
c(summary_mod$coefficients[1,1],summary_mod$coefficients[1,2])
## [1] 0.3939595 0.1715359
conf_intercept = c(summary_mod$coefficients[1,1] + qt(0.025, 7) * summary_mod$coefficients[1,2],
summary_mod$coefficients[1,1] + qt(0.975, 7) * summary_mod$coefficients[1,2])
round(conf_intercept, 4) # This will give the same result as the confint() function
## [1] -0.0117 0.7996
# For slope, the first value is estimate, the second value is SE
c(summary_mod$coefficients[2,1],summary_mod$coefficients[2,2])
## [1] 0.035509163 0.008927328
conf_intercept = c(summary_mod$coefficients[2,1] + qt(0.025, 7) * summary_mod$coefficients[2,2],
summary_mod$coefficients[2,1] + qt(0.975, 7) * summary_mod$coefficients[2,2])
round(conf_intercept, 4) # This will give the same result as the confint() function
## [1] 0.0144 0.0566
# Step 3: Predicted interval for each observation
predval = predict(linearmodel); round(predval, 4)
## 1 2 3 4 5 6 7 8 9
## 1.3385 1.3172 1.3669 1.1645 0.9230 0.7952 0.9266 0.7029 0.6851
\(\bullet\) From the confint(), we are approximately 95% confident that the intercept term is between -0.0117 and 0.7996. Also, we are approximately 95% confident that the slope term is between 0.0144 and 0.0566. Particularly, we can extract the values from the summary table, with the aids of qt() function for t-score, to compute the confidence interval by hand.
\(\bullet\) From the predict(), we can obtain the predicted value for the 9 observations after performing simple linear regression.
# Step 4: Inference on the mean response, and Prediction interval for future observation
# a) Let us assume we want to know the inference when x = 30
newdata = data.frame(x=30)
# Confidence interval for mean response
conf30 = predict(linearmodel, newdata, interval="confidence"); round(conf30,4)
## fit lwr upr
## 1 1.4592 1.1578 1.7606
# Prediction interval for one future observation
pred30 = predict(linearmodel, newdata, interval="predict"); round(pred30, 4)
## fit lwr upr
## 1 1.4592 0.905 2.0135
newdata1 = data.frame(x=7)
conf7 = predict(linearmodel, newdata1, interval="confidence");round(conf7, 4)
## fit lwr upr
## 1 0.6425 0.3676 0.9175
pred7 = predict(linearmodel, newdata1, interval="predict"); round(pred7, 4)
## fit lwr upr
## 1 0.6425 0.1022 1.1828
\(\bullet\) By adding new data and performing the predict() function, we obtain that when \(x=30\), the confidence interval for mean response is (1.1578, 1.7606), and the prediction interval for a future observation is (0.905, 2.0135). By similar method, we obtain that when \(x=7\), the confidence interval for mean response is (0.3676, 0.9175), and the prediction interval for a future observation is (0.1022, 1.1828).
\(\bullet\) We see that the prediction interval for future observation is wider than the confidence interval for mean response for both cases. Do you remember the formula to derive these two intervals? See the note below before attempting the homework. Good Luck!
# Step 5: Plot Diagnostics
# Scatterplot
scatter.smooth(x, y, main="Regression model: y ~ x") # scatterplot
# Boxplot (No outlier in this case)
par(mfrow=c(1, 2)) # divide graph area in 2 columns
boxplot(x, main="Boxplot for x", sub=paste("Outlier rows: ", boxplot.stats(x)$out)) # Boxplot for x
boxplot(y, main="Boxplot for y", sub=paste("Outlier rows: ", boxplot.stats(y)$out)) # Boxplot for y
# Q-Q plot
par(mfrow=c(1, 2))
qqnorm(x, xlim = c(-3,3))
qqline(x, lwd=2)
qqnorm(y, xlim = c(-3,3))
qqline(y, lwd=2)
\(\bullet\) Correlation, Coefficient of Determination: See class note 21.
\(\bullet\) JMP outputs in Lecture note and ANOVA: See class note 21.
\(\bullet\) Simple Linear Model: The simple linear model (true model for least-squares regression) is given by:
\[\begin{equation} \boxed{y_i = \beta_0 + \beta_1 x_i + \epsilon_i \ \ or \ \ Y = \beta_0 + \beta_1 X + \epsilon} \end{equation}\], where \(y_i\) is the observed value of the response variable, \(\beta_0\) is the least-squares y-intercept, \(\beta_1\) is the least-squares slope for the model. The \(\epsilon_i\) is the error term with the ith measurement.
\(\bullet\) 1) We need the observations are independent.
\(\bullet\) 2) We need the errors \(\epsilon_1, \epsilon_2, ......, \epsilon_n\) are random and independent. In particular, the magnitude of any error term does not influence the value of the other error terms. Again, if the error terms appear to be monotone increasing or decreasing, U-shape or inverted-U-shape, the assumption for linear models is violated.
\(\bullet\) 3) We need the error terms to be normally distributed, with mean 0 and with same variance (homoscedasticity). Notationally speaking, we need \(\epsilon_i \overset{{\scriptstyle\text{i.i.d.}}}{\sim} N(0,\sigma^2)\)
If the normal assumption is fulfilled, we can further have the response variable \(y_i\) follows normal distribution. Notationally speaking, \(y_i \overset{{\scriptstyle\text{i.i.d.}}}{\sim} N(\beta_0 + \beta_1x_i,\sigma^2)\), for \(i=1,2,......,n\).
I will provide only the formulas of \(\hat \beta_0\), \(\hat \beta_1\), \(SE(\hat \beta_0)\) and \(SE(\hat \beta_1)\) here, while skipping the proofs for the derivation for simplicity. To see the proofs, I will upload side notes for further study after the final examination. If you are interested in regression, take STATS 500 or other related courses.
\[\begin{equation} \boxed{\hat \beta_1 = \frac{\sum_{i=1}^n (y_i - \bar y)(x_i - \bar x)}{\sum_{i=1}^n (x_i - \bar x)^2}} \end{equation}\] \[\begin{equation} \boxed{\hat \beta_0 = \bar y - \hat \beta_1 \bar x} \end{equation}\] \[\begin{equation} \boxed{SE(\hat \beta_1) = \sqrt{Var(\frac{\sum_{i=1}^n (y_i - \bar y)(x_i - \bar x)}{\sum_{i=1}^n (x_i - \bar x)^2})} = \frac{\sigma}{\sqrt{\sum_{i=1}^n (x_i - \bar x)^2}}} \end{equation}\] \[\begin{equation} \boxed{SE(\hat \beta_0) = \sqrt{Var(\bar y - \hat \beta_1 \bar x)} = \sigma \cdot \sqrt{\frac{1}{n} + \frac{\bar x^2}{\sum_{i=1}^n (x_i - \bar x)^2}}} \end{equation}\] \[\begin{equation} \boxed{\hat \sigma^2 = s^2 = MSE = (RMSE)^2 = \frac{\sum_{i=1}^n (y_i - \hat y_i)^2}{n-2}} \end{equation}\]\(\bullet\) For Intercept, our hypothesis is \(H_0: \beta_0 =0 \ \ vs \ \ H_1:\beta_0 \ne 0\).
\(\bullet\) We perform standardization for \(\beta_0 \rightarrow \frac{Observed - Mean}{SD} = \frac{\hat \beta_0 - \beta_0}{SE(\hat\beta_0)}\). This value should follow the t-distribution with \(n-2\) degree of freedom, because the degree of freedom is computed by (sample size - number of coefficients estimated \(\beta_0\), \(\beta_1\)), which is \(n-2\).
\(\bullet\) If the p-value is smaller than the significance level \(\alpha\), we reject the null hypothesis and we have sufficient evidence to conclude that the intercept is significantly different from 0.
\(\bullet\) The \(100(1-\alpha) \%\) confidence interval (you know how to derive the CUB and CLB already) of intercept, we have: \(\hat \beta_0 \pm t_{n-2, \frac{\alpha}{2}} \cdot SE(\hat\beta_0)\). Then, we are approximately \(100(1-\alpha) \%\) confident that the intercept value falls (between/below/above) ……….
\(\bullet\) For Slope, our hypothesis is \(H_0: \beta_1 =0 \ \ vs \ \ H_1:\beta_1 \ne 0\).
\(\bullet\) We perform standardization for \(\beta_1 \rightarrow \frac{Observed - Mean}{SD} = \frac{\hat \beta_1 - \beta_1}{SE(\hat\beta_1)}\). This value should follow the t-distribution with \(n-2\) degree of freedom, because the degree of freedom is computed by (sample size - number of coefficients estimated \(\beta_0\), \(\beta_1\)), which is \(n-2\).
\(\bullet\) If the p-value is smaller than the significance level \(\alpha\), we reject the null hypothesis and we have sufficient evidence to conclude that the slope (regression effect) is significantly different from 0. In this case, the predictor is significant in our linear model.
\(\bullet\) On the other hand, if the p-value is greater than the significance level \(\alpha\), we fail to reject the null hypothesis and we do have sufficient evidence to conclude that the slope (regression effect) is significantly different from 0. In this case, the predictor can be dropped in our linear model.
\(\bullet\) The \(100(1-\alpha) \%\) confidence interval (you know how to derive the CUB and CLB already) of slope, we have: \(\hat \beta_1 \pm t_{n-2, \frac{\alpha}{2}} \cdot SE(\hat\beta_1)\). Then, we are approximately \(100(1-\alpha) \%\) confident that the intercept value falls (between/below/above) ……….
Recap: \(Y|x_i \sim N(\beta_0 +\beta_1x_1, \sigma^2)\), \(Y|x^* \sim N(\beta_0 +\beta_1x^*, \sigma^2)\).
\(\bullet\) For \(\hat Y = \hat \beta_0 + \hat \beta_1 x^* \rightarrow E[\hat Y] = \beta_0 + \beta_1 x^* \ \ , \ \ Var(\hat Y) = \sigma^2 \Big[\frac{1}{n} + \frac{(x^*-\bar x)^2}{\sum_{i=1}^n(x_i - \bar x)^2}\Big]\)
\(\bullet\) Now for the estimated standard deviation of \(\hat Y\), we have
\[\begin{equation} \boxed{s_{\hat Y} = s \cdot \sqrt{\frac{1}{n} + \frac{(x^* - \bar x)^2}{\sum_{i=1}^n (x_i - \bar x)^2}} \ \ , \ \ s = RMSE = \sqrt{MSE} = \sqrt{\frac{\sum_{i=1}^n (y_i - \hat y_i)^2}{n-2}} = \sqrt{\frac{(1-r^2)\sum_{i=1}^n (y_i - \bar y)^2}{n-2}}} \end{equation}\]\(\bullet\) Therefore, for hypothesis testing, we use the test statistics \(\frac{\hat Y - (\beta_0 + \beta_1 x^*)}{s_{\hat Y}} \sim t_{n-2}\), remember we have degree of freedom \(n -2\) because we are estimating intercept \(\beta_0\) and slope \(\beta_1\). In JMP output, we can see the degree of freedom \(n-2\) by looking at the Df for Error in Analysis of Variance.
\(\bullet\) To construct \(100(1-\alpha)\%\) confidence interval for mean response, we use
\[\begin{equation} \boxed{(\hat \beta_0 + \hat \beta_1 x^*) \pm t_{\frac{\alpha}{2}, n-2} \cdot s_{\hat Y} = (\hat \beta_0 + \hat \beta_1 x^*) \pm t_{\frac{\alpha}{2}, n-2} \cdot \sqrt{MSE \cdot \Big[\frac{1}{n} + \frac{(x^* - \bar x)^2}{Sum \ of \ Square}\Big]}} \end{equation}\]\(\bullet\) The MSE in JMP can be found in Analysis of Variance \(\rightarrow\) Mean Square \(\rightarrow\) Error row, while \(\sum_{i=1}^n (x_i - \bar x)^2\) can be cound in Corrected SS in Summary Statistics.
Since the error in prediction is calculated by \(y - \hat y = (\beta_0 + \beta_1 x^* + \epsilon) - (\hat \beta_0 + \hat \beta_1 x^*)\), the prediction interval for a future observation has more variability which makes it wider than a confidence interval for mean response.
\(\bullet\) Since the future observation \(Y_{new} = \beta_0 + \beta_1x^* + \epsilon\) is assumed to be independent of the observed values, we have \(cov(Y_{new}, \hat Y) = 0\). Therefore, the error of prediction has variance:
\[\begin{equation} \boxed{Var(Y_{new} - \hat Y) = Var(Y_{new}) - 2cov(Y_{new}, \hat Y) + Var(\hat Y) = Var(\beta_0 + \beta_1 x^* + \epsilon) + Var(\hat Y)} \end{equation}\] \[\begin{equation} \boxed{=Var(\epsilon) + \sigma^2 \cdot [\frac{1}{n} + \frac{(x^* - \bar x)^2}{\sum_{i=1}^n (x_i -\bar x)^2}] = \sigma^2 + \sigma^2 \cdot [\frac{1}{n} + \frac{(x^* - \bar x)^2}{\sum_{i=1}^n (x_i -\bar x)^2}] = \sigma^2 \cdot [1 + \frac{1}{n} + \frac{(x^* - \bar x)^2}{\sum_{i=1}^n (x_i -\bar x)^2}]} \end{equation}\]\(\bullet\) Therefore, with the similar logic of mean response, the of \(Y_{new}\), we have
\[\begin{equation} \boxed{s_{Y_{new}} = \sqrt{s^2_{\hat Y} + s^2} = s \cdot \sqrt{1 + \frac{1}{n} + \frac{(x^* - \bar x)^2}{\sum_{i=1}^n (x_i - \bar x)^2}} = (RMSE) \cdot \sqrt{1 + \frac{1}{n} + \frac{(x^* - \bar x)^2}{Sum \ of \ Square}}} \end{equation}\]\(\bullet\) Therefore, for hypothesis testing, we use the test statistics \(\frac{Y_{new} - (\hat \beta_0 + \hat \beta_1 x^*)}{s_{Y_{new}}} \sim t_{n-2}\), remember we have degree of freedom \(n -2\) because we are estimating intercept \(\beta_0\) and slope \(\beta_1\). In JMP output, we can see the degree of freedom \(n-2\) by looking at the Df for Error in Analysis of Variance.
\(\bullet\) To construct \(100(1-\alpha)\%\) prediction interval for future observations, we use
\[\begin{equation} \boxed{(\hat \beta_0 + \hat \beta_1 x^*) \pm t_{\frac{\alpha}{2}, n-2} \cdot s_{Y_{new}} = (\hat \beta_0 + \hat \beta_1 x^*) \pm t_{\frac{\alpha}{2}, n-2} \cdot \sqrt{MSE \cdot \Big[1 + \frac{1}{n} + \frac{(x^* - \bar x)^2}{Sum \ of \ Square}\Big]}} \end{equation}\]library(faraway)
data(uswages)
attach(uswages)
exper [exper == -2] =NA; exper [exper == -1] = NA # Eliminate wrong data
\(\bullet~~\) Similar to what had done in assignment 1, I eliminated 33 observations with negative value in year of experience before performing regression analysis.
# Linear regression and statistical inference w.r.t wage, education and experience
lm.fit1=lm(wage~ educ+exper); summary(lm.fit1)
##
## Call:
## lm(formula = wage ~ educ + exper)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1014.7 -235.2 -52.1 150.1 7249.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -239.1146 50.7111 -4.715 2.58e-06 ***
## educ 51.8654 3.3423 15.518 < 2e-16 ***
## exper 9.3287 0.7602 12.271 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 426.8 on 1964 degrees of freedom
## (33 observations deleted due to missingness)
## Multiple R-squared: 0.1348, Adjusted R-squared: 0.1339
## F-statistic: 153 on 2 and 1964 DF, p-value: < 2.2e-16
\(\bullet~~\) Referring to the coefficients, the regression model with weekly wages as the response and years of education and experience as predictors is as follows:
\[wage = -239.11 + 51.87\cdot(education) + 9.33 \cdot(experience)\]
, which implies education and experience are proportional to wage. Particularly, every increase of one year education and experience is associated with an increase of approximately 51.87 dollars and 9.33 dollars in weekly wage, respectively.
\(\bullet~~\) The \(R^2 \approx 0.135\) (Multiple R-squared) implies that approximately 13.5% of the variability of the wage is explained by the above regression model.
\(\bullet~~\) The p-values of the predictors are extremely low, so there are strong indication that the response and predictors are of linear relationship (common criteria for p-value are p < 0.01 or p < 0.05).
# Finding the object with the largest positive residual and its value
residuals(lm.fit1)[which.max(residuals(lm.fit1))]
## 1576
## 7249.174
\(\bullet~~\) As the result suggests, the 1576th observation has the largest positive residual and its value is 7249.174.
# Finding mean and median of the residuals
mean(residuals(lm.fit1)); median(residuals(lm.fit1))
## [1] -1.381535e-15
## [1] -52.14337
\(\bullet~~\) The mean of residuals is \(-1.38 \times 10^{-15}\) and the median of residuals is -52.14. It is reasonable that the mean is nearly 0 but not zero, since R could not perform the calculation so precisely to get the perfect answer.
\(\bullet~~\) Obviously, the mean of residuals is greater than the median of residuals. This positive difference indicates that the distribution of the response, i.e., weekly wage is skew to the right. A few of people’s weekly wage are more than the predicted value (and the value tends to be high), while most people have weekly wage lower than predicted.
# Performing residual analysis: Plotting residual and finding non-linearity
par(mfrow=c(1,3))
plot(predict(lm.fit1), residuals(lm.fit1))
plot(predict(lm.fit1), rstudent(lm.fit1))
plot(hatvalues(lm.fit1))
\(\bullet~~\) The above figures represent the residuals, studentized residuals and leverage of the fitted observations. \(\bullet~~\) The first two plots demonstrate the non-constant variance in error terms, i.e., they are expanding in trend. This presence of funnel shape (heteroscedasticity) in residuals indicates the high possibility that a transformation of the current model is needed.
\(\bullet~~\) In statistical context, we may consider an observation a possible outlier if its studentized residuals locate outside 3 in absolute value.
# Finding the correlation of the residuals with the fitted values
zapsmall(cor(fitted(lm.fit1), residuals(lm.fit1)))
## [1] 6.35678e-17
\(\bullet~~\) To solve question 6, referring to the model \[wage = -239.11 + 51.87\cdot(education) + 9.33 \cdot(experience)\] , we obtain that two people with the same education and one year difference in experience will have approximately 9.33 dollars difference in weekly wage. \(\bullet~~\) Since the above linear model shows heteroscedasticity, now perform log transformation.
# Log-linear regression and statistical inference w.r.t wage, education and experience
loglm.fit1=lm(log(wage)~ educ+exper); summary(loglm.fit1)
##
## Call:
## lm(formula = log(wage) ~ educ + exper)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7527 -0.3383 0.1002 0.4297 3.5728
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.675518 0.077085 60.65 <2e-16 ***
## educ 0.091940 0.005080 18.10 <2e-16 ***
## exper 0.016516 0.001156 14.29 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6488 on 1964 degrees of freedom
## (33 observations deleted due to missingness)
## Multiple R-squared: 0.1747, Adjusted R-squared: 0.1739
## F-statistic: 207.9 on 2 and 1964 DF, p-value: < 2.2e-16
# Performing residual analysis for log-linear regression model
par(mfrow=c(1,3))
plot(predict(loglm.fit1), residuals(loglm.fit1))
plot(predict(loglm.fit1), rstudent(loglm.fit1))
plot(hatvalues(loglm.fit1))
\(\bullet~~\) Now we have log-linear model \[ln(wage) = 4.676 + 0.092\cdot(education) + 0.017 \cdot(experience)\] , which means that every increase of one year experience is associated with an increase of approximately 0.017 in log(weekly wage). \(\bullet~~\) From the two corresponding pairs of residual plots, we can see that the transformed model shows no particular pattern, while the original model shows non-constant variance. Additionally, the studentized residuals of the transformed model shrank to 4 in absolute value, indicating that the transformed model has a more natural interpretation.
library(faraway)
attach(sat)
lm.fit1 = lm(total~ takers+ratio+salary, sat)
summary(lm.fit1)$coef; summary(lm.fit1)$r.squared
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1057.898162 44.3286685 23.864876 1.496236e-27
## takers -2.913350 0.2282436 -12.764215 1.007685e-16
## ratio -4.639428 2.1215142 -2.186847 3.387652e-02
## salary 2.552470 1.0045183 2.540989 1.449126e-02
## [1] 0.8238643
\(\bullet~~\) Referring to the coefficients, the regression model with total sat score as the response and percentage of takers, pupil teacher ratio and teacher’s annual salary as predictors is as follows:
\[total = 1057.90 - 2.91\cdot(\% takers) - 4.64 \cdot(ratio) + 2.55\cdot (salary)\]
, which means the changes of total SAT score by an unit increase of takers, ratio and salary (keeping other conditions constant) are -2.91, -4.64 and (+)2.55, respectively.
\(\bullet~~\) The \(R^2\) value is 0.8239 in this linear model, so approximately 82.4% of the variability of the total SAT score is explained by the above regression model.
\(\bullet~~\) Then we test hypothesis \(H_0: \beta_{salary}=0\) against \(H_a: \beta_{salary} \neq0\). We use the formula:
\[\frac{|\hat \beta_{salary}-0|}{se(\hat \beta_{salary})}= \frac{2.5525}{1.0045}=2.541>t_{46}^{0.025}\approx 2.013\]
Therefore, under significance level \(\alpha=0.05\), we reject \(H_0\) and say salary is a significant factor to total SAT score and can not be removed from the model.
\(\bullet~~\) Then we test hypothesis \(H_0: \beta_{takers}=\beta_{ratio}=\beta_{salary}=0\) against \(H_a: at ~least~one ~factor \neq 0\). We use the following R command:
h0_1 = lm(total~ 1) # Null hypothesis: all 3 factors are insignificant
h0a_1 = lm(total~ takers+ratio+salary)
anova(h0_1,h0a_1)
## Analysis of Variance Table
##
## Model 1: total ~ 1
## Model 2: total ~ takers + ratio + salary
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 49 274308
## 2 46 48315 3 225992 71.721 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\(\bullet~~\) From the ANOVA, since the value \(Pr(>F) = 2.2 \times10^{-16} <0.05\), we reject \(H_0\) and claim that as least one factor among the three is significant to the total SAT score.
\(\bullet~~\) For question 2(generating confidence intervals), since we already have the standard errors from the above summary table, i.e., \(se(\hat \beta_{salary}) = 1.0045\), \(se(\hat \beta_{takers})=0.2282\) and \(se(\hat \beta_{ratio})=2.1215\), we can compute the 95% and 99% confidence intervals of the factors by using t-distribution with df = 46 (\(t_{46}^{0.025}=2.013\), \(t_{46}^{0.005}=2.687\)).
For example, the 95% confidence interval of salary is given by:
\[\hat \beta_{salary}-2.013\cdot se(\hat \beta_{salary})\leq \beta_{salary} \leq \hat \beta_{salary}+2.013\cdot se(\hat \beta_{salary}) \rightarrow \beta_{salary} \in [0.5304, 4.5746]\]
\(\bullet~~\) In general, using R command, the 95% and 99% confidence intervals are computed by:
list(confint(lm.fit1, level=0.95), confint(lm.fit1, level=0.99))
## [[1]]
## 2.5 % 97.5 %
## (Intercept) 968.6691802 1147.1271438
## takers -3.3727807 -2.4539197
## ratio -8.9098147 -0.3690414
## salary 0.5304797 4.5744605
##
## [[2]]
## 0.5 % 99.5 %
## (Intercept) 938.786432 1177.009892
## takers -3.526644 -2.300057
## ratio -10.339965 1.061109
## salary -0.146684 5.251624
\(\bullet~~\) Only using these intervals, we find that the confidence lower bound of \(\beta_{salary}\) is positive in 95% level, but that of \(\beta_{salary}\) becomes negative in 99% level, which means that the p-value for salary in the regression should be between 0.01 and 0.05.
library(ellipse)
## Warning: package 'ellipse' was built under R version 3.4.3
##
## Attaching package: 'ellipse'
## The following object is masked from 'package:graphics':
##
## pairs
# Plot the confidence region and add the estimates
plot(ellipse(lm.fit1, c("ratio","salary")), type="l", xlim=c(-10,1))
points(lm.fit1$coef['ratio'], lm.fit1$coef['salary'], pch=18)
points(0,0, pch=1) # Add the origin to the plot
abline(v=confint(lm.fit1)['ratio',], lty=2) # Add the vertical line representing confidence interval
abline(h=confint(lm.fit1)['salary',], lty=2)
\(\bullet~~\) Since the confidence region does not contain the origin(0,0), in significance level \(\alpha = 0.05\)(this confidence region is by default 95%), we are confident to reject the null hypothesis: \(H_0:\hat \beta_{ratio}=\hat \beta_{salary}=0\) and conclude that at least one factor among ratio and salary is statistically significant, regarding candidate’s total SAT score (\(H_a:at ~least~one ~factor \neq0\)).
lm.fit2 = lm(total~ takers+ratio+salary+expend)
summary(lm.fit2)$coef; summary(lm.fit2)$r.squared
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1045.971536 52.869760 19.7839283 7.857530e-24
## takers -2.904481 0.231260 -12.5593745 2.606559e-16
## ratio -3.624232 3.215418 -1.1271418 2.656570e-01
## salary 1.637917 2.387248 0.6861110 4.961632e-01
## expend 4.462594 10.546528 0.4231339 6.742130e-01
## [1] 0.8245623
\(\bullet~~\) Now, regression model with total sat score as the response and percentage of takers, pupil teacher ratio, average expenditure and teacher’s annual salary as predictors is as follows:
\[total = 1045.97 - 2.90\cdot(\% takers) - 3.62 \cdot(ratio) + 1.64\cdot (salary) + 4.46\cdot (expend)\]
\(\bullet~~\) Comparing with the model in question 1, the coefficients of takers, ratio and salary change slightly, but without changing the positive/negative relationship. Notably, a unit increase of expend is associated with 4.46 total SAT score.
\(\bullet~~\) In terms of significance, the p-values of takers, ratio and salary all increase greatly to a level exceeding 0.05, which is the critical point to claim that a factor is statistically significant to the regression model.
\(\bullet~~\) The \(R^2\) remains fairly constant compared with the first model, i.e., \(R^2=0.8246\) in this model versus \(R^2 = 0.8239\) in the previous model(approximately 82.5% of the variability of total SAT score is explained by the model now).
\(\bullet~~\) To test the null hypothesis \(H_0: \beta_{salary}=\beta_{expend}=\beta_{ratio}=0\) against \(H_a: at ~least~one ~factor \neq0\). We use the R command:
h0_2 = lm(total~ takers) # Null hypothesis: only the factor takers is significant to the model
h0a_2 = lm(total~ takers+ratio+salary+expend)
anova(h0_2,h0a_2)
## Analysis of Variance Table
##
## Model 1: total ~ takers
## Model 2: total ~ takers + ratio + salary + expend
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 48 58433
## 2 45 48124 3 10309 3.2133 0.03165 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\(\bullet~~\) From the ANOVA, since the value \(Pr(>F) = 0.03165 <0.05\), we say in signifance level 0.05, we reject \(H_0\) and claim that as least one factor among the three is significant to the total SAT score.
\(\bullet~~\) Utilizing the first model and substituting values \(\beta_{takers}=35\) and \(\beta_{ratio}=17\), we simplify the model as
\[total = 1057.90 - 2.91\cdot(35) - 4.64 \cdot(17) + 2.55\cdot (salary) \rightarrow total = 877.17 + 2.55\cdot (salary)\]
result = lm(total~ takers + ratio + salary, sat)
grid = seq(25,52,1)
conf_line = predict(result, data.frame(salary=grid, takers = 35, ratio = 17), interval="confidence")
pred_line = predict(result, data.frame(salary=grid, takers = 35, ratio = 17), interval="prediction")
matplot(grid, cbind(conf_line, pred_line[,-1]), lty= c(1,2,2,3,3), col=1,
type ="l", xlab="salary", ylab="total") # Plot a matrix
rug(sat$salary) # Visualize the plot generated by the grid
\(\bullet~~\) The bold line at the center represents the fitted value of total SAT score, with predictor salary changing and takers and ratio constant; since it is a linear model, the line is straight.
\(\bullet~~\) The inner and outer curves represent the 95% confidence interval for the mean and 95% prediction interval for the actual total sat score observation, respectively. Undoubtedly, the prediction interval in this model is always of greater distance from the mean than the confidence interval.
\(\bullet~~\) Discussion: The narrowest 95% confidence interval occurs at approximately 36 (salary), and non-constant variance occurs in the two tails. This implies a possibility that separating the model into two independent models (teacher’s salary less than 36,000 each year and more than 36,000 each year) may obtain more reasonable results.
Please inform me to fix the typos and grammatical mistakes if they exist. It is a great practice of writing and I appreciate your help!