4 Sampling Distribution of the Least Squares Estimators and Testing the Slope

“If you do not know how to ask the right question, you discover nothing.” - W. E. Demining

4.1 Properties of the constants \(k_i\) and \(c_i\)

In the previous section, we examined how the least squares estimators are linear combinations of the response variable \(y\). Let’s know look at the properties of the coefficients in Equation 3.1 and Equation 3.2. We will not present the proofs in this course but they are not complicated.

The coefficients \(k_{i}\) have the following properties: \[ \begin{align} \sum k_{i} & =0 \end{align} \tag{4.1}\]

\[ \begin{align} \sum k_{i}x_i & =1 \end{align} \tag{4.2}\]

\[ \begin{align} \sum k_{i}^{2} & =\frac{1}{\sum \left(x_i-\bar{x}\right)^{2}} \end{align} \tag{4.3}\]

Likewise, the coefficients \(c_{i}\) have the following properties: \[ \begin{align} \sum c_{i} & =1 \end{align} \tag{4.4}\]

\[ \begin{align} \sum c_{i}x_i & =0 \end{align} \tag{4.5}\]

\[ \begin{align} \sum c_{i}^{2} & =\frac{1}{n}+\frac{\left(\bar{x}\right)^{2}}{\sum \left(x_i-\bar{x}\right)^{2}} \end{align} \tag{4.6}\]

Review: The Expected Value of a Linear Combination

Recall that the expected value of a linear combination of the random variable \(Y\) is \[ E(aY+b)=aE(Y)+b \] where \(a\) and \(b\) are constants.

4.2 Expected Values of \(b_0\) and \(b_1\)

Before finding the expectations, recall \(E\left(y_i\right)=\beta_{0}+\beta_{1}x_i\).

4.2.1 Expected Value of \(b_1\)

The expected value of \(b_1\) is \[ \begin{align*} E\left[b_1\right] & =E\left[\underbrace{\sum k_{i}y_i}_{(3.1)}\right]\\ & =\sum k_{i}\left(\beta_{0}+\beta_{1}x_i\right)\\ & =\beta_{0}\underbrace{\sum k_{i}}_{(4.1)}+\beta_{1}\underbrace{\sum k_{i}x_i}_{(4.2)}\\ & =\beta_{1} \end{align*} \]

4.2.2 Expected Value of \(b_0\)

The expected value of \(b_0\) is \[ \begin{align*} E\left[b_0\right] & =E\left[\underbrace{\sum c_{i}y_i}_{(3.2)}\right]\\ & =\sum c_{i}\left(\beta_{0}+\beta_{1}x_i\right)\\ & =\beta_{0}\underbrace{\sum c_{i}}_{(4.4)}+\beta_{1}\underbrace{\sum c_{i}x_i}_{(4.5)}\\ & =\beta_{0} \end{align*} \]

Therefore, \(b_0\) is an unbiased estimator of \(\beta_0\) and \(b_1\) is an unbiased estimator of \(\beta_1\).

Review: Unbiased Estimator

Recall that an unbiased estimator for some parameter is an estimator that has an expected value equal to that parameter.

4.3 Variances of \(b_0\) and \(b_1\)

To find the variances, we will use a result from mathematical statistics: Let \(Y_{1},\ldots,Y_{n}\) be uncorrelated random variables and let \(a_{1},\ldots,a_{n}\) be constants. Then \[ \begin{align} Var\left[\sum a_{i}Y_i\right] & =\sum a_{i}^{2}Var\left[Y_i\right] \end{align} \tag{4.7}\]

Recall that we assume the response variables \(y_i\)’s are independent.

Technically, we assume the \(y_i\)’s are uncorrelated. In general, uncorrelated does not imply independent. However, if the random variables are jointly normally distributed (recall our third assumption of the model), then uncorrelated does imply independent.

Also, note that \[ \begin{align*} Var\left[Y\right]& = Var\left[\beta_0 + \beta_1 x + \varepsilon\right]\\ & = Var\left[\varepsilon\right]\\ & = \sigma^2 \end{align*} \]

4.3.1 Variance of \(b_1\)

The variance of \(b_1\) is \[ \begin{align} Var\left[b_1\right] & =Var\left[\underbrace{\sum k_{i}y_i}_{(3.1)}\right]\\ & =\underbrace{\sum k_{i}^{2}}_{(4.3)}Var\left[y_i\right]\\ & =\frac{\sigma^{2}}{\sum \left(x_i-\bar{x}\right)^{2}} \end{align} \tag{4.8}\]

4.3.2 Variance of \(b_0\)

The variance of \(b_0\) is \[ \begin{align} Var\left[b_0\right] & =Var\left[\underbrace{\sum c_{i}y_i}_{(3.2)}\right]\nonumber\\ & =\underbrace{\sum c_{i}^{2}}_{(4.6)}Var\left[y_i\right]\\ & =\sigma^{2}\left[\frac{1}{n}+\frac{\left(\bar{x}\right)^{2}}{\sum \left(x_i-\bar{x}\right)^{2}}\right] \end{align} \tag{4.9}\]

4.4 Best Linear Unbiased Estimators (BLUEs)

We see from Equation 3.1 and Equation 3.2 that \(b_0\) and \(b_1\) are linear estimators.

Any estimator for \(\beta_{1}\), which we will denote as \(\hat{\beta}_{0}\), that takes the form \[ \begin{align*} \hat{\beta}_{1} & =\sum a_{i}y_i \end{align*} \] where \(a_{i}\) is some constant, is called a linear estimator.

Of all linear estimators for \(\beta_0\) and \(\beta_1\) that are unbiased, the least squares estimators, \(b_0\) and \(b_1\), have the smallest variance.

This is summarized in the following well known theorem:

Theorem 4.1 (Gauss Markov Theorem) For the simple linear regression model, the least squares estimators \(b_0\) and \(b_1\) are unbiased and have minimum variance among all unbiased linear estimators.

An estimator that is linear, unbiased, and has the smallest variance of all unbiased linear estimators is called the best linear unbiased estimator (BLUE).

For those who want to see the math:

Proof of the Gauss Markov Theorem:

For all linear estimators that are unbiased, we must have \[ \begin{align*} E\left[\hat{\beta}_{1}\right] & =E\left[\sum a_{i}y_i\right]\\ & =\sum a_{i}E\left[y_i\right]\\ & =\beta_{1} \end{align*} \] Since \(E\left[y_i\right]=\beta_{0}+\beta_{1}x_i\), then we must have \[ \begin{align*} E\left[\hat{\beta}_{1}\right] & =\sum a_{i}\left(\beta_{0}+\beta_{1}x_i\right)\\ & =\beta_{0}\sum a_{i}+\beta_{1}\sum a_{i}x_i\\ & =\beta_{1} \end{align*} \] Therefore, \[ \begin{align*} \sum a_{i} & =0\\ \sum a_{i}x_i & =1 \end{align*} \] We now examine the variance of \(\hat{\beta}_{1}\): \[ \begin{align*} Var\left[\hat{\beta}_{1}\right] & =\sum a_{i}^{2}Var\left[y_i\right]\\ & =\sigma^{2}\sum a_{i}^{2} \end{align*} \] Let’s now define \(a_{i}=k_{i}+d_{i}\) where \(k_{i}\) is defined in Equation 3.1. and \(d_{i}\) is some arbitrary constant.

We will show that adding a constant (whether negative or positive) to \(k_i\) cannot make the variance smaller. Thus, the smallest variance of the linear estimator \(\hat{\beta}_1\) is when \(a_i=k_i\).

The variance of \(\hat{\beta}_{1}\) can now be written as \[ \begin{align*} Var\left[\hat{\beta}_{1}\right] & =\sigma^{2}\sum a_{i}^{2}\\ & =\sigma^{2}\sum\left(k_{i}+d_{i}\right)^{2}\\ & =\sigma^{2}\sum\left(k_{i}^{2}+2k_{i}d_{i}+d_{i}^{2}\right)\\ & =Var\left[b_1\right]+2\sigma^{2}\sum k_{i}d_{i}+\sigma^{2}\sum d_{i}^{2} \end{align*} \] Examining the second term and using the expression of \(k_{i}\) in Equation 3.1, we see that \[ \begin{align*} \sum k_{i}d_{i} & =\sum k_{i}\left(a_{i}-k_{i}\right)\\ & =\sum a_{i}k_{i}-\underbrace{\sum k_{i}^{2}}_{(4.3)}\\ & =\sum a_{i}\frac{x_i-\bar{x}}{\sum\left(x_i-\bar{x}\right)^{2}}-\frac{1}{\sum\left(x_i-\bar{x}\right)^{2}}\\ & =\frac{\sum a_{i}x_i-\bar{x}\sum a_{i}}{\sum\left(x_i-\bar{x}\right)^{2}}-\frac{1}{\sum\left(x_i-\bar{x}\right)^{2}}\\ & =\frac{1-\bar{x}\left(0\right)}{\sum\left(x_i-\bar{x}\right)^{2}}-\frac{1}{\sum\left(x_i-\bar{x}\right)^{2}}\\ & =0 \end{align*} \]

We now have the variance of \(\hat{\beta}_{1}\) as \[ \begin{align*} Var\left[\hat{\beta}_{1}\right] & =Var\left[b_1\right]+\sigma^{2}\sum d_{i}^{2} \end{align*} \] This variance is minimized when \(\sum d_{i}^{2}=0\) which only happens when \(d_{i}=0\).

Thus, the unbiased linear estimator with the smallest variance is when \(a_{i}=k_{i}\). That is, the least squares estimator \(b_1\) in Equation 3.1 has the smallest variance of all unbiased linear estimators of \(\beta_{1}\).

A similar argument can be used to show that \(b_0\) has the smallest variance of all unbiased linear estimators of \(\beta_{0}\).

4.5 Sampling Distribution for \(b_1\)

Now that we see that the least squares estimator \(b_1\) is the BLUE for \(\beta_{1}\), we will now examine the sampling distribution for \(b_1\).

We previously discussed that the mean of the sampling distribution of \(b_1\) is \[ E[b_1]=\beta_1 \] with a variance of \[ \begin{align} Var\left[b_1\right] & =\frac{\sigma^{2}}{\sum \left(x_i-\bar{x}\right)^{2}} \end{align} \tag{4.10}\]

Note that in our model with our four assumptions, \(y\) is normally distributed. That is, \[ \begin{align} y\sim N\left(\beta_0+\beta_1 x, \sigma^2\right) \end{align} \tag{4.11}\]

To learn about the sampling distributions of the least squares estimators, we will use the following theorems from mathematical statistics:

Theorem 4.2 (Sum of Independent Normal Random Variables) If \[ Y_i\sim N\left(\mu_i,\sigma_i^2\right) \] are independent, then the linear combination \(\sum_i a_iY_i\) is also normally distributed where \(a_i\) are constants. In particular \[ \sum_i a_iY_i \sim N\left(\sum_i a_i\mu_i, \sum_i a_i^2\sigma_i^2\right) \]

Theorem 4.3 (Adding a Constant to a Normal Random Variable) If \[ Y\sim N\left(\mu,\sigma^2\right) \] then for any real constant \(c\), \[ Y+c\sim N\left(\mu+c,\sigma^2\right) \]

Since \(Y\) is normally distributed by Equation 4.11, then we can apply Theorem 4.2 which implies that \(b_1\) is normally distributed. That is, \[ \begin{align} b_1 & \sim N\left(\beta_{1},\frac{\sigma^{2}}{\sum\left(x_i-\bar{x}\right)^{2}}\right) \end{align} \tag{4.12}\]

4.5.1 Standardized Score

Since \(b_1\) is normally distributed, we can standardize it so that the resulting statistic will have a standard normal distribution.

Therefore, we have \[ \begin{align} z=\frac{b_1-\beta_{1}}{\sqrt{\frac{\sigma^{2}}{\sum\left(x_i-\bar{x}\right)^{2}}}} & \sim N\left(0,1\right) \end{align} \tag{4.13}\]

4.5.2 Studentized Score

In practice, the standardized score \(z\) is not useful since we do not know the value of \(\sigma^{2}\). We can estimate \(\sigma^{2}\) with the statistic \[ s^2 = \frac{SSE}{n-2} \]

Using this estimate for \(\sigma^2\) leads us to a \(t\)-score: \[ \begin{align} t=\frac{b_1-\beta_{1}}{\sqrt{\frac{s^{2}}{\sum\left(x_i-\bar{x}\right)^{2}}}} & \sim t\left(n-2\right) \end{align} \tag{4.14}\]

We call this \(t\) statistic, the studentized score.

For those who want to see the math:

It is important to note the following theorem from math stats presented here without proof:

Theorem 4.4 (Distribution of the sample variance of the residuals) For the sample variance of the residuals \(s^{2}\), the quantity \[\begin{align*} \frac{\left(n-2\right)s^{2}}{\sigma^{2}} & =\frac{SSE}{\sigma^{2}} \end{align*}\] is distributed as a chi-square distribution with \(n-2\) degrees of freedom. That is, \[\begin{align*} \frac{SSE}{\sigma^{2}} & \sim\chi^{2}\left(n-2\right) \end{align*}\]

We will use another important theorem form math stats (again presented without proof):

Theorem 4.5 (Ratio of independent standard normal and chi-square statistics) If \(Z\sim N\left(0,1\right)\) and \(W\sim\chi^{2}\left(\nu\right)\), and \(Z\) and \(W\) are independent, then the statistic \[\begin{align*} \frac{Z}{\sqrt{\frac{W}{\nu}}} \end{align*}\] is distributed as a Student’s \(t\) distribution with \(\nu\) degrees of freedom.

We will take the standardized score in Equation 4.13 and divide by \[\begin{align*} \sqrt{\frac{\frac{\left(n-2\right)s^{2}}{\sigma^{2}}}{\left(n-2\right)}} & =\sqrt{\frac{s^{2}}{\sigma^{2}}} \end{align*}\] to give us \[\begin{align*} t & =\frac{\frac{b_1-\beta_{1}}{\sqrt{\frac{\sigma^{2}}{\sum\left(x_i-\overline{X}\right)^{2}}}}}{\sqrt{\frac{s^{2}}{\sigma^{2}}}}\\ & =\frac{b_1-\beta_{1}}{\sqrt{\frac{\sigma^{2}}{\sum\left(x_i-\overline{X}\right)^{2}}}\sqrt{\frac{s^{2}}{\sigma^{2}}}}\\ & =\frac{b_1-\beta_{1}}{\sqrt{\frac{s^{2}}{\sum\left(x_i-\overline{X}\right)^{2}}}} \end{align*}\] which will have a Student’s \(t\) distribution with \(n-2\) degrees of freedom.

4.6 Assessing the Utility of the Model: Making Inferences About the Slope

Suppose that the independent variable \(x\) is completely unrelated to the dependent variable \(y\).

What could be said about the values of \(\beta_0\) and \(\beta_1\) in the hypothesized probabilistic model \[\begin{align*} y = \beta_0 +\beta_1 x + \varepsilon \end{align*}\] if \(x\) contributes no information for the prediction of \(y\)?

The implication is that the mean of \(y\), does not change as \(x\) changes. In other words, the line would just be a horizontal line.

If \(E(y)\) does not change as \(x\) increases, then using \(x\) to predict \(y\) in the linear model is not useful.

Regardless of the value of \(x\), you always predict the same value of \(y\). In the straight-line model, this means that the true slope, \(\beta_1\), is equal to 0.

Therefore, to test the null hypothesis that \(x\) contributes no information for the prediction of \(y\) against the alternative hypothesis that these variables are linearly related with a slope differing from 0, we test \[\begin{align*} H_0:\beta_1 = 0\\ H_a:\beta_1\ne 0 \end{align*}\]

If the data support the alternative hypothesis, we conclude that \(x\) does contribute information for the prediction of \(y\) using the straight-line model (although the true relationship between \(E(y)\) and \(x\) could be more complex than a straight line). Thus, to some extent, this is a test of the utility of the hypothesized model.

The appropriate test statistic is the studentized score given above: \[ \begin{align} t &= \frac{b_1-\beta_{1}}{\sqrt{\frac{s^{2}}{\sum\left(x_i-\overline{X}\right)^{2}}}}\\ &=\frac{b_1}{\sqrt{\frac{s^{2}}{SS_{xx}}}} \end{align} \tag{4.15}\]

Another way to make inferences about the slope \(\beta_1\) is to estimate it using a confidence interval \[ \begin{align} b_1 \pm \left(t_{\alpha/2}\right)s_{b_1} \end{align} \tag{4.16}\] where \[\begin{align*} s_{b_1} = \frac{s}{\sqrt{SS_{xx}}} \end{align*}\]

We can obtain the p-value for the hypothesis test by using the summary function with an lm object. For the previous example consisting of the mtcars data.

Example 4.1 (Example 3.1 - revisited)

library(tidyverse)

fit = lm(mpg~wt, data = mtcars)

We find the least squares estimates as

summary(fit)


Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5432 -2.3647 -0.1252  1.4096  6.8727 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
wt           -5.3445     0.5591  -9.559 1.29e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared:  0.7528,    Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

From the output, we see the p-value is \(1.29\times 10^{-10}\). So we have sufficient evidence to conclude that the true population slope is different than zero.

To find the confidence interval, we can use the confint function with the lm object.

confint(fit, level = 0.95)

                2.5 %    97.5 %
(Intercept) 33.450500 41.119753
wt          -6.486308 -4.202635

We 95% confident that the true population slope is in the interval \((-6.486, -4.203)\)