14 Multicollinearity and Principal Component Regression

“The greatest value of a picture is when it forces us to notice what we never expected to see.” - John Tukey

14.1 Multicollinearity

Often, two or more of the independent variables used in the model for \(E(y)\) will contribute redundant information. That is, the independent variables will be correlated with each other.

For example, suppose we want to construct a model to predict the gasoline mileage rating, \(y\), of a truck as a function of its load, \(x_1\), and the horsepower, \(x_2\), of its engine.

In general, you would expect heavier loads to require greater horsepower and to result in lower mileage ratings.

Thus, although both \(x_1\) and \(x_2\) contribute information for the prediction of mileage rating, some of the information is overlapping, because \(x_1\) and \(x_2\) are correlated.

When the independent variables are correlated, we say that multicollinearity exists.

The ability to obtain a good fit or to make inferences on the mean response or to predict the response are not affected by multicollinearity. However, inferences for the coefficients (the \(\beta\)s) and for the model variance (\(\sigma^2\)) are affected by large correlation among the predictor variables.

Another effect of multicollinearity is the interpretation of the estimated coefficients. In multiple regression, we interpret the coefficient as the average change in \(y\) when \(x\) is increased by one unit when all other predictor variables are held constant.

If \(x\) is highly correlated with one or more of the other predictor variables, then it may not be feasible to think of varying \(x\) when the others are constant.

14.1.1 Variance Inflation Factors

We can see evidence of multicollinearity by examining the scatterplot matrix since this will give us a plot of each pair of predictor variables. If there are pairs that appear to be highly correlated (ggpairs in R will give the correlation value as well) then multicollinearity will be present.

We could also examine \(R_{a}^{2}\) for models with and without certain pairs of variables. If \(R_{a}^{2}\) decreases when a particular \(x\) variable is added but it appears to have a strong linear relationship with \(y\) in the scatterplot matrix, then this is evidence of multicollinearity.

A more convenient way to examine multicollinearity is through the use of the variance inflation factors (VIF).

Each predictor variable will have a VIF. Suppose we are interested in the VIF for \(x_{1}\). We start by regression \(x_{1}\) on all the other predictor variables. Thus, we fit the model \[ \begin{align*} x_{i1} & =\alpha_{0}+\alpha_{2}x_{i2}+\alpha_{3}x_{i3}+\cdots+\alpha_{p-1}x_{i,p-1}+\epsilon \end{align*} \] where the \(\alpha\)’s are the coefficients and \(\epsilon\) is the random error term.

Now find the coefficient of multiple determination for this model which we will denote as \(R_{1}^{2}\). The VIF for \(x_{1}\) is then \[ \begin{align*} VIF_{1} & =\frac{1}{1-R_{1}^{2}}. \end{align*} \] We can do this for any \(i\)th predictor variable so that the VIF for that variable is \[ \begin{align} VIF_{i} & =\frac{1}{1-R_{i}^{2}} \end{align} \tag{14.1}\] where \(R_{i}^{2}\) is the coefficient of multiple determination for the regression fit of \(x_{i}\) on all the other predictor variables.

A rule of thumb is that a VIF greater than 10 is evidence that multicollinearity is high when that variable is added to the model. Some use a cutoff of 5 instead of 10.

Example 14.1 (UN98 data) One approach to seeing which variables are correlated with each other is to remove a variable with a large VIF and see which variables had the largest change in their VIF.

We will illustrate this process with the dataset from the library. We will not use the and variables for this example.

library(tidyverse)
library(tidymodels)
library(car)
library(GGally)

#explore correlation and scatterplots between pairs of variables
UN98 |> 
  select(-region, -GDPperCapita) |> 
  ggpairs()

#prepare data
dat_recipe = recipe(infantMortality ~ ., 
                    data = UN98) |> 
  step_rm(region, GDPperCapita)

#setup model
lm_model = linear_reg() |>
  set_engine("lm")

#setup the workflow
lm_workflow = workflow() |>
  add_recipe(dat_recipe) |> 
  add_model(lm_model)

#fit the model
lm_fit = lm_workflow |>
  fit(data = UN98)

lm_fit |> tidy()

# A tibble: 11 × 5
   term                   estimate std.error statistic p.value
   <chr>                     <dbl>     <dbl>     <dbl>   <dbl>
 1 (Intercept)            111.        48.9       2.27  0.0309 
 2 tfr                     12.0        3.87      3.11  0.00423
 3 contraception           -0.0709     0.137    -0.517 0.609  
 4 educationMale            5.98       3.11      1.92  0.0647 
 5 educationFemale         -6.41       2.76     -2.32  0.0278 
 6 lifeMale                -0.405      1.11     -0.365 0.718  
 7 lifeFemale              -0.757      1.32     -0.575 0.570  
 8 economicActivityMale    -0.383      0.264    -1.45  0.158  
 9 economicActivityFemale   0.172      0.128     1.34  0.190  
10 illiteracyMale          -0.544      0.392    -1.39  0.176  
11 illiteracyFemale         0.321      0.313     1.03  0.314

lm_fit |> glance()

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.944         0.924  9.40      47.5 6.75e-15    10  -136.  296.  316.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

#get the VIFs
lm_fit |> extract_fit_engine() |> vif()

                   tfr          contraception          educationMale 
             14.319844               3.331552              23.833590 
       educationFemale               lifeMale             lifeFemale 
             27.561149              42.279211              74.707620 
  economicActivityMale economicActivityFemale         illiteracyMale 
              2.059083               2.049210              17.476821 
      illiteracyFemale 
             24.355614

We see that there are a number of variables with high VIF. The highest is lifeFemale. If we remove this variable, what happens to the VIFs of the other variables?

#take out the lifeFemale variable
lm_workflow2= lm_workflow |> 
  update_recipe(
    dat_recipe |> step_rm(lifeFemale)
  )

#fit the model
lm_fit2 = lm_workflow2 |>
  fit(data = UN98)

#get the VIFs
lm_fit2 |> extract_fit_engine() |> vif()

                   tfr          contraception          educationMale 
              7.650182               3.151511              23.768388 
       educationFemale               lifeMale   economicActivityMale 
             27.547206               6.053211               2.040815 
economicActivityFemale         illiteracyMale       illiteracyFemale 
              1.989025              15.271120              20.160209

We see that removing lifeFemale leads to a couple of the variable to have a substantial decrease in their VIfs. The variables tfr and lifeMale both decrease by more than five points when lifeFemale is removed. From the scatterplot matrix above, we see that tfr and lifeMale have the highest correlation with lifeFemale. If we are deciding whether to keep lifeFemale in the model, then we can do so with a practical reason. Are any of the variables that are highly correlated with lifeFemale easier to obtain? If so, we should keep that variable and remove the other.

Let’s now look at the next highest VIF: educationFemale.

#take out the educationFemale variable
lm_workflow3= lm_workflow |> 
  update_recipe(
    dat_recipe |> step_rm(educationFemale)
  )

#fit the model
lm_fit3 = lm_workflow3 |>
  fit(data = UN98)

#get the VIFs
lm_fit3 |> extract_fit_engine() |> vif()

                   tfr          contraception          educationMale 
             14.173794               3.303755               3.425892 
              lifeMale             lifeFemale   economicActivityMale 
             42.269363              74.669828               1.898826 
economicActivityFemale         illiteracyMale       illiteracyFemale 
              2.000282              15.192202              18.021121

We see that educationMale has dropped substantially when educationFemale was removed. Again, deciding which variable to remove from the model is a practical one.

We can continue this process of identifying which variables are highly correlate with other variables.

14.1.2 Effects on Inferences

Recall from Equation 13.3 that we can obtain the variance of the least squares estimators with the diagonal of \({\bf s}^{2}\left[{\bf b}\right]\).

It can be shown that this variance can be expressed in terms of VIF. So the variance of \(b_{j}\) can be expressed as \[ \begin{align} s^{2}\left[b_{j}\right] & =MSE\frac{VIF_{j}}{\left(n-1\right)\widehat{Var}\left[x_{j}\right]} \end{align} \tag{14.2}\] where \(\widehat{Var}\left[x_{j}\right]\) is the sample variance of \(x_j\).

So we see that if \(VIF_{j}\) is large (meaning there is multicollinearity when \(x_{j}\) is included in the model) then the standard error will be larger.

Noting that the test statistic for testing \(\beta_{j}=0\) in Equation 13.5 is \[ \begin{align*} t^{*} & =\frac{b_{j}}{s\left[b_{j}\right]} \end{align*} \]

So an inflated standard error \(s\left[b_{j}\right]\) will lead to a smaller \(t\) and thus a larger p-value. This will cause us to conclude there is not enough evidence for the alternative hypothesis when in fact \(\beta_{j}\ne0\).

14.1.3 Effects on CI and PI for the Response

As stated above, multicollinearity does not affect the confidence interval for the mean response or the prediction interval.

We will illustrate this with the bodyfat data below.

Example 14.2 (bodyfat data)

library(tidyverse)
library(tidymodels)
library(car)
library(GGally)

dat = read_table("BodyFat.txt")


#examine the scatterplot matrix
ggpairs(dat)

We see from this scatterplot matrix, that the predictor variables have some high correlation between them. Thus, we already see that there will be a problem with multicollinearity.

#prepare data
dat_recipe = recipe(bfat ~ tri + thigh + midarm, 
                    data = dat) 

#setup model
lm_model = linear_reg() |>
  set_engine("lm")

#setup the workflow
lm_workflow = workflow() |>
  add_recipe(dat_recipe) |> 
  add_model(lm_model)

#fit the model
lm_fit = lm_workflow |>
  fit(data = dat)

fit_full = lm_fit |>  extract_fit_engine()

fit_full |> vif()

     tri    thigh   midarm 
708.8429 564.3434 104.6060

From VIFs, we see that there is clear multicollinearity between all three variables.

The highest VIF is tri. From the scatterplot matrix above, we see tri is most correlated with thigh. Let’s remove tri and see what happens.

lm_workflow_no_tri= lm_workflow |> 
  update_recipe(
    dat_recipe |> step_rm(tri)
  )


#fit the model
lm_fit_no_tri = lm_workflow_no_tri |>
  fit(data = dat)

fit_no_tri= lm_fit_no_tri |>  extract_fit_engine()

fit_no_tri |> vif()

  thigh  midarm 
1.00722 1.00722

We see that once tri is removed, the remaining variables have small VIFs.

We are not suggesting that tri should definitely be removed. It may still be the best predictor of bfat. There are other ways to determine if only having tri is preferred over a model with just the other two variables. We will discuss those methods later. For now, we do see that multicollinearity will be an issue for these three variables.

To see the effect on the estimated coefficients, let’s look at the standard errors for the model with all three variables and the model without tri.

fit_full |> tidy()

# A tibble: 4 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)   117.       99.8       1.17   0.258
2 tri             4.33      3.02      1.44   0.170
3 thigh          -2.86      2.58     -1.11   0.285
4 midarm         -2.19      1.60     -1.37   0.190

fit_no_tri |> tidy()

# A tibble: 3 × 5
  term        estimate std.error statistic     p.value
  <chr>          <dbl>     <dbl>     <dbl>       <dbl>
1 (Intercept) -26.0        7.00     -3.72  0.00172    
2 thigh         0.851      0.112     7.57  0.000000772
3 midarm        0.0960     0.161     0.595 0.560

Note how much larger the standard errors are for the full model (with multicollinearity) than the model without tri (no multicollinearity).

Larger standard errors will lead to smaller \(t\) statistics and thus larger p-values. So multicollinearity will make variables look like they are insignificant but they really are significant.

Let’s now look at the RMSE (sigma in the glance() function output) which is used in the confidence interval of the mean response and prediction interval formulas.

fit_full |> glance()

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic    p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>      <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.801         0.764  2.48      21.5 0.00000734     3  -44.3  98.6  104.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

fit_no_tri |> glance()

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic    p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>      <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.776         0.749  2.56      29.4 0.00000303     2  -45.5  99.1  103.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Note that RMSE is not much different between the two models. It is not affected by multicollinearity. So if all we want to use our model for are estimation of the mean and predictions, then multicollinearity is not an issue. If, however, we want to also determine which variables are important in the estimation and prediction, then multicollinearity is an issue.

14.2 Standardizing Predictor Variables

In linear regression, predictor variables are often transformed to ensure that they are on a comparable scale. This is especially important when predictors have vastly different units or magnitudes, as it can influence the stability and interpretability of the model.

A common technique for transforming predictors is standardization, where each predictor variable is rescaled to have a mean of 0 and a standard deviation of 1. This section will explore the concept of standardization, how it differs from normalization, and the implications for model performance, especially in the context of multicollinearity.

14.2.1 Standardization vs. Normalization

While the terms “standardization” and “normalization” are sometimes used interchangeably, they describe distinct mathematical operations. Standardization transforms a variable \(X\) to have a mean of 0 and a standard deviation of 1, using the following formula:

\[ z_i = \frac{x_i - \bar{x}}{s_x} \]

where \(z_i\) is the standardized value, \(x_i\) is the original value, \(\bar{x}\) is the mean of the variable, and \(s_x\) is its standard deviation. This process ensures that the transformed variable has a standard normal distribution with mean 0 and standard deviation 1.

Normalization, on the other hand, typically refers to scaling the data to a specific range, such as [0, 1]. This is done using the formula:

\[ x' = \frac{x_i - \text{min}(x)}{\text{max}(x) - \text{min}(x)} \]

Normalization is most useful when the range of the variables needs to be constrained for certain algorithms, such as in neural networks. In contrast, standardization is generally preferred in linear models, where the focus is on centering and rescaling predictors to ensure interpretability of coefficients.

14.2.2 Why `tidymodels` Uses the Term “Normalize”

In the tidymodels framework in R, the term “normalize” is used to describe what is technically a standardization process. For example, when creating a preprocessing recipe with the recipe() function, the step called step_normalize() computes the mean and standard deviation of each predictor and scales it accordingly. Although the terminology might be confusing, this usage reflects a common convention in some statistical software where both standardization and normalization are loosely referred to as normalization.

14.2.3 Standardization and Multicollinearity

By scaling all predictors to a common variance, the impact of large magnitude differences between variables is reduced, leading to more stable coefficient estimates. Standardization alone does not reduce correlations between variables. However, it provides numerical stability when using methods other than least squares, making it easier to detect and address multicollinearity issues. When multicollinearity is severe, other techniques, such as principal component regression, may be necessary.

14.3 Principal Component Regression

One effective method for addressing multicollinearity is Principal Component Regression (PCR), which combines Principal Component Analysis (PCA) and linear regression. The key idea behind PCR is to transform the original predictor variables into a smaller set of uncorrelated variables, called principal components (PCs), which capture the most variance in the data.

14.3.1 Introduction to Principal Component Analysis (PCA)

PCA is a dimensionality reduction technique that identifies the directions (called principal components) along which the variation in the data is maximized. These directions are orthogonal to each other and ranked by the amount of variance they capture. The first principal component explains the largest amount of variance in the data, followed by the second principal component, and so on. Each principal component is a linear combination of the original predictor variables.

In practical terms, PCA helps to reduce the dimensionality of the data, making it more manageable without losing too much information. This is particularly useful in cases where the number of predictors is large, and some of them are highly correlated.

For those who want to see the math:

Principal components (PCs) are linear transformations of the original predictor variables that aim to capture the maximum variance in the data. Here is a mathematical summary of how these components are computed:

Standardizing the Data Given a dataset with \(p\) predictors \(x_1, x_2, \dots, x_p\) and \(n\) observations, we first standardize each predictor variable to have mean 0 and standard deviation 1:

\[ z_{ij} = \frac{x_{ij} - \bar{x}_j}{s_j} \]

where: - \(z_{ij}\) is the standardized value of the \(j\)-th predictor for the \(i\)-th observation. - \(\bar{x}_j\) is the mean of the \(j\)-th predictor. - \(s_j\) is the standard deviation of the \(j\)-th predictor.

This step ensures that all variables are on the same scale, which is necessary for PCA.

Computing the Covariance Matrix Once the data is standardized, we compute the covariance matrix \(\mathbf{S}\) of the standardized predictors:

\[ \mathbf{S} = \frac{1}{n-1} \mathbf{Z}^T \mathbf{Z} \]

where \(\mathbf{Z}\) is the matrix of standardized data with \(n\) rows (observations) and \(p\) columns (predictors).

Finding Eigenvalues and Eigenvectors We solve the following eigenvalue problem for the covariance matrix \(\mathbf{S}\):

\[ \mathbf{S} \mathbf{v}_j = \lambda_j \mathbf{v}_j \]

where: - \(\lambda_j\) is the \(j\)-th eigenvalue of the covariance matrix. - \(\mathbf{v}_j\) is the corresponding eigenvector (also called the loading vector) associated with \(\lambda_j\).

The eigenvalues \(\lambda_1, \lambda_2, \dots, \lambda_p\) represent the amount of variance explained by each principal component. The eigenvectors define the direction of the new coordinate axes (principal components).

Constructing the Principal Components The \(j\)-th principal component \(PC_j\) is a linear combination of the original standardized variables:

\[ PC_j = v_{j1} z_1 + v_{j2} z_2 + \cdots + v_{jp} z_p \]

where \(v_{jk}\) is the \(k\)-th element of the eigenvector \(\mathbf{v}_j\).

Each principal component is uncorrelated with the others and explains a decreasing amount of variance. Specifically:

PC1 (the first principal component) explains the largest possible variance.
PC2 explains the largest remaining variance subject to being orthogonal to PC1.
This process continues for all \(p\) components.

Explained Variance The proportion of the total variance explained by the \(j\)-th principal component is:

\[ \text{Explained Variance of } PC_j = \frac{\lambda_j}{\sum_{k=1}^p \lambda_k} \]

The cumulative variance explained by the first \(m\) components is:

\[ \text{Cumulative Explained Variance} = \frac{\sum_{j=1}^m \lambda_j}{\sum_{k=1}^p \lambda_k} \]

In the context of regression, instead of regressing the response variable on the original set of predictors, we use the principal components as the new predictors. By selecting a subset of the principal components, we can retain most of the variability in the predictors while reducing multicollinearity.

Example 14.3 (bodyfat data again) The following steps will illustrate how to apply PCA to the predictors and then use the principal components in a regression model.

recipe = recipe(bfat ~ ., data = dat) |>
  step_normalize(all_predictors()) |>
  step_pca(all_predictors(), num_comp = 3)

#determine the parameters for any of the steps in the recipe
#in this case, we need the mean and std dev of each variable
#along with the values for principal components
prepped = recipe |> prep()

#apply the steps (with the prepared parameters found in prep)
#to the data
pca_data = prepped |> bake(dat)

# Print the principal components
pca_data

# A tibble: 20 × 4
    bfat      PC1     PC2       PC3
   <dbl>    <dbl>   <dbl>     <dbl>
 1  11.9 -1.63     1.10    0.0463  
 2  22.8 -0.193    0.264   0.0375  
 3  18.7  1.73     2.19   -0.0245  
 4  20.1  1.33     0.547  -0.00257 
 5  12.9 -1.62     1.62   -0.0363  
 6  21.7 -0.00515 -1.20    0.00331 
 7  27.1  1.72    -0.683  -0.0242  
 8  25.4  0.755    0.628   0.0327  
 9  21.3 -1.02    -0.947   0.0301  
10  19.3  0.0379  -0.891  -0.0448  
11  25.4  1.68     0.0702 -0.0153  
12  27.2  1.43    -0.349   0.000371
13  11.7 -1.92    -0.677  -0.0247  
14  17.8 -1.52     0.883  -0.0221  
15  12.8 -3.10    -0.734  -0.0177  
16  23.9  1.21     0.296   0.0176  
17  22.6  0.645   -0.843  -0.0184  
18  25.4  1.28    -1.42    0.0179  
19  14.8 -0.767    0.148   0.0302  
20  21.1 -0.0464  -0.0141  0.0148

The following code provides the principal components derived from the predictors in the BodyFat dataset. Each principal component is a linear combination of the original predictors (tri, thigh, midarm), and they explain the maximum variance in the data.

To visualize how PCA works and to understand the contribution of each principal component, let’s examine the explained variance.

# Extract the PCA results
pca_results = prepped |> tidy(number = 2, type="variance")
pca_results

# A tibble: 12 × 4
   terms                            value component id       
   <chr>                            <dbl>     <int> <chr>    
 1 variance                      2.07             1 pca_xWz8S
 2 variance                      0.933            2 pca_xWz8S
 3 variance                      0.000727         3 pca_xWz8S
 4 cumulative variance           2.07             1 pca_xWz8S
 5 cumulative variance           3.00             2 pca_xWz8S
 6 cumulative variance           3                3 pca_xWz8S
 7 percent variance             68.9              1 pca_xWz8S
 8 percent variance             31.1              2 pca_xWz8S
 9 percent variance              0.0242           3 pca_xWz8S
10 cumulative percent variance  68.9              1 pca_xWz8S
11 cumulative percent variance 100.               2 pca_xWz8S
12 cumulative percent variance 100                3 pca_xWz8S

# Visualize the explained variance
pca_results |>
  filter(terms == "percent variance") |>
  ggplot(aes(x = component, y = value)) +
  geom_bar(stat = "identity") +
  labs(title = "Explained Variance by Principal Component", 
       x = "Principal Components", 
       y = "Proportion of Variance Explained")

In this plot, we observe the proportion of variance explained by each principal component. The first principal component typically explains the most variance, followed by the second, and so on. Depending on the cumulative proportion of variance explained, we can select an appropriate number of components for our regression model.

Let’s now look at the scatterplot matrix of the PCA data. Note how the correlations between the PCs are zero.

ggpairs(pca_data)

Principal Component Regression is just using these PCs as the predictor variables. We see from above that the third PC does not explain much variability. Thus, we can just use the first two PCs and save a degree of freedom since we will be using one less coefficient in the model.

dat_recipe = recipe(bfat ~ ., data = dat) |>
  step_normalize(all_predictors()) |>
  step_pca(all_predictors(), num_comp = 2)

lm_model = linear_reg() |>
  set_engine("lm")

lm_workflow = workflow() |>
  add_recipe(dat_recipe) |> 
  add_model(lm_model)

lm_fit = lm_workflow |>
  fit(data = dat)

lm_fit |> tidy()

# A tibble: 3 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    20.2      0.566     35.7  1.98e-17
2 PC1             2.94     0.404      7.27 1.30e- 6
3 PC2            -1.65     0.601     -2.75 1.38e- 2

Note how small the standard errors are for the coefficients now that we no longer have multicollinearity. Let’s verify that there is no multicollinearity by examining the VIFs.

fit = lm_fit |>  extract_fit_engine()

fit |> vif()

PC1 PC2 
  1   1

Both VIFs are exactly one indicating no multicollinearity.

PCR gives us the ability to still fit the regression model even in the presence of extreme multicollinearity. Another benefit of PCR is that you can also reduce dimensionality by only using the PCs that explain most of the variability. Note that you still need all the original predictor variables since the PCs are linear combinations of these variables. Thus, this is not a method for removing predictor variables. So if our goal is to determine if a variable (that may be difficult or expensive to obtain) can be dropped, then PCR is not the tool we want to use. In addition, in PCR we lose interpretability of the coefficients.