Linear Regression

An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem. John Tukey

\[ \newcommand{\E}{\mathbb{E}} \newcommand{\Var}{\text{Var}} \newcommand{\Cov}{\text{Cov}} \newcommand{\R}{\mathbb{R}} \newcommand{\N}{\mathcal{N}} \newcommand{\MLE}{\text{MLE}} \renewcommand{\bar}{\overline} \newcommand{\supp}[1]{\text{supp}(#1)} \newcommand\independent{\perp\!\!\!\perp} \renewcommand{\d}[0]{\mathrm{d}} \newcommand{\pp}[2][]{\frac{\partial#1}{\partial#2}} \newcommand{\dd}[2][]{\frac{\mathrm d#1}{\mathrm d#2}} \]

Visualizing probability density in linear regression

On the history of regression:

See Stigler (1981) on the invention of least squares, and Stigler (1997) discussing the history of linear “regression”.

Stigler, Stephen M. 1981. “Gauss and the Invention of Least Squares.” The Annals of Statistics 9 (3): 465–74. http://www.jstor.org/stable/2240811.
———. 1997. “Regression Towards the Mean, Historically Considered.” Statistical Methods in Medical Research 6 (2): 103–14. https://doi.org/10.1177/096228029700600202.

Linear Regression

Our typical problem is to estimate the linear relationship between \(Y\) and \(p\) covariates/predictors \((x_1, ..., x_p)\).

In modeling, we can distinguish between the systematic and random parts of a model.

Consider the model:

\[Y_i = \beta_0 + \beta_1 x_1 + ... + \beta_p x_{ip}, \quad i = 1, ..., n,\]

with the assumptions:

  1. (Size of Data) \(p < n\),
  2. (Mean-Zero Error) \(\E(\varepsilon_i) = 0\)
  3. (Homoscedasticity) \(\Var(\varepsilon_i) = \sigma^2\),
  4. (Uncorrelated Error) \(\Cov(\varepsilon_i, \varepsilon_j) = 0.\)

We can also write this model in the following way: \[\mathbf Y = \mathbf X \pmb \beta + \pmb \epsilon, \quad \text{ where } \] \[\mathbf Y = \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{bmatrix}, \quad \pmb X = \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1p} \\ 1 & x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{np} \end{bmatrix},\] \[\pmb \beta = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{bmatrix}, \quad \pmb \epsilon = \begin{bmatrix} \epsilon_0 \\ \epsilon_1 \\ \vdots \\ \epsilon_p\end{bmatrix}.\]

We assume that this is the true model. Our task is to do the best we can coming up with \(\hat {\pmb \beta}\) such that \(\mathbf X \hat{\pmb \beta}\) is as close as possible to \(\mathbf y\).

We’ll use \(\mathbf y\) to refer to observed values and \(\mathbf Y\) to denote the corresponding random vector that generates these observations. That is to say, \(\E(\mathbf Y) = \mathbf X \pmb \beta\).

Euclidean distance is usually defined as \[d(\mathbf y, \mathbf X \pmb \beta) = \sqrt{(\mathbf y - \mathbf X \pmb \beta)^\top (\mathbf y - \mathbf X \pmb \beta)}.\]

Since it’s easier to minimize over a function that doesn’t involve a square-root, and squaring is a monotone transformation on \(\mathbb R^+\), the vector \(\hat{\pmb \beta}\) that minimizes \(d(\mathbf y, \mathbf X \pmb \beta)\) will also minimize \(d(\mathbf y, \mathbf X \pmb \beta)^2\).

\[ \begin{aligned} \text{Let } S(\pmb \beta) & =d(\mathbf y, \mathbf X \pmb \beta)^2 = (\mathbf y - \mathbf X \pmb \beta)^\top (\mathbf y - \mathbf X \pmb \beta) \\ & = \mathbf y^\top \mathbf y - 2 \mathbf y^\top \mathbf X \pmb \beta + \pmb \beta^\top \mathbf X^\top \mathbf X \pmb \beta. \end{aligned} \]

Why do we call this \(S\)?

  • It is the squared Euclidean distance;
  • Also the sum of squared errors/residuals (SSE).

\(S(\pmb \beta)\) is our objective function, and the minimization of \(S(\pmb \beta)\) under the assumption of normal errors leads to the best linear unbiased estimator (BLUE) of the regression coefficients due to the Gauss-Markov Theorem.

To find the OLS estimate, we minimize our loss function by computing the gradient, setting it equal to zero, and solving for the coefficient vector \(\pmb \beta\) satisfying this constraint.

\[ \begin{aligned} \pp[S(\pmb \beta)]{\pmb \beta} & = -2 \mathbf X^\top \mathbf y + 2 \mathbf X^\top\mathbf X \pmb \beta \stackrel {set} = \pmb 0 \\ & 2 \mathbf X^\top (\mathbf y - \mathbf X\pmb \beta) = 0 \end{aligned} \] From this we get the least squares normal equations: \[\mathbf X^\top \mathbf X \hat {\pmb \beta} = \mathbf X'\mathbf y,\] \[ \hat \beta = (\mathbf X^{\top} \mathbf X)^{-1} \mathbf X^\top \mathbf y,\] provided that \((\mathbf X^\top \mathbf X)^{-1}\) exists, which it will if the predictors are linearly independent.

Tip
How exactly was the above worked out?

First, expand \(S(\pmb \beta)\): \[ \begin{aligned} S(\pmb \beta) & = (\mathbf y - \mathbf X \pmb \beta)^{\top}(\mathbf y - \mathbf X \pmb \beta)\\ & = \mathbf y^\top \mathbf y - \mathbf y^\top \mathbf X \pmb \beta - (\mathbf X \pmb \beta)^{\top} \mathbf y + (\mathbf X \pmb \beta)^\top \mathbf X \pmb \beta \end{aligned} \]

Since \((\mathbf X \pmb \beta)^\top \mathbf y = \mathbf y^\top \mathbf X \pmb \beta\) (as it’s a scalar and equal to its transpose), we can simplify to \[ S(\pmb \beta) = \mathbf y^\top \mathbf y - 2 \mathbf y^\top \mathbf X \pmb \beta + \pmb \beta^\top \mathbf X^\top \mathbf X \pmb \beta. \]

  • The derivative of \(\mathbf y^\top \mathbf y\) with respect to \(\pmb \beta\) is 0
  • The derivative of \(-2\mathbf y^\top \mathbf X \pmb \beta\) with respect to \(\pmb \beta\) is \(-2 \mathbf X^\top \mathbf y\) since the derivative of \(\mathbf a^\top \pmb \beta\) is \(\mathbf a\). (Page 10, 2.4.1 of Petersen and Pedersen (2012))
  • The derivative of \(\pmb \beta^\top \mathbf X^\top \mathbf X \pmb \beta\) with respect to \(\pmb \beta\) is \(2 \mathbf X^\top \mathbf X \pmb \beta\) by applying the derivative rule for quadratic forms. (See equation 81 from Petersen and Pedersen (2012), and observe that \(\mathbf X^\top \mathbf X\) is symmetric).

Hence \[\nabla S(\pmb \beta) = -2 \mathbf X^\top \mathbf y + 2 \mathbf X^\top \mathbf X \pmb \beta = \pmb 0. \]

Setting \(\nabla S(\pmb \beta) = 0\) gives us the normal equations: \[\mathbf X^\top \mathbf X \hat {\pmb \beta} = \mathbf X^\top \mathbf y.\]

Finally we solve for \(\hat \beta\) by multiplying both sides by \((\mathbf X^\top \mathbf X)^{-1}\): \[\hat \beta = (\mathbf X^\top \mathbf X)^{-1} \mathbf X^\top \mathbf y.\]
Petersen, Kaare Brandt, and Michael Syskind Pedersen. 2012. “The Matrix Cookbook.” https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf.