# 线性回归模型简介

Suppose $Y$ is a response variable, $X_p$ is a $p$-dimension covariates, $\epsilon$ is random error. We say $Y$ linearly dependents $X_p$ if

$$Y=X_p^T\beta+\epsilon,(*)$$

here $A^T$ is the transpose of matrix $A$.

With model(*), we have some tasks to finish:

1) Does model(*) correctly describe the dependence of $Y$ to $X_p$? Or the dependence is really linear? How do we verify that?

2) If model(*) is right, then how do we estimate parameter $\beta$? How do we evaluate the goodness of our estimation?

3) Comparing to other models , how do we evaluate the goodness of model(*)?

Now let’s talks about these questions.

Suppose we have a sample from model(*) $\{(Y_i,X_i), i=1,\cdots,n\}$, and let $Y=(Y_1,\cdots,Y_n)^T$, $X=(X_1,\cdots,X_n)^T$, $\epsilon=(\epsilon_1,\cdots, \epsilon_n)^T$. We can draw a scatter plot for every pair of $(Y, X_{:,i})$ in a plane, $i=1,\cdots,p$. It is a simple but useful way to see if $X_i$ varies with $Y$ linearly. If for $i=1,\cdots,p$, all the $X_i$’s go in a general line, we can trust that model(*) may be right(there may exist nonlinear dependence of $Y$ to $X_i$), and then by test the hypothesis of $\beta=0$ to verify the rightness of model(*), say, if $\beta\ne{0}$, model(*) may be right, if $\beta=0$, then model(*) is wrong.

The parameter of model(*) can be estimated by many ways. The most famous one is Least-Squares. The idea is to approximate $Y$ by a linear combination $\hat{Y}$ of columns of $X$, that is to find $\hat{\beta}$ such that $\hat{Y}=X\hat{\beta}$ is the best fit of $Y$ in the sense of least-squares. The least-squares solution of $\beta$ is given by:

$$\hat{\beta}=\mathop{argmin}_{\beta}(Y-X\beta)^T(Y-X\beta)=\mathop{argmin}_{\beta}\epsilon^T\epsilon$$.

Suppose $X^TX$ is invertible, then minimizing equation above yields

$$\hat{\beta}=( X^TX)^{-1} X^TY$$.

We have $E(\hat{\beta})=\beta$, $Var(\hat{\beta})=\sigma^2(X^TX)^{-1}$, where $\sigma^2$ is the constant variance of error term $\epsilon$. Usually $\sigma^2$ is not known, one can estimate it by

$\hat{\sigma}^2=\frac{1}{n-(p+1)}(Y-\hat{Y})^T(Y-\hat{Y}).$

The goodness of estimation of parameters can be evaluated by $E(\hat{\beta})$ and $Var(\hat{\beta})$. If $E(\hat{\beta})=\beta$, then $\hat{\beta}$ is said to be unbiased. If

$\sqrt{n}(\hat{\beta}-\beta) \stackrel{D}\longrightarrow{N(0,\Sigma)}$

as $n\to\infty$, $\hat{\beta}$ is said to be $\sqrt{n}$-consistent. That is $\hat{\beta}$ converges to $\beta$ in the order of $\sqrt{n}$ as $n\to\infty$.

A $t$-test for the multivariate linear regression is

$t=\frac{\hat{\beta_j}}{SE(\hat{\beta_j})},$

where the standard error $SE(\hat{\beta_j})$ is given by the square root of the diagonal elements of $Var(\hat{\beta})$. In testing $\beta_j=0$ at the significance level $\alpha$, we reject the hypothesis if $|t|\ge{t_{1-\alpha/2;n-(p+1)}}$.

The variation in the response is

$SST=\sum\limits_{i=1}^n(Y_i-\bar{Y})^2.$

The variation explained by the linear regression predicted value $\hat{Y}$ is

$SSE=\sum\limits_{i=1}^n(\hat{Y}_i-\bar{Y})^2.$

The residual sum of squares is

$RSS=\sum\limits_{i=1}^n(Y_i-\hat{Y}_i)^2.$

Using $\hat{Y}=X\hat{\beta}$, it is easy to show that

$SST=SSE+RSS.$

So the goodness of model(*) can be evaluated by coefficient of determination

$r^2=\frac{\sum\limits_{i=1}^n(\hat{Y}_i-\bar{Y})^2}{\sum\limits_{i=1}^n(Y_i-\bar{Y})^2}=\frac{explained\ variation}{total\ variation}.$
In paractice, when $r^2\ge{0.8}$, the model(*) is said to be a good fitting.

Note: model(*) contains no intercept, one can show that if there is an intercept

$Y=\alpha+X_p^T\beta+\epsilon, (**)$

one can add a column of $1$ to $X_p$ to get $\tilde{X}_p=(1,X_p^T)^T$, and rewrite model(**) as

$Y=\tilde{X}_p^T\tilde{\beta} +\epsilon$

with $\tilde{\beta}=(\alpha,\beta^T)^T$.