## Multivariate Linear Regression

### Multiple Features

Linear regression with multiple variables is also known as “multivariable linear regression”.

Suppose we have a hypothesis with n features:

- $x^{(i)}_j$ : value of feature j in the $i^{th} training example
- $x^{i}$ : the input (features) of the $i^{th}$ training example
- m : the number of training examples
- n : the number of features

In order to represent our multivariable hypothesis consisely, we can use the definition of matrix multiplication:

Note that for convenience reasons, we assume $x^{(i)}_0 = 1$ for (i $\in$ 1, 2, …, m). This allows us to do matrix operations with theta and x.

### Gradient Descent

The gradient descent equation itself is generally the same form; we just have to repeat it for our n’features:

Repeat until convergence:

In other words.

- Repeat until convergence:

Comparasion: Gradient descent with on variable to gradient descent with multiple variable:

#### Feature Scaling

To speed up gradient descent, two techniques to help with this are `feature scaling`

and `mean normalization`

.

`Feature scaling`

involves dividing the input values by the range(i.e the maximum value minus the minmun value) of the input variable, resuliting in a new range of just 1.

`Mean nirmalization`

involves subtracting the average value for an input variable from the values for that input variable resulting in a nwe average value for the input variable of just zero.The delta will descend quickly on small ranges and slowly on large ranges.

To implement both of these techniques, we can ajust our input values as shown in this formula:

$x_i := \frac{x_i - u_i}{S_i}$

- $u_i$ is the average of all the values for feature (i)
- $S_i$ is the range of values (max - min), or $S_i$ is the standard deviation

#### Learning rate(`⍺`

):

- if
`⍺`

is too samll: slow convergence - if
`⍺`

is too large: J(`𝛅`

) may not decrease on every iteration; may not converge ( slow convergence also possible)

**Tip**: To choose `⍺`

, try …., 0.001, …, 0.01, …., 0.1, …, 1, ….

#### Features and Polynomial Regression

In order to improve our features and the form of out hypothesis function, we can cobine multiple features into one. For example ,we can combine $x_1$ and $x_2$ into a new feature $x_3$ by taking $x_1 \cdot x_2$.

Here we are going to introduce the `Polynomial Regression`

by giving some examples.

We can change the behavior or curve of our hypothesis function by taking it a quadratic, cubic or square root function (or any other form), when our hypothesis function need not be linear (a straight line) if that does not fit the data well.

For example, if our hypothesis function is

then we can create additional features based on $x*1$, to get the quadratic function $$h*{\theta} (x) = \theta_0 + \theta_1 x_1 + \theta_2 x^2_1$$

, or the cubic function

, in the cubic version, we have created new features $x_2$ and $x_3$ where $x_2 = x^2_1 $and $x_3 = x^3_1$.

To make it a square root function, we could do:

One important thing to keep in mind is, if we choose our features this way then feature scaling becomes very import.

e.g. if $x_1$ has range 1 - 1000 then range of $x^2_1$ becomes 1 - 1000000 and that of$ x^3_1$ becomes 1 - 1000000000

### Normal Equation

Gradient descent gives one way of minizing J. There is a second way of doing so, this time performing the minizaiton explicitly and without resorting to an iterative algorithm.

In the ‘Noramal Equation’, method, we will minimize J by explicitly taking its derivatives with respect to the $𝛿_J$ ‘s, and setting them to zero. This allow us to find the optimum theta without iteration. The normal equation formula is given below:

There is

no needto do feature scaling with the normal equation.

#### Normal Equation Noninvertibility

If $X^T X$ is non-invertible, the common causes might be having:

- Redundant features, where two features are very closely related(i.e they are linearly dependent)
- Too many features(e.g m<= n). We could delete some features or use
`regularization`

In Octave or Matlab, we want to use the

`pinv`

function(pseudo inverse) rather than`inv`

The

`pinv`

function will give us a value of delta even if $X^T$ is not invertible.

### Comparison of gradient descent and the normal equation

Gradient Descent | Normal Equation |
---|---|

Need to choose alpha | No nee to choose alpha |

Needs many iterations | No need to iterate |

O($kn^2$) | O($n^3$), need to calculate inverse of $X^T X$ |

Works well when n is large(about x >10, 000) | Slow if n is very large |