## Multivariate Linear Regression

### Multiple Features

Linear regression with multiple variables is also known as “multivariable linear regression”.

Suppose we have a hypothesis with n features:

and we also have notation:

• $x^{(i)}_j$ : value of feature j in the $i^{th} training example •$x^{i}$: the input (features) of the$i^{th}$training example • m : the number of training examples • n : the number of features In order to represent our multivariable hypothesis consisely, we can use the definition of matrix multiplication: Note that for convenience reasons, we assume$x^{(i)}_0 = 1$for (i$\in$1, 2, …, m). This allows us to do matrix operations with theta and x. ### Gradient Descent The gradient descent equation itself is generally the same form; we just have to repeat it for our n’features: Repeat until convergence: In other words. • Repeat until convergence: Comparasion: Gradient descent with on variable to gradient descent with multiple variable: #### Feature Scaling To speed up gradient descent, two techniques to help with this are feature scaling and mean normalization. Feature scaling involves dividing the input values by the range(i.e the maximum value minus the minmun value) of the input variable, resuliting in a new range of just 1. Mean nirmalization involves subtracting the average value for an input variable from the values for that input variable resulting in a nwe average value for the input variable of just zero. The delta will descend quickly on small ranges and slowly on large ranges. To implement both of these techniques, we can ajust our input values as shown in this formula:$x_i := \frac{x_i - u_i}{S_i}$•$u_i$is the average of all the values for feature (i) •$S_i$is the range of values (max - min), or$S_i$is the standard deviation #### Learning rate(⍺): • if ⍺ is too samll: slow convergence • if ⍺ is too large: J(𝛅) may not decrease on every iteration; may not converge ( slow convergence also possible) Tip: To choose ⍺, try …., 0.001, …, 0.01, …., 0.1, …, 1, …. #### Features and Polynomial Regression In order to improve our features and the form of out hypothesis function, we can cobine multiple features into one. For example ,we can combine$x_1$and$x_2$into a new feature$x_3$by taking$x_1 \cdot x_2$. Here we are going to introduce the Polynomial Regression by giving some examples. We can change the behavior or curve of our hypothesis function by taking it a quadratic, cubic or square root function (or any other form), when our hypothesis function need not be linear (a straight line) if that does not fit the data well. For example, if our hypothesis function is $h_{\theta} (x) = \theta_0 + \theta_1 x_1$ then we can create additional features based on$x1$, to get the quadratic function $$h{\theta} (x) = \theta_0 + \theta_1 x_1 + \theta_2 x^2_1$$ , or the cubic function $h_{\theta} (x) = \theta_0 + \theta_1 x_1 + \theta_3 x^3_1$ , in the cubic version, we have created new features$x_2$and$x_3$where$x_2 = x^2_1 $and$x_3 = x^3_1$. To make it a square root function, we could do: $h_{\theta} (x) = \theta_0 + \theta_1 x_1 + \theta_2 \sqrt{x_1}$ One important thing to keep in mind is, if we choose our features this way then feature scaling becomes very import. e.g. if$x_1$has range 1 - 1000 then range of$x^2_1$becomes 1 - 1000000 and that of$ x^3_1$becomes 1 - 1000000000 ### Normal Equation Gradient descent gives one way of minizing J. There is a second way of doing so, this time performing the minizaiton explicitly and without resorting to an iterative algorithm. In the ‘Noramal Equation’, method, we will minimize J by explicitly taking its derivatives with respect to the$𝛿_J$‘s, and setting them to zero. This allow us to find the optimum theta without iteration. The normal equation formula is given below: There is no need to do feature scaling with the normal equation. #### Normal Equation Noninvertibility If$X^T X$is non-invertible, the common causes might be having: • Redundant features, where two features are very closely related(i.e they are linearly dependent) • Too many features(e.g m<= n). We could delete some features or use regularization In Octave or Matlab, we want to use the pinv function(pseudo inverse) rather than inv The pinv function will give us a value of delta even if$X^T$is not invertible. ### Comparison of gradient descent and the normal equation Gradient Descent Normal Equation Need to choose alpha No nee to choose alpha Needs many iterations No need to iterate O($kn^2$) O($n^3$), need to calculate inverse of$X^T X\$
Works well when n is large(about x >10, 000) Slow if n is very large