Coursera_Machine Learning_Andrew Ng_Note week2

Multivariate Linear Regression

Multiple Features

Linear regression with multiple variables is also known as “multivariable linear regression”.

Suppose we have a hypothesis with n features:

and we also have notation:

  • $x^{(i)}_j$ : value of feature j in the $i^{th} training example
  • $x^{i}$ : the input (features) of the $i^{th}$ training example
  • m : the number of training examples
  • n : the number of features

In order to represent our multivariable hypothesis consisely, we can use the definition of matrix multiplication:

Note that for convenience reasons, we assume $x^{(i)}_0 = 1$ for (i $\in$ 1, 2, …, m). This allows us to do matrix operations with theta and x.

Gradient Descent

The gradient descent equation itself is generally the same form; we just have to repeat it for our n’features:

Repeat until convergence:

multiple_gradient

In other words.

  • Repeat until convergence:

Comparasion: Gradient descent with on variable to gradient descent with multiple variable:

comparasion_gradientDescent

Feature Scaling

To speed up gradient descent, two techniques to help with this are feature scaling and mean normalization.

Feature scaling involves dividing the input values by the range(i.e the maximum value minus the minmun value) of the input variable, resuliting in a new range of just 1.

Mean nirmalization involves subtracting the average value for an input variable from the values for that input variable resulting in a nwe average value for the input variable of just zero.

The delta will descend quickly on small ranges and slowly on large ranges.

To implement both of these techniques, we can ajust our input values as shown in this formula:

$x_i := \frac{x_i - u_i}{S_i}$

  • $u_i$ is the average of all the values for feature (i)
  • $S_i$ is the range of values (max - min), or $S_i$ is the standard deviation

Learning rate():

  • if is too samll: slow convergence
  • if is too large: J(𝛅) may not decrease on every iteration; may not converge ( slow convergence also possible)

Tip: To choose , try …., 0.001, …, 0.01, …., 0.1, …, 1, ….

Features and Polynomial Regression

In order to improve our features and the form of out hypothesis function, we can cobine multiple features into one. For example ,we can combine $x_1$ and $x_2$ into a new feature $x_3$ by taking $x_1 \cdot x_2$.

Here we are going to introduce the Polynomial Regression by giving some examples.

We can change the behavior or curve of our hypothesis function by taking it a quadratic, cubic or square root function (or any other form), when our hypothesis function need not be linear (a straight line) if that does not fit the data well.

For example, if our hypothesis function is

then we can create additional features based on $x1$, to get the quadratic function $$h{\theta} (x) = \theta_0 + \theta_1 x_1 + \theta_2 x^2_1$$

, or the cubic function

, in the cubic version, we have created new features $x_2$ and $x_3$ where $x_2 = x^2_1 $and $x_3 = x^3_1$.

To make it a square root function, we could do:

One important thing to keep in mind is, if we choose our features this way then feature scaling becomes very import.

e.g. if $x_1$ has range 1 - 1000 then range of $x^2_1$ becomes 1 - 1000000 and that of$ x^3_1$ becomes 1 - 1000000000

Normal Equation

Gradient descent gives one way of minizing J. There is a second way of doing so, this time performing the minizaiton explicitly and without resorting to an iterative algorithm.

In the ‘Noramal Equation’, method, we will minimize J by explicitly taking its derivatives with respect to the $𝛿_J$ ‘s, and setting them to zero. This allow us to find the optimum theta without iteration. The normal equation formula is given below:

Normal Equation

There is no need to do feature scaling with the normal equation.

Normal Equation Noninvertibility

If $X^T X$ is non-invertible, the common causes might be having:

  • Redundant features, where two features are very closely related(i.e they are linearly dependent)
  • Too many features(e.g m<= n). We could delete some features or use regularization

In Octave or Matlab, we want to use the pinv function(pseudo inverse) rather than inv

The pinv function will give us a value of delta even if $X^T$ is not invertible.

Comparison of gradient descent and the normal equation

Gradient Descent Normal Equation
Need to choose alpha No nee to choose alpha
Needs many iterations No need to iterate
O($kn^2$) O($n^3$), need to calculate inverse of $X^T X$
Works well when n is large(about x >10, 000) Slow if n is very large