(Deep Learning) Week 1 + 2
Rise of deep learning driven by scale
  • More data → neural networks improve nearly endlessly with more data to train them on
  • Previous methods of AI/machine learning plateaued at a certain point
  • Larger the neural network, the more it can scale its performance (accuracy) with data
  • Caveat — these neural networks work on labeled data (where the (x,y) values are both provided for feature/result)
  • Other improvements that have allowed this rise in scale, but also important, speed
  • Data (already covered) — society and the digital age produce more data every day
  • Computation — GPU and other processing advancements have made things move faster
  • Algorithmic improvements — for example, changing from sigmoid to rectified linear functions (ReLU) has improved speeds significantly

Binary Classification

The output label only has two choices: 0 or 1. Is the image a cat or not a cat?

Logistic Regression

Given a feature set xx, we want to get y^\hat{y}, which is P(y=1x)P(y=1|x). The probability that the picture is a cat, given the data xx.

So how do we define the function?

xx is an nxn_x-dimensional vector. If we have parameters wRnxw \in \mathbb{R}^{n_x} and bRb\in \mathbb{R}, we might imagine doing a linear regression, which would give us an equation f(x)=wTx+bf(x) =w^T x + b.

However,  this doesn’t work for our purposes, since we need y^\hat{y} to fall between 0 and 1 (it’s a probability).

To get a function that remains between 0 and 1, we use the sigmoid function: y^=σ(z)=1/1ez\hat{y} = \sigma(z) = 1/{1-e^{-z}}, where z=wTx+bz=w^Tx+b, our linear regression from before. You can see how for large values of zz, the sigmoid function goes to 1, while for very negative values of zz, the function goes to 0.

Logistic Regression Cost Function

We need a cost function for our neural network.

Let’s reiterate our notation for y^\hat{y}:
y^(i)=σ(wTx(i)+b)\hat{y}^{(i)} = \sigma(w^Tx^{(i)}+b). This (i)(i) denotes the index of the training sample, since we are assuming we have an array of labeled data {(x(1),y(1)),...,(x(m),y(m))}\{(x^{(1)},y^{(1)}),...,(x^{(m)},y^{(m)})\} that we’ll be training our neural network on.

We have a loss function (the precursor to our cost function), but what do we define it as?

Maybe a difference of squares? L(y,y^)=12(y2y^2)L(y,\hat{y})=\frac{1}{2}(y^2-\hat{y}^2)

But that will make our optimization problem later non-convex (i.e. many local maximums).

Instead, we define the following function:
L(y,y^)=(ylog(y^)+(1y)log(1y^))L(y,\hat{y})=-(y\log{(\hat{y})} + (1-y)\log{(1-\hat{y})})

In both of these functions, we’re trying to minimize the value of the function. This latter function has notable properties — when y=1y=1, the loss function wants to make y^\hat{y} as large as possible (since it’s a sigmoid function, as close to 1 as possible). When y=0y=0, we want to make y^\hat{y} as small as possible (as close to 0 as possible).

Using this loss function, which we can define for each training sample iiL(y(i),y^(i))L(y^{(i)},\hat{y}^{(i)}) — we can define our cost function, which will be an average of all the loss functions.

We can now use this in gradient descent.

Gradient Descent

How do you know what ww value to start out with to start the gradient descent?

So gradient descent is a method for choosing the best values for your parameters ww. Your cost function J(w,b)J(w,b) has a shape, and based on the loss function we defined it ought to have a global minimum somewhere. For an initial value of ww and bb, there will be a certain cost, which is a point on the graph of JJ.

Gradient descent is pretty simple thereafter: w:=wαJ(w,b)ww := w - \alpha\frac{\partial J(w,b)}{\partial w}. α\alpha is the learning rate, which determines how big of a step you take at each iteration of gradient descent.