Dropbox Paper

(Deep Learning) Week 1 + 2

Rise of deep learning driven by scale

More data → neural networks improve nearly endlessly with more data to train them on

Previous methods of AI/machine learning plateaued at a certain point

Larger the neural network, the more it can scale its performance (accuracy) with data

Caveat — these neural networks work on labeled data (where the (x,y) values are both provided for feature/result)

Other improvements that have allowed this rise in scale, but also important, speed

Data (already covered) — society and the digital age produce more data every day

Computation — GPU and other processing advancements have made things move faster

Algorithmic improvements — for example, changing from sigmoid to rectified linear functions (ReLU) has improved speeds significantly

Binary Classification

The output label only has two choices: 0 or 1. Is the image a cat or not a cat?

Logistic Regression

Given a feature set

x

, we want to get

\hat{y}

, which is

P(y=1|x)

. The probability that the picture is a cat, given the data

x

.

So how do we define the function?

x

is an

n_x

-dimensional vector. If we have parameters

w \in \mathbb{R}^{n_x}

and

b\in \mathbb{R}

, we might imagine doing a linear regression, which would give us an equation

f(x) =w^T x + b

.

However, this doesn’t work for our purposes, since we need

\hat{y}

to fall between 0 and 1 (it’s a probability).

To get a function that remains between 0 and 1, we use the sigmoid function:

\hat{y} = \sigma(z) = 1/{1-e^{-z}}

, where

z=w^Tx+b

, our linear regression from before. You can see how for large values of

z

, the sigmoid function goes to 1, while for very negative values of

z

, the function goes to 0.

Logistic Regression Cost Function

We need a cost function for our neural network.

Let’s reiterate our notation for

\hat{y}

:

\hat{y}^{(i)} = \sigma(w^Tx^{(i)}+b)

. This

(i)

denotes the index of the training sample, since we are assuming we have an array of labeled data

\{(x^{(1)},y^{(1)}),...,(x^{(m)},y^{(m)})\}

that we’ll be training our neural network on.

We have a loss function (the precursor to our cost function), but what do we define it as?

Maybe a difference of squares?

L(y,\hat{y})=\frac{1}{2}(y^2-\hat{y}^2)

But that will make our optimization problem later non-convex (i.e. many local maximums).

Instead, we define the following function:

L(y,\hat{y})=-(y\log{(\hat{y})} + (1-y)\log{(1-\hat{y})})

In both of these functions, we’re trying to minimize the value of the function. This latter function has notable properties — when

y=1

, the loss function wants to make

\hat{y}

as large as possible (since it’s a sigmoid function, as close to 1 as possible). When

y=0

, we want to make

\hat{y}

as small as possible (as close to 0 as possible).

Using this loss function, which we can define for each training sample

i

—

L(y^{(i)},\hat{y}^{(i)})

— we can define our cost function, which will be an average of all the loss functions.

J(w,b)=\frac{1}{m}\sum_{i=1}^m{L(y^{(i)},\hat{y}^{(i)})}

We can now use this in gradient descent.

Gradient Descent

How do you know what

w

value to start out with to start the gradient descent?

So gradient descent is a method for choosing the best values for your parameters

w

. Your cost function

J(w,b)

has a shape, and based on the loss function we defined it ought to have a global minimum somewhere. For an initial value of

w

and

b

, there will be a certain cost, which is a point on the graph of

J

.

Gradient descent is pretty simple thereafter:

w := w - \alpha\frac{\partial J(w,b)}{\partial w}

.

\alpha

is the learning rate, which determines how big of a step you take at each iteration of gradient descent.

​​Binary Classification

​​Logistic Regression

​​Logistic Regression Cost Function

​​Gradient Descent

Binary Classification

Logistic Regression

Logistic Regression Cost Function

Gradient Descent