(Deep Learning) Week 1 + 2
Rise of deep learning driven by scale
• More data → neural networks improve nearly endlessly with more data to train them on
• Previous methods of AI/machine learning plateaued at a certain point
• Larger the neural network, the more it can scale its performance (accuracy) with data
• Caveat — these neural networks work on labeled data (where the (x,y) values are both provided for feature/result)
• Other improvements that have allowed this rise in scale, but also important, speed
• Data (already covered) — society and the digital age produce more data every day
• Computation — GPU and other processing advancements have made things move faster
• Algorithmic improvements — for example, changing from sigmoid to rectified linear functions (ReLU) has improved speeds significantly

## ​​Binary Classification

The output label only has two choices: 0 or 1. Is the image a cat or not a cat?

## ​​Logistic Regression

Given a feature set $x$, we want to get $\hat{y}$, which is $P(y=1|x)$. The probability that the picture is a cat, given the data $x$.

So how do we define the function?

$x$ is an $n_x$-dimensional vector. If we have parameters $w \in \mathbb{R}^{n_x}$ and $b\in \mathbb{R}$, we might imagine doing a linear regression, which would give us an equation $f(x) =w^T x + b$.

However,  this doesn’t work for our purposes, since we need $\hat{y}$ to fall between 0 and 1 (it’s a probability).

To get a function that remains between 0 and 1, we use the sigmoid function: $\hat{y} = \sigma(z) = 1/{1-e^{-z}}$, where $z=w^Tx+b$, our linear regression from before. You can see how for large values of $z$, the sigmoid function goes to 1, while for very negative values of $z$, the function goes to 0.

## ​​Logistic Regression Cost Function

We need a cost function for our neural network.

Let’s reiterate our notation for $\hat{y}$:
$\hat{y}^{(i)} = \sigma(w^Tx^{(i)}+b)$. This $(i)$ denotes the index of the training sample, since we are assuming we have an array of labeled data $\{(x^{(1)},y^{(1)}),...,(x^{(m)},y^{(m)})\}$ that we’ll be training our neural network on.

We have a loss function (the precursor to our cost function), but what do we define it as?

Maybe a difference of squares? $L(y,\hat{y})=\frac{1}{2}(y^2-\hat{y}^2)$

But that will make our optimization problem later non-convex (i.e. many local maximums).

Instead, we define the following function:
$L(y,\hat{y})=-(y\log{(\hat{y})} + (1-y)\log{(1-\hat{y})})$

In both of these functions, we’re trying to minimize the value of the function. This latter function has notable properties — when $y=1$, the loss function wants to make $\hat{y}$ as large as possible (since it’s a sigmoid function, as close to 1 as possible). When $y=0$, we want to make $\hat{y}$ as small as possible (as close to 0 as possible).

Using this loss function, which we can define for each training sample $i$$L(y^{(i)},\hat{y}^{(i)})$ — we can define our cost function, which will be an average of all the loss functions.
$J(w,b)=\frac{1}{m}\sum_{i=1}^m{L(y^{(i)},\hat{y}^{(i)})}$

We can now use this in gradient descent.

How do you know what $w$ value to start out with to start the gradient descent?

So gradient descent is a method for choosing the best values for your parameters $w$. Your cost function $J(w,b)$ has a shape, and based on the loss function we defined it ought to have a global minimum somewhere. For an initial value of $w$ and $b$, there will be a certain cost, which is a point on the graph of $J$.

Gradient descent is pretty simple thereafter: $w := w - \alpha\frac{\partial J(w,b)}{\partial w}$. $\alpha$ is the learning rate, which determines how big of a step you take at each iteration of gradient descent.