More data → neural networks improve nearly endlessly with more data to train them on
Previous methods of AI/machine learning plateaued at a certain point
Larger the neural network, the more it can scale its performance(accuracy) with data
Caveat — these neural networks work on labeled data(where the(x,y) values are both provided for feature/result)
Other improvements that have allowed this rise in scale, but also important, speed
Data(already covered) — society and the digital age produce more data every day
Computation — GPU and other processing advancements have made things move faster
Algorithmic improvements — for example, changing from sigmoid to rectified linear functions(ReLU) has improved speeds significantly
Binary Classification
The output label only has two choices: 0 or 1. Is the image a cat or not a cat?
Logistic Regression
Given a feature set x, we want to get y^, which is P(y=1∣x). The probability that the picture is a cat, given the data x.
So how do we define the function?
x is an nx-dimensional vector. If we have parameters w∈Rnx and b∈R, we might imagine doing a linear regression, which would give us an equation f(x)=wTx+b.
However, this doesn’t work for our purposes, since we need y^ to fall between 0 and 1(it’s a probability).
To get a function that remains between 0 and 1, we use the sigmoid function: y^=σ(z)=1/1−e−z, where z=wTx+b, our linear regression from before. You can see how for large values of z, the sigmoid function goes to 1, while for very negative values of z, the function goes to 0.
Logistic Regression Cost Function
We need a cost function for our neural network.
Let’s reiterate our notation for y^:
y^(i)=σ(wTx(i)+b). This (i) denotes the index of the training sample, since we are assuming we have an array of labeled data {(x(1),y(1)),...,(x(m),y(m))} that we’ll be training our neural network on.
We have a loss function(the precursor to our cost function), but what do we define it as?
Maybe a difference of squares? L(y,y^)=21(y2−y^2)
But that will make our optimization problem later non-convex(i.e. many local maximums).
In both of these functions, we’re trying to minimize the value of the function. This latter function has notable properties — when y=1, the loss function wants to make y^ as large as possible(since it’s a sigmoid function, as close to 1 as possible). When y=0, we want to make y^ as small as possible(as close to 0 as possible).
Using this loss function, which we can define for each training sample i — L(y(i),y^(i)) — we can define our cost function, which will be an average of all the loss functions.
How do you know what w value to start out with to start the gradient descent?
So gradient descent is a method for choosing the best values for your parameters w. Your cost function J(w,b) has a shape, and based on the loss function we defined it ought to have a global minimum somewhere. For an initial value of w and b, there will be a certain cost, which is a point on the graph of J.
Gradient descent is pretty simple thereafter: w:=w−α∂w∂J(w,b). α is the learning rate, which determines how big of a step you take at each iteration of gradient descent.
Binary Classification
Logistic Regression
Logistic Regression Cost Function
Gradient Descent