In discussion, we started with the assumption that the constraint for SVMs should be:
yi(xi⋅w+α)≥c
But what does this mean? And what is c anyway?
It’s hard to come up with an intuitive visualization of c. Instead, it might be easier to think about this alternate constraint(which we’ll see later turns out to be equivalent):
∣∣w∣∣yi(xi⋅w+α)≥M
Remember that ∣∣w∣∣yi(xi⋅w+α) is the formula for the distance from any point xi to the decision boundary(hyperplane) defined by w and α. So this constraint is saying that we want all of our training points to be at least distance M away from the boundary. In other words, M plays the role of the margin!
Using this definition, the SVM optimization problem is fairly easy to write:
w,α,MmaxM
s.t. ∣∣w∣∣yi(xi⋅w+α)≥M
Note that the only difference between the constraint above and the one used in the discussion worksheet is that this version has an extra ∣∣w∣∣ in the denominator of the left-hand side. So basically, M=∣∣w∣∣c, or equivalently c=∣∣w∣∣⋅M. In other words, c is just the margin scaled by some additional value.
Simpler formulation
Equivalent formulation in terms of c
(what was used on the discussion worksheet)
w,α,MmaxM
s.t. ∣∣w∣∣yi(xi⋅w+α)≥M
w,α,cmax∣∣w∣∣c
s.t. yi(xi⋅w+α)≥c
Visually, the problem looks like this:
(As a side note, the diagram above might help you see why there must always be a point that lies on the margin on both the positive and negative sides of the boundary. There are two opposing goals here: we’re trying to maximize M, but our training points still need to stay at least distance M away from the boundary. So naturally, we should just keep increasing M until we’re forced to stop — or geometrically, keep expanding the dotted lines until they touch the training data!)
Why don’t we just stop at this simpler formulation? The issue is that we’ve introduced an additional variable M(orc) which makes the problem a little more complicated to optimize.
Luckily, there’s a trick we can use: we can take advantage of the fact that the ∣∣w∣∣ in the constraint ∣∣w∣∣yi(xi⋅w+α)≥M doesn’t really matter. For example, we can arbitrarily change ∣∣w∣∣ to be 10⋅∣∣w∣∣, and adjust w and α accordingly(since we have full control over those variables), so that nothing changes about our constraint:
10⋅∣∣w∣∣yi(xi⋅(10w)+(10α))≥M
(The 10’s cancel out and we’re left with the same expression as before.)
Since ∣∣w∣∣ can be whatever we want, why not define it as M1? This seems arbitrary, but it’s nice because it allows us to cancel out the M’s entirely:
1/Myi(xi⋅w+α)≥M
yi(xi⋅w+α)≥1
And since ∣∣w∣∣=M1, we can replace the M in the objective as well, which leaves us with:
yi(xi⋅w+α)≥c
∣∣w∣∣yi(xi⋅w+α)≥M
w,α,MmaxM
s.t. ∣∣w∣∣yi(xi⋅w+α)≥M
Simpler formulation
Equivalent formulation in terms of c
w,α,MmaxM
s.t. ∣∣w∣∣yi(xi⋅w+α)≥M
w,α,cmax∣∣w∣∣c
s.t. yi(xi⋅w+α)≥c
10⋅∣∣w∣∣yi(xi⋅(10w)+(10α))≥M
1/Myi(xi⋅w+α)≥M
yi(xi⋅w+α)≥1
w,αmax∣∣w∣∣1