To my knowledge all existing GAN variants minimise the f-divergence between the real data distribution Pr(x) and the generated data distribution Pg(x)
The usual GAN objective turns out to be very similar to Jenson-Shannon(JS) divergence, though the f-GAN paper explains how to use any f-divergence you like
f-divergence is a function of the density ratio Pg(x)Pr(x)
But what if the supports of the two distributions don’t overlap significantly? The density ratio will be infinite or zero everywhere they don’t overlap! 😲
As long as the supports are disjoint, the f-divergence will be constant since the density ratio is constant
Simple example:
The real data consists of (0,z) points where z∼U(0,1)
Samples are uniformly distributed along a vertical line at x = 0 from y = 0 to y = 1
The model has one parameter θ such that it produces samples (θ,z)
Either the distributions match perfectly or do not overlap at all
The above graph shows the JS divergence for different values of θ
The graph is mostly flat
This means that the gradient of the divergence w.r.t. θ is zero almost everywhere
If the discriminator learns to output a highly accurate approximation of the JS divergence, the generator gets approximately zero gradient
This is an instance of the“vanishing gradient” problem found in GANs
The problem of non-overlapping supports has been identified before
Instance Noise isn’t a very satisfying solution(it just adds noise to the inputs and says that the supports now overlap)
Earth-mover distance
a.k.a. EM distance or Wasserstein-1 distance
An alternative to f-divergence which is not a function of the density ratio
If you think of the probability distributions as mounds of dirt, the EM distance describes how much effort it takes to transform one mound of dirt so it is the same as the other using an optimal transport plan
Accounts for both mass and distance
If the supports of the distributions don’t overlap, the EM distance will describe how far apart they are
For the simple example described earlier:
Note that we now have gradients that always point towards the optimal θ!
EM distance is defined as W(Pr,Pg)=infγ∈Π(Pr,Pg)E(x,y)∼γ∣∣x−y∣∣
Notation: think of the infimum as a minimum
Considers all possible“configurations” of pairing up points from the two distributions
Calculates the mean distance of pairs in each configuration
Returns the smallest mean distance across all of the configurations
A problem with existing GANs
Earth-mover distance
What the Lipschitz?