Wasserstein GAN
  • a.k.a. WGAN
  • Authors: M. Arjovsky (maths guy), S. Chintala and L. Bottou (Facebook AI guys)

A problem with existing GANs

  • To my knowledge all existing GAN variants minimise the ff-divergence between the real data distribution Pr(x)P_r(x) and the generated data distribution Pg(x)P_g(x)
  • The usual GAN objective turns out to be very similar to Jenson-Shannon (JS) divergence, though the f-GAN paper explains how to use any ff-divergence you like
  • ff-divergence is a function of the density ratio Pr(x)Pg(x)\dfrac{P_r(x)}{P_g(x)}
  • But what if the supports of the two distributions don’t overlap significantly? The density ratio will be infinite or zero everywhere they don’t overlap! 😲 
  • As long as the supports are disjoint, the ff-divergence will be constant since the density ratio is constant
  • Simple example:
  • The real data consists of (0,z)(0,z) points where zU(0,1)z\sim U(0, 1)
  • Samples are uniformly distributed along a vertical line at x = 0 from y = 0 to y = 1
  • The model has one parameter θ\theta such that it produces samples (θ,z)(\theta,z)
  • Either the distributions match perfectly or do not overlap at all
  • The above graph shows the JS divergence for different values of θ\theta
  • The graph is mostly flat
  • This means that the gradient of the divergence w.r.t. θ\theta is zero almost everywhere
  • If the discriminator learns to output a highly accurate approximation of the JS divergence, the generator gets approximately zero gradient
  • This is an instance of the “vanishing gradient” problem found in GANs
  • The problem of non-overlapping supports has been identified before
  • For example, Ferenc has a blog post about it (http://www.inference.vc/instance-noise-a-trick-for-stabilising-gan-training/), which is based on the Instance Noise paper
  • Instance Noise isn’t a very satisfying solution (it just adds noise to the inputs and says that the supports now overlap)

Earth-mover distance

  • a.k.a. EM distance or Wasserstein-1 distance
  • An alternative to ff-divergence which is not a function of the density ratio
  • If you think of the probability distributions as mounds of dirt, the EM distance describes how much effort it takes to transform one mound of dirt so it is the same as the other using an optimal transport plan
  • Accounts for both mass and distance
  • If the supports of the distributions don’t overlap, the EM distance will describe how far apart they are
  • For the simple example described earlier:
  • Note that we now have gradients that always point towards the optimal θ\theta!
  • EM distance is defined as W(Pr,Pg)=infγΠ(Pr,Pg)E(x,y)γxyW(P_r,P_g)=\inf_{\gamma \in \Pi(P_r,P_g)} \mathbb{E}_{(x,y)\sim\gamma}||x-y||
  • Notation: think of the infimum as a minimum
  • Considers all possible “configurations” of pairing up points from the two distributions
  • Calculates the mean distance of pairs in each configuration
  • Returns the smallest mean distance across all of the configurations
  • Intractable, can’t compute directly 😞 
  • Fortunately there is an alternative definition
  • KW(Pr,Pg)=supfLK(ExPr[f(x)]ExPg[f(x)])K \cdot W(P_r,P_g)=\sup_{||f||_L \le K} (\mathbb{E}_{x \sim P_r}[f(x)] - \mathbb{E}_{x \sim P_g}[f(x)])
  • Result of the Kantorovich-Rubinstein duality
  • Notation: think of the supremum as a maximum
  • The supremum is over all K-Lipschitz functions (more on this later)
  • Intuitively we are finding some function with greatest margin between the mean value over real examples and the mean value over generated examples

What the Lipschitz?

  • So, in order to use the EM distance we need to approximate the maximal f(x)f(x) 
  • f(x)f(x)  is K-Lipschitz, which means the magnitude of the gradient is upper bounded by K everywhere