Part 1 : Introducing Normalizing Flows
Written by Swaraj and Kriti

Introducing Normalizing Flows

Normalizing flows, popularized by (Rezende, & Mohamed, 2015), are techniques used in machine learning to transform simple probability distribution functions into complicated ones. One of the popular use cases is in generative modelling - an unsupervised learning method - where the goal is to model a probability distribution given samples drawn from that distribution. 

Motivation : Why bother about normalizing flows ?

  • They have been used in many TTS (text-to-speech) models, memorably in the Parallel WaveNet model (2017) where a clever application of normalizing flows resulted in a 1000 times faster generation of audio samples in comparison to the original WaveNet model. The Parallel WaveNet model was also deployed on Google assistant for real-time generation of audio.

  • Normally a back-propagation pass requires the activation value for each neuron to be stored in memory. This places a restriction on training deeper, wider models on single GPU’s(since GPU’s have limited memory) and forces one to use small batch sizes during training. In flow based networks, one does not need to store the activations at all, as they can be reconstructed online during the back-propagation. This property was leveraged in the RevNets paper (2017) which uses invertible residual blocks. Reducing the memory cost of storing activations significantly improve the ability to efficiently train wider and deeper networks.

  • Flowtron (2020), an autoregressive flow based TTS model does a kind of representation learning using normalizing flows to learn an invertible mapping from a data space to a latent space which can be manipulated to control many aspects of speech synthesis (pitch, tone, speech rate, cadence, accent). Flowtron matches state-of-the-art TTS models in terms of speech quality and is able to transfer speech characteristics from a source speaker to a target speaker, making the target speaker sound more expressive.

  • If you’ve ever thought about reversible networks, Normalizing flows do precisely that. Reversibility of flows also means that one can trivially encode images into the latent space for editing. They also have cool mathematical applications, for example their use in Neural ODE solvers (2019) which use continuous normalizing flows. 

Brief Introduction

Definition : A Normalizing Flow is a transformation of a simple probability distribution into a more complex distribution by a sequence of invertible and differentiable mappings.

Note: The above formalism is a simplification, for a more precise definition one can consult [5]. The formalism allows piecewise continuous functions to be used in the construction of the flow which the above definition restricts. 

Normalizing since the transformed distribution needs to be normalized by the change of variables formula (discussed below). Flow refers to the series of invertible transformations which are composed with each other to create more complex invertible transformations.

When applied as density estimators, some NFs provide a general way of constructing flexible probability distributions over continuous random variables starting from a simple probability distribution. By constraining the transformations to be invertible, Flow-based models provide a tractable method to calculate the exact likelihood for a wide variety of generative modeling problems.

Efficient inference and efficient synthesis: Autoregressive models, such as the PixelCNN, are also reversible, however synthesis from such models is difficult to parallelize, and typically inefficient on parallel hardware. Flow-based generative models like Glow (and RealNVP) are efficient to parallelize for both training and synthesis.

Exact latent-variable inference:
Within the class of exact likelihood models, normalizing flows provide two key advantages: model flexibility and generation speed. Flows have been explored both to increase the flexibility of the variational posterior in the context of variational autoencoders (VAEs), and directly as a generative model.  With VAEs, one is able to infer only approximately the value of the latent variables that correspond to a datapoint. GAN’s have no encoder at all to infer the latents. In flow based generative models, this can be done exactly without approximation. Not only does this lead to accurate inference, it also enables optimization of the exact log-likelihood of the data, instead of a lower bound of it.

Mathematical Framework: 

Let, z0z_0 be a continuous random variable belonging to a simple probability distribution pθ(z0)p_\theta(z_0) . Let it be a Gaussian with parameters (μ,σ)=(0,1)(\mu, \sigma) = (0,1).

  • z0pθ(z0)=N(z0;0,1)z_0 \sim p_\theta (z_0) = N(z_0;0,1)

Normalizing flows transforms the simple distribution, into a desired output probability distribution with random variable xx, with a sequence of invertible transformations, fisf_i's

  • zk=fθ(z0)=fk...f2.f1(z0)z_k = f_\theta (z_0) = f_k...f_2.f_1(z_0)       s.t. each fif_i is invertible (bijective)

The composition of all the individual flows is represented by fθf_\theta. Since each fif_i is bijective, so is fθf_\theta. The new density pθ(zk)p_\theta (z_k) is called a push forward of the initial density pθ(z0)p_\theta(z_0) by the function fθ.f_\theta.

An example of a transformation obtained by a normalizing flow is shown below, which transforms a base gaussian distribution into a target multi-modal distribution with the help of a bijective function.  

The constrains of a distribution being a probability distribution is that pθ(z0)=1\int p_\theta (z_0) =1. However, this doesn’t hold after applying a bijective function (for intuition consider f1:zz3f_1 : z \rightarrow z^3). 

Change of Variables Formula