Impressions from ICML+ACL'08: Structure Back in Play, Translation Wants More Context

A few weeks ago I attended the International Conference in Machine Learning (ICML 2018) in Stockholm and, right after, the Annual Conference of the Association for Computational Linguistics (ACL 2018) in the opposite side of the world: Melbourne. Interestingly, the combination of temporal proximity and geographical distance between these two conferences is becoming a tradition — last year it was ICML in Australia, and ACL in Canada.

This year, we presented one paper at ICML, another paper at ACL, and were invited to deliver a talk in the Second Workshop for Neural Machine Translation and Generation. This blog post shares some of my thoughts about both conferences.

ICML’18: Structured, Deep, Generative

Like NIPS, ICML is growing really fast, with 10 parallel tracks (it used to be 4 when I first attended ICML 10 years ago). Not surprisingly, a large fraction of the presented papers had to do with neural networks — their architecture, their training, their learned representations. A few notes follow, with a focus on structured prediction, deep generative models, and representation learning.

User-uploaded image: IMG_20180713_220619.jpg

View of central Stockholm from Södermalm.

Structured Prediction

In the Structured Prediction track, Vlad Niculae presented our SparseMAP paper (a joint work with Mathieu Blondel and Claire Cardie) — a new technique for structured inference that outputs a sparse set of structures, as opposed to a single one (as in MAP inference) or a dense distribution over all structures (as in marginal inference). For example: the model may return only a handful plausible parse trees for a sentence, assigning zero probability to all the other, implausible, ones. SparseMAP is differentiable (we can plug it as a hidden layer in a neural network and run the usual gradient backpropagation) and efficient (thanks to an active set algorithm that evaluates SparseMAP by solving a sequence of MAP problems). We can regard it as the structured variant of sparsemax. Pytorch code is provided along with the paper.

Before Vlad, Nataly Brukhim gave a nice talk on “Modeling Cardinality in Structured Prediction”, which plugs cardinality constraints for multi-label classification, handled through Dykstra’s projection algorithm. Interestingly, I suspect this model can be regarded as a special case of our SparseMAP framework — the constraint corresponds to a “budget” factor, solvable in linear time and with closed-form gradient (likely more efficient than their proposed approach of unrolling Dykstra’s). Another related paper was presented the day after by Arthur Mensch: a way of using sparsemax in the update equations of dynamic programming algorithms, arriving at differentiable variants in-between sum-product and max-product.

SparseMAP as a structured version of sparsemax (figure extracted from here). Left: in the unstructured case (e.g. multi-class classification), sparsemax returns a sparse distribution over classes, represented as a point in the boundary of the simplex. Right: in the structured case, SparseMAP returns a sparse combination of structures (e.g. dependency parse trees), illustrated as a boundary point of the marginal polytope.

User-uploaded image: sparsemap_parse_tree.png

Example of two parse trees returned by SparseMAP for an ambiguous sentence. All other trees get zero probability (figure extracted from here).

Deep Generative Models

Deep generative models, most prominently variational auto-encoders (VAEs) and generative adversarial networks (GANs), are in great hype these days. Max Welling gave a beautiful invited talk (“Intelligence per Kilowatthour”) motivating deep generative models through the lens of information theory, energy minimization, and the minimum description length principle, all rooted in the equation F (free energy) = E (energy) - H (entropy).

Several conference papers presented new insights in VAEs and GANs. VAEs are usually trained by maximizing the evidence lower bound (ELBO), a tractable variational approximation to the likelihood. It has been observed, however, that this procedure often gives poor latent representations. As an alternative, Fixing a Broken ELBO proposes variational bounds of the mutual information between the input and latent variables, which are more general than ELBO. The result is a full rate-distortion curve that trades off compression and reconstruction accuracy, from which we can seek the Pareto optimal solutions. Two other papers propose techniques to reduce the amortization gap of VAEs (this gap quantifies the suboptimality of the variational parameters found by inference networks, i.e., the encoder part of the VAE): Iterative Amortized Inference proposes an iterative strategy akin to learning-to-learn, which augments inference networks with an extra loop to update the variational parameters with stochastic gradients, while Semi-Amortized Variational Auto-Encoders proposes an hybrid approach between stochastic and amortized variational inference, leveraging differentiable optimization.

As for GANs, the central idea is, rather than approximating maximum likelihood training (equivalently KL divergence minimization), to replace it by a different objective: training the generator to create samples that fool a supervised discriminator. This boils down to a Jensen-Shannon divergence minimization problem (the original GAN paper) or the earth mover’s distance (as in Wasserstein GANs).

An advantage of GANs over other generative models is that they tend to generate sharp outputs (useful when multi-modal distributions are desired); however, they also suffer from mode collapse (meaning that they tend to generate only from a few modes). Their training dynamics, which boil down to finding a Nash equilibrium in a minimax problem, is not yet fully understood. Lars Mescheder gave a nice presentation shedding some light on this, trying to answer the question Which Training Methods for GANs do actually Converge?, providing an analysis of when GANs converge by analyzing the spectrum of the Jacobian of the gradient field of the GAN objective.

Another challenge (very relevant to us NLP researchers) is to make GANs generate discrete data like text (most research on GANs has focused on continuous outputs like images). This is much more challenging, as the GAN training set-up, based on alternating gradient updates, requires the output of the generator to be differentiable, hence continuous. Previously proposed solutions include policy gradient methods like REINFORCE or the Gumbel-softmax reparametrization trick. Adversarially Regularized Autoencoders proposes an alternate solution, by combining a discrete auto-encoder with GAN-regularized continuous latent representations, with interesting results in textual style transfer tasks. It feels, however, that there is still a lot to do in this space.

There was also this cool workshop on deep generative models that I unfortunately could not attend, as I was flying to ACL.

Sequence to Sequence

Sequence-to-sequence learning has been dominated by autoregressive models (e.g. in a recurrent decoder, each emitted output symbol is fed back as input to the LSTM in the next time step, creating a dependency on previous output symbols). Non-autoregressive sequence-to-sequence models are an active topic of research — if they worked, decoders could be parallelized and made much faster. Existing work accomplishes this, but typically with a drop in accuracy or the need for iterative refinement.

A Google AI paper, Fast Decoding in Sequence Models Using Discrete Latent Variables, proposes a less autoregressive approach, where a small number of latent variables are generated autoregressivelly (on top of a Transformer network), and then the target words are decoded in parallel, in a non-autoregressive manner, conditioned on the latent variables and the source words (there seems to be more recent work which combines this approach with distillation, with improved results). Apparently it is crucial that the latent variables are discrete, although I don’t fully understand why.

Another machine translation paper, Analyzing Uncertainty in Neural Machine Translation, from FAIR, improves understanding of sequence-to-sequence models by assessing how uncertainty caused by noisy training data is captured by the model distribution and how it affects search, proposing tools to assess model calibration.

Representation Learning and Test of Time

Another interesting paper was Representation Tradeoffs for Hyperbolic Embeddings (and two other related papers: this and this), from which I got to learn about this recent framework of embeddings in hyperbolic space. The common theme in this (mathematically beautiful) line of research is that embeddings on hyperbolic manifolds (as opposed to the Euclidean space) are suitable for representing hierarchical relations among objects (for example, trees can be embedded in the Poincaré disk, the 2-dimensional hyperbolic space, with arbitrarily low distortion). As many objects of interest, e.g., social networks, WordNet-like knowledge bases, entailment relations, etc. have a hierarchical structure, this geometry may be an appealing alternative to the usual flat Euclidean space. In this paper, they manage to represent the WordNet taxonomy with state-of-the-art precision with very compact two-dimensional representations.

Ronan Collobert and Jason Weston received the Test-of-Time Award for their influential ICML 2008 paper “A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multi-Task Learning.“ I remember this paper very well (ICML 2008 was the first machine learning conference I attended), and the “cold” reception it got from most of the NLP audience, at a time where neural networks were everything but popular. I admit I was skeptical too — how could a single model, without any feature engineering, trained for weeks on the entire Wikipedia, achieve state-of-the-art scores in several NLP tasks?

There were some problems in the original paper with the evaluation in the semantic role labeling task (fixed in a later JMLR paper) and a great deal of extrapolation from making progress in this shallow task to being close to solve “all of semantics” (This may be what annoyed the NLP community the most).

While the last problem persists in the community to this day, time has definitely proved that this was a really valuable contribution and the award very well deserved. Ten years later, we’re all using continuous word representations trained on large datasets and training end-to-end models “from scratch.” Except maybe for some industry niches, feature engineering is long gone and progressively replaced by representation learning and other forms of engineering: architecture search, hyperparameter tuning, transfer learning. Perhaps another take-home message (and I may lose some friends here) is that the NLP community should be more open to contributions from other fields, even when they seem ignorant about language — a closed community is condemned to overfitting their own techniques.

ACL’18: Fertility, Context-Aware Translation, Linguistic Structural Bias

​​ICML’18: Structured, Deep, Generative

​​Structured Prediction

​​Deep Generative Models

​​Sequence to Sequence

​​Representation Learning and Test of Time

​​ACL’18: Fertility, Context-Aware Translation, Linguistic Structural Bias