In the last two years, attentional-sequence-to-sequence neural models have become the state-of-the-art in machine translation, far surpassing the accuracy phrasal translation systems of in many scenarios. However, these Neural Machine Translation (NMT) systems are not without their difficulties: training a model on a large-scale data set can often take weeks, and they are typically much slower at decode time than a well-optimized phrasal system. In addition, robust training of these models often relies on particular 'recipes' that are not well-explained or justified in the literature. In the talk, I will describe a number of tricks and techniques to substantially speed up training and decoding of large-scale NMT systems. These techniques - which vary between algorithmic and engineering-focused - reduced the time required to train a large-scale NMT from two weeks to two days, and improved the decoding speed to match that of a well-optimized phrasal MT system. In addition, I will attempt to give empirical and intuitive justification for many of the choices made regarding architecture, optimization, and hyperparameters. Although this talk will primarily focus on NMT, the techniques described here should generalize to a number of other models based on sequence-to-sequence and recurrent neural networks, such as caption generation and conversation agents.
See more on this video at https://www.microsoft.com/en-us/research/video/practical-guide-neural-machine-translation/