Erasure: Measures the additional reading burden on the user due to instability. It is the number of words that are erased and replaced for every word in the final translation.Lag: Measures the average time that has passed between when a user utters a word and when the word’s translation displayed on the screen becomes stable. Requiring stability avoids rewarding systems that can only manage to be fast due to frequent corrections.BLEU score: Measures the quality of the final translation. Quality differences in intermediate translations are captured by a combination of all metrics.

Evaluating Live Translation: The Metrics

It is important to recognize the inherent trade-offs between these different aspects of quality. Transcribe enables live-translation by stacking machine translation on top of real-time automatic speech recognition. For each update to the recognized transcript, a fresh translation is generated in real time; several updates can occur each second. This approach placed Transcribe at one extreme of the 3 dimensional quality framework: it exhibited minimal lag and the best quality, but also had high erasure. Understanding this allowed us to work towards finding a better balance.

The Performance Measure of Quality

The combination of masking and biasing, produces a re-translation system with high quality and low latency, while virtually eliminating erasure. The table below shows how the metrics react to the heuristics we introduced and how they compare to the other systems discussed above. The graph demonstrates that even with a very small erasure budget, re-translation surpasses zero-flicker streaming translation systems (MILk and Wait-k) trained specifically for live-translation.

Zero-Flicker Streaming

The solution outlined above returns a decent translation very quickly, while allowing it to be revised as more of the source sentence is spoken. The simple structure of re-translation enables the application of our best speech and translation models with minimal effort. However, reducing erasure is just one part of the story — we are also looking forward to improving the overall speech translation experience through new technology that can reduce lag when the translation is spoken, or that can enable better transcriptions when multiple people are speaking.

The Bottomline

The new version of the Google Translate app that significantly reduces translation revisions and improves the user experience. The research enabling this is presented in two papers. The <a href="https://arxiv.org/abs/2004.03643">first</a> formulates an evaluation framework tailored to live translation and develops methods to reduce instability. The <a href="https://arxiv.org/abs/2004.03643">second</a> demonstrates that these methods do very well compared to alternatives, while still retaining the simplicity of the original approach. The resulting model is much more stable and provides a noticeably improved reading experience within Google Translate.

A New Update

The transcription feature in the Google Translate app may be used to create a live, translated transcription for events like meetings and speeches, or for a story at the dinner table. In such settings, it is useful for the translated text to be displayed promptly to help keep the reader engaged.Early versions of this feature the translated text suffered from multiple real-time revisions. The non-monotonic relationship between the source and the translated text, in which words at the end of the source sentence can influence words at the beginning of the translation.

Real Time Conversion Of Languages We Don't Understand

The end of an on-going translation tends to flicker because it is more likely to have dependencies on source words that have yet to arrive. We reduce this by truncating some number of words from the translation until the end of the source sentence has been observed. This masking process thus trades latency for stability, without affecting quality. This is very similar to delay-based strategies used in streaming methods such as <a href="https://arxiv.org/abs/1810.08398">Wait-k</a>, but applied only during inference and not during training.

The End Game

In our paper, “<a href="https://arxiv.org/abs/2004.03643">Re-translation versus Streaming for Simultaneous Translation</a>”, we show that our original “re-translation” approach to live translation can be fine-tuned to reduce erasure and achieve a more favourable erasure/lag/BLEU trade-off. Without training any specialized models, we applied a pair of inference-time heuristics to the original machine translation models — masking and biasing.

Real-Time Streaming 

One straightforward solution to reduce erasure is to decrease the frequency with which translations are updated. Along this line, “streaming translation” models (for example, <a href="https://arxiv.org/abs/1810.08398">STACL</a> and <a href="https://arxiv.org/abs/1906.05218">MILk</a>) intelligently learn to recognize when sufficient source information has been received to extend the translation safely, so the translation never needs to be changed. In doing so, streaming translation models are able to achieve zero erasure.

Stabilizing Re-translation

Deep-dive into google's machine learning models and how it relies more on logic than data