'Savvy' shortcuts produce near-instant speech-to-speech translation of 36 languages
Babel Fish like ML model emerges after training on 4.5 million hours of multilingual spoken audio
Meta has developed a machine learning model its researchers claim offers near-instant speech-to-speech translation between around 36 languages.
Reminiscent of the Babel Fish from The Hitchhiker’s Guide to the Galaxy, the foundation model SEAMLESSM4T was trained on 4.5 million hours' of recorded human speech and takes a "savvy" approach that avoids onerous data annotation by exploiting snippets of internet audio.
Presenting the paper in the journal Nature today, the team from the Facebook parent company said that a relatively open model — on which other applications could be built — could support on-demand "streamlining multilingual exchange across various contexts."
In an accompanying article, Tanel Alumäe, professor of speech processing at Estonia's Tallinn University of Technology, said the model was pre-trained on a massive data set containing 4.5 million hours' worth of multilingual spoken audio to help establish patterns in the data, "making it easier to fine-tune the model for specific tasks without the need for large amounts of bespoke training data."
The research team also used a new automation technique to avoid annotating vast amounts of training data.
"One of the SEAMLESS team's savviest strategies involved 'mining' the internet for training pairs that align across languages — such as audio snippets in one language that match subtitles in another. Starting with some data that they knew to be reliable, the authors trained the model to recognize when two pieces of content (such as a video clip and a corresponding subtitle) actually match in meaning," Alumäe explained.
The technique helped the Meta's Seamless Communication Team collect around 443,000 hours of audio with matching text, and aligned about 30,000 hours of speech pairs, which they then used to further train the model. Alumäe praised Meta's level of openness with the model - which is similar to Llama family of large language models that can be used to create other applications. "This level of openness is a huge advantage for researchers who lack the massive computational resources needed to build these models from scratch."
However, others have criticized LLaMA-3 for its "distinctly non-open use restrictions."
Meta's new model can also translate up to 100 languages from speech to text, we're told. Alumäe pointed out that while impressive, this figure was well short of the 7,000 languages spoken around the world.
"The tool also struggles in many situations that humans handle with relative ease — for example, conversations in noisy places or between people with strong accents. However, the authors' methods for harnessing real-world data will forge a promising path towards speech technology that rivals the stuff of science fiction," he said.
In a second accompanying article, Allison Koenecke, of Cornell University's Department of Information Science, pointed out that while the breakthrough could represent a more efficient and cost-effective method of transcribing and translating than humans can currently provide, "it is imperative to understand the ways in which these technologies fail — disproportionately so for some demographics."
- Can AWS really fix AI hallucination? We talk to head of Automated Reasoning Byron Cook
- Staff can't code? No prob. Singapore superapp's LLM whips up apps for them
- The case for handcrafted software in a mass-produced world
- Handwritten Einstein essay on theory of relativity goes under the hammer
"Future work must ensure that speech-technology researchers ameliorate performance disparities, and that users are well informed about the potential benefits and harms associated with these models," she said. In the paper, Meta describes how it measured language “toxicity” and gender bias.
The researchers also said natural speech "encompasses a suite of prosodic — rhythm, stress, intonation or tone — and emotional components that deserve further research."
They added: "To create S2ST systems that feel organic and natural, more research should be directed at output generation that preserves expressivity. Moreover, the consummate realization of the Babel Fish requires deeper investments into research on low-latency speech translation. Developing systems that enable streaming (that is, incrementally translating an input sentence as it is being presented) may increase the adoption of these systems across institutional contexts. We hope that SEAMLESSM4T opens up new possibilities for both these research areas." ®