Meta unveils speech-to-speech translator

Meta unveils speech-to-speech translator

Meta’s SeamlessM4T: Advancing Speech Translation with Multimodality


Meta, the parent company of Facebook, Instagram, and WhatsApp, recently announced its latest breakthrough in the field of machine translation. The new program, SeamlessM4T, focuses on speech translation and demonstrates its potential in bridging the gap between different languages. Unlike existing models that specialize in translating either speech-to-speech or speech-to-text, SeamlessM4T is a prime example of multi-modality, a single program that can handle both speech and text data simultaneously.

In the past, Meta had primarily focused on large language models for text translation across 200 languages. However, the lead author of the SeamlessM4T project, Loïc Barrault, along with researchers from Meta and UC California at Berkeley, recognized the limitations of text-based models in achieving comprehensive speech-to-speech translation. In their formal paper titled “SeamlessM4T – Massively Multilingual & Multimodal Machine Translation,” they discuss the challenges in addressing the multilingual and multimodal aspects of translation.

One of the limitations of speech translation is the lack of readily available speech data for training neural networks. This scarcity of speech data hampers the development of accurate speech translation models. However, the authors argue that speech data, compared to text, conveys richer information and expressive components, making it superior in conveying intent and fostering stronger social connections between users.

The main goal of SeamlessM4T is to train a single program that can effectively process both speech and text data. The program, named “Massively Multilingual & Multimodal Machine Translation” (M4T), combines different components to achieve seamless translation. Instead of employing a cascaded approach where separate functions handle different parts of the translation process, the authors propose an end-to-end program that integrates various existing parts. This approach ensures a more cohesive and efficient translation process.

The four core components of SeamlessM4T are as follows:

  1. SeamlessM4T-NLLB: A massively multilingual text-to-text translation model.
  2. w2v-BERT 2.0: A speech representation learning model that leverages unlabeled speech audio data.
  3. T2U: A text-to-unit sequence-to-sequence model.
  4. Multilingual HiFi-GAN: A unit vocoder for speech synthesis.

These components are combined, like Lego pieces, into a single program called UnitY. UnitY operates on a two-pass modeling framework, generating text first and then predicting discrete acoustic units.

The diagram below provides a visual representation of the program’s architecture:


The program’s authors conducted extensive tests to evaluate its performance in speech recognition, speech translation, and speech-to-text tasks. The results showed that SeamlessM4T outperformed other similar programs, including both end-to-end and explicitly speech-focused models. The evaluation demonstrated significant improvements in translation accuracy and speech recognition capabilities.

The companion GitHub site not only provides access to the program code but also introduces two additional technologies. SONAR, a new multi-modal data embedding technology, helps enhance the integration of different types of data in the program. BLASAR 2.0, an updated metric, enables automatic evaluation of multi-modal tasks.

Meta’s SeamlessM4T program represents a significant advancement in machine translation, specifically in the realm of speech translation. By harnessing the power of multimodality and training on both speech and text data simultaneously, SeamlessM4T demonstrates Meta’s dedication to breaking down language barriers and improving communication between people from different linguistic backgrounds.

Beyond its practical applications, the program also represents a leap forward in artificial intelligence research. Meta’s ongoing efforts to develop and refine machine translation models undoubtedly pave the way for future advancements in the field, encouraging the creation of more inclusive and connected global communities.