Cerebras and Abu Dhabi create powerful Arabic-language AI model.

Cerebras and Abu Dhabi create powerful Arabic-language AI model.

Breaking Language Barriers: Cerebras Systems’ Jais-Chat Revolutionizes AI for Arabic

In the realm of artificial intelligence (AI), language has been the center of attention. However, existing AI programs like ChatGPT primarily focus on English, leaving hundreds of other commonly spoken languages out in the cold. But that’s about to change with the groundbreaking collaboration between AI startup Cerebras Systems and Abu Dhabi’s Inception, a subsidiary of UAE’s investment firm G42. They are introducing Jais-Chat, an open-source large language model specifically designed for Arabic, catering to the approximately 400 million Arabic speakers worldwide.

Jais-Chat is like Chat-GPT, but with Arabic capabilities. It can produce Arabic-language writing when prompted in English or generate responses in Arabic when given an Arabic-language prompt. Trained on an extensive corpus of Arabic texts, Jais-Chat focuses solely on English and Arabic translations instead of attempting to handle hundreds of languages with mediocre results. This laser-focused approach has proven to be immensely successful—outscoring leading language models like Meta’s LlaMA 2 and even specialized Arabic models in tests assessing knowledge, reasoning, and biases.

Andrew Feldman, co-founder and CEO of Cerebras, emphasizes the significance of this development in democratizing AI. With Arabic being the primary language of 25 nations and spoken by 400 million people, giving Arabic speakers a voice in AI is a truly transformative endeavor. Addressing the language gap in AI has been a challenge due to the dominance of English content on the internet. Previous attempts, like Meta’s “No Language Left Behind” initiative, aimed for a generalist approach but struggled to improve performance in many languages, including even some with substantial translation resources.

Jais-Chat’s success can be attributed to the exceptional efforts put into the creation of the program and the specialized Arabic dataset. Researchers at Inception and Cerebras compiled an extensive collection of 55 billion Arabic tokens from various reputable sources, such as Abu El-Khair, an Arabic news source, and the Arabic-language version of Wikipedia. They further increased the dataset by translating 3 billion tokens from English Wikipedia and 15 billion tokens from the Books3 corpus, resulting in a robust 116 billion Arabic tokens. In an innovative move, they augmented the Arabic texts with billions of tokens of computer code snippets from GitHub, combining 29% Arabic, 59% English, and 12% code.

To represent Arabic vocabulary more effectively, the researchers developed their own tokenizer, a program that splits texts into individual units. This was necessary because the standard tokenizer used in programs like GPT-3 is primarily trained on English corpora and struggles with Arabic words. Additionally, they employed the ALiBi embedding algorithm, developed by the Allen Institute and Meta, which excels in handling long context inputs.

Jais, the resulting model, and its companion chat app, Jais-Chat, incorporate these advancements. Based on the GPT-3 architecture, Jais boasts 13 billion parameters and is named after Jebel Jais, the highest point in the United Arab Emirates. The Jais program code is released under the Apache 2.0 source code license and is available for download on Hugging Face. A waitlist is available to access a demo of Jais, and the authors plan to release the dataset publicly in the near future.

Jais and Jais-Chat were trained and fine-tuned on Cerebras’ Condor Galaxy 1, the world’s largest supercomputer for AI. Comprising Cerebras’ special-purpose AI computers, the CS-2, and AMD’s EPYC x86 server processors, this machine affords unparalleled processing power. The use of Condor Galaxy 1 significantly reduced training time compared to GPU clusters by standard AI vendors such as Nvidia. The Jais programs not only outperformed existing open-source Arabic models but also achieved comparable performance to leading English models, despite training with significantly less data.

Cerebras’ contribution to the open-source software community doesn’t end here. With programs like BTLM-3B-8K gaining popularity on Hugging Face, Cerebras is paving the way for further advancements in AI. By building supercomputers and fostering open-source collaboration, Cerebras is driving the democratization of AI forward. Their work is not only pushing the boundaries of AI capabilities but also making it more accessible to a global audience. With Jais-Chat leading the forefront, the language barrier in AI is finally starting to crumble, facilitating meaningful communication for Arabic speakers in the world of AI.