Categories
Multilingual AI
Multimodal AI
GenAI
LLM
Contextualization
Share
GenAI models are changing the world, but mainly for people who speak English. That’s because they’ve been trained in English. As a result, GenAI content often doesn’t reflect the linguistic diversity of their users, which can create barriers for non-English speakers and limit potential value.
In a world where nearly 7,000 languages are spoken, it’s critical to train GenAI models in multiple languages. Doing so can offer your business the ability to serve both global and local audiences. Overcoming these linguistic challenges in GenAI engineering starts with multilingual datasets.
English dominates GenAI models
There are many reasons why most GenAI models are trained on predominantly English data. For one thing, English is one of the most widely used languages on the Internet, which results in a vast amount of data available in English. This publicly available data is crucial for training GenAI models because it meets the need for training data to be affordable, accessible, and expansive.
A significant amount of research and development in the field of AI and machine learning is conducted in English-speaking countries or in English-language academic and research institutions. This naturally leads to a focus on English in the development of technology.
There’s also the profit motive to consider. English-speaking markets are among the largest and most lucrative for technology products, so companies often prioritize English for their initial models to cater to these markets.
But developing GenAI models exclusively in English has serious drawbacks. English-speaking markets are certainly lucrative; but, if you want to expand globally, you’ll need to be more responsive to markets around the world, in which people speak diverse languages.
Companies increasingly recognize the importance of catering to a global audience, too. For example, off-the-shelf GenAI models like ChatGPT are becoming multilingual to address a growing global user base.
The expansion of GenAI capabilities into multiple languages opens new avenues for research and application. Multilingual and multicultural models can potentially extend their sophisticated content analysis and generation capabilities to a multitude of underserved markets, even those with limited data access.
Monolingual GenAI models can diminish your business impact
Companies like OpenAI have are strongly motivated to develop multilingual GenAI models, a fact that demonstrates the value thereof to any business developing its own GenAI models. To put it simply, failing to do so can result in countless missed revenue opportunities.
For one thing, language bias and underrepresentation can hinder monolingual GenAI models’ effectiveness in diverse contexts. This can lead to a lack of resources and support for these languages, contributing to a digital divide where speakers of less-represented languages have less access to advanced technology—and less ability to do business with you.
GenAI models trained mainly on English may inherently carry biases and perspectives dominant in English-speaking cultures. This can limit a model’s understanding and sensitivity to cultural nuances and contexts of other regions and languages (not to mention increase their likelihood of committing translation errors).
Monolingual GenAI models also limit your ability to produce content that feels authentically relevant and personal across countries. According to Nimdzi, users in different countries are more likely to adopt a product or service if it’s properly personalized in their native language.
The world’s four fastest-growing languages are not English. To plan for future growth in a multilingual world, you need to cater to multiple languages and customs. That means curating multilingual data to train GenAI models and applications.
Training GenAI models on English translations doesn’t cut it
Why do GenAI models need to be developed in multiple languages? Why not train them in English and translate them to other languages?
Training large language models (LLMs) on multiple languages, instead of relying solely on translation from English, is important for several reasons. Among those reasons is the importance of respecting each language’s unique cultural nuance and context.
Multilingual GenAI models can better understand and generate text that is culturally relevant and sensitive, capturing subtleties that may be lost in translation.
For example, the use of pronouns and forms of address in English is relatively straightforward. “You” is used universally, regardless of the level of formality or the audience’s age, status, or relationship to the speaker. But Spanish distinguishes between formal and informal address. “Tú” is used for informal, familiar situations, while “Usted” is reserved for formal contexts or when addressing someone older or in a position of authority.
A multilingual GenAI model trained on both English and Spanish can appreciate the importance of formality in Spanish and is less likely to make errors resulting from improper translations. It recognizes when to use “Tú” and “Usted,” a subtlety that doesn’t have a direct parallel in English.
This understanding can make products developed on GenAI, such as chatbots, more useful and relevant to cultures where languages other than English are spoken.
Multilingual GenAI models make you more customer-friendly
Multilingual GenAI models deliver benefits ranging from improving your customer service to improving the global accessibility of your products and services.
Make your business more accessible
One of the primary advantages of multilingual LLMs is their ability to make your content accessible in multiple languages. This increased accessibility can lead to higher conversion rates, customer satisfaction, market penetration, and so on.
Pursuing inclusive AI can also broaden your reach, enabling you to connect with a global audience. As GenAI adoption grows in non-English speaking regions, the ability to interact in local languages becomes a significant competitive edge.
Improve customer service
Multilingual GenAI models enable you to provide customer support in multiple languages, which helps ensure a more personalized and effective customer service experience. This capability is particularly crucial for online platforms that serve a global user base.
Moderate content more effectively
Businesses that host user-generated content can use multilingual GenAI models for efficient content moderation across languages. This technology is crucial for identifying and addressing harmful content, disinformation, and policy violations.
Take three steps to embrace multilingual GenAI
Adding multilingual capabilities to a GenAI model requires access to robust, multilingual datasets. To bridge the gap between single-language and multilingual datasets, consider adopting a frontier AI data foundry platform.
Augment multilingual datasets
The developers of GenAI models need high-quality multilingual datasets to train them. Efforts should focus on expanding the size and diversity of these datasets to encompass a wider range of languages, domains, and writing styles.
This can be achieved through collaborations between research institutions, language experts, and organizations that possess extensive text data in various languages.
Explore new training approaches
Developing newer and more efficient training methods can significantly reduce the computational cost of training multilingual GenAI models. Consider exploring techniques such as transfer learning, domain adaptation, and multi-task learning to enable the training of effective multilingual GenAI models with less computational overhead.
Address data bias
Multilingual GenAI models should be trained on datasets that are carefully curated to minimize bias. Adopt data cleaning techniques, develop bias detection algorithms, and consider ethical guidelines to ensure that the resulting models are unbiased.
It’s also critical to ensure that you adopt the proper mechanisms for human review and accountability, which a frontier AI data foundry platform can provide.
Enter a more inclusive digital world
Every innovation improving multilingual GenAI is a step toward a future where language is less of a barrier between people and the technology they need—or between your business and its customers. To become more responsive to the needs of all people all over the world, you need multilingual GenAI.