- The linguistic diversity gap in AI was a topic of discussion at the ‘AI for Global Good’ session at the Sustainable Development Impact Meetings 2024. | Image: World Economic Forum
- |
The ‘missed opportunity’ with AI’s linguistic diversity gap
Pooja Chhabria
Digital Editor, World Economic Forum
Chris Hamill-Stewart
Writer, Forum Agenda
- The linguistic diversity gap in AI threatens to exclude billions from the digital economy, with most current systems trained on only 100 of the world’s 7,000+ languages.
- Emerging initiatives are showcasing the potential of linguistically diverse AI to drive innovation and inclusion.
- The future of AI must be diverse, leaders emphasised at the Sustainable Development Impact Meetings in New York this week.
By 2050, more than a third of the world’s youth will be living in Africa. But will they have equal access to participate in the digital economy?
Currently, of the top 34 languages used on the internet globally, not one is African. Advances in areas such as natural language processing (NLP), large language models (LLMS), and artificial intelligence (AI) research also continue without adequate representation of African languages.
“It’s both a challenge and also one of the greatest opportunities,” says Crystal Rugege, Managing Director of the Centre for the Fourth Industrial Revolution Rwanda, while referring to the rich linguistic diversity in the continent and the current inability of AI systems to serve this diversity. “We may not have applications that can interact in 1,400 dialects, but certainly, we should be able to serve the majority of our populations. This market can also become the world’s digital workforce, and we should create an enabling environment.”
This linguistic divide in AI isn’t just an African issue – it’s a global challenge with far-reaching implications.
Linguistic divide in AI – a deepening problem?
There are more than 7,000 languages in the world, yet most AI chatbots are trained on around 100 of them. The first language of AI is considered English, which is unsurprising given that there is simply more English language data available online to scrape and train models on.
There are signs this trend of linguistic concentration around English in AI — despite less than 20% of the global population speaking the language — is deepening: some Generative AI models trained to respond to prompts in other languages now “think” in English. In contrast to this ‘high-resource language’, different ‘low-resource’ or underserved languages are lagging behind due to the lack of quality data sets, tools and techniques that underpin these AI systems.
But the challenge of linguistic diversity in AI is not just a technical problem – it’s an opportunity to reshape the digital landscape, drive economic growth, and ensure that the benefits of AI are truly global.
Left unchecked, it would mean that the groups and nations that are already struggling to take advantage of current AI systems and face additional challenges of inadequate access to internet services, limited computing power, and lack of availability to sectoral training ‘will probably fall further behind’, as Cathy Li, Head of AI, Data and Metaverse at the World Economic Forum, points out.
Early efforts to tackle the issue
There are emerging use cases globally, from India and North America to African countries, that demonstrate the value of investing in AI’s ability to work in diverse languages.
In Rwanda, for example, linguistically diverse AI is enabling community health workers to provide services across these divides. Crystal Rugege says the country has about 70,000 such frontline workers who are not conversing in English, and they often discern whether people need more critical care.
“We built a translation model that’s both voice- and text-based, so they can interact with it and be able to discern if someone has a headache; if someone has a cough.” Using OpenAI’s ChatGPT 4.0, she explains, they’ve managed to reach 71% accuracy in trials of interactions with patients. That means more people are treated for their illnesses because linguistic diversity was a feature, not an afterthought, of this AI application.
But an equally important consideration, she says, is having the proper guardrails to make sure the people’s rights are protected and the technology is being used responsibly. ‘Data is the oxygen of AI… [ensuring] people have agency to make decisions over how their data is used is a fundamental principle that must be embedded, but beyond that, also making sure that the policies and laws are put in place to stimulate innovation.”
Open-source AI and partnerships offer solutions
Speaking at the World Economic Forum’s Sustainable Development Impact Meetings (SDIM) in New York, Yann LeCun, Meta’s Vice President and Chief AI Scientist, pointed to digital healthcare measures emerging in Senegal as another example.
“It’s difficult to get an appointment with a doctor in Senegal, particularly if you are in rural areas,” he says. But AI-powered platforms like Kera Health, he says, allows people to ‘now talk to an AI assistant for this. But it has to speak Wolof, in addition to French, and three other official languages of Senegal.”
There are two main drivers to achieve more progress, LeCun says. First is open-source AI – “What we need is a very simple open infrastructure – think of it as a ‘Wikipedia for AI’ – so you give people the ability to build the systems that are useful for local populations.”
Second is partnerships that can drive the change. “For example, there is a partnership between Meta and the government of India so that future versions of the open-source LLM from Meta (called LLaMA) can speak at least all 22 official languages of India and, perhaps, all the hundreds of local languages and dialects.”
He also sees the opportunity translating into the physical space, where devices like eye-glasses will be available to provide simultaneous translation between speakers of two different languages. “The future of hardware will be things like smart glasses … which enables interaction between people in their own languages,” he says.
“We’re starting to have systems that can translate non-written languages … so directly for speech-to-speech, we can do text-to-text, text-to-speech, speech-to-text, and speech-to-speech, including for languages that are not written — of which there are many.”
Pascale Fung, a researcher focusing on improving natural language processing for low-resourced or underrepresented languages, says we should aim to build systems that will facilitate communication between low-resource and high-resource language communities. “For large language models, it means collecting additional data in a low-resource language to fine-tune the models so that they perform at the same level as English models.”
Towards a ‘diverse future’
Efforts are underway to facilitate the smooth exchange of data, including linguistic data. The European Commission’s Alliance for Language Technologies (ALT-EDIC), for example, will contribute to addressing the shortage of European languages data for the training of AI solutions and support the development of European large language models.
Other nations, such as the United Arab Emirates (UAE), have ‘produced and exported’ new large language models (LLM) like NANDA, that will specifically cater to Hindi-speaking users while making a concerted global push for its open-source LLM ‘Falcon’. “One thing that we are doing as well is working across different geographies trying to see how we can customise Falcon to cater to the needs of these governments that do not have the ability to build their own large model,” says Omar Sultan Al Olama, UAE’s Minister of State for Artificial Intelligence, Digital Economy and Remote Work Applications.
The World Economic Forum’s AI Governance Alliance gathers diverse stakeholders and is crucial in building a more equitable and responsible AI ecosystem globally. The Inclusive AI workstream, in particular, prioritises inclusive AI development that respects and considers the needs of all people. It’s also developing a framework for public-private cooperation alongside highlighting and promoting AI applications that support people and planet goals.
The future needs to be diverse, emphasises Meta’s LeCun. “For the same reason that we need access to a wide diversity of sources of information, from the press to social media, we also need a high diversity of AI systems to cater to all our diverse interests, cultural norms, value systems and languages.”
RELATED NEWS