New AI model brings 11 South African languages online

MzansiLM offers foundation for developers to build local language applications

The UCT researchers behind MzansiLM. From left: Simbarashe Mawere, Anri Lombard, Dr Jan Buys and Dr Francois Meyer. (Supplied)

A new artificial intelligence (AI) model developed in South Africa could help millions of people use digital tools in their home languages, tackling a gap that has long left many behind, local academics say.

A team of researchers at the University of Cape Town (UCT) has developed a language model trained across all 11 of the country’s official written languages, a first of its kind in South Africa.

The research will be presented at the Language Resources and Evaluation Conference 2026 in Mallorca, Spain, in May, highlighting the country’s growing role in global AI development.

At the centre of the project are two tools:

  • MzansiText, a multilingual dataset; and
  • MzansiLM, a language model trained from scratch.

The work was led by Anri Lombard and Dr Jan Buys, together with Dr Francois Meyer and a broader team of collaborators.

The study points to a major issue in the rise of AI, language inequality. While AI tools are becoming part of everyday life, they often work best in English and a few widely used languages.

In South Africa that excludes many people. The research shows that only about 8.7% of South Africans speak English at home, underlining the need for tools that work across local languages.

While languages like isiZulu and isiXhosa have received some attention, others, including isiNdebele and Sepedi, have largely been overlooked. MzansiLM aims to change that

Researchers say the problem is largely due to limited data.

“In language modelling, languages are considered low resource, primarily because there are much fewer and smaller textual datasets available in these languages for training language models,” said Buys.

He added that while MzansiText is still small compared with global datasets, it is “larger than previous datasets for South African languages”.

Nine of South Africa’s 11 official written languages fall into this “low-resource” category. While languages like isiZulu and isiXhosa have received some attention, others, including isiNdebele and Sepedi, have largely been overlooked.

MzansiLM aims to change that. UCT said it is believed to be the first publicly available decoder-only model designed to support all 11 official written languages in one system.

“There has been real progress in language modelling for African languages,” said Meyer, “but most existing models only cover a subset of languages.”

He said the team’s goal was to build a single model focused specifically on South Africa that includes all official languages, especially those often left out.

The model itself is relatively small, with 125-million parameters, but the study shows it can still deliver strong results.

According to the research, it performed well in several tasks and, in some cases, matched or outperformed models more than 10 times its size. On isiXhosa text generation, for example, it achieved a BLEU score of 20.65, competing with much larger systems.

For Lombard, the project began during his master’s research into how language models perform in low-resource settings.

“I came into this work through my master’s research, which looks at how different language-model architectures perform for low-resource languages,” he said.

Our findings show that the model can work well when fine-tuned for specific tasks

—  Dr Jan Buys

He noted that most available models only supported a few South African languages, adding that “MzansiLM was meant to provide a small decoder-only baseline that future work can compare against and build on.”

The study also found that the model works best when adapted for specific tasks, rather than general use.

“Our findings show that the model can work well when fine-tuned for specific tasks,” said Buys, but added that it “is not yet able to work well for general-purpose user interaction or instruction following, due to the limited training data”.

This, he said, helps explain why even large AI systems still struggle outside of English.

The researchers stressed that MzansiLM is not a chatbot but a foundation that developers can build on. “In practice, that means developers could build tools for specific use cases, for example, summarising information or annotating raw data, in South African languages,” Meyer said.

He added that adapting such a model for focused tasks could be more effective and affordable than using large commercial systems.

The team said the project is only a starting point.

“A lot of the progress we were able to make depends on earlier open research from the African Natural Language Processing research community,” said Lombard, adding that continuing that openness is essential.

Meyer agreed, saying the research community has a key role to play. “That kind of openness is often what leads to progress,” he said, especially compared with systems where data and methods are not shared.

TimesLIVE

Would you like to comment on this article?
Sign up (it's quick and free) or sign in now.

Comment icon

Related Articles