The large language model (LLM) developed by South African artificial intelligence (AI) research and product lab Lelapa AI seeks to pioneer five African languages initially.
With about 364 million speakers, the low-resource African languages of Swahili, Yoruba, isiXhosa, Hausa, and isiZulu are intended to be supported and improved using the InkubaLM natural language processing (NLP) model.
Lelapa AI claims that the InkubaLM (Dung Beetle Language Model) is a powerful, little model designed to support African communities that doesn’t require a lot of resources. It is made up of the Inkuba-Mono and Inkuba-Instruct databases.
To pre-train InkubaLM models, Inkuba-Mono is a monolingual dataset that was gathered from open source repositories in five African languages in addition to English and French data.
We gathered publicly available datasets in five African languages from Github, Hugging Face, and Zenodo repositories. Researchers trained the InkubaLM models using 1.9 billion tokens of data after pre-processing, according to the startup.
Lelapa AI observes At the moment, Inkuba-Instruct offers tools for natural language processing, transcription, and translation.
Machine translation, sentiment analysis, named entity recognition, parts of speech tagging, question-answering, and news subject classification were the main areas of concentration for the instruction dataset. We studied five African languages for each task: Hausa, Swahili, Zulu, Yoruba, and Xhosa.
Lelapa AI gives Microsoft’s AI For Good Lab credit, stating that the researchers there were successful in obtaining computational credit for the InkubaLM model’s training.
“As AI professionals, we are dedicated to leveraging AI’s potential to create a future that is inclusive. Lelapa AI’s CEO and co-founder Pelonomi Moiloa says, “No one should have to assimilate to a culture outside of their own in order to access cutting-edge technology.”
“The challenge lies in the resources required for large models, which are often out of reach for the majority of the world, even though AI holds the promise of global prosperity.” Although open source models have made an effort to close this gap, much more may be done to make models locally relevant, affordable, and accessible.
Calls to remove some of the obstacles to the development of these AI tools for African audiences have increased after the emergence of natural language processing tools such as ChatGPT.
This is motivated by the moral dilemmas brought forth by AI’s quick development. Among these are notable examples of bias in algorithms, invasions of privacy, and more lately, language, as there are concerns that the 1.2 billion people living in Africa may be largely left behind by the rapid growth of AI.
There are up to 3,000 languages spoken on the African continent, making it a multilingual melting pot. For instance, despite having 12 official languages, just 10% of South Africans speak English at home, despite it being the language of the internet.
Arabic, French Creole, Shona, Swahili, and Swati are just a few of the languages spoken throughout the rest of the continent. According to Lelapa AI, the Inkuba release aims to improve language model capabilities for African languages through two main initiatives: “First, InkubaLM is introduced as a new model that can be further trained and developed to improve functionality in a variety of tasks for the languages in question; second, the Inkuba datasets are available to enhance the performance of existing models.” Since conventional large language models perform poorly with these languages, Inkuba gives NLP practitioners useful options to achieve robust functionality for the five targeted languages.
Lelapa AI was founded in 2022 to address how AI can be used for solutions and applications through an African lens. The company develops speech recognition tools for African languages and its founding members include Moiloa, Jade Abbott, Vukosi Marivate, Benjamin Rosman, Pravesh Ranchod, and George Konidaris. “The Inkuba-Mono dataset can be utilised to train language models to perform tasks that require monolingual datasets. The Inkuba-Instruct dataset can be utilised to instruct and fine-tune any language model for the five African languages of interest.”