Shankar Maruwada of EkStep Foundation, the nonprofit that helped build the chatbot, said the bot works by combining two types of language models together and users can submit queries in their native language. These native language queries are passed to machine translation software at an Indian research facility, which translates into English before being forwarded to LLM to process the response. Finally, the answer will be translated back to the user’s native language.
This procedure may work, but translating queries into the LLM’s “preferred” language is a clumsy workaround. Language is a means of reflecting culture and worldview. A paper by Rebecca Johnson, a researcher at the University of Sydney, published in 2022, found that ChatGPT-3 provided answers on topics such as gun control and refugee policy that were comparable to other American values expressed in the World Values Survey.
Therefore, many researchers are trying to make LLMs fluent in less commonly used languages. Technically, one approach is to modify the tokenization for the language. An Indian startup called Sarvam AI has written a Hindi-optimized token, or Devanagari (India) language-optimized OpenHathi – LLM model that can help cut costs significantly Answer the questions.
Another way is to improve the datasets on which the LLM is trained. In November, a team of researchers at Mohamed bin Zayed University, in Abu Dhabi, released the latest version of its Arabic speaking model called “Jais”. It has 1/6 the number of parameters as ChatGPT-3, but provides performance comparable to Arabic.
Timothy Baldwin, president of Mohamed bin Zayed University, noted that, although his team digitized a lot of Arabic text, the model still included some English text. Some concepts are the same in all languages and can be learned in any language.
A third approach is to tune the models after they are trained. Both Jais and OpenHathi have a number of human-generated question and answer pairs. The same goes for Western chatbots, to prevent misinformation.
Ernie Bot, LLM of Baidu, a major Chinese technology company, has been adjusted to limit content that could offend the government. Models can also learn from human feedback, with users evaluating LLM responses. But that’s difficult to do in many languages in underdeveloped regions because of the need to hire people qualified to critique machine responses.
(Theo Economist)