While the global excitement around AI models like ChatGPT continues to grow, there are significant challenges in implementing these models in non-English languages, particularly in the Indian context.
One fundamental process in natural language processing models is tokenisation, which involves breaking down language into smaller units to be understood by AI models. However, different languages have varying requirements for tokens. Languages like Hindi, Kannada, and Telugu, with their complex structures and scripts, require significantly more tokens compared to simpler languages like English.
The cost implications of tokenisation disparities are substantial. Each token comes with a price tag, and generating content in languages like Hindi or Kannada can be much more expensive compared to English. For example, generating an article using the ‘Ada’ model in English may cost around $1.2, while the same article in Hindi would cost approximately $8, and in Kannada, an astonishing $14.5.
These inflated costs present a challenge in developing AI models for non-English languages. Training models like GPT-3 in Hindi alone could potentially cost around $32 million, a staggering figure compared to the original training cost.
To address these challenges and bridge the cost gap, collaborations between government initiatives, nonprofits, and big tech companies are emerging. Nonprofit organizations like Karya are dedicated to accelerating social mobility in India through AI training and upskilling. Their ‘Labely’ tool simplifies transcription and annotations, enabling rural talent to contribute to the linguistic landscape in India.
Big tech companies like OpenAI and Microsoft are also playing a crucial role. OpenAI launched its ChatGPT Android app in India to gather user feedback and refine AI responses for improved contextual relevance and cultural sensitivity. This presents an opportunity for ChatGPT to become a widely used app, potentially impacting usage patterns of other apps like Google Search.
Microsoft’s Project ELLORA focuses on preserving and empowering endangered languages in India. By developing an open-source framework and tools like Interactive Neural Machine Translation (INMT), Microsoft aims to provide language communities with the resources to develop their own language technologies.
The collaboration between OpenAI’s data collection efforts and Microsoft’s Project ELLORA has the potential to significantly contribute to linguistic diversity and accessibility in India. Incorporating an extensive corpus of Indic languages into AI models like ChatGPT can enrich the dataset and promote a more inclusive AI ecosystem.
As the landscape of AI and language models evolves, it is crucial to address the hidden costs of language diversity and ensure that the benefits of technology are accessible to all, regardless of the language they speak.
1. What is tokenisation in natural language processing?
Tokenisation is a process in natural language processing that involves breaking down language into smaller units or tokens to enhance comprehension by AI models.
2. Why do different languages require varying numbers of tokens?
Different languages, especially those with complex structures and scripts, require varying numbers of tokens due to their unique linguistic characteristics.
3. How do tokenisation disparities impact the cost of AI models?
Tokenisation disparities directly influence the cost of training and using AI models. Languages that require more tokens for the same input result in higher costs.
4. How are government initiatives and nonprofits helping address the challenges of AI accessibility in Indian languages?
Initiatives like the Government of India’s Bhashini and nonprofits like Karya are dedicated to accelerating social mobility in India through AI training and upskilling. They provide resources and tools to bridge the gap in AI accessibility.
5. What is Microsoft’s Project ELLORA?
Microsoft’s Project ELLORA focuses on preserving and empowering endangered languages in India. It aims to provide language communities with tools and resources to develop their own language technologies.
– [Microsoft Research](https://www.microsoft.com/en-us/research/)