Indian Language Dialect Data Annotation Challenge - 2026

Official Guidelines

Organized by NEXTGENVECTORA DATA INNOVATIONS (OPC) Private Limited

Click here to register

1. Purpose of the Competition

This competition aims to create high-quality annotated datasets for Indian languages and their dialects to support research and development in NLP, Generative AI, and Responsible AI systems.

2. Eligibility

The competition is open to:

Engineering students
Graduates and postgraduates
Researchers and academicians
Software developers and test engineers
Language enthusiasts

3. Important Dates

Registration Opens: 15th February 2026
Registration Closes: 15th March 2026
Competition Period: 15th March 2026 to 15th May 2026

4. Preferred Annotation Languages

Kannada
Marathi
Telugu
Tamil
Bengali

5. Annotation Tasks

Vocabulary Annotation (Lexical-Level Annotation)
Sentence Annotation (Syntactic/Semantic-Level Annotation)
Paragraph Annotation (Discourse-Level Annotation)
Question Answering (NLP Capability)
Text Summarization (NLP Capability)
Text Paraphrase Generation (NLP Capability)
Sentiment Classification (NLP Capability)
Named Entity Recognition – NER (NLP Capability)

6. Annotation Guidelines

Annotations must be linguistically accurate and contextually meaningful
Dialect usage should reflect authentic regional language
Correct script usage is mandatory
Machine-generated or copied content is not allowed
Respectful and neutral language must be maintained
The Art and Science of Data Annotation for AI

7. Prohibited Content

Hate speech, abusive, or explicit content
Political, religious, or ideological material
Personal or sensitive information
Copyrighted or plagiarized text

8. Data Usage & Ownership

All submitted annotations will become the intellectual property of NEXTGENVECTORA DATA INNOVATIONS (OPC) Private Limited and will be used strictly for research, educational, and AI model development purposes.

9. Evaluation Criteria

Accuracy and correctness
Dialect authenticity
Consistency of annotations
Quality and volume of contributions

10. Benefits of Participation

Hands-on experience in NLP data annotation
Exposure to real-world AI dataset creation
Understanding of dialect-based language modeling
Industry-relevant project experience

11. Certification & Recognition

A 2-month internship letter recognizing their contribution to the data annotation project
Participation Certificate for all eligible contributors who complete the required annotations
Merit Certificates for participants demonstrating high accuracy and consistency
First Prize - ₹25,000 awarded to the top overall contributor
Second Prize - ₹15,000 awarded to the second-highest performing contributor
Third Prize - ₹10,000 awarded to the third-highest performing contributor
Opportunities for future collaboration and internships with NEXTGENVECTORA DATA INNOVATIONS (OPC) Private Limited

12. Ethical Commitment

This initiative promotes linguistic diversity, responsible AI development, and ethical data collection practices.

13. Disclaimer

NEXTGENVECTORA DATA INNOVATIONS (OPC) Private Limited reserves the right to modify these guidelines at any time. Any violation may result in suspension or termination of access.

14. Sponsorships & Collaborations

NEXTGENVECTORA DATA INNOVATIONS (OPC) Private Limited is happy to collaborate and welcome sponsorships from colleges, universities, research institutions, and industries interested in advancing Indian language technologies, dialect preservation, and responsible AI development.

Highlights for Sponsors:

Visibility and branding across the official competition platform and communication channels
Recognition in certificates, reports, and official acknowledgments
Opportunities for academic–industry collaboration
Early access to high-quality annotated datasets (as per data usage policies)
Engagement with skilled student contributors and language experts
Support a national initiative promoting linguistic diversity and ethical AI

I have read and agree to the Data Annotation Competition – Official Guidelines