AI-Assisted Annotation: scaling human-led prompt generation across 13 languages using LLM-as-a-judge
Categories
Lean Annotation
Scaling AI
Enterprise AI
LLM
GenAI
Share
Multilingual instruction tuning and safety alignment require high-quality, diverse training data. But collecting that data, especially for low-resource languages and sensitive topics like privacy or misuse, remains slow, expensive, and inconsistent.
In our VLDB 2025 paper, we have introduced AI-assisted annotation, a scalable human-in-the-loop framework that combines prompt writers, quality assurance based on large language models (LLMs), and synthetic personally identifiable information (PII) tools to accelerate multilingual prompt generation. By using LLMs not just as generators, but as judges, we have reduced annotation time by over 40% without sacrificing coverage, quality, or instruction compliance.
We deployed AI-assisted annotation to create more than 40,000 high-quality prompts across 13 languages, covering tasks like personal data redaction, safety alignment, and misuse detection. Those are all critical areas for LLM safety evaluation.
The challenge: scaling prompt writing for low-resource locales
Instruction tuning and safety alignment depend on training data that reflects realistic, nuanced user inputs. For high-stakes applications like identifying harmful intent or protecting personal data, researchers often rely on human-written prompts to capture edge cases and ambiguity. But this manual work doesn’t scale, especially when:
Prompts must be written in underrepresented languages
They must include synthetic but realistic PII
They must follow complex, high-precision instructions
And they must avoid repetition while maintaining natural variation
Manual workflows struggle to meet all these constraints at scale, making it difficult to build multilingual safety datasets that are both broad and reliable.
The framework: AI-augmented prompt generation with LLM-as-a-judge
AI-assisted annotation introduces a modular, AI-supported pipeline to improve throughput, quality, and diversity in human-authored prompts. The system includes:
Synthetic PII Suggestion: we recommend structured, locale-specific PII—like Aadhaar numbers or health IDs— using a tool that draws from curated templates and fake data generators.
LLM-as-a-judge for instruction following: instead of using LLMs to write prompts, we use them to evaluate and score human-written ones. Judges assess instruction of compliance (e.g., word count, structure, content type) and provide instant feedback to writers.
Similarity checking: We apply vector-based checks to flag lexically or semantically repetitive prompts, enforcing diversity across submissions.
Maintaining Data Diversity: As LLM training would require good coverage and diversity of input data, authors are provided with structured guidance on domain, PII type combinations, text length.
Unlike traditional pipelines, AI-assisted annotation delivers immediate feedback loops, reducing rework and improving consistency across annotators and languages.
Deployment at scale: 13 languages, 40,000+ prompts
We deployed AI-assisted annotation with Centific’s professional annotation teams to generate multilingual prompts across 13 locales: Arabic (UAE), Finnish (Finland), Hindi (India), Norwegian (Norway), Dutch (Belgium), Dutch (Netherlands), Polish (Poland), Portuguese (Brazil), Portuguese (Portugal), Swedish (Sweden), Vietnamese (Vietnam), and Chinese (China), Chinese (Singapore).
Each prompt was written by a human, guided by tools and judged by an LLM for accuracy and coverage. Prompts were designed to simulate:
Domain (e.g., IT, CPG, Healthcare, Finance)
Requests containing region-specific PII (e.g., local ID numbers, bank account formats)
Malicious or ambiguous intent (e.g., phishing, scams)
Natural language diversity (e.g., tone, phrasing, cultural context)
Task category (summarization, rewrite, translation, chain-of-thought)
Tasks included redaction, classification, and detection of misuse, aligned with the needs of LLM safety benchmarks.
Results: 40% faster, with no drop in quality

We ran a comparative study of AI-assisted annotation against a traditional human-only workflow. Key results:
40% reduction in average handling time per prompt, driven by fewer revision cycles and better real-time feedback.
Significant gains in instruction compliance, especially word count, PII formatting, and language specificity.
Higher lexical and structural diversity, enforced through AI-based repetition checks.
These gains did not come at the cost of quality: judged correctness, completeness, and naturalness of prompts were statistically equivalent across workflows.
Why it matters
AI-assisted Annotation addresses a known bottleneck in LLM alignment: scalable, high-quality prompt generation in underrepresented languages. The framework:
Improves multilingual coverage for red-teaming, safety tuning, and PII detection
Reduces annotation cost without losing linguistic or structural diversity
Provides a reusable pipeline that supports multilingual model evaluation and instruction tuning
By decoupling prompt generation from manual review, and using LLMs as real-time QA agents, we improve both speed and safety in data creation.
What’s next
We are expanding AI-assisted annotation to support longer tasks, multi-turn dialogue, and complex chain-of-thought reasoning prompts. Future work includes evaluating LLM-as-a-judge alignment with human quality scores and extending the framework to unseen languages and scripts.
Dr. Abhishek Mukherji is an accomplished AI thought leader with over 18 years of experience in driving business innovation through AI and data technologies. He has developed impactful AI applications for Fortune 100 clients across sectors including high-tech, finance, utilities, and more, showcasing expertise in deploying machine learning (ML), natural language processing, and other AI technologies. In his prior roles, he shaped GenAI and responsible AI product strategy for Accenture, using large language models to transform business processes. He has also worked to advance ML technologies across wireless use cases at Cisco and contributed to Android and Tizen frameworks at Samsung’s Silicon Valley Lab. Dr. Mukherji, who holds a Ph.D. in Computer Science from Worcester Polytechnic Institute, is an award-winning professional, an inventor with more than 40 patents and publications, and an IEEE Senior Member active in the research community.
Harshit Rajgarhia is an AI Lead with more than seven years of experience designing and delivering bespoke machine learning and deep learning solutions for Fortune 500 clients. His expertise spans a broad range of domains, including computer vision, natural language processing, time-series forecasting, and reinforcement learning. He has spearheaded the development and deployment of scalable AI for use cases such as autonomous imaging, trade promotion optimization, and intelligent data annotation platforms. Harshit began his AI career at Quantiphi, where he engineered production-ready recommender systems and vision-based automation tools for global enterprises. He is particularly passionate about building robust, end-to-end AI pipelines, incorporating MLOps best practices, and leveraging large language models to solve real-world problems. His work is driven by a commitment to reducing turnaround time and improving the quality and efficiency of the annotation lifecycle through data-centric AI solutions. Harshit’s current research interests include GenAI, multimodal learning, and the application of large language models for social impact. With a master's in data science and a deep foundation in applied AI, he actively contributes to both industrial and academic advancements. His work has been submitted to leading conferences such as ACL and VLDB, and he remains dedicated to ethical, reproducible AI development.
Categories
Lean Annotation
Scaling AI
Enterprise AI
LLM
GenAI
Share