This blog is written by Ajay Palle from Inqui-lab Foundation. Team Inqui-lab has been mentored and guided by Akankshya Vyas, CTO of Pinky Promise.
Evaluating Student Innovation at Scale: Building a Multi-LLM Assessment Pipeline
How we used human evaluation as ground truth to benchmark three AI models for assessing 30 student innovation projects, and what we learned about AI-human alignment in subjective evaluation tasks.
The Problem: Innovation Evaluation Doesn’t Scale
Every year, thousands of students across India submit innovation projects to competitions like the School Innovation Marathon. Each submission needs evaluation across multiple dimensions: novelty, usefulness, feasibility, scalability, and sustainability. Traditional human evaluation is thorough but doesn’t scale—imagine manually reading and scoring 100,000 submissions with consistent criteria.
This is a classic “subjective evaluation at scale” problem that many organizations face: grant reviews, essay scoring, content moderation, or any domain where human judgment is needed but human bandwidth is limited.
Our Approach: Human Truth + Multi-Model Benchmarking
Rather than replacing human evaluation, we decided to calibrate AI models against human judgment. Our hypothesis: if we can establish strong AI-human alignment on a representative sample, we can scale human-quality evaluation to larger datasets.
Here’s what we built:
The Dataset: 30 Student Projects as Ground Truth
We started with a curated sample of 30 student innovation projects, carefully selected from our historic database of over 26,700 submissions from SIM 2024-25. Our selection criteria ensured representation across:
- Problem domains: Technology, social issues, environmental challenges, healthcare, and education
Each project contained:
- Problem statement: What issue is the student trying to solve?
- Solution description: Their proposed approach
- CID: Unique identifier for tracking
These ranged from simple ideas (“solar-powered phone charger”) to complex social innovations (“AI-powered crop disease detection system for small farmers”).
Human Evaluation: Establishing the Gold Standard
Before testing any AI models, we conducted thorough human evaluation using a structured rubric that we had been refining. This rubric was originally designed for human evaluators, complete with half-day training sessions to address nuances and potential misunderstandings that arise when evaluating student innovations.
The Challenge of Adaptation
Our original human-centered evaluation framework relied heavily on contextual understanding that came from training sessions, implicit knowledge about student capabilities across different grades, and flexible interpretation that human evaluators could apply. Converting this to work with LLMs required significant transformation.
METRICS = ["novelty", "usefulness", "feasibility", "scalability", "sustainability"]
# Example evaluation criteria
{
"novelty": {
"1-2": "Extremely common solution, no creativity",
"3-4": "Minor adaptations of existing solutions",
"5-6": "Some original thinking, moderate novelty",
"7-8": "Original approach with creative elements",
"9-10": "Highly innovative, breakthrough thinking"
}
# ... similar scales for other metrics
}
Each project was evaluated by multiple reviewers (2-3 members), with final scores representing the human consensus. This gave us our ground truth: the scores and rationale that represent high-quality human judgment.
From Human Training to LLM Prompts
To make this framework work for LLMs, we systematically transformed our approach:
- Explicit Context Addition: We embedded the contextual knowledge from our training sessions directly into system prompts
- Detailed Scoring Scales: Each parameter’s 1-10 scale was expanded with specific, actionable descriptions rather than general guidelines
- Edge Case Documentation: Common misunderstandings from human evaluator training were converted into explicit instructions
- Cultural Context Integration: We added specific Indian educational context that human evaluators understood implicitly
The result was our comprehensive evaluation_prompts.py framework—maintaining the rigor of human evaluation while being executable by LLMs at scale.
The Models: Testing Three Different Approaches
We evaluated three state-of-the-art models using identical prompts and zero-shot evaluation:
- GPT-4o-mini: OpenAI’s cost-optimized model
- GPT-5: OpenAI’s latest flagship model
- Gemini 2.5 Flash: Google’s high-speed model
Each model received the same structured prompt with detailed evaluation criteria, examples of good vs. poor reasoning, and specific instructions about Indian educational context.
The Pipeline: From Raw Ideas to Comparative Analysis
Our evaluation pipeline consists of several key components:
1. Evaluation Prompt Engineering (evaluation_prompts.py)
We spent significant time crafting prompts that would produce consistent, well-reasoned evaluations:
def build_evaluation_messages(problem: str, solution: str) -> list:
"""Build messages for LLM evaluation with structured criteria"""
system_prompt = f"""You are a senior innovation evaluator with 15+ years
of experience. Evaluate this student submission across 5 dimensions:
CRITICAL INSTRUCTIONS:
- These are school students (grades 6-12), not professional innovators
- Consider Indian educational and resource context
- Provide specific evidence for each score
- Be fair but maintain rigorous standards
Respond in JSON format with scores 1-10 and detailed reasoning.
"""
return [
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Problem: {problem}\n\nSolution: {solution}"}
]
2. Simple Evaluator (simple_evaluator.py)
A clean interface that abstracts the complexity of multiple LLM providers:
def evaluate_innovation(cid: str, problem: str, solution: str,
model: str) -> dict:
"""
Evaluate a single innovation using specified model.
Returns structured JSON with scores and usage metrics.
"""
messages = build_evaluation_messages(problem, solution)
# Route to appropriate model
if "gpt" in model:
response, usage = _call_openai(api_key, model, messages)
elif "gemini" in model:
response, usage = _call_gemini(api_key, model, messages)
return {
"metadata": {"cid": cid, "model": model, **usage},
"input": {"problem": problem, "solution": solution},
"evaluation": parse_evaluation_response(response),
"feedback": generate_student_feedback(response)
}
3. Batch Processing (batch_process_sample30.py)
For systematic evaluation across all models:
def batch_process_sample30():
"""Process all 30 samples with all 3 models"""
df = pd.read_csv("data/samples/sample30.csv")
models = ["gpt-4o-mini", "gpt-5", "gemini-2.5-flash"]
for idx, row in df.iterrows():
cid = str(row['CID'])
problem = str(row['Problem'])
solution = str(row['Solution'])
for model in models:
try:
result = evaluate_innovation(cid, problem, solution, model)
save_result(result, output_dir)
time.sleep(1) # Rate limiting
except Exception as e:
log_error(cid, model, e)
This gives us 90 evaluations total (30 projects × 3 models).
Results: What We Learned About AI-Human Alignment
Overall Performance Metrics
After processing all 90 evaluations, here’s what we found:
| Model | Perfect Match | Partial Match | No Match | Total Cases |
|---|---|---|---|---|
| GPT-5 | 67 (44.7%) | 25 (16.7%) | 58 (38.7%) | 150 |
| GPT-4o-mini | 68 (45.3%) | 23 (15.3%) | 59 (39.3%) | 150 |
| Gemini | 53 (35.3%) | 19 (12.7%) | 78 (52.0%) | 150 |
Note: Total cases = 30 projects × 5 metrics = 150 evaluation points per model
Surprising Finding: GPT-4o-mini Matches GPT-5
Our analysis revealed that GPT-4o-mini performed essentially identically to GPT-5 in human alignment, with both models achieving the same overall score. However, this surface-level equivalence masks important nuances:
- Structured Evaluation Advantage: For our highly structured, rubric-based evaluation task, model size showed diminishing returns
- Cost-Effectiveness: GPT-4o-mini offers significant cost savings (~10x cheaper) with virtually identical performance
- Task-Specific Performance: This equivalence may not generalize to other evaluation domains requiring more complex reasoning
Metric-Specific Performance
Some metrics proved easier for AI models to evaluate than others:
| Metric | Perfect Match Rate |
|---|---|
| Sustainability | 61.1% |
| Novelty | 46.7% |
| Usefulness | 38.9% |
| Feasibility | 31.1% |
| Scalability | 31.1% |
Sustainability evaluations showed the highest AI-human alignment, possibly because they involve more objective criteria (environmental impact, resource requirements). Scalability and Feasibility showed the lowest alignment, suggesting these require more subjective judgment that’s harder for models to replicate.
Cost-Benefit Analysis
Running 90 evaluations cost approximately:
- GPT-4o-mini: ~$2.50
- GPT-5: ~$8.00
- Gemini 2.5 Flash: ~$1.20
Human evaluation of 30 projects took approximately 15 hours of expert time. Even accounting for the 55% cases requiring human review, the hybrid approach represents significant time and cost savings.
Building Your Own Evaluation Pipeline
If you’re facing a similar subjective evaluation challenge, here’s a framework you can adapt:
1. Start with Human Ground Truth
# Create a representative sample (20-50 items)
# Get high-quality human evaluation
# Document your rubric clearly
# Measure inter-rater reliability
2. Engineer Your Prompts Systematically
# Define clear roles and context
# Provide specific examples
# Require structured output (JSON)
# Include reasoning requirements
# Test prompt variations on your sample
3. Benchmark Multiple Models
# Test 2-3 different models
# Use identical prompts across models
# Measure alignment with human truth
# Consider cost vs. performance tradeoffs
Key Components Available
The complete pipeline is available for adaptation to your own evaluation challenges:
- Evaluation prompts (
evaluation_prompts.py): Structured prompt templates - Multi-model evaluator (
simple_evaluator.py): API abstraction layer - Batch processing (
batch_process_sample30.py): Systematic evaluation workflow - Analysis pipeline (
process_batch_results.py): Visual comparison generation
The Broader Implications
This work sits at the intersection of several important trends:
- AI Evaluation: As models become more capable, understanding where they align with human judgment becomes crucial for deployment decisions.
- Human-AI Collaboration: Rather than replacement, the most effective approach often involves thoughtful division of labor between humans and AI.
- Scale vs. Quality: Perfect evaluation quality may not be necessary for all use cases—understanding the “good enough” threshold enables practical applications.
- Evaluation Methodology: Systematic benchmarking against human experts provides a principled way to validate AI performance in subjective domains.
The key insight: instead of asking “can AI replace human evaluation?”, ask “where can AI effectively augment human evaluation?” The answer, as we’ve shown, depends on careful measurement and thoughtful system design.
Future Directions
Several extensions could improve this approach:
- Production Monitoring: Moving from batch evaluation to continuous assessment of new submissions
- Failure Mode Classification: More systematic categorization of misalignment types
- Cost Optimization: Exploring smaller models for specific evaluation subtasks
- Confidence Scoring: Models that can indicate uncertainty for human handoff
- Multi-Modal Evaluation: Incorporating diagrams, prototypes, or video demonstrations
- Multi-Cultural Adaptation: Extending our framework to other educational contexts beyond India
Building effective AI evaluation systems requires careful attention to prompt engineering, systematic benchmarking against human judgment, and honest assessment of limitations. While perfect AI-human alignment remains elusive, understanding where models agree and disagree with human evaluators enables practical hybrid systems that scale human judgment effectively.