Evaluating student innovation at scale – Building a Multi-LLM assessment pipeline

Oct 2025

This blog is written by Ajay Palle from Inqui-lab Foundation. Team Inqui-lab has been mentored and guided by Akankshya Vyas, CTO of Pinky Promise.

Evaluating Student Innovation at Scale: Building a Multi-LLM Assessment Pipeline

How we used human evaluation as ground truth to benchmark three AI models for assessing 30 student innovation projects, and what we learned about AI-human alignment in subjective evaluation tasks.

The Problem: Innovation Evaluation Doesn’t Scale

Every year, thousands of students across India submit innovation projects to competitions like the School Innovation Marathon. Each submission needs evaluation across multiple dimensions: novelty, usefulness, feasibility, scalability, and sustainability. Traditional human evaluation is thorough but doesn’t scale—imagine manually reading and scoring 100,000 submissions with consistent criteria.

This is a classic “subjective evaluation at scale” problem that many organizations face: grant reviews, essay scoring, content moderation, or any domain where human judgment is needed but human bandwidth is limited.

Our Approach: Human Truth + Multi-Model Benchmarking

Rather than replacing human evaluation, we decided to calibrate AI models against human judgment. Our hypothesis: if we can establish strong AI-human alignment on a representative sample, we can scale human-quality evaluation to larger datasets.

Here’s what we built:

The Dataset: 30 Student Projects as Ground Truth

We started with a curated sample of 30 student innovation projects, carefully selected from our historic database of over 26,700 submissions from SIM 2024-25. Our selection criteria ensured representation across:

Problem domains: Technology, social issues, environmental challenges, healthcare, and education

Each project contained:

Problem statement: What issue is the student trying to solve?
Solution description: Their proposed approach
CID: Unique identifier for tracking

These ranged from simple ideas (“solar-powered phone charger”) to complex social innovations (“AI-powered crop disease detection system for small farmers”).

Human Evaluation: Establishing the Gold Standard

Before testing any AI models, we conducted thorough human evaluation using a structured rubric that we had been refining. This rubric was originally designed for human evaluators, complete with half-day training sessions to address nuances and potential misunderstandings that arise when evaluating student innovations.

The Challenge of Adaptation

Our original human-centered evaluation framework relied heavily on contextual understanding that came from training sessions, implicit knowledge about student capabilities across different grades, and flexible interpretation that human evaluators could apply. Converting this to work with LLMs required significant transformation.

METRICS = ["novelty", "usefulness", "feasibility", "scalability", "sustainability"]

# Example evaluation criteria
{
    "novelty": {
        "1-2": "Extremely common solution, no creativity",
        "3-4": "Minor adaptations of existing solutions",
        "5-6": "Some original thinking, moderate novelty",
        "7-8": "Original approach with creative elements",
        "9-10": "Highly innovative, breakthrough thinking"
    }
    # ... similar scales for other metrics
}

Each project was evaluated by multiple reviewers (2-3 members), with final scores representing the human consensus. This gave us our ground truth: the scores and rationale that represent high-quality human judgment.

From Human Training to LLM Prompts

To make this framework work for LLMs, we systematically transformed our approach:

Explicit Context Addition: We embedded the contextual knowledge from our training sessions directly into system prompts
Detailed Scoring Scales: Each parameter’s 1-10 scale was expanded with specific, actionable descriptions rather than general guidelines
Edge Case Documentation: Common misunderstandings from human evaluator training were converted into explicit instructions
Cultural Context Integration: We added specific Indian educational context that human evaluators understood implicitly

The result was our comprehensive evaluation_prompts.py framework—maintaining the rigor of human evaluation while being executable by LLMs at scale.

The Models: Testing Three Different Approaches

We evaluated three state-of-the-art models using identical prompts and zero-shot evaluation:

GPT-4o-mini: OpenAI’s cost-optimized model
GPT-5: OpenAI’s latest flagship model
Gemini 2.5 Flash: Google’s high-speed model

Each model received the same structured prompt with detailed evaluation criteria, examples of good vs. poor reasoning, and specific instructions about Indian educational context.

The Pipeline: From Raw Ideas to Comparative Analysis

Our evaluation pipeline consists of several key components:

1. Evaluation Prompt Engineering (`evaluation_prompts.py`)

We spent significant time crafting prompts that would produce consistent, well-reasoned evaluations:

def build_evaluation_messages(problem: str, solution: str) -> list:
    """Build messages for LLM evaluation with structured criteria"""

    system_prompt = f"""You are a senior innovation evaluator with 15+ years
    of experience. Evaluate this student submission across 5 dimensions:

    CRITICAL INSTRUCTIONS:
    - These are school students (grades 6-12), not professional innovators
    - Consider Indian educational and resource context
    - Provide specific evidence for each score
    - Be fair but maintain rigorous standards

    Respond in JSON format with scores 1-10 and detailed reasoning.
    """

    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"Problem: {problem}\n\nSolution: {solution}"}
    ]

2. Simple Evaluator (`simple_evaluator.py`)

A clean interface that abstracts the complexity of multiple LLM providers:

def evaluate_innovation(cid: str, problem: str, solution: str,
                       model: str) -> dict:
    """
    Evaluate a single innovation using specified model.
    Returns structured JSON with scores and usage metrics.
    """

    messages = build_evaluation_messages(problem, solution)

    # Route to appropriate model
    if "gpt" in model:
        response, usage = _call_openai(api_key, model, messages)
    elif "gemini" in model:
        response, usage = _call_gemini(api_key, model, messages)

    return {
        "metadata": {"cid": cid, "model": model, **usage},
        "input": {"problem": problem, "solution": solution},
        "evaluation": parse_evaluation_response(response),
        "feedback": generate_student_feedback(response)
    }

3. Batch Processing (`batch_process_sample30.py`)

For systematic evaluation across all models:

def batch_process_sample30():
    """Process all 30 samples with all 3 models"""

    df = pd.read_csv("data/samples/sample30.csv")
    models = ["gpt-4o-mini", "gpt-5", "gemini-2.5-flash"]

    for idx, row in df.iterrows():
        cid = str(row['CID'])
        problem = str(row['Problem'])
        solution = str(row['Solution'])

        for model in models:
            try:
                result = evaluate_innovation(cid, problem, solution, model)
                save_result(result, output_dir)
                time.sleep(1)  # Rate limiting
            except Exception as e:
                log_error(cid, model, e)

This gives us 90 evaluations total (30 projects × 3 models).

Results: What We Learned About AI-Human Alignment

Overall Performance Metrics

After processing all 90 evaluations, here’s what we found:

Model	Perfect Match	Partial Match	No Match	Total Cases
GPT-5	67 (44.7%)	25 (16.7%)	58 (38.7%)	150
GPT-4o-mini	68 (45.3%)	23 (15.3%)	59 (39.3%)	150
Gemini	53 (35.3%)	19 (12.7%)	78 (52.0%)	150

Note: Total cases = 30 projects × 5 metrics = 150 evaluation points per model

Surprising Finding: GPT-4o-mini Matches GPT-5

Our analysis revealed that GPT-4o-mini performed essentially identically to GPT-5 in human alignment, with both models achieving the same overall score. However, this surface-level equivalence masks important nuances:

Structured Evaluation Advantage: For our highly structured, rubric-based evaluation task, model size showed diminishing returns
Cost-Effectiveness: GPT-4o-mini offers significant cost savings (~10x cheaper) with virtually identical performance
Task-Specific Performance: This equivalence may not generalize to other evaluation domains requiring more complex reasoning

Metric-Specific Performance

Some metrics proved easier for AI models to evaluate than others:

Metric	Perfect Match Rate
Sustainability	61.1%
Novelty	46.7%
Usefulness	38.9%
Feasibility	31.1%
Scalability	31.1%

Sustainability evaluations showed the highest AI-human alignment, possibly because they involve more objective criteria (environmental impact, resource requirements). Scalability and Feasibility showed the lowest alignment, suggesting these require more subjective judgment that’s harder for models to replicate.

Cost-Benefit Analysis

Running 90 evaluations cost approximately:

GPT-4o-mini: ~$2.50
GPT-5: ~$8.00
Gemini 2.5 Flash: ~$1.20

Human evaluation of 30 projects took approximately 15 hours of expert time. Even accounting for the 55% cases requiring human review, the hybrid approach represents significant time and cost savings.

Building Your Own Evaluation Pipeline

If you’re facing a similar subjective evaluation challenge, here’s a framework you can adapt:

1. Start with Human Ground Truth

# Create a representative sample (20-50 items)
# Get high-quality human evaluation
# Document your rubric clearly
# Measure inter-rater reliability

2. Engineer Your Prompts Systematically

# Define clear roles and context
# Provide specific examples
# Require structured output (JSON)
# Include reasoning requirements
# Test prompt variations on your sample

3. Benchmark Multiple Models

# Test 2-3 different models
# Use identical prompts across models
# Measure alignment with human truth
# Consider cost vs. performance tradeoffs

Key Components Available

The complete pipeline is available for adaptation to your own evaluation challenges:

Evaluation prompts (evaluation_prompts.py): Structured prompt templates
Multi-model evaluator (simple_evaluator.py): API abstraction layer
Batch processing (batch_process_sample30.py): Systematic evaluation workflow
Analysis pipeline (process_batch_results.py): Visual comparison generation

The Broader Implications

This work sits at the intersection of several important trends:

AI Evaluation: As models become more capable, understanding where they align with human judgment becomes crucial for deployment decisions.
Human-AI Collaboration: Rather than replacement, the most effective approach often involves thoughtful division of labor between humans and AI.
Scale vs. Quality: Perfect evaluation quality may not be necessary for all use cases—understanding the “good enough” threshold enables practical applications.
Evaluation Methodology: Systematic benchmarking against human experts provides a principled way to validate AI performance in subjective domains.

The key insight: instead of asking “can AI replace human evaluation?”, ask “where can AI effectively augment human evaluation?” The answer, as we’ve shown, depends on careful measurement and thoughtful system design.

Future Directions

Several extensions could improve this approach:

Production Monitoring: Moving from batch evaluation to continuous assessment of new submissions
Failure Mode Classification: More systematic categorization of misalignment types
Cost Optimization: Exploring smaller models for specific evaluation subtasks
Confidence Scoring: Models that can indicate uncertainty for human handoff
Multi-Modal Evaluation: Incorporating diagrams, prototypes, or video demonstrations
Multi-Cultural Adaptation: Extending our framework to other educational contexts beyond India

Building effective AI evaluation systems requires careful attention to prompt engineering, systematic benchmarking against human judgment, and honest assessment of limitations. While perfect AI-human alignment remains elusive, understanding where models agree and disagree with human evaluators enables practical hybrid systems that scale human judgment effectively.

Evaluating student innovation at scale – Building a Multi-LLM assessment pipeline

Oct 2025

Evaluating Student Innovation at Scale: Building a Multi-LLM Assessment Pipeline

The Problem: Innovation Evaluation Doesn’t Scale

Our Approach: Human Truth + Multi-Model Benchmarking

The Dataset: 30 Student Projects as Ground Truth

Human Evaluation: Establishing the Gold Standard

The Challenge of Adaptation

From Human Training to LLM Prompts

The Models: Testing Three Different Approaches

The Pipeline: From Raw Ideas to Comparative Analysis

1. Evaluation Prompt Engineering (`evaluation_prompts.py`)

2. Simple Evaluator (`simple_evaluator.py`)

3. Batch Processing (`batch_process_sample30.py`)

Results: What We Learned About AI-Human Alignment

Overall Performance Metrics

Surprising Finding: GPT-4o-mini Matches GPT-5

Metric-Specific Performance

Cost-Benefit Analysis

Building Your Own Evaluation Pipeline

1. Start with Human Ground Truth

2. Engineer Your Prompts Systematically

3. Benchmark Multiple Models

Key Components Available

The Broader Implications

Future Directions

You may also like

First Flight, First Sprint: A Week of Code, Cricket, and Chaotic Uno at Tech4Dev

Learning, Mentoring, and Moments in Between: My AI Sprint Journey

NGO Applications Open for the Tech Leadership Cohort!

Our Initiatives

Connect with us

Evaluating student innovation at scale – Building a Multi-LLM assessment pipeline

Oct 2025

Evaluating Student Innovation at Scale: Building a Multi-LLM Assessment Pipeline

The Problem: Innovation Evaluation Doesn’t Scale

Our Approach: Human Truth + Multi-Model Benchmarking

The Dataset: 30 Student Projects as Ground Truth

Human Evaluation: Establishing the Gold Standard

The Challenge of Adaptation

From Human Training to LLM Prompts

The Models: Testing Three Different Approaches

The Pipeline: From Raw Ideas to Comparative Analysis

1. Evaluation Prompt Engineering (evaluation_prompts.py)

2. Simple Evaluator (simple_evaluator.py)

3. Batch Processing (batch_process_sample30.py)

Results: What We Learned About AI-Human Alignment

Overall Performance Metrics

Surprising Finding: GPT-4o-mini Matches GPT-5

Metric-Specific Performance

Cost-Benefit Analysis

Building Your Own Evaluation Pipeline

1. Start with Human Ground Truth

2. Engineer Your Prompts Systematically

3. Benchmark Multiple Models

Key Components Available

The Broader Implications

Future Directions

You may also like

First Flight, First Sprint: A Week of Code, Cricket, and Chaotic Uno at Tech4Dev

Learning, Mentoring, and Moments in Between: My AI Sprint Journey

NGO Applications Open for the Tech Leadership Cohort!

1. Evaluation Prompt Engineering (`evaluation_prompts.py`)

2. Simple Evaluator (`simple_evaluator.py`)

3. Batch Processing (`batch_process_sample30.py`)