Avanti Fellows – AI Cohort 2.0 Reflections

Jul 2026

This blog is written by Poojitha from Avanti Fellows.

At the end of my last blog, we had a pilot that didn’t go according to plan, a spillover effect we didn’t design for and two things in the roadmap: build an evaluation layer and scale. This is where Cohort 2 began. 

Pilot Results 

We managed to pilot the AI summaries for mentors across 2 schools, 8 mentors and 72 students. Adoption was 76% ie. Teachers used the AI summaries while mentoring 57 out of 72 students. Accuracy was 80.7%. Personalisation and actionability both hovered around 67-68%. But only 56.1% of summaries were hallucination-free. These figures are as reported by the mentors through a simple feedback form.

When we dug deeper with some of the teachers, we realised the accuracy issue stemmed from data interpretation. Results were being retrieved correctly from our database but interpreted incorrectly by the model: like reading a score change from 27 to 19 as an improvement, or describing performance in Mole Concepts as going “down from -7 to 4.” While the data points were right, the model’s reading of them wasn’t.

Before fixing anything, we went back to teachers

Before making changes to the system prompt or brainstorming ways to improve accuracy, we took a step back and ran a two-day Human-Centred Design research study with all 13 teacher-mentors who were part of the pilot.

In my previous blog, I mentioned that we had been constantly experimenting with different problem statements. As a result, most of the work on the AI summaries had followed a top-down approach → we were interacting mainly with the central mentorship and curriculum teams to roll out and iterate on solutions quickly. But we hadn’t really asked the teachers themselves: is this solving a pressing problem they actually have? What does their everyday life look like? And how did they perceive the AI summaries on the ground?

What we learned reframed the problem. We already knew teachers were busy. What we didn’t fully understand was what that busyness actually looked like from the inside. Teachers don’t just have bandwidth issues – their cognitive load is almost always at capacity. They juggle four parallel tracks at once: completing syllabus, 1:1 mentorship, monitoring self-study and prepping for tomorrow. What gets dropped from that list is exactly what mentorship depends on: 1:1 time, doubt-solving, trying different approaches with struggling students.

The research also surfaced something we hadn’t named before. Teachers carry three layers of student knowledge — formal sources (test reports, attendance), observational (who went quiet, who engaged), and conversational (what came up in informal moments). Only the first layer lives in any system. The other two exist entirely in a teacher’s head. Which means preparation depth across teachers is wildly inconsistent, and when a teacher hands over a student to someone else, continuity is almost entirely memory-dependent.

This gave us a clearer brief for Cohort 2: the tool can’t just be accurate. It needs to be a genuine extension of what teachers already know — not a replacement for their judgment.

On Day 2 of the study, all 13 teachers annotated the same prototype. Six things came through clearly: data trust is fragile (one wrong number and the whole summary is questioned), “start with wins” works, the guidance section was too long, teachers want counts not just percentages, AI should surface and teachers should decide and the core value is time saved.


What we actually fixed

Fix 1: Redesigning the guidance

The old guidance read like a report. It told teachers what to think. The redesign made each bullet conversational — grounded in a specific data point, acknowledging what the mentor already noted and ending with a question that opens the session.

Fix 2: Solving the accuracy problem

The 44% hallucinations issue from Cohort 1 had one root cause: we were asking the LLM to do arithmetic in prose. LLMs are unreliable at numerical reasoning and operations when they’re also generating narrative → they misread percentage changes, get trajectory directions wrong, fabricate specifics.

The fix was architectural. 

  1. Tools: We moved all computation, variance and delta analysis of the numerical data out of the model entirely. The LLM now has dedicated tools (scripts) which handle the numbers deterministically: one for subject and chapter analysis, one for test comparisons, one for cutoff calculations. The LLM uses those tools to generate pre-computed numerical analysis files, and reasons on these. 
  2. Harness: Tools and working with files and scripts requires a harness. Till now we only had a single LLM call being made. We used the open source Pi agent and gave our LLM a harness to work in. This also allowed us to add full pipeline transcript visibility so we can see exactly what the model read, wrote and reasoned from.

The result: accuracy went from 80% to 100% on our evaluation benchmark.

Fix 3: Building the evaluation pipeline

Building an evaluation layer was something we’d been putting off since Cohort 1. It was a significant unknown and it was easy to keep deprioritising it in favour of shipping. Cohort 2 forced us to prioritise this as we were making prompt changes with no reliable way to know if they’d helped or hurt, which wasn’t sustainable

In Cohort 2, we built a proper evaluation system.

We created a golden dataset of 6 benchmark students, chosen to cover the range of profiles a mentor actually encounters — a consistent improver (19→28→37), a consistent decliner (36→37→17), a volatile student (38→17→31), a plateau case, someone strong in one subject and weak in everything else, and someone uniformly strong with rich mentor notes. Each benchmark student has a human-approved reference report. Every time we change the prompt or the pipeline, we run an evaluation against all 6 and score them on 6 criteria: completeness, accuracy, actionability, personalisation, tone and mentor note usage.

We’re currently tuning the eval further such that given the same set of input reports for the six students, they get scored the same way with a judge model and a scoring prompt and it matches the scoring given to those exact reports by human experts. 

We’ve been able to keep the actual report generation cheap (0.95 INR per student) because we have an eval system in place which gives a good score to DeepSeek v4 pro, which is quite cheap of a model. For our Judge, we’ve come down to GLM 5.2 which costs about 0.05 usd per eval run (benchmark of 6 students).  

Fix 4: Post-session continuity

This one came directly from the research. We’re building a mentorship workspace inside the LMS where teachers can log post-session notes, see the full history of a student’s sessions and track whether the guidance from the previous summary actually made a difference.


Where we stand

The pilot in Cohort 2 covered 2 schools, 8 mentors, 57 students. Summaries are being used. Teachers have told us the core value is time saved and the earliest feedback is that it’s delivering that. Rollout in July expands to 4 schools across Punjab and Uttarakhand with the Navodaya Vidyalaya centres potentially coming on in August.

The metric we care most about heading into rollout is whether teachers have genuinely bought in. Did they feel that the feedback they gave during the research actually made it into the product? Are they coming to this tool before every session as a routine, not because they were told to? And would they recommend it to another teacher without being asked (NPS)?


What this cohort taught us

Cohort 1 helped us identify the right problem to solve. In Cohort 2 we understood the difference between improving a product and fully understanding the problem it’s embedded in.

The two-day research study not only gave us a list of features to build but also changed what we were optimising for. We went from solving for “how do we make the summary better?” to brainstorming “what does Bhumika Ma’am actually need to walk into a session feeling prepared?” The first question leads to prompt tuning. The second led to an architecture rethink, a guidance redesign and a continuity layer. If we had skipped the research and gone straight to fixing, we would have fixed the wrong things faster.

The other big shift this cohort was finally building the evaluation layer, something we’d known we needed since Cohort 1. It felt like a significant undertaking and it was easy to keep pushing it down the list. But building the eval pipeline from scratch turned out to be one of the high-leverage things we did. We went from gut-checking outputs to having a repeatable, low-cost feedback loop.

You may also like

From Maker Day Challenge to Production: How a Small Feature Solved a Big Problem for NGOs

Pointing Claude at the right work, not just the code

My Circus, My Monkey: How AI Pairing Changed the Way I Code