AI Evaluation: From ‘Seems Good’ to ‘Scores Good’

Dec 2025

At Project Tech4dev, we try to look at technology with a wider lens—mainly, how it can actually help the social sector. Tech is just an enabler, sure, but when it’s used well, it can help NGOs do more with the resources they already have. And like everyone else, we’ve been paying close attention to the loudest thing in the room right now: AI.

Over the years, we’ve tested different AI tools and baked them into our products—things like TTS, STT, RAG with OpenAI’s File Search, and Vision. We first built these features for a few NGOs with very specific needs. But watching different teams take the same features and use them in their own ways has been genuinely inspiring. It’s one of the fun parts of building tech: seeing people do things you never imagined when you wrote the code.

NGOs have moved from trying out AI to running pilots to actually launching real solutions. With that kind of growth, the next step is making sure the AI keeps performing well. When you have lots of moving parts, quality can slip without anyone noticing. That’s where evaluation comes in—it gives you a clear, measurable sense of how your AI is doing.

Our earliest work in this space was a government project called Kunji, where we built a system to check the quality of bot responses. We also worked with Agency Fund, and during the Kenya sprint, Edmund ran model evaluation workshop—why it matters and what it looks like in practice. Between that session and Kunji, it became clear that evaluation could help many NGOs who are already using AI in their tools.

So we decided to bring evaluation into our platform, Kaapi, where we’re building responsible AI infrastructure for the social sector.

The idea has been in the works for a while, but we started development in October and now have the first version ready. Since Kaapi is API-first, we’re building a simple thin UI for this, so NGOs can try it out without writing code to call the endpoints

Why AI Evaluation Is Hard

Traditional software testing is simple: check that 2 + 2 equals 4, done. AI doesn’t work that way as AI is non-deterministic. Ask “What’s the capital of France?” and you might get “Paris,” “Paris, France,” “The capital is Paris,” or “Paris, the City of Light”. Thus, asking the same question gets similar but different answers. This “flakiness” means you can’t trust a single test run.

This is where semantic similarity comes in—measuring meaning, not exact character matches.

A golden dataset, a standardized set of known questions and ground truth—the answers you want your AI to give, used to evaluate an AI system.

Once uploaded, this dataset can be used to run any AI configuration against every question and compare the model’s responses with the expected answers, generating the report card that reflects the system’s performance.

Creating Your First Dataset

Getting started is straightforward. Upload a CSV file, assign a name, and pass duplication factor—the number of times each question should be executed. For example, setting the factor to 5 results in each question being asked five separate times, allowing evaluation of response consistency and detection of potential flakiness.

Once uploaded, the dataset is set as a permanent benchmark. Different AI configuration setups can be tested against the same set of questions, performance can be tracked over time, prompt variations can be compared, and updates can be validated to ensure previously working behaviors remain stable

The Two-Phase Magic

When an evaluation is initiated, Kaapi follows a two-phase process rather than sending configuration and each question sequentially to the model. This approach improves efficiency by leveraging OpenAI’s Batch API

Phase 1: Generating Responses

Questions are bundled into a single batch job and sent to OpenAI’s servers. Instead of sending each question individually—which is slower and more resource-intensive—we pack all the questions into one large container and send it off in a single go

The batch configuration specifies the dataset to be used and the assistant settings to be evaluated. The batch job then processes every question in the dataset, factoring in any duplication settings, and records the resulting outputs

Phase 2: Similarity Scoring with Embeddings

Next we still need to score these responses. How do we measure if “Paris, France” is as good as “Paris”? This is where embeddings enter the picture, where words or sentences with similar meanings have similar embeddings.

The system takes each pair of texts—the AI’s generated answer and the expected answer—and converts them into embeddings using models like text-embedding-3-large. Then it calculates the cosine similarity between them.

Why Use Batches?

Two reasons: cost and speed. OpenAI’s Batch API costs 50% less than real-time calls and handles thousands of requests efficiently. Thus less overhaul at our end to process each request and track each request of a batch. We just need to check the status of the batch periodically.

Reading the Results – What Good Looks Like

So you’ve run your evaluation, and the status shows “completed”. Now what? Time to decode your AI’s report card.

Two Ways to Score: Embeddings and LLM-as-a-Judge

When evaluation completes, we get back two different types of scores—each telling something different about your AI’s performance.

1. Cosine Similarity (Embedding-based)

This is the score we discussed earlier. Using OpenAI’s embedding API, we convert both the generated answer and the expected answer into vectors, then calculate their cosine similarity. This gives you a numerical measure of semantic closeness—how similar the meaning of the two texts are, regardless of exact wording.

2. LLM-as-a-Judge (via Langfuse)

But embeddings only tell part of the story. Two answers can be semantically similar yet differ in quality—one might be more helpful, more accurate, or better structured. That’s where LLM-as-a-judge comes in.

We’ve integrated with Langfuse, an observability platform for LLM applications, to set up automated evaluation using LLM-as-a-judge. Here’s how it works: once the dataset run completes, Langfuse uses a configured evaluator (typically a powerful LLM like GPT-4) to assess each response against the expected answer. The judge evaluates factors like correctness, completeness, relevance, and quality—things that pure similarity scores can miss.

This gives you a complementary perspective:

– Cosine similarity tells you “how close is the meaning?”

– LLM-as-a-judge tells you “how good is the answer?”

Catching the Flakiness

At the start of the evaluation, we passed the duplication factor, i.e, running each question multiple times helps identify inconsistencies. Check for large differences in similarity scores across runs:

Question A: [0.91, 0.89, 0.90] → Consistent
Question B: [0.85, 0.42, 0.88] → Flaky

Flakiness often suggests:

Ambiguous questions that confuse the model
High temperature settings (more randomness)
Insufficiently clear instructions

Using Results to Improve

Here’s where evaluation gets useful:

Regression Testing: Run your evaluation before and after making changes. Did the score go up or down? No more “I think it’s better” conversations.

A/B Testing: Test two different prompts, models, or temperatures. The numbers tell you which one works better.

Finding Weak Spots: Look at your lowest-scoring questions, figure out why the AI struggles, then add more context or examples.

Continuous Monitoring: Run the same evaluation weekly. Is quality improving, staying flat, or getting worse?

Getting Your Results

The evaluation runs can be accessed at any time to check the status, scores, configuration used, and links to detailed results. The scorecard can also be downloaded as a CSV for further analysis or sharing with other teams for color-coding or annotations. Evaluations for specific datasets can be viewed, making it easy to compare multiple runs and track performance over time.

Bringing It All Together

Here’s the full flow:

What used to be “seems good to me” becomes measurable. You’re not guessing if your AI is getting better—you’re seeing it in the numbers.

The Confidence Factor

The toughest part of building AI isn’t the code—it’s trusting it. An AI assistant can’t be deployed to handle real questions without the assurance that it won’t hallucinate or give incorrect answers.

With traditional software, testing builds that confidence. For AI, it’s the evaluation that does the job. It’s the difference between hoping the assistant works and knowing it does.

The real question isn’t “Can I measure my AI’s quality?” It’s “Can I afford not to?”