Your AI Evaluator Is Probably Blind

by Alien Brain Trust AI Learning
Your AI Evaluator Is Probably Blind

Your AI Evaluator Is Probably Blind

Meta Description: Our AI graded every module 5/10 for “truncated content.” The files were 20KB. The eval window was 3KB. One line of bash was the entire problem.

Five modules. Five identical complaints. “Content cuts off mid-sentence.” “Incomplete delivery.” “Truncated ending.”

We spent two days adding glossaries, expanding sections, rewriting intros. Scores barely moved.

Then we found the bug.

The Symptom

We built a QA bot that sends each course module to Claude Haiku for evaluation. Haiku scores each module 0-10 on clarity, completeness, and actionability.

Every single module scored 4-6/10. Every review said some version of “the content is incomplete” or “the lesson cuts off.”

Module 1: “The lesson cuts off mid-sentence at the end.” Module 2: “Source 5 cuts off mid-word.” Module 3: “The lesson ends mid-word.” Module 4: “Text cuts off mid-word at end of case study.” Module 5: “The checklist cuts off at Checkpoint 4 of 9.”

We assumed the content was bad. We started fixing it.

The Wrong Fix

We added key definitions glossaries to every module. We expanded brief sections. We rewrote Module 3’s attack descriptions from one-liners to full paragraphs.

Scores went from 5/10 to 6/10. Better, but the “truncated” complaints persisted.

That’s when we stopped fixing content and started debugging the test.

The Root Cause

One line in our test script:

CONTENT=$(head -c 3000 "$MODULE_PATH/video-script.md")

Our video scripts are 13-21KB. We were sending the AI evaluator the first 3,000 bytes—roughly the introduction and first section header.

The AI wasn’t wrong. From its perspective, the content literally did cut off mid-sentence. Because we cut it off.

The Fix

CONTENT=$(head -c 20000 "$MODULE_PATH/video-script.md")

One number changed. Scores went from 5/10 to 7/10 instantly. No content changes needed.

The Score Progression

Eval WindowAvg ScoreAI Complaints
3KB5/10”Truncated,” “cuts off,” “incomplete”
12KB6/10Some truncation flags on largest files
20KB7/10Genuine feedback on content quality

The jump from 3KB to 20KB was the single largest quality improvement in the entire project. Bigger than all content fixes combined.

Why This Happens More Than You Think

If you’re using AI to evaluate content, grade documents, review code, or assess quality, ask yourself: how much of the input does the AI actually see?

Common blind spots:

Token limits you forgot about. Most APIs have input token limits. If you’re truncating before sending, the AI is grading a fragment.

Preprocessing that strips content. HTML-to-text conversion, markdown parsing, file reading utilities—any of these can silently truncate.

Context window assumptions. “Claude can handle 200K tokens” doesn’t mean your wrapper code sends 200K tokens. Check what actually gets sent.

Batch processing shortcuts. Reading head -c N or [:N] to keep API costs down is rational. But if N is too small, you’re paying for useless evaluations.

How to Catch It

1. Log the actual input. Before sending content to your evaluator, log the byte count. Compare it to the source file size. If they don’t match, you have a truncation problem.

FILE_SIZE=$(wc -c < "$FILE")
SENT_SIZE=${#CONTENT}
echo "File: $FILE_SIZE bytes, Sent: $SENT_SIZE bytes"

2. Check for consistent complaints. If every evaluation has the same criticism (“incomplete,” “truncated,” “cuts off”), the problem is probably systemic, not content-specific.

3. Test with known-good content. Send a complete, polished document through your pipeline. If the AI still says “incomplete,” your pipeline is the problem.

4. Compare raw vs. processed. Open the source file. Count the sections. Then check what the AI received. If sections are missing, trace the pipeline.

The Broader Lesson

We spent two days improving content that was already complete. The AI evaluator told us “this is incomplete” and we believed it—because we built the evaluator and trusted its output.

The evaluator was right. It was just describing what it could see, not what existed.

When your AI gives you feedback, debug the pipeline before you debug the content. The model is usually doing exactly what you asked. The question is whether you asked the right thing.


Scores from Claude Haiku (claude-haiku-4-5-20251001) evaluating Secure Prompt Vault course modules. File sizes and byte counts from actual production test runs.