We Built a Bot That Fixes Its Own Course

by Alien Brain Trust AI Learning
We Built a Bot That Fixes Its Own Course

We Built a Bot That Fixes Its Own Course

Meta Description: We built an autonomous AI agent that tests course content, fixes failures, and creates PRs from Telegram. Scores went from 5/10 to 7/10 on $7/month infrastructure.

Our course had a problem. Every module scored 5/10 on our AI evaluator. The content was “truncated,” “incomplete,” “missing key definitions.”

Except it wasn’t.

The Real Bug Was the Test, Not the Content

We built a simulated student bot that enrolls in our Secure Prompt Vault course, reads every module, and has Claude Haiku evaluate each one on clarity, completeness, and actionability.

Every module failed the same way: “content cuts off mid-sentence.”

Root cause? One line of bash:

CONTENT=$(head -c 3000 "$MODULE_PATH/video-script.md")

Our eval window was 3KB. Our files are 13-21KB. The AI evaluator literally never saw the content. It read the intro and assumed the rest was missing.

Fix: head -c 20000. Scores jumped from 5/10 to 7/10 across the board.

The lesson: when your AI says “the content is incomplete,” maybe the content is fine and your AI can’t see it.

From Manual Fixes to Autonomous Agent

The old workflow was painful:

  1. Cron runs daily test at 10:00 UTC
  2. Bot sends Telegram report
  3. I read the report
  4. I open Claude Code
  5. I fix files manually
  6. I commit, subtree push to student repo

That’s a 30-minute loop. For every fix cycle.

The new workflow:

  1. Cron runs daily test
  2. Bot sends report
  3. I send /fix from Telegram
  4. Bot reads the report, identifies failures, fixes files, creates a PR
  5. I send /approve 2 from Telegram
  6. Done

Same infrastructure. Same $7/month t4g.micro instance. We just gave the bot tools.

8 Tools, 7 Safety Layers

The agent runs Claude Haiku with 8 sandboxed tools:

ToolWhat It Does
read_fileRead any file (path-validated)
write_fileCreate or overwrite files
edit_fileSurgical find-and-replace
list_filesGlob pattern matching
search_filesGrep across the repo
run_shellAllowlisted commands only
git_opsBranch, commit, push, status
create_prGitHub API PR creation

Every tool is sandboxed. Here’s why that matters:

  1. Filesystem isolation — All paths must resolve within the student repo directory. No ../../ traversal.
  2. Command allowlist — Shell access is limited to: git, ls, find, grep, cat, head, tail, wc, diff, tree. Nothing destructive.
  3. Git remote lock — Validates the origin URL contains secure-prompt-vault before any push.
  4. Account scope — The bot’s GitHub PAT only has collaborator access to the student repo. It physically cannot touch our private repo.
  5. Branch naming — All branches must start with bot/. Enforced in code.
  6. System prompt guardrails — Claude is told: only modify student repo, never destructive commands.
  7. Human approval gate — The bot cannot merge its own PRs. I must explicitly /approve from Telegram.

Defense in depth. The bot can fix content. It cannot break infrastructure.

The First PR

The bot’s first autonomous PR fixed three things:

  • Created a missing FAQ.md that onboarding emails referenced
  • Expanded incomplete sections in two module files
  • Added definitions for jargon that confused the AI evaluator

It created a branch (bot/fix-critical-truncation-qa), committed, pushed, and opened the PR. All from one /fix command.

What Actually Moved the Scores

ChangeScore Impact
Eval window 3KB → 20KB5/10 → 7/10 (biggest single fix)
Key definitions glossariesEliminated “undefined jargon” complaints
Module 3 Getting Started section6/10 → 7/10
Expanded attack descriptionsRemoved “incomplete” flags

The eval window fix was 80% of the improvement. Everything else was polish.

The Architecture

Telegram: /fix

Bot parses latest test report (FAILs, WARNs, scores)

Anthropic SDK tool_use loop (Claude Haiku, max 40 turns)

Agent reads files → edits → creates branch → commits → pushes → opens PR

Telegram: "PR #2 created: [link]"

You: /approve 2

Bot merges via GitHub API

Total monthly cost: ~$10-14. The instance is $7. Haiku API calls for fix sessions add $2-5.

What We Learned

1. Test your tests first. Our evaluator was broken for weeks. The content was fine. The eval window was the bug.

2. Safety layers compound. Any single layer could be bypassed. Seven layers together mean the bot can’t accidentally (or intentionally) break anything outside its sandbox.

3. Human-in-the-loop is a feature, not a limitation. The /approve gate costs me 5 seconds per PR. It prevents every category of “autonomous agent destroys production” horror story.

4. Haiku is enough. We don’t need Opus or Sonnet for file fixes. Haiku reads reports, identifies problems, and writes targeted patches. At a fraction of the cost.

5. Single-file deployment wins. The entire bot—Telegram commands, agent loop, safety validators, PR management—is one Python file. Deploy via S3 copy + systemd restart. No containers, no orchestration.

What’s Next

The bot currently scores 7/10 across all modules. The remaining P2 tickets (practice exercises, audit spreadsheet templates, worked examples) need deeper content creation, not surgical fixes. That’s a different kind of work—and a good test of whether the agent can handle generative tasks, not just repairs.

We’ll find out tomorrow when the cron runs.


All data from real test runs on our EC2 instance. Scores are from Claude Haiku evaluations of course content. Infrastructure costs based on AWS t4g.micro on-demand pricing in us-east-1.