We Built a Bot That Fixes Its Own Course
We Built a Bot That Fixes Its Own Course
Meta Description: We built an autonomous AI agent that tests course content, fixes failures, and creates PRs from Telegram. Scores went from 5/10 to 7/10 on $7/month infrastructure.
Our course had a problem. Every module scored 5/10 on our AI evaluator. The content was “truncated,” “incomplete,” “missing key definitions.”
Except it wasn’t.
The Real Bug Was the Test, Not the Content
We built a simulated student bot that enrolls in our Secure Prompt Vault course, reads every module, and has Claude Haiku evaluate each one on clarity, completeness, and actionability.
Every module failed the same way: “content cuts off mid-sentence.”
Root cause? One line of bash:
CONTENT=$(head -c 3000 "$MODULE_PATH/video-script.md")
Our eval window was 3KB. Our files are 13-21KB. The AI evaluator literally never saw the content. It read the intro and assumed the rest was missing.
Fix: head -c 20000. Scores jumped from 5/10 to 7/10 across the board.
The lesson: when your AI says “the content is incomplete,” maybe the content is fine and your AI can’t see it.
From Manual Fixes to Autonomous Agent
The old workflow was painful:
- Cron runs daily test at 10:00 UTC
- Bot sends Telegram report
- I read the report
- I open Claude Code
- I fix files manually
- I commit, subtree push to student repo
That’s a 30-minute loop. For every fix cycle.
The new workflow:
- Cron runs daily test
- Bot sends report
- I send
/fixfrom Telegram - Bot reads the report, identifies failures, fixes files, creates a PR
- I send
/approve 2from Telegram - Done
Same infrastructure. Same $7/month t4g.micro instance. We just gave the bot tools.
8 Tools, 7 Safety Layers
The agent runs Claude Haiku with 8 sandboxed tools:
| Tool | What It Does |
|---|---|
read_file | Read any file (path-validated) |
write_file | Create or overwrite files |
edit_file | Surgical find-and-replace |
list_files | Glob pattern matching |
search_files | Grep across the repo |
run_shell | Allowlisted commands only |
git_ops | Branch, commit, push, status |
create_pr | GitHub API PR creation |
Every tool is sandboxed. Here’s why that matters:
- Filesystem isolation — All paths must resolve within the student repo directory. No
../../traversal. - Command allowlist — Shell access is limited to:
git,ls,find,grep,cat,head,tail,wc,diff,tree. Nothing destructive. - Git remote lock — Validates the origin URL contains
secure-prompt-vaultbefore any push. - Account scope — The bot’s GitHub PAT only has collaborator access to the student repo. It physically cannot touch our private repo.
- Branch naming — All branches must start with
bot/. Enforced in code. - System prompt guardrails — Claude is told: only modify student repo, never destructive commands.
- Human approval gate — The bot cannot merge its own PRs. I must explicitly
/approvefrom Telegram.
Defense in depth. The bot can fix content. It cannot break infrastructure.
The First PR
The bot’s first autonomous PR fixed three things:
- Created a missing
FAQ.mdthat onboarding emails referenced - Expanded incomplete sections in two module files
- Added definitions for jargon that confused the AI evaluator
It created a branch (bot/fix-critical-truncation-qa), committed, pushed, and opened the PR. All from one /fix command.
What Actually Moved the Scores
| Change | Score Impact |
|---|---|
| Eval window 3KB → 20KB | 5/10 → 7/10 (biggest single fix) |
| Key definitions glossaries | Eliminated “undefined jargon” complaints |
| Module 3 Getting Started section | 6/10 → 7/10 |
| Expanded attack descriptions | Removed “incomplete” flags |
The eval window fix was 80% of the improvement. Everything else was polish.
The Architecture
Telegram: /fix
↓
Bot parses latest test report (FAILs, WARNs, scores)
↓
Anthropic SDK tool_use loop (Claude Haiku, max 40 turns)
↓
Agent reads files → edits → creates branch → commits → pushes → opens PR
↓
Telegram: "PR #2 created: [link]"
↓
You: /approve 2
↓
Bot merges via GitHub API
Total monthly cost: ~$10-14. The instance is $7. Haiku API calls for fix sessions add $2-5.
What We Learned
1. Test your tests first. Our evaluator was broken for weeks. The content was fine. The eval window was the bug.
2. Safety layers compound. Any single layer could be bypassed. Seven layers together mean the bot can’t accidentally (or intentionally) break anything outside its sandbox.
3. Human-in-the-loop is a feature, not a limitation. The /approve gate costs me 5 seconds per PR. It prevents every category of “autonomous agent destroys production” horror story.
4. Haiku is enough. We don’t need Opus or Sonnet for file fixes. Haiku reads reports, identifies problems, and writes targeted patches. At a fraction of the cost.
5. Single-file deployment wins. The entire bot—Telegram commands, agent loop, safety validators, PR management—is one Python file. Deploy via S3 copy + systemd restart. No containers, no orchestration.
What’s Next
The bot currently scores 7/10 across all modules. The remaining P2 tickets (practice exercises, audit spreadsheet templates, worked examples) need deeper content creation, not surgical fixes. That’s a different kind of work—and a good test of whether the agent can handle generative tasks, not just repairs.
We’ll find out tomorrow when the cron runs.
All data from real test runs on our EC2 instance. Scores are from Claude Haiku evaluations of course content. Infrastructure costs based on AWS t4g.micro on-demand pricing in us-east-1.