Auditing the Audit: When I Made an AI Check Another AI’s Homework – Meet CLIde

Foreword

So I’ve been running breadAI for a few months now. It pulls data from 30+ sources across my homelab, compresses 11 million characters of logs down to about 2,800, and sends me a daily Discord report. It’s great. Except for the part where it sometimes just… makes stuff up.

Not maliciously. LLMs do this thing where they compress a massive dataset and then confidently tell you “the backup server had 3 failures this week” when it actually had zero. Or they’ll see a brief latency spike at 3 AM and flag it as a “significant network degradation event” when it was just a single dropped ping. The AI reads the data, and sometimes it reads things into the data that aren’t there.

This bugged me. What’s the point of an automated audit if I have to manually verify the AI’s findings? That defeats the whole purpose. So I did what any reasonable person would do. I built a second AI to check the first one’s work.

CLIde just hanging out waiting to be utilized

How CLIde works

After the primary Claude audit produces its findings, breadAI extracts the specific claims: “this DB scan failed,” “this device had unusual traffic,” “backup for this VM is overdue,” whatever. It packages each finding with hints on HOW to verify it. Specific SSH commands to run, API endpoints to hit, log files to check.

Then it hands the whole thing to CLIde.

CLIde is a second Claude instance running in its own LXC container with actual shell access to my network. It can SSH into devices, curl APIs, query databases, whatever it needs to independently verify each claim against the real, live state of the infrastructure.

Each finding comes back marked: CONFIRMED, CONTRADICTED, or UNVERIFIABLE.

If everything checks out, CLIde gives a short all-clear with a bit of personality. If something’s contradicted, you get the full details: what the first AI said, what CLIde actually found, and why they disagree.

Why this is interesting (to me, at least)

The key architectural thing that makes this work is that the two AIs are operating at completely different abstraction levels. The primary Claude is analyzing compressed summaries, already several layers removed from the raw data. CLIde goes back to the source. It’s not re-analyzing the same summary; it’s checking the actual device, the actual log, the actual API response.

That’s why it catches things a single-pass system would miss. The first AI might say “Proxmox node 2 backup failed” because the compressed syslog mentioned an error. CLIde SSHs into the backup server and checks the actual job status, and it turns out the error was a transient timeout and the backup retried successfully.

The stuff that went wrong building it

Because of course stuff went wrong.

SSH timeouts. My first approach had CLIde SSHing into everything, which sounds cool until you realize some of my devices take 8+ seconds to respond and Claude Code has timeout limits. Switched a bunch of the verification checks to HTTP API calls instead. Way faster, way more reliable.

Unicode nightmares. Some of my syslog data has weird characters in it. When CLIde piped output through bash commands, non-ASCII characters would silently break things. Had to add a scrubbing step to strip all that out before processing.

Cost. The first version was costing about $0.15 per verification run. Not huge, but it adds up daily. Switched to Claude Sonnet instead of the bigger models and stripped out unnecessary context with a simple environment flag. Got it down to $0.03-0.04 per run. That’s like a dollar a month. Fine.

Does it actually catch things?

Yeah, pretty regularly. Maybe once or twice a week CLIde contradicts something the primary audit flagged. Usually it’s stuff like:

A “failure” that was actually a successful retry
A “degradation” that was a single outlier data point
A stale alert about something that already resolved itself

Without CLIde, those would’ve landed in my Discord report and I would’ve either wasted time investigating nothing or (more likely) learned to ignore the reports entirely. Alert fatigue is real, and false positives are how it starts.

The bigger picture

I think this pattern, AI verifying AI, is going to become standard for anything where you need to actually trust the output. Prompt engineering helps, but it’s not enough. Having a second agent with access to ground truth and a mandate to disagree is a much stronger guarantee than trying to make a single model more careful.

It’s also just satisfying to watch. Two AIs arguing about whether my backup server is actually broken is way more entertaining than it has any right to be.

How CLIde works

Why this is interesting (to me, at least)

The stuff that went wrong building it

Does it actually catch things?

The bigger picture

Leave a ReplyCancel reply