VoxWatch: I made my security cameras yell at people

Foreword

I have security cameras. A bunch of them. They record footage, they send me alerts, and they dutifully capture every person who walks onto my property in beautiful 4K. You know what they don’t do? Anything about it.

Every security camera on the market is a passive witness. Someone walks up to your door at 3 AM, tries the gate latch, peers into windows, and the camera just… watches. Records. Maybe sends you a push notification that you won’t see until morning. By then they’re long gone and you’re scrubbing through footage trying to figure out what happened.

I wanted a camera that talks back.

So I built VoxWatch. It listens for person detections from Frigate NVR, analyzes what it sees with AI, and speaks through the camera’s built-in speaker in real-time. Not a generic beep. Not a pre-recorded “you are being monitored.” A specific, AI-generated description of the person standing there, delivered in the voice of a police dispatcher calling it in over the radio.

Classic me.

How it actually works

The whole thing is a three-stage escalation pipeline, and the timing is what makes it work psychologically.

Stage 1 (instant, 0-2 seconds): The moment Frigate detects a person, a pre-cached warning plays immediately. “Attention. You are on camera and being recorded.” This is generated at startup and sitting in memory, so there’s zero latency. The intruder hears something within seconds of stepping into frame.

Stage 2 (5-8 seconds later): While Stage 1 was playing, the AI was already working in the background. It grabbed three snapshots from the camera, sent them to Google Gemini Flash, and got back a description. Now the camera says something like: “Subject wearing dark hoodie and jeans, approximately six feet tall, approaching from the south side of the property.” That’s when it gets real for whoever’s standing there. The camera just described them. Specifically. Someone is watching.

Stage 3 (15-25 seconds later): If the person is still there (VoxWatch checks), it does a behavioral analysis. “Subject is testing the gate latch and looking toward the rear entrance.” At this point most people are gone. The ones who aren’t get an escalating response.

The key insight is that Stage 1 and Stage 2 run concurrently. The AI analysis takes 5-8 seconds, but the intruder doesn’t experience a gap because Stage 1 fills that time. By the time the generic warning finishes, the personalized description is ready. From the intruder’s perspective, it sounds like someone is actively watching them and reacting in real-time. Which is exactly the point.

The police dispatch mode

This is the flagship feature and honestly the thing I’m most proud of.

Instead of a single voice reading a deterrent message, VoxWatch can simulate an entire police radio dispatch. A female dispatcher voice comes on with the characteristic beep: “All units… 10-97 at 123 Oak Street. White male, dark clothing, approximately five-ten, testing the side gate. Requesting unit for welfare check.” Then there’s a pause with radio static, and a male officer responds: “Copy that… Unit seven, en route, ETA four minutes.”

The whole thing sounds like someone called the cops and you’re hearing the dispatch happen in real time. It uses authentic 10-codes, bandpass filtering to simulate radio frequency response (300-3000 Hz, tighter than normal speech), random radio static overlays, squelch sounds between transmissions, and different TTS voices for the dispatcher and officer.

The radio effects are done entirely in ffmpeg. Bandpass filter, compand for automatic gain control simulation (makes everything sound equally loud like a real radio), and noise injection at configurable levels. Three intensity presets: low (subtle), medium (convincing), and high (sounds like it’s coming through a Motorola on someone’s belt).

The dispatcher and officer use separate Kokoro TTS voices with independent speed settings. The officer’s response is randomly selected from a pool with random ETAs and random pause durations, so the same dispatch never plays twice. That matters because a scripted-sounding response loses its psychological impact fast.

16 response modes (yes, sixteen)

Police dispatch is the serious one. But I got carried away.

There are 6 professional modes: police dispatch, live operator (“I can see you moving to the left side of the building”), private security (corporate and firm), recorded evidence (cold and robotic), homeowner (personal and direct), and automated surveillance (neutral AI).

Two situational modes: guard dog (“and I should mention, the dogs haven’t been fed today”) and neighborhood watch (community pressure angle).

And then five novelty modes that exist because I thought they were funny: mafioso, Tony Montana, pirate captain, British butler, and disappointed parent. The disappointed parent one is genuinely unsettling. “I can see you from the window. I’m not angry, I’m just… really disappointed.”

Each mode has its own AI prompt templates for all three stages, so the descriptions come out in character. The pirate captain doesn’t say “subject wearing dark clothing.” It says “a scurvy-looking bilge rat in dark garb.”

You can also define completely custom modes via YAML. And there are per-camera overrides, so your front door can use police dispatch while your backyard uses homeowner.

The AI vision stack

VoxWatch supports 6 AI vision providers with automatic fallback:

Google Gemini Flash is the default. Excellent quality, 2-5 second latency, supports native video analysis (not just snapshots), and costs about a tenth of a cent per detection. It’s the only provider that can do Stage 3 video clip analysis natively.

OpenAI, Anthropic Claude, and xAI Grok are all supported as alternatives. They use snapshots since they don’t do video. Ollama with LLaVA runs locally on my RTX 3060 for zero-cost operation. And there’s a custom provider slot for any OpenAI-compatible API.

The fallback chain works like this: primary provider fails, try the fallback provider, and if that fails too, use a safe generic description. The pipeline never goes silent. I spent a lot of time on this because the worst possible outcome for a security system is silence when it should be speaking.

One thing I’m particularly happy about: nightvision awareness. When the camera is in IR mode (black and white), the AI prompt automatically switches to avoid describing colors. No more “subject wearing a blue shirt” when the footage is grayscale. Instead it focuses on silhouette, build, height, and clothing type.

Seven TTS providers (because why not)

The TTS situation is similar. Seven providers with automatic fallback:

Kokoro (local ONNX model, near-human quality, free), Piper (bundled in the Docker image, solid quality, free), ElevenLabs (highest quality, cloud, paid), Cartesia Sonic (fastest cloud option at 0.5-1 second), Amazon Polly, OpenAI TTS, and espeak-ng as the always-available fallback.

espeak sounds robotic, but it’s always there. It’s the bottom of the fallback chain. If every other provider fails, you still get audio.

The fallback chain is: try primary, then try each provider in the configured fallback list, then espeak. Every cloud SDK is lazy-imported, so if you don’t use ElevenLabs, the ElevenLabs SDK never loads. This keeps the Docker image from bloating with dependencies most people won’t use.

The natural cadence system

Early on, the TTS output sounded flat. A single sentence read end-to-end by a robot. So I built a natural cadence speech system.

The AI returns descriptions as a JSON array of short phrases instead of one long sentence. Each phrase gets its own TTS generation with slight speed variation (plus or minus 8% via ffmpeg atempo). Between phrases, there are punctuation-aware silence gaps: longer pauses after periods, shorter after commas, longest after ellipses. The whole thing gets concatenated with ffmpeg’s concat demuxer, then run through a loudness normalization post-pass (compressor plus loudnorm at -16 LUFS).

The result sounds like a real person speaking in measured phrases rather than a TTS engine reading a paragraph. The difference is dramatic, especially for the dispatch mode where cadence matters a lot.

The dashboard

VoxWatch has a full React web dashboard for setup and management. Setup wizard walks you through connecting to Frigate, choosing your AI and TTS providers, picking a response mode, and testing audio push to your cameras. There’s a form-based config editor, a Monaco code editor for raw YAML editing (with real-time validation and diff view), camera management, and an audio test suite.

The dashboard is optional though. VoxWatch runs entirely from config.yaml. You can stop the dashboard container and the deterrent keeps working. I wanted it to be a nice-to-have for setup and tuning, not a dependency.

Some dashboard highlights:

The config editor has a Monaco editor (same engine as VS Code) with real-time YAML validation. Red squiggly underlines appear on the exact error line as you type. There’s a diff view that shows your changes versus the saved config. And Ctrl+S works, because of course it does.

There’s a browser-based audio preview system. Click a button and hear exactly what any persona sounds like with any voice, without pushing audio to a camera. The dispatch preview uses the exact same compose_dispatch_audio function as the live pipeline. No separate preview logic, no “it sounds different in production” surprises.

The camera setup wizard does ONVIF device identification (raw SOAP, no zeep dependency), fuzzy model matching against a compatibility database, and speaker detection. If your camera has no speaker, the wizard tells you before you waste time configuring it. If it has an RCA out jack for an external speaker, you get a checkbox confirmation flow.

Some numbers

~42,500 lines of code across Python backend (22.4k), dashboard backend (8.3k), and React frontend (20.1k)
6 AI vision providers with automatic fallback
7 TTS providers with automatic fallback
16 response modes (6 professional, 2 situational, 5 novelty, custom, plus per-camera overrides)
Config hot-reload within 10 seconds, no container restart needed
Tested on 6 camera models across Reolink and Dahua product lines
Per-camera audio codec override for mixed-vendor deployments (PCMU for Reolink, PCMA for Dahua)

What I found along the way

Building this surfaced a bunch of things that were fun to debug and not fun to experience.

The go2rtc audio push pattern. There’s no “send audio to camera” SDK. The way it works: generate a WAV file, start a temporary HTTP server on a random port, tell go2rtc “fetch this URL and push it to this camera’s backchannel.” go2rtc handles the RTSP negotiation. The HTTP server shuts down after the push. It’s janky and elegant at the same time. No persistent service, no port conflicts, no firewall rules.

The CodeMirror save bug. The original config editor had CodeMirror 6 and the save handler called JSON.parse() on YAML content. Every save silently failed. The content was always YAML, never JSON. I replaced the whole editor with Monaco and fixed the save pipeline to parse YAML properly.

The MQTT race condition. On first deploy, the MQTT on_connect callback fired before the topic subscription attribute was set. VoxWatch connected to the broker and then crashed trying to subscribe to None. Moved the attribute assignment before connect(). Classic async timing issue.

Piper TTS path resolution. The model was downloaded to /usr/share/piper-voices/ but Piper expected the full .onnx file path, not just the model name. Worked fine in development, broke in Docker because the working directory was different.

The concat demuxer Windows gotcha. ffmpeg’s concat demuxer requires forward slashes in file paths even on Windows. Backslashes silently produce empty output. One .replace("\\", "/") fixed it.

The AI-assisted development thing

Same deal as my other projects. This is AI-assisted development. Claude Code writes the code; I make the decisions.

The architectural calls are mine: three-stage pipeline with concurrent execution, police dispatch as a first-class persona with its own AI prompt shape (structured JSON instead of free-text), fallback chains that never go silent, nightvision-aware prompting. The 25 years of infrastructure and security experience is what tells me that firewall blocks from the internet are noise but a person testing a gate latch at 2 AM is not.

Claude Code wrote a 42,000-line codebase across Python, TypeScript, and React. I told it what to build and caught the stuff it got wrong. The radio dispatch audio pipeline was particularly collaborative, lots of back and forth on the ffmpeg filter chains, the compand parameters for realistic AGC simulation, the exact bandpass frequencies that make speech sound like it’s coming through a Motorola.

I also ran multiple code review passes with specialized agents: a 56-file frontend audit that caught a real closure-capture bug in the editor’s discard path, accessibility failures on keyboard-inaccessible toggles, and divergent cost maps between two status cards. Then a QA baseline system to prevent regressions, because every round of fixes was breaking something else.

Lessons learned

Latency hiding is a design pattern, not a hack. Running Stage 1 and Stage 2 concurrently isn’t cutting corners. It’s the only way to make an AI-powered real-time system feel responsive. The intruder doesn’t know (or care) that the AI needs 5-8 seconds. They just hear continuous, escalating response.

Fallback chains should be non-negotiable for anything user-facing. Every TTS provider, every AI provider, every audio generation step has a fallback. The pipeline has four fallback layers for the natural cadence system alone. A security system that goes silent because ElevenLabs had a blip is worse than no system at all.

Per-segment radio effects sound better than whole-file processing. Applying the radio effect to each dispatch segment individually before concatenation gives each segment its own noise floor variation. One pass over the whole file sounds flat and synthetic. Per-segment sounds like a real radio where conditions change slightly between transmissions.

Config hot-reload is worth the complexity. The alternative is restarting a Docker container every time you change a setting. That means lost MQTT connections, missed detections during restart, and users who stop tweaking because the feedback loop is too slow. The 10-second polling with section-level diff was worth every line.

https://github.com/badbread as soon as it goes public!