How to Build a Local-First Video Intelligence Pipeline
Capture, analyze, and store video intelligence in the browser — resilient recording, in-browser keyframe extraction, and a split IndexedDB store built for field and safety work under time pressure.
Key Takeaways
- Local-first beats record-then-upload for field and safety work: no upload latency, full offline resilience, and raw footage stays on the device.
- Resilient browser recording needs a MIME fallback chain, short timeslice chunks, and a hard duration cap so a capture never fails silently on an unsupported device.
- Send a handful of downscaled keyframes to a vision model instead of raw video — it is far cheaper, faster, and good enough for OCR, detection, and risk scoring.
- Extract up to five frames at half-second intervals, cap them at 1024px, encode JPEG at quality 0.7, and enforce a five-second timeout so extraction never blocks the capture loop.
- Separate storage in IndexedDB: a lightweight feedItems table of JSON metadata for instant rendering, and a videos table of heavy blobs lazy-loaded via Intersection Observer.
- Render the structured JSON the model returns — detections, risk ratings, coordinates — onto a Leaflet map, and defer the heavy upload until connectivity returns.
Most video apps follow the same lazy pattern: record a clip, upload the whole thing to a server, wait, then show the user a result. That pattern collapses exactly where it matters most — in the field, on a bad network, under time pressure, with sensitive footage you would rather not stream anywhere. A local-first video intelligence pipeline inverts the order of operations: it captures, analyzes, and stores on the user's own device, and treats the network as an optional enhancement rather than a dependency.
This is the architecture behind Ombrixa, our local-first video intelligence progressive web app built for civic, field, and safety work. Below is the full pipeline — recording, keyframe extraction, vision analysis, split local storage, map rendering, and deferred sync — with the specific parameters that make it robust in the browser.
Why is local-first better than record-then-upload?
For time-pressure and safety use, local-first wins on three axes that a server-first design simply cannot match:
- No network latency. The analysis runs on a few small frames the moment recording stops, so the analyst sees results in seconds rather than waiting on a multi-megabyte video upload over a weak connection.
- Offline resilience. Field work happens in basements, dead zones, and on saturated networks. A local-first app keeps working when the network does not, because everything the user needs is already on the device.
- Privacy. Raw footage stays on the device by default. Only a minimal set of downscaled frames ever crosses the wire, and only when analysis runs — a meaningful property when you are capturing sensitive scenes.
How do you record video reliably in the browser?
Browser video capture is a minefield of codec support. No single MIME type works everywhere, so a naive MediaRecorder call that hard-codes one format will silently fail on a meaningful slice of devices. The fix is a MIME fallback chain: try formats in order of preference and use the first the browser actually supports.
const mimeOptions = [
"video/mp4",
"video/webm;codecs=vp9",
"video/webm",
];
const mimeType =
mimeOptions.find((type) => MediaRecorder.isTypeSupported(type)) ?? "";
const recorder = new MediaRecorder(stream, mimeType ? { mimeType } : undefined);Two more parameters make recording robust rather than fragile:
- Chunk every 250 ms. Start the recorder with a 250 millisecond timeslice so data is flushed continuously into chunks instead of held in one buffer until you stop. Continuous flushing means a crash or an early stop still leaves you with usable footage.
- Hard 30-second cap. Enforce a maximum recording duration of 30 seconds. Field captures should be short and purposeful, and a hard cap protects both memory and the user from an accidental runaway recording.
How do you extract keyframes at high speed in the browser?
Here is the central insight of the whole pipeline: do not send the video to the model — send a few frames. Raw video is enormous and mostly redundant. A handful of well-chosen frames carries nearly everything a vision model needs to do OCR, object detection, and risk scoring, at a tiny fraction of the size, latency, and cost. So after recording, extract keyframes locally using a hidden canvas:
- Up to 5 frames at 0.5-second intervals. Seek the video element to evenly spaced timestamps and draw each into an offscreen
canvaselement. Five frames across a short clip is plenty of temporal coverage for a single scene. - Downscale to a 1024px max edge. Vision models do not need full-resolution capture. Constraining the longest edge to 1024 pixels slashes payload size while preserving readable text and recognizable objects.
- Encode JPEG at quality 0.7. Export each canvas as JPEG at quality 0.7 — the sweet spot where artifacts are invisible to a model but the bytes are small.
- Enforce a 5-second timeout budget. Frame seeking can stall on some codecs and devices. Wrap extraction in a five-second timeout so a stubborn clip degrades gracefully — return whatever frames you have rather than hanging the capture loop forever.
Those frames go to a Gemini vision endpoint, which returns structured JSON: OCR detections, risk ratings, coordinates, and whatever else the schema asks for. Sending frames instead of blobs is what keeps analysis fast and cheap and keeps the original video on the device.
How should you store video and metadata locally?
The temptation is to dump everything into one store. Resist it. The winning pattern is a clean separation of light metadata from heavy blobs in IndexedDB, which we manage with Dexie:
- A
feedItemstable holds lightweight JSON metadata per capture: the OCR detections, risk rating, coordinates, timestamp, and a reference to its video. This table is tiny, so the feed renders instantly even with hundreds of entries. - A
videostable holds the heavy binary blobs, keyed so eachfeedItemsrecord can find its clip. You never load these eagerly.
Then lazy-load the blobs with an Intersection Observer: only when a feed card actually scrolls into the viewport do you pull its video blob out of the videos table and hydrate the player. This keeps memory flat and scrolling smooth no matter how much footage has accumulated — the feed is always reading from the tiny metadata table, and the multi-megabyte blobs load one at a time, on demand.
How do you turn the model output into a usable UI?
Structured JSON is only useful if it becomes something an analyst can act on. Because the vision endpoint returns coordinates alongside its detections and risk ratings, the natural home for the output is a map. Render each capture onto a Leaflet map as a marker at its coordinates, styled by risk rating, with the OCR detections and analysis available on tap. The feed and the map read from the same lightweight feedItems table, so the spatial view and the list view stay in sync for free, and the heavy video only loads when an analyst opens a specific capture. The result is situational awareness — what was seen, where, and how risky — without ever waiting on a video to download.
How do you handle sync when connectivity returns?
In a local-first design, upload is a background chore, not a gate. The analyst already has everything they need stored and rendered on the device, so the heavy video sync can be deferred until connectivity returns. Queue captures locally, sync them when the network is available — ideally in the background — and let the user keep capturing and reviewing the entire time. The network enhances the system; it never blocks it. That single inversion is what separates a tool that works in the field from one that only works at a desk.
What is the privacy and edge trade-off?
Local-first is not the same as fully on-device. In this pipeline, recording, extraction, and storage all happen on the device, but the analysis step still sends a few downscaled frames to a vision model — so those frames leave the device. The honest framing is that you are choosing exactly what crosses the boundary: raw footage stays local, only minimal keyframes are transmitted, and you decide whether even that analysis runs in the cloud or on-device based on sensitivity, accuracy needs, and cost. That is the same calculus we lay out in choosing between on-device and cloud AI. For workflows that need an agent to act on these detections rather than just display them, see building an AI agent for your business.
From architecture to a shipped field tool
A local-first video intelligence pipeline is a set of deliberate constraints working together: short capped recordings with a codec fallback, a few downscaled keyframes instead of raw video, a split IndexedDB store that keeps the UI instant, a map that makes the output actionable, and deferred sync that frees you from the network. Each choice trades a little theoretical flexibility for the thing that actually matters in the field — speed, resilience, and privacy under pressure.
This is the kind of edge-first system Game Changer Labs designs and ships across civic and safety technology — Ombrixa being one example of taking a hard, real-world capture problem from idea to a production progressive web app. If you are building intelligence tools that have to work where the network does not, we can help you ship them.
Frequently Asked Questions
What is a local-first video intelligence pipeline?
It is a system that captures video, extracts frames, runs analysis, and stores the results primarily on the user's own device rather than uploading raw footage to a server first. The device records the clip, pulls a few keyframes in the browser, sends only those frames to a vision model, and persists both the structured result and the original video locally. Uploads, if any, are deferred until connectivity is available. This makes the experience fast, offline-resilient, and private.
Why is local-first better than record-then-upload for field work?
Three reasons. There is no upload latency, so an analyst sees results in seconds instead of waiting on a large video transfer. It works offline, which matters in the field, in basements, and on congested networks where uploads stall. And the raw footage stays on the device by default, which is a meaningful privacy and safety property when you are capturing sensitive scenes under time pressure.
How do you record video reliably across browsers?
Use the MediaRecorder API with a MIME fallback chain, because no single codec is supported everywhere. Try video/mp4 first, fall back to video/webm with the vp9 codec, then plain video/webm, picking the first the browser supports. Record in short chunks using a timeslice — around 250 milliseconds — so data is flushed continuously, and enforce a hard duration cap, around 30 seconds, so a capture cannot run away.
Why send keyframes to a vision model instead of the whole video?
Because raw video is enormous and most of it is redundant. A few well-chosen frames carry nearly all the information a vision model needs for OCR, object detection, and risk scoring, at a tiny fraction of the size, latency, and cost. Extracting up to five downscaled JPEG frames and sending those keeps analysis fast and cheap, and it keeps the original video on the device.
How should you store video and metadata in the browser?
Split the storage into two tables in IndexedDB, which is practical to manage with a wrapper like Dexie. Keep a feedItems table of lightweight JSON metadata — detections, risk ratings, coordinates — so the feed renders instantly. Keep the heavy binary video blobs in a separate videos table, and lazy-load each blob only when its card scrolls into view using an Intersection Observer. Separating light metadata from heavy blobs is what keeps the UI responsive.
How do you handle uploads when the device is offline?
Defer them. Because everything the user needs is already stored and rendered locally, the upload is a background concern, not a blocker. Queue captures locally and sync them when connectivity returns, ideally in the background, so the analyst is never waiting on the network to do their job. Local-first means the network is an enhancement, not a dependency.
What is the privacy trade-off of edge video analysis?
Doing extraction and storage on-device keeps raw footage private and avoids streaming sensitive video to a server, but the analysis step still sends a few frames to a vision model, so those frames leave the device. The trade-off is choosing what crosses the boundary: raw video stays local, only minimal downscaled keyframes are sent, and you decide whether even that runs in the cloud or on-device based on sensitivity, cost, and accuracy needs.
Have a project that needs to ship?
Game Changer Labs designs and builds production systems across AI, neurotech, civic, and spatial computing. Tell us what you are building and we will scope it.