I built SherpaEdit AI to tackle the manual drudgery involved in documentary and film editing, with the ultimate mission of preserving the human element of storytelling while eliminating the tedious work. The traditional editing process starts with “archaeology”: watching hundreds of hours of vérité, interviews, and b-roll to find the best moments and build a story arc. Then, an assistant editor or story producer usually assembles assembling an audio-only story (radio cut) and adds supporting visuals (rough cut). My goal was to automate these first two steps: archaeology and a rough cut assembly, delivering an output directly to a Final Cut Pro or Premiere timeline.
One problem with using a frontier LLM to solve the archaeology problem, is that it’s too expensive to send terabytes of video to a cloud-based language model. So, I developed a local processing solution. For every clip, analyze_clips.py creates a structured JSON manifest using local models: WhisperX for transcripts with word-level timestamps, pyannote for speaker diarization, and Moondream to describe visuals every few seconds. This creates a single JSON manifest of all the video clips which an LLM can then read to pick A-roll lines and matching B-roll shots without needing access to the actual video files.
Next, I ran into a problem where trying to use a single general LLM prompt to plot a story arc yielded shallow and disappointing results. So, instead I created a multi-agent system which makes use of any frontier model (Gemini, Claude, etc.). This system uses a story agent to propose a story arc, an a-roll agent to pick precise narration and dialog, a b-roll agent to select visuals while avoiding repetition, and a quality check agent to review pacing and avoid repeated clips.
