Claude Code Creates Launch Videos

I’ve been pushing these coding agents beyond creating code. They already understand the code and the purpose of the app they are building, have read the docs, and have access to the product plans. That makes it straightforward to ask them to draft a script for a launch demo.

Beyond that, they have access to tools like Chrome DevTools to navigate the web app, take screenshots, associate talking points with those screenshots, record the audio, sync narration timestamps to image transitions, and collate everything into the final video.

My three-step pipeline runs entirely on Apple Silicon:

mlx-audio → Playwright → FFmpeg
(narration)  (capture)    (video)

Basically, the LLM calls and navigates the app by sending MCP commands. The deterministic screenshots mean that once it has figured out which pages are needed for the narrative, it can create a fairly simple Playwright script to capture the pages. It also means that any changes to those pages can be re-run when the app is updated. (You get the same result every time.)

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch();
  const page = await browser.newPage();
  await page.setViewportSize({ width: 1280, height: 720 });

  await page.goto('http://localhost:3000');
  await page.screenshot({ path: 'scene1.png' });

  await page.click('#settings-btn');
  await page.waitForSelector('.settings-panel');
  await page.screenshot({ path: 'scene2.png' });

  // ... more scenes

  await browser.close();
})();

The LLM must also create a single source-of-truth file for visuals, narration, and timing. The images.txt format drives both video generation and TTS generation. It is similar to FFmpeg's input.txt, with the main difference being the addition of the narration strings interleaved between entries. The LLM generates the initial timing duration based on assumed speech rate, but this will be updated by the TTS generator once it has generated the final audio. The images.txt looks something like this:

# images.txt
file 'scene1.png'
text "Track your strength training with session-based progression."
duration 10

file 'scene2.png'
text "Quick Actions let you copy previous weights or skip a day."
duration 8

The magic of Apple Silicon is that you can easily run a local TTS using mlx-audio. In my examples I use the Kokoro-82M model, it is ~160 MB in size, and produces pretty smooth sound for its size. Lastly, the narration script allows us to fiddle with the narration speed and the transition wait times—the 0.8 speed and 2s transition wait times worked well for me.

uv run python generate_narration.py -i images.txt -o narration.wav --speed 0.8 --wait 2.0

Finally, the images and the narration audio are directly passed to FFmpeg along with input.txt to produce the video. I generate input.txt from images.txt by stripping the text narration lines.

ffmpeg \
    -f concat -safe 0 -i input.txt \
    -i narration.wav \
    -vf "scale=1280:720:force_original_aspect_ratio=decrease,pad=1280:720:(ow-iw)/2:(oh-ih)/2" \
    -c:v libx264 -pix_fmt yuv420p -r 30 \
    -c:a aac \
    -map 0:v -map 1:a \
demo-final.mp4

Since everything is scripted, I loved the fact that the results can be regenerated quickly with small variations. The LLM research part is the only thing that requires some painstaking prompting to get the pitch and narrative correct.