Text pipeline behavior (chunking, summaries, assets)

This page documents the current text pipeline implementation. The summarize and asset steps use stub logic; there are no external LLM calls yet.

Pipeline stages (current)

  1. Chunk transcript text into overlapping segments.

  2. Summarize each chunk and roll up an episode summary (stub summarizer).

  3. Generate draft candidate assets from the episode summary (stub generator).

The podcast summarize command only supports --dry-run. It creates episode.yaml and state.json at the workspace root. The podcast draft-candidates command reads the workspace produced by the summarize step, while podcast draft --dry-run runs the summarize + draft-candidates flow end-to-end.

Chunking behavior

Chunking is implemented in podcast_pipeline.transcript_chunker and writes a .txt file plus a metadata JSON file per chunk.

  • Tokenization: whitespace-delimited tokens matched by \S+.

  • Chunk IDs: start at 1 and increment per chunk.

  • Default config: ChunkerConfig(max_tokens=1200, overlap_tokens=200, boundary_lookback_tokens=200, min_tokens=None).

  • Effective minimum size: when min_tokens is None, effective_min_tokens is 60 percent of max_tokens.

  • Boundary selection:

    • Prefer paragraph boundaries (\n\n), then sentence boundaries (./!/? followed by whitespace), then line boundaries (\n).

    • Search backward up to boundary_lookback_tokens from the desired end.

    • If no boundary is found, cut at max_tokens.

  • Overlap: next chunk starts at end_token - overlap_tokens (unless that would go backwards).

Chunk outputs live under transcript/chunks/:

  • Text: chunk_0001.txt

  • Metadata: chunk_0001.json containing chunk_id and text_relpath (time fields are not populated yet)

Summaries (stub)

Summaries are generated by podcast_pipeline.summarization_stub and written under summaries/:

  • Chunk summaries: summaries/chunks/chunk_0001.summary.json

  • Episode summary: summaries/episode/episode_summary.json

  • Rendered outputs: episode_summary.md and episode_summary.html

Stub behaviors:

  • Chunk summary_markdown includes a heading (## Chunk N) plus the first non-empty transcript lines.

  • bullets are derived from early non-empty lines (max 5 by default).

  • entities are the first capitalized tokens in the chunk (case-sensitive).

  • Episode key_points and topics are unique rollups of chunk bullets and entities.

Asset candidates (stub)

Asset candidates are generated by podcast_pipeline.asset_candidates_stub.

Inputs:

  • summaries/episode/episode_summary.json

  • Optional chapters, resolved in this order:

    1. --chapters CLI argument

    2. episode.yaml inputs.chapters

    3. transcript/chapters.txt

Outputs:

  • One candidate set per asset kind under copy/candidates/<asset_id>/.

  • For each candidate: candidate_<uuid>.json plus .md and .html renderings.

Current asset ids (from AssetKind):

  • description

  • shownotes

  • summary_short

  • title_detail

  • title_seo

  • subtitle_auphonic

  • slug

  • cms_tags

  • audio_tags

  • itunes_keywords

  • mastodon

  • linkedin

  • youtube_description

The stub generator uses deterministic UUIDs based on asset id and variant, so reruns with the same config are stable.