Text pipeline behavior (chunking, summaries, assets)¶
This page documents the current text pipeline implementation. The summarize and asset steps use stub logic; there are no external LLM calls yet.
Pipeline stages (current)¶
Chunk transcript text into overlapping segments.
Summarize each chunk and roll up an episode summary (stub summarizer).
Generate draft candidate assets from the episode summary (stub generator).
The podcast summarize command only supports --dry-run. It creates episode.yaml and state.json at the workspace root. The podcast draft-candidates command reads the workspace produced by the summarize step, while podcast draft --dry-run runs the summarize + draft-candidates flow end-to-end.
Chunking behavior¶
Chunking is implemented in podcast_pipeline.transcript_chunker and writes a .txt file plus a metadata JSON file per chunk.
Tokenization: whitespace-delimited tokens matched by
\S+.Chunk IDs: start at 1 and increment per chunk.
Default config:
ChunkerConfig(max_tokens=1200, overlap_tokens=200, boundary_lookback_tokens=200, min_tokens=None).Effective minimum size: when
min_tokensisNone,effective_min_tokensis 60 percent ofmax_tokens.Boundary selection:
Prefer paragraph boundaries (
\n\n), then sentence boundaries (./!/?followed by whitespace), then line boundaries (\n).Search backward up to
boundary_lookback_tokensfrom the desired end.If no boundary is found, cut at
max_tokens.
Overlap: next chunk starts at
end_token - overlap_tokens(unless that would go backwards).
Chunk outputs live under transcript/chunks/:
Text:
chunk_0001.txtMetadata:
chunk_0001.jsoncontainingchunk_idandtext_relpath(time fields are not populated yet)
Summaries (stub)¶
Summaries are generated by podcast_pipeline.summarization_stub and written under summaries/:
Chunk summaries:
summaries/chunks/chunk_0001.summary.jsonEpisode summary:
summaries/episode/episode_summary.jsonRendered outputs:
episode_summary.mdandepisode_summary.html
Stub behaviors:
Chunk
summary_markdownincludes a heading (## Chunk N) plus the first non-empty transcript lines.bulletsare derived from early non-empty lines (max 5 by default).entitiesare the first capitalized tokens in the chunk (case-sensitive).Episode
key_pointsandtopicsare unique rollups of chunk bullets and entities.
Asset candidates (stub)¶
Asset candidates are generated by podcast_pipeline.asset_candidates_stub.
Inputs:
summaries/episode/episode_summary.jsonOptional chapters, resolved in this order:
--chaptersCLI argumentepisode.yamlinputs.chapterstranscript/chapters.txt
Outputs:
One candidate set per asset kind under
copy/candidates/<asset_id>/.For each candidate:
candidate_<uuid>.jsonplus.mdand.htmlrenderings.
Current asset ids (from AssetKind):
descriptionshownotessummary_shorttitle_detailtitle_seosubtitle_auphonicslugcms_tagsaudio_tagsitunes_keywordsmastodonlinkedinyoutube_description
The stub generator uses deterministic UUIDs based on asset id and variant, so reruns with the same config are stable.