# Text pipeline behavior (chunking, summaries, assets)

This page documents the current text pipeline implementation. The summarize and asset steps use stub logic; there are no external LLM calls yet.

## Pipeline stages (current)

1. Chunk transcript text into overlapping segments.
2. Summarize each chunk and roll up an episode summary (stub summarizer).
3. Generate draft candidate assets from the episode summary (stub generator).

The `podcast summarize` command only supports `--dry-run`. It creates `episode.yaml` and `state.json` at the workspace root. The `podcast draft-candidates` command reads the workspace produced by the summarize step, while `podcast draft --dry-run` runs the summarize + draft-candidates flow end-to-end.

## Chunking behavior

Chunking is implemented in `podcast_pipeline.transcript_chunker` and writes a `.txt` file plus a metadata JSON file per chunk.

- Tokenization: whitespace-delimited tokens matched by `\S+`.
- Chunk IDs: start at 1 and increment per chunk.
- Default config: `ChunkerConfig(max_tokens=1200, overlap_tokens=200, boundary_lookback_tokens=200, min_tokens=None)`.
- Effective minimum size: when `min_tokens` is `None`, `effective_min_tokens` is 60 percent of `max_tokens`.
- Boundary selection:
  - Prefer paragraph boundaries (`\n\n`), then sentence boundaries (`.`/`!`/`?` followed by whitespace), then line boundaries (`\n`).
  - Search backward up to `boundary_lookback_tokens` from the desired end.
  - If no boundary is found, cut at `max_tokens`.
- Overlap: next chunk starts at `end_token - overlap_tokens` (unless that would go backwards).

Chunk outputs live under `transcript/chunks/`:

- Text: `chunk_0001.txt`
- Metadata: `chunk_0001.json` containing `chunk_id` and `text_relpath` (time fields are not populated yet)

## Summaries (stub)

Summaries are generated by `podcast_pipeline.summarization_stub` and written under `summaries/`:

- Chunk summaries: `summaries/chunks/chunk_0001.summary.json`
- Episode summary: `summaries/episode/episode_summary.json`
- Rendered outputs: `episode_summary.md` and `episode_summary.html`

Stub behaviors:

- Chunk `summary_markdown` includes a heading (`## Chunk N`) plus the first non-empty transcript lines.
- `bullets` are derived from early non-empty lines (max 5 by default).
- `entities` are the first capitalized tokens in the chunk (case-sensitive).
- Episode `key_points` and `topics` are unique rollups of chunk bullets and entities.

## Asset candidates (stub)

Asset candidates are generated by `podcast_pipeline.asset_candidates_stub`.

Inputs:

- `summaries/episode/episode_summary.json`
- Optional chapters, resolved in this order:
  1. `--chapters` CLI argument
  2. `episode.yaml` `inputs.chapters`
  3. `transcript/chapters.txt`

Outputs:

- One candidate set per asset kind under `copy/candidates/<asset_id>/`.
- For each candidate: `candidate_<uuid>.json` plus `.md` and `.html` renderings.

Current asset ids (from `AssetKind`):

- `description`
- `shownotes`
- `summary_short`
- `title_detail`
- `title_seo`
- `subtitle_auphonic`
- `slug`
- `cms_tags`
- `audio_tags`
- `itunes_keywords`
- `mastodon`
- `linkedin`
- `youtube_description`

The stub generator uses deterministic UUIDs based on asset id and variant, so reruns with the same config are stable.