llm-structured-data-pipeline

/home/avalon/.hermes/skills/software-development/llm-structured-data-pipeline/SKILL.md · raw

LLM Structured Data Pipeline

Patterns for building multi-phase apps where each phase calls an LLM to extract/generate structured JSON, stores results in a database, and feeds into the next phase.

For universal mixed-media ingestion systems that combine routing, canonical evidence packages, notebook/wiki publishing, typed knowledge graphs, and multi-resolution 2D/3D visualization, see references/universal-multimodal-ingestion-and-knowledge-graph.md.

When to Use

App has sequential LLM analysis phases (e.g., generate → analyze → segment → scene-define → shot-breakdown)
Each phase produces structured JSON that gets stored in a relational DB
Later phases depend on earlier phase outputs (timestamps, IDs, entity names)
Long-running async operations where users may trigger re-runs mid-flight

Pattern 1: Robust JSON Extraction from LLM

LLMs frequently return slightly invalid JSON — markdown fences, truncated output, trailing commas. Always post-process.

function safeParseJSON(text) {
  // Step 1: Direct parse
  try { return JSON.parse(text) } catch {}

  // Step 2: Strip markdown fences
  let cleaned = text.trim()
  if (cleaned.startsWith('```')) {
    cleaned = cleaned.replace(/^```(?:json)?\n?/, '').replace(/\n?```$/, '')
    try { return JSON.parse(cleaned) } catch {}
  }

  // Step 3: Extract JSON from prose preamble (model wrote "Looking at the scene..." before JSON)
  const arrayStart = cleaned.indexOf('[')
  const objStart = cleaned.indexOf('{')
  const jsonStart = (arrayStart >= 0 && (objStart < 0 || arrayStart < objStart)) ? arrayStart : objStart
  if (jsonStart > 0) {
    cleaned = cleaned.substring(jsonStart)
    try { return JSON.parse(cleaned) } catch {}
  }

  // Step 4: Repair truncated JSON
  const lastBrace = cleaned.lastIndexOf('}')
  if (lastBrace > 0) {
    let fixed = cleaned.substring(0, lastBrace + 1)
    if (!fixed.trim().endsWith(']')) fixed += ']'
    try { return JSON.parse(fixed) } catch {}
  }

  throw new Error(`Invalid JSON from LLM: ${text.substring(0, 100)}...`)
}

Critical: Fallback LLM providers (OpenRouter) often return prose before JSON — "Looking at the scene, here are the shots:" followed by JSON. Step 3 handles this by finding the first [ or {. Also add retry logic: if JSON parse fails, call the LLM again (up to 2 attempts).

Key Pitfalls

max_tokens too low — OpenRouter and Anthropic both truncate silently. Structured JSON for scene breakdowns with detailed prompts can easily exceed 4096 tokens. Use 8192+ for any response containing multiple objects with long text fields.
OpenRouter inherits max_tokens — If your fallback function hardcodes max_tokens, it overrides what the caller requested. Always pass through: max_tokens: params.max_tokens || 8192
Variable names in JSON repair — Don't reuse the variable name text if the outer scope already has one. Use parsed, jsonText, cleaned etc. to avoid patch tool matching issues.

Pattern 2: Computed Fields from Cross-Phase Data

LLMs often omit fields they were asked for, especially numeric ones like timestamps. Always compute critical fields from authoritative data rather than trusting the LLM.

// LLM returns scene with start_segment_index/end_segment_index
// but may omit start_time_ms/end_time_ms
let startMs = scene.start_time_ms
let endMs = scene.end_time_ms

// Compute from segment indices (authoritative source) if LLM didn't include them
if ((startMs === undefined || startMs === null) && scene.start_segment_index !== undefined) {
  const startSeg = segments[scene.start_segment_index]
  const endSeg = segments[Math.min(scene.end_segment_index, segments.length - 1)]
  if (startSeg) startMs = startSeg.start_time_ms
  if (endSeg) endMs = endSeg.end_time_ms
}

Rule: Audio timestamps are the source of truth. Scene timestamps derive from segment timestamps. Shot timestamps derive from scene timestamps. Never trust LLM-generated timestamps when you have computed ones available.

Pattern 3: Race Condition Guards for Async Chains

When phase N is running async (LLM calls take 5-30s each), the user may trigger phase N-1 again, which deletes and recreates the data that phase N depends on. This causes foreign key constraint errors.

for (const scene of scenes) {
  const shots = await breakdownShots(scene, ...)  // Takes 10-30 seconds

  // Scene may have been deleted + recreated while we were waiting
  const stillExists = db.prepare('SELECT id FROM scenes WHERE id = ?').get(scene.id)
  if (!stillExists) {
    console.log(`Scene ${scene.scene_number} was re-defined during breakdown, skipping`)
    continue
  }

  // Safe to insert shots now
  for (const shot of shots) {
    insertShot.run(shotId, scene.id, ...)
  }
}

Pattern 4: Entity Name Matching Between Phases

When the LLM extracts entities (characters, locations, props) in phase 1, later phases must reference them by exact name. Build name→ID maps and use fuzzy matching as fallback:

// Build maps from DB records
const charMap = {}
characters.forEach(c => { charMap[c.name] = c.id })

// LLM returns character names — look up IDs
for (const name of scene.characters_present || []) {
  const id = charMap[name]
  if (id) insertSceneChar.run(sceneId, id)
  // else: LLM used a slightly different name — log and skip
}

Tip: In the LLM prompt, always list the exact available names: Available characters: ${charNames}. This dramatically reduces name mismatches.

Pattern 5: Cascading Deletes for Re-analysis

When a user re-runs an earlier phase, all downstream data must be cleared. Order matters with foreign keys:

// Clear in reverse dependency order
db.prepare('DELETE FROM scene_characters WHERE scene_id IN (SELECT id FROM scenes WHERE project_id = ?)').run(projectId)
db.prepare('DELETE FROM scene_props WHERE scene_id IN (SELECT id FROM scenes WHERE project_id = ?)').run(projectId)
db.prepare('DELETE FROM scene_segments WHERE scene_id IN (SELECT id FROM scenes WHERE project_id = ?)').run(projectId)
db.prepare('DELETE FROM shots WHERE project_id = ?').run(projectId)
db.prepare('DELETE FROM scenes WHERE project_id = ?').run(projectId)

Pattern 6: Extraction-First Conversational Intake

An adaptive onboarding prompt is not enough if orchestration still behaves like a fixed survey. Return bounded structured understanding on every turn—even when incomplete—and preserve it when a server guard forces one final follow-up.

Use generic optional interaction metadata such as inputMode: text | choices | done, bounded choices: [{label, value}], and multiple. Text is the default, and the normal composer remains available when choices appear. Never encode a particular survey step into the transport contract or make the client decide whether to show controls by matching assistant prose. Keep prompt limits, response-schema limits, service normalization, signed payloads, and canonical persistence bounds identical. Test with a long multi-detail answer that leaves exactly one required fact missing, a generic non-channel choice question, and a model response that incorrectly copies the whole answer into one field.

See references/adaptive-conversational-intake.md for the generic turn contract, failure chain, boundary invariants, TDD cases, and live-provider/browser verification sequence.

Checklist

[ ] Conversational extraction preserves partial structured understanding across forced follow-ups
[ ] Free text is the default; bounded generic choices are optional and never replace the composer
[ ] Single- and multi-select choice behavior round-trips through explicit interaction metadata
[ ] UI controls use explicit interaction state, with prose matching only as a compatibility fallback
[ ] Model-generated choice labels/values are bounded, deduplicated, sanitized, and not treated as authorization
[ ] Model, service, token, and canonical schema field bounds are identical
[ ] A verbose combined answer fills multiple fields and skips redundant questions
[ ] All LLM JSON responses go through safeParseJSON (strip fences + repair truncation)
[ ] max_tokens set to 8192+ for any structured response with multiple objects
[ ] Numeric fields (timestamps, durations) computed from authoritative data, not LLM output
[ ] Async operations check that parent records still exist before inserting children
[ ] Entity names listed explicitly in LLM prompts for cross-phase consistency
[ ] Cascading deletes clear junction tables before parent tables
[ ] Re-run operations are idempotent (clear old data before inserting new)