---
name: llm-wiki
description: "Karpathy's LLM Wiki: build/query interlinked markdown KB."
version: 2.1.0
author: Hermes Agent
license: MIT
metadata:
  hermes:
    tags: [wiki, knowledge-base, research, notes, markdown, rag-alternative]
    category: research
    related_skills: [obsidian, arxiv]
---

# Karpathy's LLM Wiki

Build and maintain a persistent, compounding knowledge base as interlinked markdown files.
Based on [Andrej Karpathy's LLM Wiki pattern](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f).

Unlike traditional RAG (which rediscovers knowledge from scratch per query), the wiki
compiles knowledge once and keeps it current. Cross-references are already there.
Contradictions have already been flagged. Synthesis reflects everything ingested.

**Division of labor:** The human curates sources and directs analysis. The agent
summarizes, cross-references, files, and maintains consistency.

## When This Skill Activates

Use this skill when the user:
- Asks to create, build, or start a wiki or knowledge base
- Asks to ingest, add, or process a source into their wiki
- Asks a question and an existing wiki is present at the configured path
- Asks to lint, audit, or health-check their wiki
- References their wiki, knowledge base, or "notes" in a research context

## Wiki Location

**Location:** Set via `WIKI_PATH` environment variable (e.g. in `~/.hermes/.env`).

If unset, defaults to `~/wiki`.

For Alex's multi-wiki VPS apps, do not assume a single vault or that every wiki is an Obsidian vault. Discover what actually exists before answering or wiring a picker: check the configured `WIKI_PATH`, then known standalone/source wiki roots such as `~/astro-sources-wiki`, `~/wiki-human-design`, and project wikis like `~/wiki-salmon-business`. Only include `~/obsidian-vaults/*` if that root exists; those vaults may be absent or intentionally removed. See `references/wiki-catalog-topology.md` for the catalog distinction between LLM wikis, Obsidian-facing vaults, and tenant/source/project wikis.

```bash
WIKI="${WIKI_PATH:-$HOME/wiki}"
```

The wiki is just a directory of markdown files — open it in Obsidian, VS Code, or
any editor. No database, no special tooling required.

## Architecture: Three Layers

```
wiki/
├── SCHEMA.md           # Conventions, structure rules, domain config
├── index.md            # Sectioned content catalog with one-line summaries
├── log.md              # Chronological action log (append-only, rotated yearly)
├── raw/                # Layer 1: Immutable source material
│   ├── articles/       # Web articles, clippings
│   ├── papers/         # PDFs, arxiv papers
│   ├── transcripts/    # Meeting notes, interviews
│   └── assets/         # Images, diagrams referenced by sources
├── entities/           # Layer 2: Entity pages (people, orgs, products, models)
├── concepts/           # Layer 2: Concept/topic pages
├── comparisons/        # Layer 2: Side-by-side analyses
└── queries/            # Layer 2: Filed query results worth keeping
```

**Layer 1 — Raw Sources:** Immutable. The agent reads but never modifies these. For large assets, use a hybrid pattern: keep small raw markdown/JSON/manifests local, but store large PDFs, rendered page images, audio, and video in S3-compatible object storage with local `.s3stub` pointer files and a `manifest.json`.
**Layer 2 — The Wiki:** Agent-owned markdown files. Created, updated, and
cross-referenced by the agent.
**Layer 3 — The Schema:** `SCHEMA.md` defines structure, conventions, and tag taxonomy.

## Git-Tracked Obsidian Vault Pattern

A strong adaptation of Brad Bonanno's self-updating wiki pattern is to treat the wiki as:
- a normal markdown vault
- tracked in git
- optionally opened in Obsidian
- updated by scheduled Hermes jobs that act like "context farmers"

Recommended additions for a production wiki repo:

```
wiki/
├── context/
│   ├── watchlists.md          # sources, channels, feeds, competitors, query seeds
│   └── farmers/
│       └── .state/            # last-run markers per farmer/job
├── raw/
├── concepts/
├── entities/
└── ...
```

Use this pattern when the user wants the wiki to keep growing without manual ingestion every day.

### Why this works

- The **human chooses sources** once.
- Hermes handles the repetitive fetching, normalization, and filing.
- Git becomes the audit log for every wiki update.
- Obsidian provides graph/backlinks/frontmatter UX on top of plain markdown.

### Recommended git workflow

- Keep the wiki in a **private** repo.
- Pull before automated writes when multiple machines or jobs may touch it.
- Commit every ingestion batch with a descriptive message.
- If the user also opens the vault locally in Obsidian, sync with git or Obsidian Sync.

For Hermes, scheduled updates map naturally to `cronjob`, not Claude's cloud scheduler.

## Context Farming with Hermes

A "context farmer" is just a recurring Hermes workflow that:
1. reads a watchlist or source config
2. fetches only new material since the last run
3. writes raw source files into `raw/`
4. updates entity/concept pages
5. records the run in `log.md`
6. updates a last-run state file

### Good farmer sources

- YouTube channels or playlists
- blog/RSS feeds
- competitor websites
- meeting transcript exports
- research feeds and paper alerts
- internal documents the user regularly drops into a folder

### Hermes-native equivalents

- Claude scheduled agents → `cronjob`
- Claude subagents / farmers → Hermes `cronjob` + optional `delegate_task`
- MCP source connectors → Hermes tools (`web_extract`, `web_search`, `terminal`, provider CLIs, APIs)

### Minimal farmer state pattern

Store last-run timestamps under a repo-local state path such as:

```
context/farmers/.state/youtube-last-run.txt
context/farmers/.state/research-last-run.txt
```

These can be gitignored if they are purely operational.

### Suggested watchlist file

`context/watchlists.md` should capture what matters, not implementation details. Example sections:
- tracked YouTube channels
- tracked blogs or RSS feeds
- tracked companies / competitors
- key topics / keywords
- exclusions / noise filters

### Scheduling pattern

For recurring wiki updates, create one cron job per source class or one orchestrator job that fans out. Examples:
- every morning: ingest tracked YouTube channels
- every 2 hours: ingest monitored RSS/blog feeds
- every evening: synthesize new raw files into entity/concept pages

Use separate jobs when source classes have different failure modes or cadences.

## Resuming an Existing Wiki (CRITICAL — do this every session)

When the user has an existing wiki, **always orient yourself before doing anything**, and use token-bounded orientation for large wikis without sacrificing correctness:

① **Read `SCHEMA.md`** — understand the domain, conventions, and tag taxonomy.
② **Inspect `index.md`** — learn what pages exist and their summaries. For small indexes, reading all is fine. For large or broad indexes, read enough structure to orient, then use targeted sections/searches for the current task rather than dumping every entry into the active context by default.
③ **Scan recent `log.md`** — read the recent activity window relevant to the task. For large logs, compute the line count first and read the last 20-50 entries or a targeted date/topic range; do not treat full-log reads as the default orientation step.
④ **For large raw corpora**, use deterministic summaries/scripts/searches to identify the relevant package/source first. Load full raw artifacts only when the current bounded task requires direct evidence from that specific file. Token reduction must never override source fidelity, provenance, QA, or correct wiki orientation.

```bash
WIKI="${WIKI_PATH:-$HOME/wiki}"
# Orientation reads at session start
read_file "$WIKI/SCHEMA.md"
# Then use index/log reads sized to the current task: full file for small wikis, targeted sections/ranges for large ones.
```

Only after orientation should you ingest, query, or lint. This prevents:
- Creating duplicate pages for entities that already exist
- Missing cross-references to existing content
- Contradicting the schema's conventions
- Repeating work already logged

For large wikis (100+ pages), also run a quick `search_files` for the topic
at hand before creating anything new.

## Initializing a New Wiki

When the user asks to create or start a wiki:

1. Determine the wiki path (from `$WIKI_PATH` env var, or ask the user; default `~/wiki`)
2. Create the directory structure above
3. Ask the user what domain the wiki covers — be specific
4. Write `SCHEMA.md` customized to the domain (see template below)
5. Write initial `index.md` with sectioned header
6. Write initial `log.md` with creation entry
7. Confirm the wiki is ready and suggest first sources to ingest

### SCHEMA.md Template

Adapt to the user's domain. The schema constrains agent behavior and ensures consistency:

```markdown
# Wiki Schema

## Domain
[What this wiki covers — e.g., "AI/ML research", "personal health", "startup intelligence"]

## Conventions
- File names: lowercase, hyphens, no spaces (e.g., `transformer-architecture.md`)
- Every wiki page starts with YAML frontmatter (see below)
- Use `[[wikilinks]]` to link between pages (minimum 2 outbound links per page)
- When updating a page, always bump the `updated` date
- Every new page must be added to `index.md` under the correct section
- Every action must be appended to `log.md`
- **Provenance markers:** On pages that synthesize 3+ sources, append `^[raw/articles/source-file.md]`
  at the end of paragraphs whose claims come from a specific source. This lets a reader trace each
  claim back without re-reading the whole raw file. Optional on single-source pages where the
  `sources:` frontmatter is enough.

## Frontmatter
  ```yaml
  ---
  title: Page Title
  created: YYYY-MM-DD
  updated: YYYY-MM-DD
  type: entity | concept | comparison | query | summary
  tags: [from taxonomy below]
  sources: [raw/articles/source-name.md]
  # Optional quality signals:
  confidence: high | medium | low        # how well-supported the claims are
  contested: true                        # set when the page has unresolved contradictions
  contradictions: [other-page-slug]      # pages this one conflicts with
  ---
  ```

`confidence` and `contested` are optional but recommended for opinion-heavy or fast-moving
topics. Lint surfaces `contested: true` and `confidence: low` pages for review so weak claims
don't silently harden into accepted wiki fact.

### raw/ Frontmatter

Raw sources ALSO get a small frontmatter block so re-ingests can detect drift:

```yaml
---
source_url: https://example.com/article   # original URL, if applicable
ingested: YYYY-MM-DD
sha256: <hex digest of the raw content below the frontmatter>
---
```

The `sha256:` lets a future re-ingest of the same URL skip processing when content is unchanged,
and flag drift when it has changed. Compute over the body only (everything after the closing
`---`), not the frontmatter itself.

## Tag Taxonomy
[Define 10-20 top-level tags for the domain. Add new tags here BEFORE using them.]

Example for AI/ML:
- Models: model, architecture, benchmark, training
- People/Orgs: person, company, lab, open-source
- Techniques: optimization, fine-tuning, inference, alignment, data
- Meta: comparison, timeline, controversy, prediction

Rule: every tag on a page must appear in this taxonomy. If a new tag is needed,
add it here first, then use it. This prevents tag sprawl.

## Page Thresholds
- **Create a page** when an entity/concept appears in 2+ sources OR is central to one source
- **Add to existing page** when a source mentions something already covered
- **DON'T create a page** for passing mentions, minor details, or things outside the domain
- **Split a page** when it exceeds ~200 lines — break into sub-topics with cross-links
- **Archive a page** when its content is fully superseded — move to `_archive/`, remove from index

## Entity Pages
One page per notable entity. Include:
- Overview / what it is
- Key facts and dates
- Relationships to other entities ([[wikilinks]])
- Source references

## Concept Pages
One page per concept or topic. Include:
- Definition / explanation
- Current state of knowledge
- Open questions or debates
- Related concepts ([[wikilinks]])

## Comparison Pages
Side-by-side analyses. Include:
- What is being compared and why
- Dimensions of comparison (table format preferred)
- Verdict or synthesis
- Sources

## Update Policy
When new information conflicts with existing content:
1. Check the dates — newer sources generally supersede older ones
2. If genuinely contradictory, note both positions with dates and sources
3. Mark the contradiction in frontmatter: `contradictions: [page-name]`
4. Flag for user review in the lint report
```

### index.md Template

The index is sectioned by type. Each entry is one line: wikilink + summary.

```markdown
# Wiki Index

> Content catalog. Every wiki page listed under its type with a one-line summary.
> Read this first to find relevant pages for any query.
> Last updated: YYYY-MM-DD | Total pages: N

## Entities
<!-- Alphabetical within section -->

## Concepts

## Comparisons

## Queries
```

**Scaling rule:** When any section exceeds 50 entries, split it into sub-sections
by first letter or sub-domain. When the index exceeds 200 entries total, create
a `_meta/topic-map.md` that groups pages by theme for faster navigation.

### log.md Template

```markdown
# Wiki Log

> Chronological record of all wiki actions. Append-only.
> Format: `## [YYYY-MM-DD] action | subject`
> Actions: ingest, update, query, lint, create, archive, delete
> When this file exceeds 500 entries, rotate: rename to log-YYYY.md, start fresh.

## [YYYY-MM-DD] create | Wiki initialized
- Domain: [domain]
- Structure created with SCHEMA.md, index.md, log.md
```

## Core Operations

### 1. Ingest

When the user provides a source (URL, file, paste), integrate it into the wiki:

① **Capture the raw source:**
   - URL → use `web_extract` to get markdown, save to `raw/articles/`
   - PDF → use `web_extract` (handles PDFs), save to `raw/papers/`
   - Pasted text → save to appropriate `raw/` subdirectory
   - Name the file descriptively: `raw/articles/karpathy-llm-wiki-2026.md`
   - **Add raw frontmatter** (`source_url`, `ingested`, `sha256` of the body).
     On re-ingest of the same URL: recompute the sha256, compare to the stored value —
     skip if identical, flag drift and update if different. This is cheap enough to
     do on every re-ingest and catches silent source changes.

② **Discuss takeaways** with the user — what's interesting, what matters for
   the domain. (Skip this in automated/cron contexts — proceed directly.)

③ **Check what already exists** — search index.md and use `search_files` to find
   existing pages for mentioned entities/concepts. This is the difference between
   a growing wiki and a pile of duplicates.

④ **Write or update wiki pages:**
   - **New entities/concepts:** Create pages only if they meet the Page Thresholds
     in SCHEMA.md (2+ source mentions, or central to one source)
   - **Existing pages:** Add new information, update facts, bump `updated` date.
     When new info contradicts existing content, follow the Update Policy.
   - **Cross-reference:** Every new or updated page must link to at least 2 other
     pages via `[[wikilinks]]`. Check that existing pages link back.
   - **Tags:** Only use tags from the taxonomy in SCHEMA.md
   - **Provenance:** On pages synthesizing 3+ sources, append `^[raw/articles/source.md]`
     markers to paragraphs whose claims trace to a specific source.
   - **Confidence:** For opinion-heavy, fast-moving, or single-source claims, set
     `confidence: medium` or `low` in frontmatter. Don't mark `high` unless the
     claim is well-supported across multiple sources.

⑤ **Update navigation:**
   - Add new pages to `index.md` under the correct section, alphabetically
   - Update the "Total pages" count and "Last updated" date in index header
   - Append to `log.md`: `## [YYYY-MM-DD] ingest | Source Title`
   - List every file created or updated in the log entry

⑥ **Report what changed** — list every file created or updated to the user.

A single source can trigger updates across 5-15 wiki pages. This is normal
and desired — it's the compounding effect.

### 2. Query

When the user asks a question about the wiki's domain:

① **Read `index.md`** to identify relevant pages.
② **For wikis with 100+ pages**, also `search_files` across all `.md` files
   for key terms — the index alone may miss relevant content.
③ **Read the relevant pages** using `read_file`.
④ **Synthesize an answer** from the compiled knowledge. Cite the wiki pages
   you drew from: "Based on [[page-a]] and [[page-b]]..."
⑤ **File valuable answers back** — if the answer is a substantial comparison,
   deep dive, or novel synthesis, create a page in `queries/` or `comparisons/`.
   Don't file trivial lookups — only answers that would be painful to re-derive.
⑥ **Update log.md** with the query and whether it was filed.

### 3. Lint

When the user asks to lint, health-check, audit, or prepare for a large/source-sensitive ingestion, run or create a **repo-local QA gate** before bulk processing. For high-value corpora (lectures, books, PDFs, domain archives), do not treat lint as optional cleanup after the fact: add `scripts/wiki_qa.py` or equivalent first, document it in `docs/QA.md`, run it after every ingestion phase, and commit only when it passes. See `references/source-grounded-qa-gates.md` for the Human Design/large-media pattern.

For structured book/PDF ingests whose retrieval depends on headings or an expected source grid, add a deterministic heading/grid QA pass before saying the source is query-ready. Compare strict headings to fuzzy/OCR-normalized headings, report missing/duplicate grid entries, flag short sections and OCR noise lines, build a machine-readable heading index, and benchmark exact/natural-query lookup. Preserve a verbatim evidence layer even if compiled chapter/search artifacts get normalized. See `references/source-text-heading-grid-qa.md`.

① **Orphan pages:** Find pages with no inbound `[[wikilinks]]` from other pages.
```python
# Use execute_code for this — programmatic scan across all wiki pages
import os, re
from collections import defaultdict
wiki = "<WIKI_PATH>"
# Scan all .md files in entities/, concepts/, comparisons/, queries/
# Extract all [[wikilinks]] — build inbound link map
# Pages with zero inbound links are orphans
```

② **Broken wikilinks:** Find `[[links]]` that point to pages that don't exist.

③ **Index completeness:** Every wiki page should appear in `index.md`. Compare
   the filesystem against index entries.

④ **Frontmatter validation:** Every wiki page must have all required fields
   (title, created, updated, type, tags, sources). Tags must be in the taxonomy.

⑤ **Stale content:** Pages whose `updated` date is >90 days older than the most
   recent source that mentions the same entities.

⑥ **Contradictions:** Pages on the same topic with conflicting claims. Look for
   pages that share tags/entities but state different facts. Surface all pages
   with `contested: true` or `contradictions:` frontmatter for user review.

⑦ **Quality signals:** List pages with `confidence: low` and any page that cites
   only a single source but has no confidence field set — these are candidates
   for either finding corroboration or demoting to `confidence: medium`.

⑧ **Source drift:** For each file in `raw/` with a `sha256:` frontmatter, recompute
   the hash and flag mismatches. Mismatches indicate the raw file was edited
   (shouldn't happen — raw/ is immutable) or ingested from a URL that has since
   changed. Not a hard error, but worth reporting.

⑨ **Page size:** Flag pages over 200 lines — candidates for splitting.

⑩ **Tag audit:** List all tags in use, flag any not in the SCHEMA.md taxonomy.

⑪ **Log rotation:** If log.md exceeds 500 entries, rotate it.

⑫ **Report findings** with specific file paths and suggested actions, grouped by
   severity (broken links > orphans > source drift > contested pages > stale content > style issues).

⑬ **Append to log.md:** `## [YYYY-MM-DD] lint | N issues found`

## Working with the Wiki

### Searching

```bash
# Find pages by content
search_files "transformer" path="$WIKI" file_glob="*.md"

# Find pages by filename
search_files "*.md" target="files" path="$WIKI"

# Find pages by tag
search_files "tags:.*alignment" path="$WIKI" file_glob="*.md"

# Recent activity
read_file "$WIKI/log.md" offset=<last 20 lines>
```

### Bulk Ingest

When ingesting multiple sources at once, batch the updates:
1. Read all sources first
2. Identify all entities and concepts across all sources
3. Check existing pages for all of them (one search pass, not N)
4. Create/update pages in one pass (avoids redundant updates)
5. Update index.md once at the end
6. Write a single log entry covering the batch

### Strategic Corrections / Project Reframes

When the user corrects the operating frame of a wiki-backed project — e.g. “this is execution-first, not research-first,” “we already have relationships/buyers,” or “organize around equipment/deployment, not validation” — treat it as a first-class wiki maintenance task, not a chat-only clarification. Create or update a concept page for the corrected thesis, update relevant object/inventory pages, patch the main plan/roadmap and swarm brief, and update index/log/action registers. If a deployed console/cockpit reads or hardcodes summaries from the wiki, patch and redeploy the app/API too so the UI does not continue surfacing stale priorities.

See `references/strategic-correction-reframe.md` for the tested pattern and verification checklist.

### Hybrid Raw Asset Storage

For wikis that ingest PDFs/books/media, preserve speed by keeping compiled markdown plus small raw markdown/JSON/manifests local, while storing large immutable source PDFs, rendered page images, scans, audio, and video in S3-compatible object storage. Use local `.s3stub` pointer files and `raw/books/<slug>/manifest.json`; fetch assets into `.cache/s3/` only when exact raw reinspection is needed, and verify with SHA-256 before use.

See `references/hybrid-s3-raw-assets.md` for the tested Alex VPS pattern, manifest shape, gitignore rules, and command examples.

### Compact Source Registry Citations

For large source-backed wikis, do not repeat long machine artifact paths (`raw/mega/...`, S3 paths, transcript JSON paths) throughout concept/query pages. Keep exact paths in manifests, source pages, and a `_meta/source-registry.json`; cite compact source IDs in synthesis pages instead, e.g. `^[src:gate-27-sacral-audio transcript 00:12:33-00:14:01]`. Use a dry-run migration that targets provenance markers on synthesis pages only, not raw/source/meta layers. After migration, remember that synthesis runs can still be large because generated page bodies and source inspection dominate tool-call/session payload; compact citations are context hygiene, not proof a cron is cheap. See `references/compact-source-registry-citations.md` for the HD wiki migration pattern, BG5/raw-article inclusion, measured-run caveats, and helper-script design.

For MEGA-hosted corpora, especially when raw source files must not persist on the VPS, see `references/mega-cloud-drive-ingest.md`. It captures the `megajs` workaround for public MEGA folder listing/download, golden-sample selection, transient `/tmp` download → S3 upload → local deletion workflow, and verification checks.

For large source-grounded corpora whose raw artifact paths are long and repeated, use `references/compact-source-registry.md`: keep exact full paths in manifests/source pages/a registry, but cite compact `source_id` aliases on concept/query pages to preserve provenance without bloating LLM context.

For large corpora that must keep ingesting autonomously beyond the current chat session, see `references/background-corpus-swarm-ingest.md`. It captures the run-until-complete Hermes cron swarm pattern: dedicated project space, state file, reports, package coverage ledger, orchestrator/digest jobs, and explicit stop conditions.

For productized multi-tenant ingestion UIs (e.g. Hermes Spawn), see `references/spawn-kb-ingestion-sessions.md`. Treat every KB ingest as a durable Hermes ingestion session with a visible transcript/status, scope note, final report, resumable artifacts, and interrupt support. Do not run hidden one-shot ingestion that can fail silently behind a raw queued source.

When the user wants the swarm actively processing now rather than waiting for a scheduled heartbeat, use `references/foreground-continuation-workers.md`. It captures the pattern of spawning a bounded `hermes chat` worker from the swarm workdir, verifying the live process/PID, leaving cron enabled, and committing/pushing clean wiki changes between bounded units.

For background swarms that touch S3, provider SDKs, or other tools under cron, see `references/autonomous-swarm-runtime-bootstrap.md`. It captures the self-contained `.env` + `.venv` + `scripts/bootstrap_runtime.sh` pattern, real S3 put/head/delete verification before ledger unblocking, JSON `.s3stub` cleanup, and prompt language that tells the orchestrator to install missing dependencies before declaring blockers.

For Human Design wiki swarm synthesis, do not impose fixed page-count targets. See `references/human-design-wiki-swarm.md` for Alex's corrected dynamic-synthesis rule: load the relevant extracted source context into GPT-5.5 and let source coverage determine whether the run creates one deep page, many atomic pages, query/comparison pages, source-page expansion, or candidate work items, while preserving provenance and QA cleanliness. That reference also captures the compounding synthesis rule: update existing Markdown pages when topics already exist, create new pages only with source support, and add/repair `[[wikilinks]]`.

For source-grounded public-web research that should become both an AI-generated report and wiki pages, see `references/source-grounded-web-research-to-wiki.md`. It captures the BG5 pass pattern: verify Alex's tiered web stack when requested, avoid pirated/manual mirrors, save dated raw captures with URL/SHA frontmatter, write the synthesis report under `queries/`, create concise concept pages that cite the report plus raw sources, update schema/index/log, and verify created files plus wikilinks before reporting back.

For a concrete Human Design golden-sample ingest, see `references/human-design-lyd-golden-sample.md`. It captures the RA LYD 7h lecture + 94-page slide PDF lessons: timestamp-free ASR is provisional, PowerPoint PDFs need slide-aware vision routing, `.cache/` raw media can still fail QA, and `--s3` failures may come from loading credentials for the wrong bucket.

For the completed high-fidelity LYD pass, see `references/human-design-lyd-timestamped-slide-ingest.md`. It captures the timestamped Venice Whisper workflow, 94-slide vision extraction, resumable missing-page reruns after provider 429s, candidate slide/transcript alignment, disk cleanup, and the QA pitfall where combined transcripts reset timestamps at each source-audio heading.

For turning a large MEGA/wiki effort into a long-running autonomous project, see `references/human-design-wiki-swarm.md`. It captures the dedicated `/home/avalon/hd-wiki-swarm` project-space pattern, Hermes cron orchestrator + daily digest jobs, state/report layout, role definitions, bounded-run policy, and the chart-analysis-oriented synthesis target for gates, lines, colors, tones, bases, variables, centers, channels, transits, and chart comparison.

For source-sensitive bulk ingests (long lectures, PDFs, slide decks, domain archives), add committed QA gates before bulk processing. See `references/source-grounded-qa-gates.md` for checks covering no-local-media policy, manifests/stubs, S3 HEAD verification, transcript timestamps, PDF rendered-page assets, in-body provenance markers, and avoiding ungrounded placeholder pages.

For single long strategy recordings or interviews that need raw transcript preservation plus structured wiki pages, use `references/long-audio-to-wiki-ingest.md`. It covers cloud-link acquisition when Telegram cannot cache oversized voice files, chunked faster-whisper transcription, watcher/idempotency pattern, parallel timestamp-slice analysis, wiki page synthesis, and disk-space pitfalls on Alex's VPS.

For long Telegram/voice recordings that should become a new wiki, use the durable incoming-folder + watcher + chunked `faster-whisper` pattern in `references/long-audio-telegram-wiki-ingest.md`. Check the gateway log for `Failed to cache voice: File is too big` before assuming the audio exists locally, and use alternate upload routes when Telegram bot download rejects the file.

### Background Swarm Ingest for Large Corpora

When the corpus is too large for a single session and the user wants autonomous progress until exhaustion, run the wiki through a dedicated background swarm/project space rather than ad-hoc chat turns. The pattern is: a swarm directory beside the wiki, a persistent state file, recursive inventory, package coverage ledger, reports per run, a local/noiseless Hermes cron orchestrator, and a separate daily digest job. The orchestrator should process bounded resumable units, but its global stop condition must be explicit: all ingestible packages/files are `complete` or intentionally `skip_*`/`blocked` with reasons. Do not stop at golden samples.

See `references/background-corpus-swarm-ingest.md` for the tested run-until-complete pattern and prompt requirements.

### Farmer-Friendly Ingest Convention

When a scheduled farmer writes into the wiki, add enough provenance that later sessions can tell human and automated updates apart. Good patterns:
- raw filenames that encode source + date, e.g. `raw/articles/youtube-nate-herck-2026-05-01.md`
- frontmatter fields such as `source: farmer/youtube`, `farmed: 2026-05-01T06:00:00Z`
- a matching log entry describing which farmer/job ran and which files changed

This keeps the wiki auditable as it compounds over time.

### Chat-Product Memory Harvester Pattern

When adapting an LLM wiki into a chat product, do not let the agent silently write every interesting chat turn into durable memory. Put a memory-harvest pass after the assistant answer that considers both sides of the exchange: what the user shared and what the assistant synthesized. The pass should draft reviewable save proposals rather than directly mutating the wiki.

A lower-risk first step is a read-only wiki browser beside chat: discover vaults by `SCHEMA.md`/`index.md`, list markdown previews, and open selected files without enabling writes. Keep path resolution server-side and constrained to the selected vault. See `references/read-only-wiki-browser-chat-products.md` for the tested mobile PWA sidebar pattern.

Good proposal fields: `kind`, `title`, `summary`, `proposedMarkdown`, `targetPath`, `crossLinks`, `confidence`, `provenance` (`userMessageId`, `assistantMessageId`, timestamp), and `status` (`pending`, `approved`, `rejected`, `superseded`). Candidate kinds include durable person facts, life events, dream notes, reading conclusions, chart/HD corrections, transit windows, substantial query syntheses, and user corrections.

Before approving writes, orient to `SCHEMA.md`, `index.md`, recent `log.md`, and relevant pages; update existing pages before creating new ones; cross-link updates; and record contradictions instead of overwriting. In multi-tenant products, approval must go through constrained KB tools/path locking, not general filesystem access. Public UI should show simple human labels like “Possible memory update noticed,” not raw internal file paths, source chunks, skill ids, or hidden routing metadata.

### Archiving

When content is fully superseded or the domain scope changes:
1. Create `_archive/` directory if it doesn't exist
2. Move the page to `_archive/` with its original path (e.g., `_archive/entities/old-page.md`)
3. Remove from `index.md`
4. Update any pages that linked to it — replace wikilink with plain text + "(archived)"
5. Log the archive action

### Obsidian Integration

The wiki directory works as an Obsidian vault out of the box:
- `[[wikilinks]]` render as clickable links
- Graph View visualizes the knowledge network
- YAML frontmatter powers Dataview queries
- The `raw/assets/` folder holds images referenced via `![[image.png]]`

### Quartz Web Viewer Layer

For a hosted/web-native viewer, publish the Markdown vault with Quartz instead of running the Obsidian desktop app through VNC/noVNC. Quartz provides a fast static site with backlinks, search, and graph-style navigation while keeping the vault files as source of truth. Treat Quartz as the **viewer/publishing layer**, not the whole Karpathy LLM wiki pattern: the full pattern still requires ingestion/context-farmer jobs that create raw source records, synthesize entity/concept/comparison pages, and update `index.md`/`log.md`.

See `references/quartz-viewer-and-context-farmers.md` for the VPS multi-vault Quartz deployment pattern and the viewer-vs-farmer distinction learned from Alex's Obsidian migration.

For best results:
- Set Obsidian's attachment folder to `raw/assets/`
- Enable "Wikilinks" in Obsidian settings (usually on by default)
- Install Dataview plugin for queries like `TABLE tags FROM "entities" WHERE contains(tags, "company")`

If using the Obsidian skill alongside this one, set `OBSIDIAN_VAULT_PATH` to the
same directory as the wiki path.

### Obsidian Headless (servers and headless machines)

On machines without a display, use `obsidian-headless` instead of the desktop app.
It syncs vaults via Obsidian Sync without a GUI — perfect for agents running on
servers that write to the wiki while Obsidian desktop reads it on another device.

**Setup:**
```bash
# Requires Node.js 22+
npm install -g obsidian-headless

# Login (requires Obsidian account with Sync subscription)
ob login --email <email> --password '<password>'

# Create a remote vault for the wiki
ob sync-create-remote --name "LLM Wiki"

# Connect the wiki directory to the vault
cd ~/wiki
ob sync-setup --vault "<vault-id>"

# Initial sync
ob sync

# Continuous sync (foreground — use systemd for background)
ob sync --continuous
```

**Continuous background sync via systemd:**
```ini
# ~/.config/systemd/user/obsidian-wiki-sync.service
[Unit]
Description=Obsidian LLM Wiki Sync
After=network-online.target
Wants=network-online.target

[Service]
ExecStart=/path/to/ob sync --continuous
WorkingDirectory=/home/user/wiki
Restart=on-failure
RestartSec=10

[Install]
WantedBy=default.target
```

```bash
systemctl --user daemon-reload
systemctl --user enable --now obsidian-wiki-sync
# Enable linger so sync survives logout:
sudo loginctl enable-linger $USER
```

This lets the agent write to `~/wiki` on a server while you browse the same
vault in Obsidian on your laptop/phone — changes appear within seconds.

## Pitfalls

- **Never modify files in `raw/`** — sources are immutable. Corrections go in wiki pages.
- **Always orient first** — read SCHEMA + index + recent log before any operation in a new session.
  Skipping this causes duplicates and missed cross-references.
- **Always update index.md and log.md** — skipping this makes the wiki degrade. These are the
  navigational backbone.
- **Don't create pages for passing mentions** — follow the Page Thresholds in SCHEMA.md. A name
  appearing once in a footnote doesn't warrant an entity page.
- **Don't create pages without cross-references** — isolated pages are invisible. Every page must
  link to at least 2 other pages.
- **Frontmatter is required** — it enables search, filtering, and staleness detection.
- **Tags must come from the taxonomy** — freeform tags decay into noise. Add new tags to SCHEMA.md
  first, then use them.
- **Keep pages scannable** — a wiki page should be readable in 30 seconds. Split pages over
  200 lines. Move detailed analysis to dedicated deep-dive pages.
- **Ask before mass-updating** — if an ingest would touch 10+ existing pages, confirm
  the scope with the user first.
- **Rotate the log** — when log.md exceeds 500 entries, rename it `log-YYYY.md` and start fresh.
  The agent should check log size during lint.
- **Handle contradictions explicitly** — don't silently overwrite. Note both claims with dates,
  mark in frontmatter, flag for user review.
- **Do not create placeholder concept pages just to satisfy wikilinks** — if a concept such as `authority`, `type`, or `center` has not yet been source-grounded, leave it as plain text or a working note until an ingest actually supports it. Broken wikilinks should fail QA rather than encouraging ungrounded pages.
- **For high-value media/PDF ingests, prefer fidelity over speed/cost** — use timestamped transcription, rendered PDF page images, vision extraction for artifacted PDFs, and phase-level QA gates. Structural lint is necessary but not sufficient; include manual/LLM-assisted spot checks against the original audio/page images.

- **When the user corrects the project frame, propagate it through the wiki and any cockpit.** Do not leave a strategic correction buried in chat or in one plan page. Create a durable concept for the corrected thesis, update plans/swarm briefs/action registers/index/log, and if a live console has hardcoded recommendations or gates, patch/redeploy it. See `references/strategic-correction-reframe.md`.
- **Do not let validation artifacts become the ontology by accident.** Especially in business/execution wikis, compliance, buyer validation, and research gates are often guardrails. If the user says relationships, buyers, or operational access already exist, encode that as the starting stance while still tracking practical confirmations.

## Related Tools

[llm-wiki-compiler](https://github.com/atomicmemory/llm-wiki-compiler) is a Node.js CLI that
compiles sources into a concept wiki with the same Karpathy inspiration. It's Obsidian-compatible,
so users who want a scheduled/CLI-driven compile pipeline can point it at the same vault this
skill maintains. Trade-offs: it owns page generation (replaces the agent's judgment on page
creation) and is tuned for small corpora. Use this skill when you want agent-in-the-loop curation;
use llmwiki when you want batch compile of a source directory.