Skip to content
Guide
April 3, 202612 min read

LLM Knowledge Bases: Karpathy's Self-Improving Second Brain

Andrej Karpathy shared his workflow for building personal knowledge bases with LLMs - no RAG needed. We break down the architecture, explain why it works, identify the best models for each step, and show how to build your own.

On April 2, 2026, Andrej Karpathy - co-founder of OpenAI, former Tesla AI lead, and one of the most influential voices in machine learning - shared a workflow that quietly redefines how we should think about personal knowledge management. His approach: use LLMs not as chatbots, but as compilers that transform raw information into structured, queryable knowledge bases.

The key insight that makes the community sit up: no RAG needed. No vector databases, no embedding pipelines, no retrieval infrastructure. Just raw files, an LLM with a large enough context window, and markdown. The entire compiled knowledge base fits in context for modern long-context models - making the whole RAG stack unnecessary for personal-scale knowledge management.

This is not a theoretical exercise. Karpathy is actively using this system for his own work, and the architecture he describes maps directly onto tools and models available today. Below, we break down every component of his workflow, explain the technical reasoning behind each design decision, identify the best models for each step, and show exactly how to build your own.

280
128K+ Context Models
available today
72
1M+ Context Models
full-codebase scale
10
Free Models
zero cost
317
Coding Models Tracked
live rankings

Karpathy's Architecture: The Six-Stage Pipeline

Karpathy's system is not a single tool - it is a six-stage pipeline where each stage serves a distinct purpose. Understanding the architecture requires understanding what each stage does, why it exists, and what model capabilities it demands.

1

raw/ - The Ingestion Layer

A flat directory where you dump source materials in any format: PDFs, HTML pages, API docs, research papers, tweets, code snippets, transcripts. No preprocessing required - the LLM handles format normalization in the next stage. Karpathy treats this as a "read-later queue on steroids" where the barrier to adding information is zero. The key design principle: capture first, organize later. You do not curate at ingestion time because curation is exactly what the LLM is better at.

2

LLM Compilation - The Intelligence Layer

This is the core innovation. An LLM reads every document in raw/ and compiles them into structured markdown files organized by topic. Not summarization - compilation. The distinction matters: summarization loses detail, while compilation restructures information into a consistent format that preserves key details while adding cross-references, resolving contradictions between sources, and establishing a coherent ontology. The LLM acts as a knowledge compiler, analogous to how a C compiler transforms source code into machine code - different representation, same semantics.

Each compiled markdown page covers one topic and includes: a summary, key facts, source attribution, related topics (as wiki links), open questions, and areas where sources disagree. The LLM decides the taxonomy - you do not predefine categories.

3

Obsidian IDE - The Visualization Layer

The compiled markdown wiki is opened in Obsidian (or any markdown editor with backlink support). This turns the flat files into a navigable knowledge graph. Obsidian's graph view shows topic connections. Its backlinks panel shows which pages reference the current topic. You can browse and edit the wiki manually, but the primary interaction mode is through the LLM in the next stage. Karpathy emphasizes that the human reads the wiki, but the LLM writes it.

4

Q&A Against Wiki - The Query Layer

Instead of searching your knowledge base with keywords (grep, Ctrl+F), you ask it questions in natural language. The entire compiled wiki is passed as context to an LLM, and you query it conversationally. "What do the sources say about X?" "What are the open disagreements about Y?" "How does Z relate to W?" Because the wiki is already compiled and structured, the LLM can answer with high accuracy and cite specific source pages. This is where the no-RAG claim becomes concrete: if your compiled wiki fits in the model's context window, you do not need retrieval at all. You get perfect recall over your entire knowledge base.

5

Output Filing - The Learning Loop

When the Q&A session produces new insights, synthesized conclusions, or identified gaps, those outputs are filed back into the wiki. This is what makes the system self-improving. Every interaction can expand the knowledge base. A question that reveals a gap leads to new research. A synthesis that connects two topics creates a new wiki page. Over time, the knowledge base becomes denser and more interconnected - not because you manually organized it, but because each query cycle implicitly refines the structure.

6

Linting & Health Checks - The Quality Layer

Automated checks run periodically across the wiki: broken links between pages, orphaned topics not connected to the graph, stale information that references outdated sources, contradictions between pages, and coverage gaps where raw/ has material that has not yet been compiled. Think of it as CI/CD for your knowledge base. The linting step can itself be LLM-powered - pass the wiki through a model and ask "what is inconsistent, outdated, or missing?" This turns the knowledge base from a static artifact into a living system with built-in quality guarantees.

Why This Works Without RAG

The most provocative aspect of Karpathy's approach is the claim that RAG is unnecessary for personal knowledge management. This is not anti-RAG rhetoric - it is a practical observation about how context windows have changed the math. The reasoning breaks down into three arguments:

Argument 1: Personal KBs Fit in Context

A typical personal knowledge base - even a comprehensive one covering your entire professional domain - is 500K to 2M tokens of compiled markdown. That is roughly 375K to 1.5M words, or 750 to 3,000 pages of text. Today, 72 models support 1M+ tokens and 157 models support 256K+ in our live rankings. For most people, the entire compiled wiki fits in a single context window with room to spare. RAG solves a problem that no longer exists at this scale.

Argument 2: Retrieval Introduces Lossy Failure Modes

RAG systems depend on embedding quality, chunk size, top-k selection, and re-ranking - each step can lose information. A query about "how does X relate to Y" might retrieve chunks about X and chunks about Y separately, but miss the paragraph that explicitly connects them. When the full knowledge base fits in context, the model sees everything simultaneously. Cross-references, contradictions, and implicit connections that RAG would miss become visible. For knowledge bases under 1M tokens, full-context is strictly superior to retrieval in recall quality.

Argument 3: The Compilation Step Does the Heavy Lifting

In a traditional RAG system, raw documents are chunked and embedded as-is. The retrieval quality depends on how well the chunks represent the original information. Karpathy's compilation step eliminates this problem: the LLM has already read every raw document, extracted the key information, and restructured it into a consistent format. The compiled wiki is already optimized for LLM consumption. It is shorter than the raw sources (compilation is lossy by intent), more structured, and internally consistent. You get the benefit of "retrieval" (smaller context) without the failure modes of chunk-based retrieval.

The boundary where this breaks down is organizational scale. A company's entire knowledge base (millions of documents) will not fit in any context window. For that, RAG remains essential. Karpathy's insight is that most individuals and small teams do not operate at that scale, and for them, the RAG infrastructure is pure overhead. See our Context Windows report for a deeper analysis of how context lengths have expanded across the industry.

The Context Window Math

How much knowledge actually fits? Here is the concrete math for different context window sizes:

Context WindowWordsPages (Book)KB Size EstimateUse Case
128K tokens~96K~190Small domain wikiSingle project or topic area
256K tokens~192K~385Medium KBProfessional domain coverage
1M tokens~750K~1,500Large personal KBMulti-domain expert knowledge
2M tokens~1.5M~3,000Comprehensive KBTeam or organizational wiki

To put this in perspective: the entire Wikipedia article on "Machine Learning" (including all sub-sections) is about 15K tokens. A compiled KB of 200 such articles fits comfortably in 256K tokens. Most professionals' accumulated domain knowledge - even in deep technical fields - compiles to under 500K tokens of structured markdown.

Best Models for Knowledge Base Workflows

Different stages of Karpathy's pipeline have different model requirements. Here is how the current leaderboard maps to each stage:

For Stage 2 (Compilation): Long Context + High Quality

The compilation step needs to read long raw documents and produce structured output. This requires both a large context window and strong instruction-following. The model must maintain coherence across hundreds of pages of input while producing well-organized markdown.

Models ranked by composite score among those with 128K+ context windows. Explore all models.

For Stage 4 (Q&A): Reasoning + Full Context

The Q&A stage needs strong reasoning over the entire compiled wiki. This is where top coding models excel - they already handle complex multi-file reasoning tasks.

Top 5 by composite score. Full leaderboard.

For Daily Use: Cost-Effective Options

Running queries against your KB multiple times per day adds up. These models offer strong performance at minimal cost - critical for making the workflow sustainable. See our pricing report for the full cost landscape.

Free tier models. Filter by price.

Model Spotlight: Which LLM for Which Task?

Rather than using one model for everything, Karpathy's approach benefits from task-specific model selection. Different pipeline stages have different bottlenecks:

Pipeline StageKey RequirementTop PickScoreWhy
CompilationLong context + structureClaude Fable 597Best score among 128K+ context models
Q&A / AnalysisReasoning depthClaude Fable 597#1 overall for complex reasoning
LintingFast + affordableGrok 4.2088Good quality at low cost for batch jobs
Tool BuildingCoding abilityClaude Fable 597Highest benchmark scores for code generation
Self-HostingOpen weights + privacyGemma 4 31B80Apache 2.0, runs on consumer GPUs

For a deeper look at open-weight options suitable for self-hosted knowledge base systems, see our Gemma 4 Review and explore the open-source model rankings.

Deep Dive: The Compilation Step

The compilation step is the most technically demanding part of the pipeline, and where Karpathy's insight is sharpest. Traditional note-taking tools put the organizational burden on the human: you read a paper, decide what matters, create a page, file it in a folder, and link it to related notes. This is cognitively expensive and breaks down at scale.

Karpathy's approach inverts the workflow: you dump everything into raw/, and the LLM makes all organizational decisions. The prompt for the compilation step typically includes:

Example Compilation Prompt Structure:

1. Read all files in raw/

2. Identify distinct topics and concepts

3. For each topic, create a markdown page with:

- Title and one-line summary

- Key facts and details (preserve specifics)

- Source attribution (which raw files)

- Related topics (as [[wiki links]])

- Open questions or gaps

- Areas where sources disagree

4. Create an index.md with topic hierarchy

5. Flag any raw files that don't fit existing topics

The critical insight: this prompt runs incrementally. When new raw files are added, the LLM receives the existing wiki plus the new files and updates the wiki accordingly. It may create new pages, update existing ones, or restructure the taxonomy entirely. This is where high output capacity matters - the model needs to generate potentially dozens of updated pages in a single pass.

Models with high max output tokens are essential here. Among current models tracked in our rankings, Claude Fable 5 leads with strong output capacity. See our max tokens guide for a detailed comparison of output limits across models.

RAG vs Full-Context: When to Use What

Karpathy's approach is not universally superior to RAG. The right architecture depends on your scale and requirements:

DimensionFull-Context (Karpathy)RAG Pipeline
ScalePersonal / small team (under 2M tokens compiled)Organizational / enterprise (unlimited)
RecallPerfect - model sees everythingDependent on embedding/retrieval quality
Cross-referencingExcellent - all context visibleLimited to retrieved chunks
InfrastructureZero - just files + LLM APIVector DB, embedding pipeline, re-ranker
Cost per queryHigher (full KB sent each time)Lower (only relevant chunks)
LatencyHigher (processing full context)Lower (smaller prompt)
Setup timeMinutes - create folders, write promptHours/days - choose stack, configure pipeline
MaintenanceNear zero - just add filesRe-embedding, index updates, chunk tuning

The sweet spot for Karpathy's approach: individuals and small teams with domain-specific knowledge bases under 500K compiled tokens. At this scale, the simplicity advantage is overwhelming. For larger collections, consider a hybrid: compile and use full-context for your "hot" knowledge (active projects, current research), and RAG for your "cold" archive.

Cost Analysis: Running a Knowledge Base

The full-context approach sends your entire wiki with every query. Here is what that costs with current API pricing across different knowledge base sizes:

KB SizeQueries/DayFree TierBudget ($5/M input)Premium ($15/M input)
100K tokens10$0/day$0.50/day$1.50/day
250K tokens10$0/day$1.25/day$3.75/day
500K tokens10Rate limited$2.50/day$7.50/day
1M tokens10Exceeds limits$5.00/day$15.00/day

For knowledge bases under 250K tokens, free-tier models make this workflow essentially zero-cost. Even at 500K tokens with a budget model, you are looking at under $80/month for 10 daily queries - less than most SaaS knowledge management tools. Compare this against the infrastructure costs of running a RAG pipeline (vector DB hosting, embedding compute, maintenance time). See our pricing trends report for the broader context of API cost evolution.

Self-hosting eliminates per-query costs entirely. An open-weight model like Gemma 4 31B running on a consumer GPU (RTX 4090, ~$1,500) has zero marginal cost per query. For heavy users doing 50+ queries per day, self-hosting pays for itself within months. Read our Gemma 4 deep dive for hardware requirements and deployment options.

Building Your Own: Step-by-Step

Here is a concrete implementation path based on Karpathy's architecture, using tools and models available today:

Step 1: Set Up the Directory Structure

kb/

raw/ # Drop files here

wiki/ # LLM-compiled output

wiki/index.md # Auto-generated topic index

prompts/ # Your compilation/lint prompts

scripts/ # Automation scripts

Open the wiki/ directory in Obsidian (free). Enable backlinks and graph view in Obsidian settings.

Step 2: Seed with Initial Raw Materials

Start small. Pick one domain you know well - your current project's documentation, a research area you follow, or a technical stack you use daily. Drop 10-20 relevant files into raw/. Do not worry about format: PDFs, markdown, HTML, plain text all work. The LLM handles normalization. Start with materials you already understand so you can verify the compilation quality.

Step 3: Run the First Compilation

Use a high-quality model with a long context window. Feed it all raw files with the compilation prompt. Let it generate the wiki structure. Review the output: are topics correctly identified? Are cross-references sensible? Are there gaps? This first pass teaches you what the LLM does well and where you need to adjust the prompt. Expect to iterate 2-3 times on the compilation prompt before the output matches your expectations. Use top-rated coding models for best results on the initial compilation.

Step 4: Start Querying

Pass the entire wiki/ directory contents as context, then ask questions. Start with factual retrieval ("What do the sources say about X?") then move to synthesis ("How does X compare to Y based on what we know?") and then analysis ("Given what the KB contains, what are the biggest gaps in our understanding of Z?"). Each of these query types tests a different model capability. For data analysis and complex reasoning queries, use a top-tier reasoning model.

Step 5: Close the Loop

When a query produces a valuable insight or identifies a gap, file the output back. New synthesis goes into wiki/ as a new page. Identified gaps go into a gaps.md file that guides your next round of raw/ ingestion. Over time, the wiki becomes your externalized memory - structured, searchable, and self-improving. This is the "second brain" that Karpathy describes: not a static archive, but a living knowledge system that grows smarter with every interaction.

Step 6: Automate Maintenance

Write simple scripts (bash, Python, or use an AI automation tool) that periodically lint the wiki. Check for: broken [[wiki links]], orphaned pages with no inbound links, pages not updated in 30+ days, conflicting facts between pages. Run these weekly. The linting step can use a cheaper model since it is pattern-matching rather than deep reasoning. This is where function calling capabilities shine - the model can output structured lint results that your scripts parse.

Tool Ecosystem: What Already Exists

Karpathy's workflow maps onto existing tools, though no single tool implements the full pipeline yet:

ToolPipeline StageWhat It Does
ObsidianStage 3 (Visualization)Markdown wiki viewer with graph view and backlinks
Claude Code / Cursor / AiderStage 2 + 4 (Compile + Q&A)Code-aware LLM interfaces that can read/write local files
MarkdownlintStage 6 (Linting)Structural markdown validation
Obsidian Copilot / Smart ConnectionsStage 4 (Q&A)LLM-powered Q&A within Obsidian
GitHub Actions / CronStage 5 + 6 (Loop + Lint)Scheduled automation for compilation and health checks
Ollama / llama.cppAll stages (Self-host)Run open-weight models locally for zero API cost

The most practical setup today: Claude Code or Cursor for compilation and Q&A (they can read entire directories), Obsidian for browsing, and a cron job for linting. This gives you the full pipeline with tools that already exist. For developer-focused use cases, explore our AI for debugging and AI for code review guides for how coding models handle complex file analysis.

Community Reactions and Extensions

Karpathy's post sparked significant discussion. The key reactions cluster around four themes:

"This is just Zettelkasten with LLMs"

Several practitioners noted similarities to the Zettelkasten method (atomic notes, cross-references, emergent structure). The key difference: in Zettelkasten, the human does all the organizing. InKarpathy's approach, the LLM handles organization while the human handles curation and queries. This dramatically lowers the activation energy for maintaining a knowledge base.

"What about hallucination in the compilation step?"

A valid concern. If the LLM introduces facts during compilation that are not in the raw sources, the entire KB becomes unreliable. The mitigation: require source attribution in every compiled page. During linting, verify that every claim maps to a raw source. Use models with strong grounding capabilities - models that score highly on factual accuracy benchmarks. See our benchmark analysis for how different models handle factual grounding.

"Why not just use NotebookLM?"

Google's NotebookLM implements parts of this workflow (document ingestion + Q&A), but it is a closed system: you cannot access the intermediate representation, customize the compilation prompt, extend the pipeline with your own tools, or self-host it. Karpathy's approach is fundamentally about owning the entire pipeline - every stage is inspectable, customizable, and portable.

"Will this replace traditional note-taking?"

Not entirely. Karpathy's approach excels at structured domain knowledge but is less suited for fleeting notes, personal reflections, or creative ideation. The most effective setup is likely a hybrid: traditional notes for thinking and creation, LLM knowledge bases for structured domain knowledge and reference material.

Privacy Considerations: Self-Hosting Your KB

A personal knowledge base potentially contains sensitive information: proprietary research, client data, personal notes, unreleased work. Sending this to a cloud API is a legitimate concern. Karpathy's architecture is fully compatible with self-hosted models:

Option 1: Fully Local with Open Weights

Run the entire pipeline on your own hardware using open-weight models. For knowledge bases under 128K tokens, a model like Gemma 4 31B on a single RTX 4090 handles both compilation and Q&A. For larger KBs, quantized versions (INT4/GGUF) fit in less VRAM with minimal quality loss. Explore our open-source model rankings for the best self-hosting options.

Option 2: Hybrid - Local for Sensitive, Cloud for Heavy

Use a local model for day-to-day Q&A queries (keeping your KB private) and a cloud model for periodic bulk compilation when you need maximum quality. The compiled wiki stays on your machine; only the raw-to-wiki compilation happens in the cloud, and you can strip sensitive details from raw files before that step.

Option 3: Enterprise API with Data Agreements

Most major providers (OpenAI, Anthropic, Google) offer enterprise tiers with data retention controls and no-training guarantees. For professional use where self-hosting is impractical, this provides a reasonable middle ground. Compare provider policies on our provider directory.

Advanced Patterns

Once the basic pipeline is running, several advanced patterns emerge:

Pattern 1: Multi-KB Composition

Maintain separate knowledge bases for different domains (e.g., "ML Research", "Product Management", "Personal Finance"). When a question spans domains, compose them by concatenating wikis into a single context. With 1M+ token context windows from models like Qwen3.6 Max Preview , you can combine multiple 200K-token KBs in a single query.

Pattern 2: Temporal Layers

Version your wiki with git. Each compilation pass is a commit. This gives you temporal queries: "How has our understanding of X changed over the last 3 months?" by diffing wiki versions. Git also provides a natural backup and rollback mechanism for when a compilation introduces errors.

Pattern 3: Active Research Agent

Extend the pipeline with a web search step: the linting stage identifies gaps, a web-search-capable model finds relevant new sources, those are filed into raw/, and the next compilation cycle incorporates them. This turns the KB into a semi-autonomous research assistant that actively seeks information to fill its own gaps. Models with function calling make this particularly clean to implement.

Pattern 4: Team Knowledge Bases

Multiple team members maintain their own raw/ directories but share a common wiki/. The compilation step merges inputs from all contributors, resolves conflicting information, and maintains a unified team knowledge base. Git merge workflows apply directly. This scales Karpathy's personal approach to small teams without introducing RAG infrastructure.

The Bigger Picture: Why This Matters

Karpathy's workflow is significant beyond personal productivity. It represents a shift in how we think about LLMs - from conversational assistants to infrastructure components. The LLM in this system is not answering questions on the fly; it is performing a well-defined compilation job with deterministic inputs and structured outputs. This is closer to a database ETL pipeline than a chatbot.

This reframing has implications for the model market:

Context window size becomes a hard requirement, not a nice-to-have. Models that cannot hold your KB in context are disqualified from the pipeline.
Output quality at long context lengths matters more than benchmark scores on short prompts. Models that degrade at 200K+ tokens are not suitable for compilation.
Pricing per input token directly affects the sustainability of daily querying. The race to zero on input pricing makes this workflow increasingly viable.
Open-weight models enable the privacy-conscious self-hosted option. Apache 2.0 licensing (like Gemma 4) is not just a philosophical preference - it is a functional requirement for sensitive KBs.
Function calling and structured output capabilities determine how well the linting and automation stages work.

Key Takeaways

Karpathy's approach uses LLMs as knowledge compilers, not chatbots - a fundamentally different mental model that produces better results.
No RAG needed for personal-scale knowledge bases. With 72 models supporting 1M+ tokens today, most compiled KBs fit entirely in context.
The six-stage pipeline (raw, compile, view, query, feedback, lint) is modular - you can start with stages 1-4 and add automation later.
Different pipeline stages benefit from different models: use top-tier models for compilation, affordable models for daily Q&A, fast models for linting.
Self-hosting with open-weight models eliminates both cost and privacy concerns. A single consumer GPU handles personal-scale KBs.
The compilation step is the key innovation: it transforms unstructured raw materials into LLM-optimized structured knowledge.
Start small (10-20 raw files in one domain), iterate on the compilation prompt, then expand. The system improves with every query cycle.

Related Reading

Frequently Asked Questions

An LLM knowledge base is a system where a large language model compiles raw documents (PDFs, articles, notes) into structured markdown pages organized by topic. Instead of using traditional search or RAG, the entire compiled wiki is passed as context to the LLM for querying, giving perfect recall over your knowledge.

For personal-scale knowledge bases (under 1-2 million tokens), the entire compiled wiki fits in the context window of modern long-context models. This gives perfect recall without the lossy failure modes of chunk-based retrieval. RAG remains necessary for organizational-scale collections that exceed context limits.

Different pipeline stages benefit from different models. Compilation needs long context + high quality (128K+ tokens). Q&A needs strong reasoning. Linting can use affordable models. For self-hosted privacy, open-weight models like Gemma 4 work well. Check our live leaderboard for current rankings.

For knowledge bases under 250K tokens, free-tier models make the workflow zero-cost. At 500K tokens with budget pricing ($5/M input tokens), expect about $2.50/day for 10 queries. Self-hosting an open-weight model on a consumer GPU eliminates per-query costs entirely.

LLM Knowledge Bases: Karpathy's Self-Improving Second Brain | LM Market Cap