On April 2, 2026, Andrej Karpathy - co-founder of OpenAI, former Tesla AI lead, and one of the most influential voices in machine learning - shared a workflow that quietly redefines how we should think about personal knowledge management. His approach: use LLMs not as chatbots, but as compilers that transform raw information into structured, queryable knowledge bases.
The key insight that makes the community sit up: no RAG needed. No vector databases, no embedding pipelines, no retrieval infrastructure. Just raw files, an LLM with a large enough context window, and markdown. The entire compiled knowledge base fits in context for modern long-context models - making the whole RAG stack unnecessary for personal-scale knowledge management.
This is not a theoretical exercise. Karpathy is actively using this system for his own work, and the architecture he describes maps directly onto tools and models available today. Below, we break down every component of his workflow, explain the technical reasoning behind each design decision, identify the best models for each step, and show exactly how to build your own.
Karpathy's Architecture: The Six-Stage Pipeline
Karpathy's system is not a single tool - it is a six-stage pipeline where each stage serves a distinct purpose. Understanding the architecture requires understanding what each stage does, why it exists, and what model capabilities it demands.
raw/ - The Ingestion Layer
A flat directory where you dump source materials in any format: PDFs, HTML pages, API docs, research papers, tweets, code snippets, transcripts. No preprocessing required - the LLM handles format normalization in the next stage. Karpathy treats this as a "read-later queue on steroids" where the barrier to adding information is zero. The key design principle: capture first, organize later. You do not curate at ingestion time because curation is exactly what the LLM is better at.
LLM Compilation - The Intelligence Layer
This is the core innovation. An LLM reads every document in raw/ and compiles them into structured markdown files organized by topic. Not summarization - compilation. The distinction matters: summarization loses detail, while compilation restructures information into a consistent format that preserves key details while adding cross-references, resolving contradictions between sources, and establishing a coherent ontology. The LLM acts as a knowledge compiler, analogous to how a C compiler transforms source code into machine code - different representation, same semantics.
Each compiled markdown page covers one topic and includes: a summary, key facts, source attribution, related topics (as wiki links), open questions, and areas where sources disagree. The LLM decides the taxonomy - you do not predefine categories.
Obsidian IDE - The Visualization Layer
The compiled markdown wiki is opened in Obsidian (or any markdown editor with backlink support). This turns the flat files into a navigable knowledge graph. Obsidian's graph view shows topic connections. Its backlinks panel shows which pages reference the current topic. You can browse and edit the wiki manually, but the primary interaction mode is through the LLM in the next stage. Karpathy emphasizes that the human reads the wiki, but the LLM writes it.
Q&A Against Wiki - The Query Layer
Instead of searching your knowledge base with keywords (grep, Ctrl+F), you ask it questions in natural language. The entire compiled wiki is passed as context to an LLM, and you query it conversationally. "What do the sources say about X?" "What are the open disagreements about Y?" "How does Z relate to W?" Because the wiki is already compiled and structured, the LLM can answer with high accuracy and cite specific source pages. This is where the no-RAG claim becomes concrete: if your compiled wiki fits in the model's context window, you do not need retrieval at all. You get perfect recall over your entire knowledge base.
Output Filing - The Learning Loop
When the Q&A session produces new insights, synthesized conclusions, or identified gaps, those outputs are filed back into the wiki. This is what makes the system self-improving. Every interaction can expand the knowledge base. A question that reveals a gap leads to new research. A synthesis that connects two topics creates a new wiki page. Over time, the knowledge base becomes denser and more interconnected - not because you manually organized it, but because each query cycle implicitly refines the structure.
Linting & Health Checks - The Quality Layer
Automated checks run periodically across the wiki: broken links between pages, orphaned topics not connected to the graph, stale information that references outdated sources, contradictions between pages, and coverage gaps where raw/ has material that has not yet been compiled. Think of it as CI/CD for your knowledge base. The linting step can itself be LLM-powered - pass the wiki through a model and ask "what is inconsistent, outdated, or missing?" This turns the knowledge base from a static artifact into a living system with built-in quality guarantees.
Why This Works Without RAG
The most provocative aspect of Karpathy's approach is the claim that RAG is unnecessary for personal knowledge management. This is not anti-RAG rhetoric - it is a practical observation about how context windows have changed the math. The reasoning breaks down into three arguments:
Argument 1: Personal KBs Fit in Context
A typical personal knowledge base - even a comprehensive one covering your entire professional domain - is 500K to 2M tokens of compiled markdown. That is roughly 375K to 1.5M words, or 750 to 3,000 pages of text. Today, 41 models support 1M+ tokens and 101 models support 256K+ in our live rankings. For most people, the entire compiled wiki fits in a single context window with room to spare. RAG solves a problem that no longer exists at this scale.
Argument 2: Retrieval Introduces Lossy Failure Modes
RAG systems depend on embedding quality, chunk size, top-k selection, and re-ranking - each step can lose information. A query about "how does X relate to Y" might retrieve chunks about X and chunks about Y separately, but miss the paragraph that explicitly connects them. When the full knowledge base fits in context, the model sees everything simultaneously. Cross-references, contradictions, and implicit connections that RAG would miss become visible. For knowledge bases under 1M tokens, full-context is strictly superior to retrieval in recall quality.
Argument 3: The Compilation Step Does the Heavy Lifting
In a traditional RAG system, raw documents are chunked and embedded as-is. The retrieval quality depends on how well the chunks represent the original information. Karpathy's compilation step eliminates this problem: the LLM has already read every raw document, extracted the key information, and restructured it into a consistent format. The compiled wiki is already optimized for LLM consumption. It is shorter than the raw sources (compilation is lossy by intent), more structured, and internally consistent. You get the benefit of "retrieval" (smaller context) without the failure modes of chunk-based retrieval.
The boundary where this breaks down is organizational scale. A company's entire knowledge base (millions of documents) will not fit in any context window. For that, RAG remains essential. Karpathy's insight is that most individuals and small teams do not operate at that scale, and for them, the RAG infrastructure is pure overhead. See our Context Windows report for a deeper analysis of how context lengths have expanded across the industry.
The Context Window Math
How much knowledge actually fits? Here is the concrete math for different context window sizes:
| Context Window | Words | Pages (Book) | KB Size Estimate | Use Case |
|---|---|---|---|---|
| 128K tokens | ~96K | ~190 | Small domain wiki | Single project or topic area |
| 256K tokens | ~192K | ~385 | Medium KB | Professional domain coverage |
| 1M tokens | ~750K | ~1,500 | Large personal KB | Multi-domain expert knowledge |
| 2M tokens | ~1.5M | ~3,000 | Comprehensive KB | Team or organizational wiki |
To put this in perspective: the entire Wikipedia article on "Machine Learning" (including all sub-sections) is about 15K tokens. A compiled KB of 200 such articles fits comfortably in 256K tokens. Most professionals' accumulated domain knowledge - even in deep technical fields - compiles to under 500K tokens of structured markdown.
Best Models for Knowledge Base Workflows
Different stages of Karpathy's pipeline have different model requirements. Here is how the current leaderboard maps to each stage:
For Stage 2 (Compilation): Long Context + High Quality
The compilation step needs to read long raw documents and produce structured output. This requires both a large context window and strong instruction-following. The model must maintain coherence across hundreds of pages of input while producing well-organized markdown.
Models ranked by composite score among those with 128K+ context windows. Explore all models.
For Stage 4 (Q&A): Reasoning + Full Context
The Q&A stage needs strong reasoning over the entire compiled wiki. This is where top coding models excel - they already handle complex multi-file reasoning tasks.
Top 5 by composite score. Full leaderboard.
For Daily Use: Cost-Effective Options
Running queries against your KB multiple times per day adds up. These models offer strong performance at minimal cost - critical for making the workflow sustainable. See our pricing report for the full cost landscape.
Free tier models. Filter by price.
Model Spotlight: Which LLM for Which Task?
Rather than using one model for everything, Karpathy's approach benefits from task-specific model selection. Different pipeline stages have different bottlenecks:
| Pipeline Stage | Key Requirement | Top Pick | Score | Why |
|---|---|---|---|---|
| Compilation | Long context + structure | GPT-5.4 Pro | 91 | Best score among 128K+ context models |
| Q&A / Analysis | Reasoning depth | GPT-5.4 Pro | 91 | #1 overall for complex reasoning |
| Linting | Fast + affordable | Gemma 4 31B | 87 | Good quality at low cost for batch jobs |
| Tool Building | Coding ability | GPT-5.4 Pro | 91 | Highest benchmark scores for code generation |
| Self-Hosting | Open weights + privacy | Gemma 4 31B | 87 | Apache 2.0, runs on consumer GPUs |
For a deeper look at open-weight options suitable for self-hosted knowledge base systems, see our Gemma 4 Review and explore the open-source model rankings.
Deep Dive: The Compilation Step
The compilation step is the most technically demanding part of the pipeline, and where Karpathy's insight is sharpest. Traditional note-taking tools put the organizational burden on the human: you read a paper, decide what matters, create a page, file it in a folder, and link it to related notes. This is cognitively expensive and breaks down at scale.
Karpathy's approach inverts the workflow: you dump everything into raw/, and the LLM makes all organizational decisions. The prompt for the compilation step typically includes:
Example Compilation Prompt Structure:
1. Read all files in raw/
2. Identify distinct topics and concepts
3. For each topic, create a markdown page with:
- Title and one-line summary
- Key facts and details (preserve specifics)
- Source attribution (which raw files)
- Related topics (as [[wiki links]])
- Open questions or gaps
- Areas where sources disagree
4. Create an index.md with topic hierarchy
5. Flag any raw files that don't fit existing topics
The critical insight: this prompt runs incrementally. When new raw files are added, the LLM receives the existing wiki plus the new files and updates the wiki accordingly. It may create new pages, update existing ones, or restructure the taxonomy entirely. This is where high output capacity matters - the model needs to generate potentially dozens of updated pages in a single pass.
Models with high max output tokens are essential here. Among current models tracked in our rankings, GPT-5.4 Pro leads with strong output capacity. See our max tokens guide for a detailed comparison of output limits across models.
RAG vs Full-Context: When to Use What
Karpathy's approach is not universally superior to RAG. The right architecture depends on your scale and requirements:
| Dimension | Full-Context (Karpathy) | RAG Pipeline |
|---|---|---|
| Scale | Personal / small team (under 2M tokens compiled) | Organizational / enterprise (unlimited) |
| Recall | Perfect - model sees everything | Dependent on embedding/retrieval quality |
| Cross-referencing | Excellent - all context visible | Limited to retrieved chunks |
| Infrastructure | Zero - just files + LLM API | Vector DB, embedding pipeline, re-ranker |
| Cost per query | Higher (full KB sent each time) | Lower (only relevant chunks) |
| Latency | Higher (processing full context) | Lower (smaller prompt) |
| Setup time | Minutes - create folders, write prompt | Hours/days - choose stack, configure pipeline |
| Maintenance | Near zero - just add files | Re-embedding, index updates, chunk tuning |
The sweet spot for Karpathy's approach: individuals and small teams with domain-specific knowledge bases under 500K compiled tokens. At this scale, the simplicity advantage is overwhelming. For larger collections, consider a hybrid: compile and use full-context for your "hot" knowledge (active projects, current research), and RAG for your "cold" archive.
Cost Analysis: Running a Knowledge Base
The full-context approach sends your entire wiki with every query. Here is what that costs with current API pricing across different knowledge base sizes:
| KB Size | Queries/Day | Free Tier | Budget ($5/M input) | Premium ($15/M input) |
|---|---|---|---|---|
| 100K tokens | 10 | $0/day | $0.50/day | $1.50/day |
| 250K tokens | 10 | $0/day | $1.25/day | $3.75/day |
| 500K tokens | 10 | Rate limited | $2.50/day | $7.50/day |
| 1M tokens | 10 | Exceeds limits | $5.00/day | $15.00/day |
For knowledge bases under 250K tokens, free-tier models make this workflow essentially zero-cost. Even at 500K tokens with a budget model, you are looking at under $80/month for 10 daily queries - less than most SaaS knowledge management tools. Compare this against the infrastructure costs of running a RAG pipeline (vector DB hosting, embedding compute, maintenance time). See our pricing trends report for the broader context of API cost evolution.
Self-hosting eliminates per-query costs entirely. An open-weight model like Gemma 4 31B running on a consumer GPU (RTX 4090, ~$1,500) has zero marginal cost per query. For heavy users doing 50+ queries per day, self-hosting pays for itself within months. Read our Gemma 4 deep dive for hardware requirements and deployment options.
Building Your Own: Step-by-Step
Here is a concrete implementation path based on Karpathy's architecture, using tools and models available today:
Step 1: Set Up the Directory Structure
kb/
raw/ # Drop files here
wiki/ # LLM-compiled output
wiki/index.md # Auto-generated topic index
prompts/ # Your compilation/lint prompts
scripts/ # Automation scripts
Open the wiki/ directory in Obsidian (free). Enable backlinks and graph view in Obsidian settings.
Step 2: Seed with Initial Raw Materials
Start small. Pick one domain you know well - your current project's documentation, a research area you follow, or a technical stack you use daily. Drop 10-20 relevant files into raw/. Do not worry about format: PDFs, markdown, HTML, plain text all work. The LLM handles normalization. Start with materials you already understand so you can verify the compilation quality.
Step 3: Run the First Compilation
Use a high-quality model with a long context window. Feed it all raw files with the compilation prompt. Let it generate the wiki structure. Review the output: are topics correctly identified? Are cross-references sensible? Are there gaps? This first pass teaches you what the LLM does well and where you need to adjust the prompt. Expect to iterate 2-3 times on the compilation prompt before the output matches your expectations. Use top-rated coding models for best results on the initial compilation.
Step 4: Start Querying
Pass the entire wiki/ directory contents as context, then ask questions. Start with factual retrieval ("What do the sources say about X?") then move to synthesis ("How does X compare to Y based on what we know?") and then analysis ("Given what the KB contains, what are the biggest gaps in our understanding of Z?"). Each of these query types tests a different model capability. For data analysis and complex reasoning queries, use a top-tier reasoning model.
Step 5: Close the Loop
When a query produces a valuable insight or identifies a gap, file the output back. New synthesis goes into wiki/ as a new page. Identified gaps go into a gaps.md file that guides your next round of raw/ ingestion. Over time, the wiki becomes your externalized memory - structured, searchable, and self-improving. This is the "second brain" that Karpathy describes: not a static archive, but a living knowledge system that grows smarter with every interaction.
Step 6: Automate Maintenance
Write simple scripts (bash, Python, or use an AI automation tool) that periodically lint the wiki. Check for: broken [[wiki links]], orphaned pages with no inbound links, pages not updated in 30+ days, conflicting facts between pages. Run these weekly. The linting step can use a cheaper model since it is pattern-matching rather than deep reasoning. This is where function calling capabilities shine - the model can output structured lint results that your scripts parse.
Tool Ecosystem: What Already Exists
Karpathy's workflow maps onto existing tools, though no single tool implements the full pipeline yet:
| Tool | Pipeline Stage | What It Does |
|---|---|---|
| Obsidian | Stage 3 (Visualization) | Markdown wiki viewer with graph view and backlinks |
| Claude Code / Cursor / Aider | Stage 2 + 4 (Compile + Q&A) | Code-aware LLM interfaces that can read/write local files |
| Markdownlint | Stage 6 (Linting) | Structural markdown validation |
| Obsidian Copilot / Smart Connections | Stage 4 (Q&A) | LLM-powered Q&A within Obsidian |
| GitHub Actions / Cron | Stage 5 + 6 (Loop + Lint) | Scheduled automation for compilation and health checks |
| Ollama / llama.cpp | All stages (Self-host) | Run open-weight models locally for zero API cost |
The most practical setup today: Claude Code or Cursor for compilation and Q&A (they can read entire directories), Obsidian for browsing, and a cron job for linting. This gives you the full pipeline with tools that already exist. For developer-focused use cases, explore our AI for debugging and AI for code review guides for how coding models handle complex file analysis.
Community Reactions and Extensions
Karpathy's post sparked significant discussion. The key reactions cluster around four themes:
"This is just Zettelkasten with LLMs"
Several practitioners noted similarities to the Zettelkasten method (atomic notes, cross-references, emergent structure). The key difference: in Zettelkasten, the human does all the organizing. InKarpathy's approach, the LLM handles organization while the human handles curation and queries. This dramatically lowers the activation energy for maintaining a knowledge base.
"What about hallucination in the compilation step?"
A valid concern. If the LLM introduces facts during compilation that are not in the raw sources, the entire KB becomes unreliable. The mitigation: require source attribution in every compiled page. During linting, verify that every claim maps to a raw source. Use models with strong grounding capabilities - models that score highly on factual accuracy benchmarks. See our benchmark analysis for how different models handle factual grounding.
"Why not just use NotebookLM?"
Google's NotebookLM implements parts of this workflow (document ingestion + Q&A), but it is a closed system: you cannot access the intermediate representation, customize the compilation prompt, extend the pipeline with your own tools, or self-host it. Karpathy's approach is fundamentally about owning the entire pipeline - every stage is inspectable, customizable, and portable.
"Will this replace traditional note-taking?"
Not entirely. Karpathy's approach excels at structured domain knowledge but is less suited for fleeting notes, personal reflections, or creative ideation. The most effective setup is likely a hybrid: traditional notes for thinking and creation, LLM knowledge bases for structured domain knowledge and reference material.
Privacy Considerations: Self-Hosting Your KB
A personal knowledge base potentially contains sensitive information: proprietary research, client data, personal notes, unreleased work. Sending this to a cloud API is a legitimate concern. Karpathy's architecture is fully compatible with self-hosted models:
Option 1: Fully Local with Open Weights
Run the entire pipeline on your own hardware using open-weight models. For knowledge bases under 128K tokens, a model like Gemma 4 31B on a single RTX 4090 handles both compilation and Q&A. For larger KBs, quantized versions (INT4/GGUF) fit in less VRAM with minimal quality loss. Explore our open-source model rankings for the best self-hosting options.
Option 2: Hybrid - Local for Sensitive, Cloud for Heavy
Use a local model for day-to-day Q&A queries (keeping your KB private) and a cloud model for periodic bulk compilation when you need maximum quality. The compiled wiki stays on your machine; only the raw-to-wiki compilation happens in the cloud, and you can strip sensitive details from raw files before that step.
Option 3: Enterprise API with Data Agreements
Most major providers (OpenAI, Anthropic, Google) offer enterprise tiers with data retention controls and no-training guarantees. For professional use where self-hosting is impractical, this provides a reasonable middle ground. Compare provider policies on our provider directory.
Advanced Patterns
Once the basic pipeline is running, several advanced patterns emerge:
Pattern 1: Multi-KB Composition
Maintain separate knowledge bases for different domains (e.g., "ML Research", "Product Management", "Personal Finance"). When a question spans domains, compose them by concatenating wikis into a single context. With 1M+ token context windows from models like Qwen3.6 Plus (free) , you can combine multiple 200K-token KBs in a single query.
Pattern 2: Temporal Layers
Version your wiki with git. Each compilation pass is a commit. This gives you temporal queries: "How has our understanding of X changed over the last 3 months?" by diffing wiki versions. Git also provides a natural backup and rollback mechanism for when a compilation introduces errors.
Pattern 3: Active Research Agent
Extend the pipeline with a web search step: the linting stage identifies gaps, a web-search-capable model finds relevant new sources, those are filed into raw/, and the next compilation cycle incorporates them. This turns the KB into a semi-autonomous research assistant that actively seeks information to fill its own gaps. Models with function calling make this particularly clean to implement.
Pattern 4: Team Knowledge Bases
Multiple team members maintain their own raw/ directories but share a common wiki/. The compilation step merges inputs from all contributors, resolves conflicting information, and maintains a unified team knowledge base. Git merge workflows apply directly. This scales Karpathy's personal approach to small teams without introducing RAG infrastructure.
The Bigger Picture: Why This Matters
Karpathy's workflow is significant beyond personal productivity. It represents a shift in how we think about LLMs - from conversational assistants to infrastructure components. The LLM in this system is not answering questions on the fly; it is performing a well-defined compilation job with deterministic inputs and structured outputs. This is closer to a database ETL pipeline than a chatbot.
This reframing has implications for the model market:
Key Takeaways
Related Reading
How context windows expanded from 4K to 2M tokens and what it means for practical applications.
Token pricing has plummeted - making full-context knowledge base queries increasingly affordable.
Apache 2.0 open weights for self-hosted knowledge base pipelines with zero API cost.
The largest context window in the market - fits even comprehensive multi-domain knowledge bases.
MoE architectures deliver frontier performance at lower cost - ideal for frequent KB queries.
Top coding models ranked - the same models that excel at code also excel at knowledge compilation.