Spend Compute on Generation, Not Retrieval — Lore

There’s an assumption in long-form AI generation that goes something like this: once you exceed a few thousand words, you need RAG. Embeddings. A vector store. Some retrieval pipeline to keep the model from contradicting itself by scene 20.

We built a fiction generation pipeline that produces manuscripts up to novel length across seven genres and forty-scene arcs. After fifty-plus runs — the most recent of which came in at 90,000 words — we don’t use any of that. No embeddings. No vector database. No retrieval model.

But we do generate every scene more than once. We run a surgical editor over the output. We track foreshadow-payoff chains across the arc and measure tonal drift with arithmetic. The pipeline spends most of its budget on generation and curation. None of it on retrieval.

This isn’t a story about efficiency for its own sake. The infrastructure is lean because the generative process is deliberately excessive — and the budget has to go somewhere.

I. The Thesis

LLMs are fluent. They are not, by default, creative. Under pressure, they regress toward domain prototypes — safe phrasing, familiar arcs, hedonic drift. The research calls it Galton-style regression to the mean. We’ve measured it: left uncompensated, a long-arc novel drifts measurably toward positivity regardless of target tone. The model wants to resolve. It wants to comfort. The alignment training that makes it helpful makes it predictable.

The intervention can’t be a better prompt. It has to be structural. A pipeline where the default mode is overproduction — generate more than you keep, push past the tone floor by design, treat the editorial pass as the creative act rather than the generation.

That pipeline needs to spend its compute on generation passes, not on retrieval infrastructure. Every dollar that goes to a vector database is a dollar that doesn’t go to a second prose refinement pass or a surgical editor. The architecture has to be lean where it doesn’t matter so it can be excessive where it does.

II. Where the Excess Lives

The pipeline spends its generation budget on repetition. Each scene passes through more than one stage before it’s done: content generation under worldview constraint, a separate prose refinement pass, a structural audit over the complete manuscript, and a surgical editor that makes targeted fixes at high similarity. That’s not efficiency. It’s deliberate overproduction with progressive curation.

The clearest example is the split between content and style. The first pass focuses on what happens, who acts, what changes. The second pass takes that draft and refines it for prose register — sentence rhythm, diction, the shape of interiority. Two passes, two temperatures, two objectives. The model doesn’t have to simultaneously satisfy worldview logic and prose aesthetics in a single generation. It produces more text than survives, and the refinement pass is where the style crystallizes. Most of what gets improved in the second pass isn’t the events. It’s the inside of the events — how the character experiences what’s happening.

The editor layer operates after generation, not during. It expects the output to contain tics — diagnostic appositives, explain-after-render constructions, essay endings — and it removes them surgically rather than constraining the model’s fluency at generation time. The creative act is the cut, not the first draft. In practice, the editor doesn’t rewrite prose. It removes the three words the model used to explain what it had already shown.

III. Where the Efficiency Lives

The infrastructure that supports all that generation is minimal. The context management layer — the part that would traditionally justify a vector database — is a structured in-memory lorebook. Pydantic models. Deterministic filtering. A fixed token budget that grows logarithmically with scene count, not linearly.

It works because in sequential fiction generation, relevance is structurally determined. Each scene has a beat from the plot engine. The beat specifies which characters are present, which location, which dramatic question is active. The lorebook uses this to select context: entities matching the scene’s characters, knowledge states for who’s in the room, relationships between them, tiered summaries of prior scenes (recent full, mid-distance compressed, far-distance as one-liners), and a drift note if the tone is off-target.

No cosine similarity. No top-k retrieval. The beat tells you what’s relevant.

The tiered compression is the clever part, and it’s also the boring part. Old scenes collapse into short summaries. Recent scenes stay detailed. The model doesn’t need a full account of scene 4 when it’s generating scene 37 — it needs to know that scene 4 happened, and roughly what it established. Compression serves recency without losing continuity. The total cost of the history layer grows much slower than the story does.

This is lean. Intentionally. The lorebook’s job is to not be the bottleneck. It keeps the model oriented — who’s here, what they know, what happened recently, whether the tone is drifting — using structured data that fits in a fraction of the context window. The remaining window goes to the beat, the worldview anchor, the tonal constraints, the foreshadow tracking. The things that shape what the model produces, not what it remembers.

IV. What the Data Showed

We’ve measured this architecture against its earlier versions. The clearest lift came from introducing the lorebook at all — the baseline drift toward positive-tone resolution dropped from severe to mild. Adding knowledge gates — constraints that prevent characters from referencing things they haven’t learned yet — moved the result into the target range. A character referencing a document he wouldn’t read for another five scenes stops being a consistency bug and starts being an impossibility.

“Does character X know fact Y as of scene Z?” is a database query, not a similarity search.

The word count escalation problem — the model’s tendency to make every scene bigger than the last — solved differently. Not with retrieval, but with dynamic ceilings that scale the generation budget to arc position. Early scenes get room. Mid-arc scenes stay tight. Climactic scenes get room again. Runaway growth collapses into controlled escalation that follows dramatic structure.

None of this required retrieval. All of it required structure.

V. Where This Breaks

Honest answer about when you do need retrieval:

Scene count past a hundred. One-line summaries lose signal at extreme distance. At some point, “Thorn discovers the archive” tells you that something happened but not why it matters 60 scenes later. Semantic retrieval over full summaries becomes necessary when positional context alone can’t answer “find the scene where X happened.”

Large casts without tiering. Relationship tracking is quadratic — a big cast means a lot of pairs. Knowledge gates explode similarly. Character tiering (protagonists get full tracking, extras get a one-liner) solves this within a deterministic framework. Untiered large casts blow the budget.

Series continuity. The lorebook is per-story. Cross-manuscript queries — “what did we establish about this city in book 2?” — are genuine retrieval problems.

Non-linear timelines. Flashbacks and frame stories create three-way knowledge splits: what has happened, what has been presented, what each character knows. Deterministic filtering assumes sequential presentation. Non-linear structures need retrieval patterns that don’t map to recency tiers.

These are real limits. We’ll hit some of them eventually. When we do, the lorebook’s query interface is clean enough to slot a retrieval backend behind it without changing the generation pipeline. The infrastructure adapts; the creative architecture doesn’t need to.

VI. The Principle

RAG solves a specific problem: I have a large corpus and I don’t know which parts are relevant to this query. In structured fiction generation with a plot engine, you always know which parts are relevant. The beat is the query. The plan is the index.

But the deeper point isn’t about retrieval versus filtering. It’s about where compute should go when the goal is creative quality rather than creative safety.

If the model’s default failure mode is regression to the mean — safe phrasing, predictable arcs, hedonic resolution — then the intervention is at the generative layer. More passes. Deliberate overproduction. Surgical curation. The editorial cut as the creative act. That’s where the LLM calls need to go.

A lean context layer makes that possible. Every token you don’t spend on retrieval infrastructure is a token you can spend on a second prose pass, a foreshadow-payoff chain, a hope drift correction, an editor that removes the three words the model used to explain what it had already shown.

The infrastructure you don’t build is infrastructure that doesn’t compete with the work.

Fifty-plus runs. Seven genres. 90,000 words at the long end. The pipeline that produced The Unfinished Sentence, The Assayed Compact, The Residency, and The Merritt Certification — all with zero retrieval queries.