What users expect to see
Users expect nothing to disappear from the history.
fastpaca
Users expect full message history. LLMs have context limits.
Fastpaca handles both automatically.
7.5k req/s (3 nodes) · p99 < 150 ms · Apache 2.0
╔═ fastpaca ════════════════════════╗
╔══════════╗ ║ ║░
║ ║░ ║ ┏━━━━━━━━━━━┓ ┏━━━━━━━━━━━┓ ║░
║ client ║░───API──▶║ ┃ Message ┃────▶┃ Context ┃ ║░
║ ║░ ║ ┃ History ┃ ┃ Policy ┃ ║░
╚══════════╝░ ║ ┗━━━━━━━━━━━┛ ┗━━━━━━━━━━━┛ ║░
░░░░░░░░░░░░ ║ ║░
╚═══════════════════════════════════╝░
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ ╔════════════╗
║ client ║░
╚════════════╝░
░░░░░║░░░░░░░░
▼
╔═ fastpaca ══════╗
║ ┏━━━━━━━━━━━┓ ║░
║ ┃ Message ┃ ║░
║ ┃ History ┃ ║░
║ ┗━━━━━┳━━━━━┛ ║░
║ ▼ ║░
║ ┏━━━━━━━━━━━┓ ║░
║ ┃ Context ┃ ║░
║ ┃ Policy ┃ ║░
║ ┗━━━━━━━━━━━┛ ║░
╚═════════════════╝░
░░░░░░░░░░░░░░░░░░░Users expect nothing to disappear from the history.
Overrun the limit and the provider rejects the call.
Fastpaca manages all of it automatically for you.
Every context sets its token budget and compaction policy up front. Use built-ins or roll your own.
const ctx = await fastpaca.context('chat_42', {
budget: 1_000_000,
trigger: 0.7,
policy: { strategy: 'last_n', config: { limit: 400 } }
});Append any message from your LLMs or your users.
await ctx.append({
role: 'user',
parts: [{ type: 'text', text: 'What changed in the latest release?' }]
});Fetch the compacted context and hand it to your LLM.
const stream = ctx.stream((messages) => streamText({
model: openai('gpt-4o-mini'),
messages
}));
return stream.toResponse();Set the policy to `manual` and you can use `needsCompaction` to check if you hit the configured budget, and manage your own compaction.
const { needsCompaction, messages } = await ctx.context();
if (needsCompaction) {
const { summary, remainingMessages } = await summarise(messages);
await ctx.compact([
{ role: 'system', parts: [{ type: 'text', text: summary }] },
...remainingMessages
]);
}Bring your own framework. Works natively with ai-sdk. Use LangChain, raw OpenAI/Anthropic calls, whatever you fancy.
Distributed consensus, idempotent appends, automatic failover. Scale nodes horizontally without risk.
Enforce token budgets with built-in compaction policies. Stay within limits automatically.
Single container to start. Add nodes to cluster with automatic failover. Optional Postgres write-behind.
Bring your own to complement your LLM. Fastpaca manages conversation state, not embeddings.
Built specifically for LLMs. Optimized for token budgets and context windows.
Use it alongside whichever one you prefer. Fastpaca handles context, you handle orchestration.
Self-host on your laptop, your Kubernetes cluster, or your VPC. Full ownership and control.
A durable log of messages and a LLM context window (the slice you send to your LLM). The log is append‑only; the window respects your token budget and policy.Learn more.
No. Fastpaca is backend‑only. Your server appends/fetches from Fastpaca and then calls your LLM provider directly.Quick Start.
You set a token budget and a trigger ratio (default 0.7). When usage crosses the trigger, the compaction policy of your choosing automatically compact the context window contents.Details.
last_n: keep the latest N messages; skip_parts: drop tool* and reasoning parts, then apply last_n; manual: keep everything until the trigger trips, then decide how to rewrite.Strategies.
Yes. Update the context with a new policy and future compactions will use it. The message log remains intact.Changing policies.
Use ctx.stream(...) with your LLM call in ai-sdk. It forwards the window to your LLM and appends streamed parts back into the context. In other languages you will have to roll your own.Streaming.
Use idempotency_key for retries and if_version for optimistic concurrency (409 on mismatch).API reference.
Store everything. Budget tokens. Compact automatically. Move on to building your product.
Use ai-sdk for inference. Use Fastpaca for context state. Bring your own LLM, framework, and frontend.