Running alpaca logo

fastpaca

DocsGitHub

Context infra for LLM apps

Users expect full message history. LLMs have context limits.

Fastpaca handles both automatically.

→ Get Started

7.5k req/s (3 nodes) · p99 < 150 ms · Apache 2.0

                      ╔═ fastpaca ════════════════════════╗
╔══════════╗          ║                                   ║░
║          ║░         ║  ┏━━━━━━━━━━━┓     ┏━━━━━━━━━━━┓  ║░
║  client  ║░───API──▶║  ┃  Message  ┃────▶┃  Context  ┃  ║░
║          ║░         ║  ┃  History  ┃     ┃  Policy   ┃  ║░
╚══════════╝░         ║  ┗━━━━━━━━━━━┛     ┗━━━━━━━━━━━┛  ║░
 ░░░░░░░░░░░░         ║                                   ║░
                      ╚═══════════════════════════════════╝░
                       ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
   ╔════════════╗
   ║   client   ║░
   ╚════════════╝░
    ░░░░░║░░░░░░░░
         ▼
╔═ fastpaca ══════╗
║  ┏━━━━━━━━━━━┓  ║░
║  ┃  Message  ┃  ║░
║  ┃  History  ┃  ║░
║  ┗━━━━━┳━━━━━┛  ║░
║        ▼        ║░
║  ┏━━━━━━━━━━━┓  ║░
║  ┃  Context  ┃  ║░
║  ┃  Policy   ┃  ║░
║  ┗━━━━━━━━━━━┛  ║░
╚═════════════════╝░
 ░░░░░░░░░░░░░░░░░░░

What users expect to see

09:12 alice → can we review the feb roadmap decisions?
09:13 ben → pulling the changelog and doc links now.
09:18 alice → include every design note, i need the full thread.
09:45 audit-log → summary appended for leadership review.
History shownEvery turn

Users expect nothing to disappear from the history.

WHAT LLMs NEED

SYSTEM SUMMARY → feb roadmap decisions condensed.
09:38 user → ok, now answer with everything considered.
09:38 assistant → drafting response…
context trimmed
Context window budget180k / 200k

Overrun the limit and the provider rejects the call.

Fastpaca manages all of it automatically for you.

  • Keep every message for your users to see.
  • Stay under strict context limits before calls fail.
  • Compact just enough detail to keep the LLM coherent.

How Fastpaca Works

1. Choose a budget & context policy

Every context sets its token budget and compaction policy up front. Use built-ins or roll your own.

const ctx = await fastpaca.context('chat_42', {
  budget: 1_000_000,
  trigger: 0.7,
  policy: { strategy: 'last_n', config: { limit: 400 } }
});

2. Append from your backend

Append any message from your LLMs or your users.

await ctx.append({
  role: 'user',
  parts: [{ type: 'text', text: 'What changed in the latest release?' }]
});

3. Call your LLM

Fetch the compacted context and hand it to your LLM.

const stream = ctx.stream((messages) => streamText({
  model: openai('gpt-4o-mini'),
  messages
}));

return stream.toResponse();

4. (optional) Compact on your terms

Set the policy to `manual` and you can use `needsCompaction` to check if you hit the configured budget, and manage your own compaction.

const { needsCompaction, messages } = await ctx.context();
if (needsCompaction) {
  const { summary, remainingMessages } = await summarise(messages);
  await ctx.compact([
    { role: 'system', parts: [{ type: 'text', text: summary }] },
    ...remainingMessages
  ]);
}
→ View examples

What You Get

Stack agnostic

Bring your own framework. Works natively with ai-sdk. Use LangChain, raw OpenAI/Anthropic calls, whatever you fancy.

Horizontally scalable

Distributed consensus, idempotent appends, automatic failover. Scale nodes horizontally without risk.

Token-smart

Enforce token budgets with built-in compaction policies. Stay within limits automatically.

Self-hosted

Single container to start. Add nodes to cluster with automatic failover. Optional Postgres write-behind.

What Fastpaca is NOT

Not a vector DB

Bring your own to complement your LLM. Fastpaca manages conversation state, not embeddings.

Not generic chat infrastructure

Built specifically for LLMs. Optimized for token budgets and context windows.

Not an agent framework

Use it alongside whichever one you prefer. Fastpaca handles context, you handle orchestration.

Fastpaca is open, Apache-licensed context infrastructure you can run anywhere.

Self-host on your laptop, your Kubernetes cluster, or your VPC. Full ownership and control.

→ View on GitHub→ Quick Start

FAQ

What is a context in fastpaca?

+−

A durable log of messages and a LLM context window (the slice you send to your LLM). The log is append‑only; the window respects your token budget and policy.Learn more.

Do you call my LLM?

+−

No. Fastpaca is backend‑only. Your server appends/fetches from Fastpaca and then calls your LLM provider directly.Quick Start.

How do token budgets and triggers work?

+−

You set a token budget and a trigger ratio (default 0.7). When usage crosses the trigger, the compaction policy of your choosing automatically compact the context window contents.Details.

Which compaction policies are built in?

+−

last_n: keep the latest N messages; skip_parts: drop tool* and reasoning parts, then apply last_n; manual: keep everything until the trigger trips, then decide how to rewrite.Strategies.

Can I change policy later?

+−

Yes. Update the context with a new policy and future compactions will use it. The message log remains intact.Changing policies.

How do I stream and still keep history?

+−

Use ctx.stream(...) with your LLM call in ai-sdk. It forwards the window to your LLM and appends streamed parts back into the context. In other languages you will have to roll your own.Streaming.

How do I handle retries and concurrency?

+−

Use idempotency_key for retries and if_version for optimistic concurrency (409 on mismatch).API reference.

Fastpaca makes context management as boring as it should be.

Store everything. Budget tokens. Compact automatically. Move on to building your product.

Use ai-sdk for inference. Use Fastpaca for context state. Bring your own LLM, framework, and frontend.

Open-source at github.com/fastpaca/fastpaca — Apache 2.0 · Self-host anywhere

© 2025 Fastpaca. All rights reserved.