# LLMWise — Full Platform Documentation

> Multi-model LLM API orchestration platform. One API key to access 30+ models from OpenAI, Anthropic, Google, DeepSeek, Meta, Mistral, xAI, and more. Orchestration modes: Chat, Compare, Blend, Judge, plus Failover routing. OpenAI-style messages, credit-based pay-per-use, no subscription.

Base URL: https://llmwise.ai
API base: https://llmwise.ai/api/v1
Auth: Bearer token (mm_sk_ prefix) or Clerk JWT
Streaming: Server-Sent Events (SSE)

## Supported Models

| ID | Name | Provider | Vision |
|----|------|----------|--------|
| auto | Auto (smart routing) | LLMWise | Yes |
| gpt-5.2 | GPT-5.2 | OpenAI | Yes |
| claude-sonnet-4.5 | Claude Sonnet 4.5 | Anthropic | Yes |
| gemini-3-flash | Gemini 3 Flash | Google | Yes |
| claude-haiku-4.5 | Claude Haiku 4.5 | Anthropic | No |
| deepseek-v3 | DeepSeek V3 | DeepSeek | No |
| llama-4-maverick | Llama 4 Maverick | Meta | No |
| mistral-large | Mistral Large | Mistral | No |
| grok-3 | Grok 3 | xAI | Yes |
| zai-glm-5 | GLM 5 | Z.ai | No |
| liquid-lfm-2.2-6b | LFM2 2.6B | LiquidAI | No |
| liquid-lfm-2.5-1.2b-thinking-free | LFM2.5 1.2B Thinking (Free) | LiquidAI | No |
| liquid-lfm2-8b-a1b | LFM2 8B A1B | LiquidAI | No |
| minimax-m2.5 | MiniMax M2.5 | MiniMax | No |
| llama-3.3-70b-instruct | Llama 3.3 70B Instruct | Meta | No |
| gpt-oss-20b | GPT OSS 20B | OpenAI | No |
| gpt-oss-120b | GPT OSS 120B | OpenAI | No |
| gpt-oss-safeguard-20b | GPT OSS Safeguard 20B | OpenAI | No |
| kimi-k2.5 | Kimi K2.5 | MoonshotAI | Yes |
| nemotron-3-nano-30b-a3b | Nemotron 3 Nano 30B | NVIDIA | No |
| nemotron-nano-12b-v2-vl | Nemotron Nano 12B VL | NVIDIA | Yes |
| claude-opus-4.6 | Claude Opus 4.6 | Anthropic | Yes |
| claude-opus-4.5 | Claude Opus 4.5 | Anthropic | Yes |
| arcee-coder-large | Arcee Coder Large | Arcee AI | No |
| arcee-trinity-large-preview-free | Arcee Trinity Large (Free) | Arcee AI | No |
| qwen3-coder-next | Qwen3 Coder Next | Qwen | No |
| olmo-3.1-32b-think | OLMo 3.1 32B Think | AllenAI | No |
| llama-guard-3-8b | Llama Guard 3 8B | Meta | No |
| gpt-4o-2024-08-06 | GPT-4o (2024-08-06) | OpenAI | Yes |
| gpt-audio | GPT Audio | OpenAI | No |
| openrouter-free | OpenRouter Free | OpenRouter | Yes |
| openrouter-auto | OpenRouter Auto | OpenRouter | Yes |

## Orchestration Modes

### Chat (1 credit)
Endpoint: POST /api/v1/chat
Single-model chat with OpenAI-style messages (role + content) and streaming SSE.

### Compare (3 credits)
Endpoint: POST /api/v1/compare
Same prompt hits 2-9 models simultaneously. Responses stream back with per-model latency, tokens, and cost.

### Blend (4 credits)
Endpoint: POST /api/v1/blend
Multiple models respond, then a synthesizer combines the strongest parts. Strategies: consensus, best_of, chain.

### Judge (5 credits)
Endpoint: POST /api/v1/judge
Contestant models compete on your prompt. A judge model scores, ranks, and explains why one wins.

### Failover Routing (1 credit)
Endpoint: POST /api/v1/chat (with routing parameter)
Primary model hits 429 or goes down? Auto-failover to backup chain. Circuit breakers, health checks, zero downtime.

## Pricing

- Free Trial: 40 credits, 7-day expiry, no credit card required
- Pay-per-use: Add credits anytime, paid credits never expire
- Auto top-up: Optional automatic refill with monthly safety cap
- Enterprise: Custom limits, team billing, SLAs — contact sales@llmwise.ai

---

# Documentation


## Getting Started

### Quick Start Guide

## What you get immediately

Every new account receives **40 free credits** (7-day trial). One credit = one Chat request. No credit card required to start.

- OpenAI-style messages format (role + content)
- Chat, Compare, Blend, Judge, and Mesh modes
- Unified usage + charged credits visibility
- Optimization and replay workflows for policy tuning

## 10-minute setup

1. Create account in `/sign-up` — you receive 40 free credits instantly.
2. Generate an API key in `/keys`.
3. Open `/api-explorer` and run your first request.
4. Open `/chat` and test `Auto` mode.
5. Open `/usage` to confirm charged credits and response latency.

## First request

```bash
curl -X POST https://llmwise.ai/api/v1/chat \
  -H "Authorization: Bearer mm_sk_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "optimization_goal": "balanced",
    "messages": [
      {"role": "user", "content": "Give me a launch checklist for an AI API product."}
    ],
    "stream": true
  }'
```

## What success looks like

In streaming mode, watch for a final `done` payload including:

- `finish_reason`
- `resolved_model`
- `credits_charged`
- `credits_remaining`

### Dashboard User Guide

## Dashboard map

## Mode behavior

## Suggested daily workflow

1. Start in Chat with `Auto` and `Balanced` goal.
2. For critical prompts, run Compare before standardizing.
3. Use Blend/Judge only for high-value outputs.
4. Add Mesh chain for reliability-sensitive flows.
5. Check Usage daily and Replay weekly.

## How to read Usage page correctly

- **Charged credits**: what user wallet is billed.
- **Latency**: request performance for user experience.
- **Tokens**: workload profile for model selection decisions.


## API Core

### Authentication and API Keys

## Authentication model

LLMWise supports two authentication methods:

Both methods use the same `Authorization: Bearer <token>` header. The backend detects which method you are using by the token prefix.

## API key details

- **Prefix:** `mm_sk_` followed by 64 hex characters
- **Storage:** Keys are SHA-256 hashed before storage — the raw key is only shown once at generation time
- **One key per account** at a time. Generating a new key invalidates the previous one

### Key lifecycle

### Chat API Reference

## Endpoint

## Request fields

## Streaming events

In single-model chat mode, SSE messages are plain JSON chunks that include a `delta` field (no explicit `event` field).

In Mesh/failover mode (when `routing` is set, or when Auto uses an implicit fallback chain), chunks are wrapped in explicit events (`event: "route" | "chunk" | "trace"`), followed by a final `done` payload with billing metadata.

## Request example

```json
{
  "model": "auto",
  "cost_saver": true,
  "optimization_goal": "cost",
  "messages": [
    {"role": "user", "content": "Design retry logic for API failures."}
  ],
  "semantic_memory": true,
  "semantic_top_k": 4,
  "stream": true
}
```

## Done event example

```json
{
  "event": "done",
  "id": "request_uuid",
  "resolved_model": "deepseek-v3",
  "finish_reason": "stop",
  "credits_charged": 1,
  "credits_remaining": 2038
}
```

## Non-stream response example

```json
{
  "id": "request_uuid",
  "model": "gpt-5.2",
  "content": "...",
  "prompt_tokens": 42,
  "completion_tokens": 312,
  "latency_ms": 1180,
  "cost": 0.0039,
  "credits_charged": 1,
  "credits_remaining": 2038,
  "finish_reason": "stop",
  "mode": "chat"
}
```

### Auto Routing and Optimization (Load Balancer Mode)

## What Auto does (in one sentence)

`model="auto"` turns LLMWise into a **load balancer for LLMs**: it picks the best primary model for each request and (optionally) applies an implicit fallback chain so transient failures do not break your flow.

## Auto decision flow

When you send a Chat request with `model="auto"`, the backend:

1. Builds a candidate model set (vision-safe if your messages contain images).
2. Loads your **optimization policy** (defaults + guardrails).
3. Resolves a goal: `balanced | cost | latency | reliability`.
4. Chooses a primary model using one of two strategies:
   - `historical_optimization`: uses your recent production traces when there is enough data.
   - `heuristic_routing`: uses a fast heuristic classifier when history is insufficient or policy disables history.

The final model is returned to you in `resolved_model` on the `done` event (streaming) or in the JSON response (non-stream).

## Auto as a load balancer (implicit failover)

Auto can also add a fallback chain even if you do not provide `routing`.

This is controlled by your optimization policy:

- If `max_fallbacks > 0`, Auto will attach a fallback chain to the request.
- If `max_fallbacks = 0`, Auto will run as **single-model routing only** (no implicit failover).

When an implicit chain is active, LLMWise retries on retryable failures (429/5xx/timeouts), emits routing events (`route`, `trace`), and settles billing once a final model succeeds.

## Cost saver mode (shortcut)

If you send `cost_saver: true`, the server normalizes your request to:

- `model = "auto"`
- `optimization_goal = "cost"`

This is supported for `POST /api/v1/chat` only (not with explicit `routing`).

## What you see in streaming

In streaming mode (`stream: true`), you will see:

- **delta chunks**: JSON objects with a `delta` field (text) and a `done` boolean.
- **Mesh/Auto failover events** (only when a fallback chain is active):
  - `event: "route"`: model attempts (trying/failed/skipped)
  - `event: "chunk"`: streamed deltas (event-wrapped)
  - `event: "trace"`: final routing summary
- **final billing event**:
  - `event: "done"` with `credits_charged`, `credits_remaining`, and (when Auto is used) `resolved_model`, `auto_strategy`, `optimization_goal`.

## API examples

### cURL (Auto + cost saver)

```bash
curl -X POST https://llmwise.ai/api/v1/chat \
  -H "Authorization: Bearer mm_sk_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "cost_saver": true,
    "messages": [{"role":"user","content":"Summarize this support thread."}],
    "stream": true
  }'
```

### Python (SDK)

```python
import os
from llmwise import LLMWise

client = LLMWise(os.environ["LLMWISE_API_KEY"])

for ev in client.chat_stream(
    model="auto",
    optimization_goal="balanced",
    messages=[{"role": "user", "content": "Write a launch plan for a SaaS product."}],
):
    if ev.get("delta"):
        print(ev["delta"], end="", flush=True)
    if ev.get("event") == "done":
        print("\\n\\nresolved_model:", ev.get("resolved_model"))
        break
```

### TypeScript (SDK)

```ts
import { LLMWise } from "llmwise";

const client = new LLMWise(process.env.LLMWISE_API_KEY!);

for await (const ev of client.chatStream({
  model: "auto",
  optimization_goal: "cost",
  messages: [{ role: "user", content: "Draft a short outbound email to a CTO." }],
})) {
  if (ev.delta) process.stdout.write(ev.delta);
  if (ev.event === "done") {
    console.log("\\nresolved_model:", (ev as any).resolved_model);
    break;
  }
}
```

### Compare / Blend / Judge API Reference

## Endpoint matrix

## Compare behavior

- Runs all selected models concurrently.
- Emits per-model completion events.
- Emits summary metadata (`fastest`, `longest`).
- Refunds when all models fail.

## Blend behavior

Blend supports strategies:

- `consensus`
- `council`
- `best_of`
- `chain`
- `moa` (Mixture-of-Agents refinement layers)
- `self_moa` (Self-MoA: multiple candidates from one base model)

Notes:

- Most strategies require **2+ models**. Passing 1 model returns a 400 error.
- For `self_moa`, pass exactly **1 model** in `models[]` and set `samples` (2–8).
- For `moa`, set `layers` (1–3). Each layer refines answers using the previous layer as references.
- The judge model cannot be one of the contestants.

## Judge behavior

Judge mode collects contestant outputs, then prompts the judge model to return ranked JSON.

```json
{
  "event": "verdict",
  "winner": "claude-sonnet-4.5",
  "scores": [
    {"model": "claude-sonnet-4.5", "rank": 1, "score": 9.2, "reasoning": "..."},
    {"model": "gpt-5.2", "rank": 2, "score": 8.8, "reasoning": "..."}
  ],
  "overall": "Claude response was more complete and better structured."
}
```

## Failure semantics

### API Explorer Guide

## Why API Explorer exists

API Explorer is the fastest way to validate payload structure and endpoint behavior before coding SDK integration.

- Mode-specific payload templates
- Live request execution with your API key
- Stream event inspector (delta chunks, `route`/`chunk`/`trace`, `done`, terminal errors)
- Raw and parsed output panes
- Product-scoped assistant for endpoint-specific snippet generation

## Typical debugging sequence

## Good assistant prompts

- "Generate Node.js fetch example with retries for this payload."
- "Show Python SSE parser for done events and finish_reason handling."
- "Explain why this request returned 402 and what user action fixes it."


## Tutorials

### Mesh Mode Tutorial (Failover Routing)

## When to use Mesh

Use Mesh mode for reliability-sensitive traffic where a single provider failure is not acceptable.

- Frequent 429 bursts
- Provider latency spikes
- High-value requests that must complete

## Mesh failover model

### Replay Lab Tutorial

## What Replay Lab does

Replay Lab simulates historical request traffic against your current policy to estimate impact before you change production behavior.

- Cost deltas
- Latency deltas
- Reliability and success-rate deltas

## Replay flow

### Prompt Regression Testing Tutorial

## What this feature covers

- Prebuilt prompt templates
- Custom suite creation
- Manual and scheduled test runs
- CSV export for historical tracking

## Workflow

### Blend Strategies & Orchestration Algorithms

LLMWise orchestrates multiple models through several algorithmic layers. This guide explains every strategy and algorithm in depth, with special focus on **Blend mode** — the most configurable.

## Blend mode overview

Blend sends your prompt to multiple models simultaneously, then feeds all responses into a **synthesizer** model that produces one final answer. The synthesis behavior changes depending on which **strategy** you choose.

All strategies follow the same two-phase execution:

## Strategy: Consensus

The default strategy. The synthesizer receives all source responses and is instructed to combine the strongest points while resolving any contradictions.

- Single-pass synthesis — no refinement layers
- Synthesizer decides which parts of each response to keep
- Contradictions are resolved by weighing the majority view

```json
{
  "models": ["gpt-5.2", "claude-sonnet-4.5", "gemini-3-flash"],
  "synthesizer": "claude-sonnet-4.5",
  "strategy": "consensus",
  "messages": [{"role": "user", "content": "Explain quantum entanglement"}]
}
```

## Strategy: Council

Structures the synthesis as a deliberation. The synthesizer produces:

1. **Final answer** — the synthesized conclusion
2. **Agreement points** — where all models aligned
3. **Disagreement points** — where models diverged, with analysis
4. **Follow-up questions** — areas that need further exploration

Best when you want transparency about model consensus vs. divergence.

## Strategy: Best-Of

The synthesizer picks the single best response, then enhances it with useful additions from the others. The quickest synthesis approach — minimal rewriting, focused on augmentation.

## Strategy: Chain

Iterative integration. The synthesizer works through each response sequentially, building a comprehensive answer by incrementally incorporating each model's contribution. Produces the most thorough output but may be longer.

## Strategy: MoA (Mixture of Agents)

The most sophisticated strategy. Inspired by the [Mixture-of-Agents](https://arxiv.org/abs/2406.04692) paper, MoA adds **refinement layers** where models can see and improve upon previous answers.

### How MoA layers work

1. **Layer 0**: Each model answers the prompt independently (same as other strategies).
2. **Layer 1+**: Each model receives the previous layer's answers as reference material, injected via system message. Models are instructed to improve upon, correct, and expand the references.
3. **Final synthesis**: The synthesizer combines all responses from the last completed layer.

### Reference injection

Previous-layer answers are injected into each model's context:

- **Total reference budget**: 12,000 characters across all references
- **Per-answer cap**: 3,200 characters (truncated if longer)
- **Injection method**: System message + follow-up user message containing formatted references

### Early stopping

If a layer produces zero successful responses, MoA keeps the previous layer's successes and skips to synthesis. This prevents total failure when models hit rate limits or errors.

```json
{
  "models": ["gpt-5.2", "claude-sonnet-4.5", "gemini-3-flash"],
  "synthesizer": "claude-sonnet-4.5",
  "strategy": "moa",
  "layers": 2,
  "messages": [{"role": "user", "content": "Design a rate limiter for a distributed system"}]
}
```

## Strategy: Self-MoA

Self-MoA generates diverse candidates from a **single model** by varying temperature and system prompts. This is useful when you trust one model but want to hedge against its variance.

### How it works

1. You provide exactly **1 model** in `models[]`
2. Set `samples` (2–8, default 4) for how many candidates to generate
3. Each candidate runs with a different **temperature offset** and **agent prompt**
4. The synthesizer combines all candidates into one final answer

### Temperature variation

Each candidate gets a different temperature to encourage diversity:

```
Base offsets: [-0.25, 0.0, +0.25, +0.45, +0.15, +0.35, -0.1, +0.3]
Final temp = clamp(base_temp + offset, 0.2, 1.4)
```

For example, with `temperature: 0.7` and 4 samples:
- Candidate 1: temp 0.45 (conservative)
- Candidate 2: temp 0.70 (baseline)
- Candidate 3: temp 0.95 (creative)
- Candidate 4: temp 1.15 (exploratory)

### Agent prompt rotation

Six distinct system prompts rotate across candidates, each emphasizing a different quality:

Plus two more: **Clarity** (plain-language explanations) and **Skepticism** (challenge assumptions, flag weaknesses).

```json
{
  "models": ["claude-sonnet-4.5"],
  "synthesizer": "claude-sonnet-4.5",
  "strategy": "self_moa",
  "samples": 4,
  "temperature": 0.7,
  "messages": [{"role": "user", "content": "Write a Python async rate limiter"}]
}
```

## Blend credit cost

All blend strategies cost **4 credits** regardless of strategy or model count. Credits are reserved upfront and refunded if all source models fail. After completion, the actual provider cost is settled — you may receive a partial refund if the real cost was lower than the reservation.

## Compare mode algorithm

Compare runs 2–9 models concurrently and streams their responses side-by-side.

- All models stream via an `asyncio.Queue` — chunks are yielded in arrival order (not round-robin)
- Queue timeout: 120 seconds per chunk
- After all models finish, a **summary event** reports the fastest model and longest response
- Total latency = max(individual latencies) — bottleneck is the slowest model
- Cost: 3 credits. Refunded if all models fail; partial status logged if some succeed.

## Judge mode algorithm

Judge runs a three-phase competitive evaluation:

### Scoring system

The judge produces structured JSON with rankings sorted by score descending:

```json
{
  "rankings": [
    {"model": "claude-sonnet-4.5", "rank": 1, "score": 9.2, "reasoning": "Most complete and well-structured"},
    {"model": "gpt-5.2", "rank": 2, "score": 8.8, "reasoning": "Accurate but less organized"}
  ],
  "overall_analysis": "Claude response covered more edge cases..."
}
```

**Default evaluation criteria**: accuracy, completeness, clarity, helpfulness, code quality. You can override these with the `criteria` parameter.

**Fallback scoring**: If the judge returns malformed JSON, default scores are assigned: `8.0 - (i * 0.5)` for each contestant in order, with a note that scores were auto-assigned. Cost: 5 credits.

## Mesh mode: circuit breaker failover

When you use mesh mode (chat with `routing` parameter), LLMWise tries models in sequence with automatic failover powered by a circuit breaker.

### Circuit breaker state machine

Each model tracks health in-memory:

### Failover sequence

1. Try **primary model** first
2. If it fails (or circuit is open), try **fallback 1**, then **fallback 2**, etc.
3. For each attempt: emit a `route` event (`trying`, `failed`, or `skipped`)
4. First success stops the chain — no further fallbacks tried
5. After all attempts, emit a `trace` event summarizing the route

### Latency tracking

Model latency is tracked with exponential smoothing:

```
avg_latency = (avg_latency * 0.8) + (new_latency * 0.2)
```

This favors recent measurements, so a model that recovers from a slow period will quickly show improved latency.

## Auto-router: heuristic classification

When you set `model: "auto"`, LLMWise classifies your query using **zero-latency regex matching** (no LLM call overhead) and routes to the best model.

### Policy-based routing

If you have an **optimization policy** enabled with sufficient historical data, auto-router upgrades from regex heuristics to **historical optimization** — routing based on actual performance data from your past requests. See the next section.

## Optimization scoring algorithm

The optimization engine analyzes your historical request logs and recommends the best model + fallback chain for each goal.

### Goals and weight vectors

Each goal uses different weights for the three scoring dimensions:

### Scoring formula

For each eligible model (minimum 3 calls in lookback window):

```
inv_latency = (max_latency - model_latency) / (max_latency - min_latency)
inv_cost    = (max_cost - model_cost) / (max_cost - min_cost)

raw_score = (Ws * success_rate) + (Wl * inv_latency) + (Wc * inv_cost)

sample_factor = min(1.0, calls / 20)
score = raw_score * (0.7 + 0.3 * sample_factor)
```

The **sample factor** gives a small boost to models with more data — a model with 20+ calls gets the full score, while a model with only 3 calls is penalized 30%. Preferred models get an additional `+0.04 * sample_factor` bonus.

### Confidence score

```
confidence = min(1.0, total_calls / 60)
```

At 60+ total calls across all models, confidence reaches 1.0 (full certainty). Below that, the recommendation carries a lower confidence signal.

### Guardrails

After scoring, models are filtered through policy guardrails:

- **Max latency**: Reject models above threshold (e.g., 5000ms)
- **Max cost**: Reject models above per-request cost (e.g., $0.05)
- **Min success rate**: Reject models below reliability threshold (e.g., 0.95)

The top model that passes all guardrails becomes the **recommended primary**. The next N models become the **fallback chain** (configurable, 0–6 fallbacks).

## Credit settlement algorithm

LLMWise uses a three-phase credit system:

### Settlement formula

Reserved credits are debited at request start.
After execution, LLMWise reconciles that reserve against actual token usage.

- If usage is lower than reserved credits, unused credits are refunded.
- If usage is higher, we charge only the difference.
BYOK requests keep provider-facing billing and remain on **0 credits**.


## Billing & Limits

### Billing and Credits

## Billing principle

Users are billed in **credits**, not raw provider token costs. One dollar buys 100 credits.

- Mode-level default charge is fixed per request (reserved upfront)
- After the request completes, a settlement step reconciles actual provider cost
- Wallet balance is shown in `/credits`
- **Paid credits never expire**

## Free trial

Every new account receives **40 free credits** on signup. Free credits expire after **7 days**. Once expired, unused free credits are removed and the account moves to `free_expired` status. Purchase any credit pack to unlock your account — paid credits have no expiry.

## Default charges

## How settlement works

Credits are **reserved** before the request starts, then **settled** after the real provider cost is known:

If the actual provider cost (plus margin) exceeds the reserved credits, additional credits are charged. If it costs less, the difference is refunded. All adjustments appear as separate transactions in your history.

## Top-up flow

Minimum top-up is $10. Maximum single top-up is $10,000.

## Auto top-up

Enable automatic refills so requests never fail due to low balance:

1. Complete one Stripe checkout to save a payment method
2. Enable auto top-up in `/settings` and set your preferred amount
3. Set a balance threshold — when credits drop below it, a top-up is triggered
4. Set a monthly spending cap to control costs

Auto top-ups are processed as off-session Stripe PaymentIntents using your saved payment method. Monthly spending is tracked and capped to prevent runaway charges.

## BYOK (Bring Your Own Key)

When a BYOK provider key is configured, requests route directly to the provider using your key. **BYOK requests skip credit charges entirely** — you pay the provider directly. This is useful when customer contracts require provider-direct billing.

### Rate Limits and Reliability

## Reliability stack

## Per-endpoint limits

All limits are per 60-second window. Paid users (any purchase history) get a 1.5x multiplier; free-tier users get a 0.6x multiplier.

## Dual-layer enforcement

Every request is checked against two independent counters:

1. **Per-user** — keyed by your user ID
2. **Per-IP** — keyed by your client IP address (via `X-Forwarded-For`)

IP-level limits are separate from user limits. Default IP limits: free = 120 req/min, paid = 360 req/min.

## Burst protection

A second short-window layer prevents request spikes. Within any 10-second window:

- **Free users:** 30 requests max
- **Paid users:** 90 requests max

If you exceed the burst limit, you receive a `429` with the message "Request burst detected."

## Response headers

Every API response includes rate-limit headers:

## Fail-open mode

By default, rate limiting runs in **fail-open** mode. If Redis is unavailable, requests are allowed through rather than blocked. This prevents a Redis outage from taking down your API access. Critical routes can be configured for fail-closed if needed.

## Circuit breaker (Mesh mode)

When using Mesh/failover routing, a per-model circuit breaker protects against cascading failures:

- **3 consecutive failures** → circuit opens for 30 seconds
- During open state, the model is skipped and the next fallback is tried
- After 30 seconds, **half-open**: one test request is allowed through
- A successful test closes the circuit; a failure reopens it

## Client retry baseline

```javascript
for (let attempt = 0; attempt <= 3; attempt += 1) {
  const res = await fetch(url, init);
  if (res.ok) return res;
  if (res.status === 429 || res.status >= 500) {
    const retryAfter = res.headers.get("Retry-After");
    const delay = retryAfter
      ? parseInt(retryAfter, 10) * 1000
      : 300 * (2 ** attempt);
    await new Promise((r) => setTimeout(r, delay));
    continue;
  }
  throw new Error("HTTP " + res.status);
}
```


## Security & Data

### Privacy, Security, and Data Controls

## Control matrix

## Retention impact

## Managing privacy settings

Toggle controls via `PUT /api/v1/settings/privacy`:

```json
{
  "zero_retention_mode": true,
  "data_training_opt_in": false,
  "purge_existing_data": true
}
```

- `zero_retention_mode` — when enabled, all new requests skip prompt/response storage and semantic memory
- `data_training_opt_in` — explicit consent for training data collection (auto-disabled when zero-retention is on)
- `purge_existing_data` — when enabling zero-retention, purge previously stored data

Check current settings with `GET /api/v1/settings/privacy`.

## Data purge

When you enable zero-retention mode with `purge_existing_data: true`, the following data is permanently removed:

- **Semantic memories** — all vector embeddings deleted
- **Training samples** — all opted-in training data deleted
- **Request logs** — prompt and response text redacted (metadata preserved for billing)
- **Conversations** — titles scrubbed

The API returns a count of affected records so you can verify the purge was complete.

## Enterprise baseline checklist

1. Enable zero-retention for regulated workloads.
2. Keep training opt-in disabled by default.
3. Rotate API and webhook secrets on a schedule.
4. Use BYOK when customer contract requires provider-direct billing.
5. Verify purge counts after enabling zero-retention.

### Semantic Memory API Reference

## Endpoints

## Retrieval flow

## Search call example

```bash
curl -G https://llmwise.ai/api/v1/memory/search \
  -H "Authorization: Bearer mm_sk_YOUR_KEY" \
  --data-urlencode "q=What decision did we make about retries?" \
  --data-urlencode "top_k=4"
```

## Zero-retention behavior

When zero-retention mode is enabled, memory APIs return disabled behavior and no persisted entries.


## Operations

### Webhooks and System Sync

## Endpoints

## Clerk events handled

- `user.created` — create local user with signup bonus (40 free credits)
- `user.updated` — sync email and name changes
- `user.deleted` — deactivate user account

Clerk webhooks are verified using Svix signatures. If the auth middleware already auto-created the user before the webhook arrives, the webhook gracefully updates instead of duplicating.

## Stripe events handled

- `checkout.session.completed` — wallet top-up fulfillment
- `checkout.session.async_payment_succeeded` — delayed payment confirmation

Both events trigger the same fulfillment flow: validate metadata, check idempotency, and credit the user wallet. Events are deduplicated by `stripe_payment_id` to prevent double-crediting.

## Sync hardening

## Setup checklist

1. Configure webhook endpoints in Clerk and Stripe dashboards.
2. Set webhook secrets in environment variables.
3. Send test events and verify logs.
4. Validate duplicate event handling.