API Core

Auto Routing and Optimization (Load Balancer Mode)

How Auto selects a primary model, adds an implicit fallback chain, and optimizes for cost/latency/reliability over time.

10 minUpdated 2026-02-15

Summary

How Auto selects a primary model, adds an implicit fallback chain, and optimizes for cost/latency/reliability over time.

6 deep-dive sections3 code samples

Quick Start

Copy the request sample from this page.
Run it in API Explorer with your key.
Confirm stream done payload (finish_reason + charged credits).
Move the same payload into your backend code.

What Auto does (in one sentence)

model="auto" turns LLMWise into a load balancer for LLMs: it picks the best primary model for each request and (optionally) applies an implicit fallback chain so transient failures do not break your flow.

Auto decision flow

When you send a Chat request with model="auto", the backend:

Builds a candidate model set (vision-safe if your messages contain images).
Loads your optimization policy (defaults + guardrails).
Resolves a goal: balanced | cost | latency | reliability.
Chooses a primary model using one of two strategies:
- historical_optimization: uses your recent production traces when there is enough data.
- heuristic_routing: uses a fast heuristic classifier when history is insufficient or policy disables history.

The final model is returned to you in resolved_model on the done event (streaming) or in the JSON response (non-stream).

Auto as a load balancer (implicit failover)

Auto can also add a fallback chain even if you do not provide routing.

This is controlled by your optimization policy:

If max_fallbacks > 0, Auto will attach a fallback chain to the request.
If max_fallbacks = 0, Auto will run as single-model routing only (no implicit failover).

When an implicit chain is active, LLMWise retries on retryable failures (429/5xx/timeouts), emits routing events (route, trace), and settles billing once a final model succeeds.

Why we call this a moat

Auto is not just “pick a model”. It becomes hard to copy when the router learns from your real production traces (quality/cost/latency) and continuously improves fallback choices per workload.

Cost saver mode (shortcut)

If you send cost_saver: true, the server normalizes your request to:

model = "auto"
optimization_goal = "cost"

This is supported for POST /api/v1/chat only (not with explicit routing).

What you see in streaming

In streaming mode (stream: true), you will see:

delta chunks: JSON objects with a delta field (text) and a done boolean.
Mesh/Auto failover events (only when a fallback chain is active):
- event: "route": model attempts (trying/failed/skipped)
- event: "chunk": streamed deltas (event-wrapped)
- event: "trace": final routing summary
final billing event:
- event: "done" with credits_charged, credits_remaining, and (when Auto is used) resolved_model, auto_strategy, optimization_goal.

API examples

cURL (Auto + cost saver)

curl -X POST https://llmwise.ai/api/v1/chat \
  -H "Authorization: Bearer mm_sk_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "cost_saver": true,
    "messages": [{"role":"user","content":"Summarize this support thread."}],
    "stream": true
  }'

Python (SDK)

import os
from llmwise import LLMWise

client = LLMWise(os.environ["LLMWISE_API_KEY"])

for ev in client.chat_stream(
    model="auto",
    optimization_goal="balanced",
    messages=[{"role": "user", "content": "Write a launch plan for a SaaS product."}],
):
    if ev.get("delta"):
        print(ev["delta"], end="", flush=True)
    if ev.get("event") == "done":
        print("\\n\\nresolved_model:", ev.get("resolved_model"))
        break

TypeScript (SDK)

import { LLMWise } from "llmwise";

const client = new LLMWise(process.env.LLMWISE_API_KEY!);

for await (const ev of client.chatStream({
  model: "auto",
  optimization_goal: "cost",
  messages: [{ role: "user", content: "Draft a short outbound email to a CTO." }],
})) {
  if (ev.delta) process.stdout.write(ev.delta);
  if (ev.event === "done") {
    console.log("\\nresolved_model:", (ev as any).resolved_model);
    break;
  }
}

Chat API reference Mesh mode tutorial Rate limits and reliability

Docs Assistant

ChatKit-style guided help

Product-scoped assistant for LLMWise docs and API usage. It does not answer unrelated topics.

Chat API Reference

Compare / Blend / Judge API Reference