API Core

Auto Routing and Optimization (Load Balancer Mode)

How Auto selects a primary model, adds an implicit fallback chain, and optimizes for cost/latency/reliability over time.

10 minUpdated 2026-02-15
Summary

How Auto selects a primary model, adds an implicit fallback chain, and optimizes for cost/latency/reliability over time.

6 deep-dive sections3 code samples
Quick Start
  1. Copy the request sample from this page.
  2. Run it in API Explorer with your key.
  3. Confirm stream done payload (finish_reason + charged credits).
  4. Move the same payload into your backend code.

What Auto does (in one sentence)

model="auto" turns LLMWise into a load balancer for LLMs: it picks the best primary model for each request and (optionally) applies an implicit fallback chain so transient failures do not break your flow.

Auto decision flow

When you send a Chat request with model="auto", the backend:

  1. Builds a candidate model set (vision-safe if your messages contain images).
  2. Loads your optimization policy (defaults + guardrails).
  3. Resolves a goal: balanced | cost | latency | reliability.
  4. Chooses a primary model using one of two strategies:
    • historical_optimization: uses your recent production traces when there is enough data.
    • heuristic_routing: uses a fast heuristic classifier when history is insufficient or policy disables history.

The final model is returned to you in resolved_model on the done event (streaming) or in the JSON response (non-stream).

Auto as a load balancer (implicit failover)

Auto can also add a fallback chain even if you do not provide routing.

This is controlled by your optimization policy:

  • If max_fallbacks > 0, Auto will attach a fallback chain to the request.
  • If max_fallbacks = 0, Auto will run as single-model routing only (no implicit failover).

When an implicit chain is active, LLMWise retries on retryable failures (429/5xx/timeouts), emits routing events (route, trace), and settles billing once a final model succeeds.

Why we call this a moat

Auto is not just “pick a model”. It becomes hard to copy when the router learns from your real production traces (quality/cost/latency) and continuously improves fallback choices per workload.

Cost saver mode (shortcut)

If you send cost_saver: true, the server normalizes your request to:

  • model = "auto"
  • optimization_goal = "cost"

This is supported for POST /api/v1/chat only (not with explicit routing).

What you see in streaming

In streaming mode (stream: true), you will see:

  • delta chunks: JSON objects with a delta field (text) and a done boolean.
  • Mesh/Auto failover events (only when a fallback chain is active):
    • event: "route": model attempts (trying/failed/skipped)
    • event: "chunk": streamed deltas (event-wrapped)
    • event: "trace": final routing summary
  • final billing event:
    • event: "done" with credits_charged, credits_remaining, and (when Auto is used) resolved_model, auto_strategy, optimization_goal.

API examples

cURL (Auto + cost saver)

curl -X POST https://llmwise.ai/api/v1/chat \
  -H "Authorization: Bearer mm_sk_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "cost_saver": true,
    "messages": [{"role":"user","content":"Summarize this support thread."}],
    "stream": true
  }'

Python (SDK)

import os
from llmwise import LLMWise

client = LLMWise(os.environ["LLMWISE_API_KEY"])

for ev in client.chat_stream(
    model="auto",
    optimization_goal="balanced",
    messages=[{"role": "user", "content": "Write a launch plan for a SaaS product."}],
):
    if ev.get("delta"):
        print(ev["delta"], end="", flush=True)
    if ev.get("event") == "done":
        print("\\n\\nresolved_model:", ev.get("resolved_model"))
        break

TypeScript (SDK)

import { LLMWise } from "llmwise";

const client = new LLMWise(process.env.LLMWISE_API_KEY!);

for await (const ev of client.chatStream({
  model: "auto",
  optimization_goal: "cost",
  messages: [{ role: "user", content: "Draft a short outbound email to a CTO." }],
})) {
  if (ev.delta) process.stdout.write(ev.delta);
  if (ev.event === "done") {
    console.log("\\nresolved_model:", (ev as any).resolved_model);
    break;
  }
}
Docs Assistant

ChatKit-style guided help

Product-scoped assistant for LLMWise docs and API usage. It does not answer unrelated topics.

Sign in to ask implementation questions and get runnable snippets.

Sign in to use assistant
Previous
Chat API Reference
Next
Compare / Blend / Judge API Reference