Auto Routing and Optimization (Load Balancer Mode)
How Auto selects a primary model, adds an implicit fallback chain, and optimizes for cost/latency/reliability over time.
How Auto selects a primary model, adds an implicit fallback chain, and optimizes for cost/latency/reliability over time.
- Copy the request sample from this page.
- Run it in API Explorer with your key.
- Confirm stream done payload (finish_reason + charged credits).
- Move the same payload into your backend code.
What Auto does (in one sentence)
model="auto" turns LLMWise into a load balancer for LLMs: it picks the best primary model for each request and (optionally) applies an implicit fallback chain so transient failures do not break your flow.
Auto decision flow
When you send a Chat request with model="auto", the backend:
- Builds a candidate model set (vision-safe if your messages contain images).
- Loads your optimization policy (defaults + guardrails).
- Resolves a goal:
balanced | cost | latency | reliability. - Chooses a primary model using one of two strategies:
historical_optimization: uses your recent production traces when there is enough data.heuristic_routing: uses a fast heuristic classifier when history is insufficient or policy disables history.
The final model is returned to you in resolved_model on the done event (streaming) or in the JSON response (non-stream).
Auto as a load balancer (implicit failover)
Auto can also add a fallback chain even if you do not provide routing.
This is controlled by your optimization policy:
- If
max_fallbacks > 0, Auto will attach a fallback chain to the request. - If
max_fallbacks = 0, Auto will run as single-model routing only (no implicit failover).
When an implicit chain is active, LLMWise retries on retryable failures (429/5xx/timeouts), emits routing events (route, trace), and settles billing once a final model succeeds.
Auto is not just “pick a model”. It becomes hard to copy when the router learns from your real production traces (quality/cost/latency) and continuously improves fallback choices per workload.
Cost saver mode (shortcut)
If you send cost_saver: true, the server normalizes your request to:
model = "auto"optimization_goal = "cost"
This is supported for POST /api/v1/chat only (not with explicit routing).
What you see in streaming
In streaming mode (stream: true), you will see:
- delta chunks: JSON objects with a
deltafield (text) and adoneboolean. - Mesh/Auto failover events (only when a fallback chain is active):
event: "route": model attempts (trying/failed/skipped)event: "chunk": streamed deltas (event-wrapped)event: "trace": final routing summary
- final billing event:
event: "done"withcredits_charged,credits_remaining, and (when Auto is used)resolved_model,auto_strategy,optimization_goal.
API examples
cURL (Auto + cost saver)
curl -X POST https://llmwise.ai/api/v1/chat \
-H "Authorization: Bearer mm_sk_YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"cost_saver": true,
"messages": [{"role":"user","content":"Summarize this support thread."}],
"stream": true
}'
Python (SDK)
import os
from llmwise import LLMWise
client = LLMWise(os.environ["LLMWISE_API_KEY"])
for ev in client.chat_stream(
model="auto",
optimization_goal="balanced",
messages=[{"role": "user", "content": "Write a launch plan for a SaaS product."}],
):
if ev.get("delta"):
print(ev["delta"], end="", flush=True)
if ev.get("event") == "done":
print("\\n\\nresolved_model:", ev.get("resolved_model"))
break
TypeScript (SDK)
import { LLMWise } from "llmwise";
const client = new LLMWise(process.env.LLMWISE_API_KEY!);
for await (const ev of client.chatStream({
model: "auto",
optimization_goal: "cost",
messages: [{ role: "user", content: "Draft a short outbound email to a CTO." }],
})) {
if (ev.delta) process.stdout.write(ev.delta);
if (ev.event === "done") {
console.log("\\nresolved_model:", (ev as any).resolved_model);
break;
}
}
ChatKit-style guided help
Product-scoped assistant for LLMWise docs and API usage. It does not answer unrelated topics.
Sign in to ask implementation questions and get runnable snippets.
Sign in to use assistant