Blend Strategies & Orchestration Algorithms
Deep dive into blend synthesis strategies (Consensus, Council, MoA, Self-MoA), mesh failover, auto-routing heuristics, and optimization scoring.
Deep dive into blend synthesis strategies (Consensus, Council, MoA, Self-MoA), mesh failover, auto-routing heuristics, and optimization scoring.
- Start from your current production prompt/request.
- Run the exact tutorial flow step-by-step once.
- Measure impact in Usage before rollout.
- Promote only when quality/cost/reliability metrics match target.
LLMWise orchestrates multiple models through several algorithmic layers. This guide explains every strategy and algorithm in depth, with special focus on Blend mode — the most configurable.
Blend mode overview
Blend sends your prompt to multiple models simultaneously, then feeds all responses into a synthesizer model that produces one final answer. The synthesis behavior changes depending on which strategy you choose.
| Strategy | Models | Layers | Best for |
|---|---|---|---|
| consensus | 2–6 | 1 | Combining strongest points, resolving contradictions |
| council | 2–6 | 1 | Structured debate: agreements, disagreements, follow-ups |
| best_of | 2–6 | 1 | Quick pick — enhance the best response with additions |
| chain | 2–6 | 1 | Iterative integration across all responses |
| moa | 2–6 | 1–3 | Multi-layer Mixture-of-Agents refinement |
| self_moa | 1 | 1 | Single model, multiple diverse candidates (2–8 samples) |
All strategies follow the same two-phase execution:
Strategy: Consensus
The default strategy. The synthesizer receives all source responses and is instructed to combine the strongest points while resolving any contradictions.
- Single-pass synthesis — no refinement layers
- Synthesizer decides which parts of each response to keep
- Contradictions are resolved by weighing the majority view
{
"models": ["gpt-5.2", "claude-sonnet-4.5", "gemini-3-flash"],
"synthesizer": "claude-sonnet-4.5",
"strategy": "consensus",
"messages": [{"role": "user", "content": "Explain quantum entanglement"}]
}
Strategy: Council
Structures the synthesis as a deliberation. The synthesizer produces:
- Final answer — the synthesized conclusion
- Agreement points — where all models aligned
- Disagreement points — where models diverged, with analysis
- Follow-up questions — areas that need further exploration
Best when you want transparency about model consensus vs. divergence.
Strategy: Best-Of
The synthesizer picks the single best response, then enhances it with useful additions from the others. The quickest synthesis approach — minimal rewriting, focused on augmentation.
Strategy: Chain
Iterative integration. The synthesizer works through each response sequentially, building a comprehensive answer by incrementally incorporating each model's contribution. Produces the most thorough output but may be longer.
Strategy: MoA (Mixture of Agents)
The most sophisticated strategy. Inspired by the Mixture-of-Agents paper, MoA adds refinement layers where models can see and improve upon previous answers.
How MoA layers work
- Layer 0: Each model answers the prompt independently (same as other strategies).
- Layer 1+: Each model receives the previous layer's answers as reference material, injected via system message. Models are instructed to improve upon, correct, and expand the references.
- Final synthesis: The synthesizer combines all responses from the last completed layer.
Reference injection
Previous-layer answers are injected into each model's context:
- Total reference budget: 12,000 characters across all references
- Per-answer cap: 3,200 characters (truncated if longer)
- Injection method: System message + follow-up user message containing formatted references
Early stopping
If a layer produces zero successful responses, MoA keeps the previous layer's successes and skips to synthesis. This prevents total failure when models hit rate limits or errors.
{
"models": ["gpt-5.2", "claude-sonnet-4.5", "gemini-3-flash"],
"synthesizer": "claude-sonnet-4.5",
"strategy": "moa",
"layers": 2,
"messages": [{"role": "user", "content": "Design a rate limiter for a distributed system"}]
}
Set layers from 1 to 3. Layer 0 (initial answers) always runs. So layers: 2 means 3 total rounds: initial + 2 refinement passes. More layers = better quality but higher latency and cost.
Strategy: Self-MoA
Self-MoA generates diverse candidates from a single model by varying temperature and system prompts. This is useful when you trust one model but want to hedge against its variance.
How it works
- You provide exactly 1 model in
models[] - Set
samples(2–8, default 4) for how many candidates to generate - Each candidate runs with a different temperature offset and agent prompt
- The synthesizer combines all candidates into one final answer
Temperature variation
Each candidate gets a different temperature to encourage diversity:
Base offsets: [-0.25, 0.0, +0.25, +0.45, +0.15, +0.35, -0.1, +0.3]
Final temp = clamp(base_temp + offset, 0.2, 1.4)
For example, with temperature: 0.7 and 4 samples:
- Candidate 1: temp 0.45 (conservative)
- Candidate 2: temp 0.70 (baseline)
- Candidate 3: temp 0.95 (creative)
- Candidate 4: temp 1.15 (exploratory)
Agent prompt rotation
Six distinct system prompts rotate across candidates, each emphasizing a different quality:
Plus two more: Clarity (plain-language explanations) and Skepticism (challenge assumptions, flag weaknesses).
{
"models": ["claude-sonnet-4.5"],
"synthesizer": "claude-sonnet-4.5",
"strategy": "self_moa",
"samples": 4,
"temperature": 0.7,
"messages": [{"role": "user", "content": "Write a Python async rate limiter"}]
}
Self-MoA requires exactly 1 model in models[]. Using 2+ models will return a validation error. The synthesizer can be the same model or a different one.
Blend credit cost
All blend strategies cost 4 credits regardless of strategy or model count. Credits are reserved upfront and refunded if all source models fail. After completion, the actual provider cost is settled — you may receive a partial refund if the real cost was lower than the reservation.
Compare mode algorithm
Compare runs 2–9 models concurrently and streams their responses side-by-side.
- All models stream via an
asyncio.Queue— chunks are yielded in arrival order (not round-robin) - Queue timeout: 120 seconds per chunk
- After all models finish, a summary event reports the fastest model and longest response
- Total latency = max(individual latencies) — bottleneck is the slowest model
- Cost: 3 credits. Refunded if all models fail; partial status logged if some succeed.
Judge mode algorithm
Judge runs a three-phase competitive evaluation:
Scoring system
The judge produces structured JSON with rankings sorted by score descending:
{
"rankings": [
{"model": "claude-sonnet-4.5", "rank": 1, "score": 9.2, "reasoning": "Most complete and well-structured"},
{"model": "gpt-5.2", "rank": 2, "score": 8.8, "reasoning": "Accurate but less organized"}
],
"overall_analysis": "Claude response covered more edge cases..."
}
Default evaluation criteria: accuracy, completeness, clarity, helpfulness, code quality. You can override these with the criteria parameter.
Fallback scoring: If the judge returns malformed JSON, default scores are assigned: 8.0 - (i * 0.5) for each contestant in order, with a note that scores were auto-assigned. Cost: 5 credits.
Mesh mode: circuit breaker failover
When you use mesh mode (chat with routing parameter), LLMWise tries models in sequence with automatic failover powered by a circuit breaker.
Circuit breaker state machine
Each model tracks health in-memory:
| State | Condition | Behavior |
|---|---|---|
| Healthy | consecutive_failures < 3 | Model available for requests |
| Open | 3+ consecutive failures | Model skipped for 30 seconds |
| Half-Open | 30s elapsed since circuit opened | One probe request allowed |
| Recovered | Probe succeeds | Reset to Healthy, consecutive_failures = 0 |
Failover sequence
- Try primary model first
- If it fails (or circuit is open), try fallback 1, then fallback 2, etc.
- For each attempt: emit a
routeevent (trying,failed, orskipped) - First success stops the chain — no further fallbacks tried
- After all attempts, emit a
traceevent summarizing the route
Latency tracking
Model latency is tracked with exponential smoothing:
avg_latency = (avg_latency * 0.8) + (new_latency * 0.2)
This favors recent measurements, so a model that recovers from a slow period will quickly show improved latency.
Auto-router: heuristic classification
When you set model: "auto", LLMWise classifies your query using zero-latency regex matching (no LLM call overhead) and routes to the best model.
| Category | Pattern examples | Routed to |
|---|---|---|
| Code | function, debug, error, refactor, python, git, docker | gpt-5.2 |
| Math | equation, integral, derivative, probability, proof | claude-sonnet-4.5 |
| Creative | write poem, story, brainstorm, roleplay, screenplay | claude-sonnet-4.5 |
| Translation | translate, Spanish, French, Chinese, Japanese | gemini-3-flash |
| Quick fact | Short query (60 chars or less), no pattern match | gemini-3-flash |
| Analysis | Default fallback for everything else | gpt-5.2 |
| Vision | Any request with images attached | gpt-5.2 (override) |
If your message contains images, auto-router always picks a vision-capable model regardless of query content.
Policy-based routing
If you have an optimization policy enabled with sufficient historical data, auto-router upgrades from regex heuristics to historical optimization — routing based on actual performance data from your past requests. See the next section.
Optimization scoring algorithm
The optimization engine analyzes your historical request logs and recommends the best model + fallback chain for each goal.
Goals and weight vectors
Each goal uses different weights for the three scoring dimensions:
| Goal | Success rate weight | Speed weight | Cost weight |
|---|---|---|---|
| Balanced | 0.50 | 0.30 | 0.20 |
| Latency (Speed) | 0.30 | 0.60 | 0.10 |
| Cost | 0.30 | 0.10 | 0.60 |
| Reliability | 0.75 | 0.20 | 0.05 |
Scoring formula
For each eligible model (minimum 3 calls in lookback window):
inv_latency = (max_latency - model_latency) / (max_latency - min_latency)
inv_cost = (max_cost - model_cost) / (max_cost - min_cost)
raw_score = (Ws * success_rate) + (Wl * inv_latency) + (Wc * inv_cost)
sample_factor = min(1.0, calls / 20)
score = raw_score * (0.7 + 0.3 * sample_factor)
The sample factor gives a small boost to models with more data — a model with 20+ calls gets the full score, while a model with only 3 calls is penalized 30%. Preferred models get an additional +0.04 * sample_factor bonus.
Confidence score
confidence = min(1.0, total_calls / 60)
At 60+ total calls across all models, confidence reaches 1.0 (full certainty). Below that, the recommendation carries a lower confidence signal.
Guardrails
After scoring, models are filtered through policy guardrails:
- Max latency: Reject models above threshold (e.g., 5000ms)
- Max cost: Reject models above per-request cost (e.g., $0.05)
- Min success rate: Reject models below reliability threshold (e.g., 0.95)
The top model that passes all guardrails becomes the recommended primary. The next N models become the fallback chain (configurable, 0–6 fallbacks).
Credit settlement algorithm
LLMWise uses a three-phase credit system:
Settlement formula
Reserved credits are debited at request start. After execution, LLMWise reconciles that reserve against actual token usage.
- If usage is lower than reserved credits, unused credits are refunded.
- If usage is higher, we charge only the difference. BYOK requests keep provider-facing billing and remain on 0 credits.
ChatKit-style guided help
Product-scoped assistant for LLMWise docs and API usage. It does not answer unrelated topics.
Sign in to ask implementation questions and get runnable snippets.
Sign in to use assistant