Tutorials

Blend Strategies & Orchestration Algorithms

Deep dive into blend synthesis strategies (Consensus, Council, MoA, Self-MoA), mesh failover, auto-routing heuristics, and optimization scoring.

15 minUpdated 2026-02-15

Summary

Deep dive into blend synthesis strategies (Consensus, Council, MoA, Self-MoA), mesh failover, auto-routing heuristics, and optimization scoring.

14 deep-dive sections8 code samples

Quick Start

Start from your current production prompt/request.
Run the exact tutorial flow step-by-step once.
Measure impact in Usage before rollout.
Promote only when quality/cost/reliability metrics match target.

LLMWise orchestrates multiple models through several algorithmic layers. This guide explains every strategy and algorithm in depth, with special focus on Blend mode — the most configurable.

Blend mode overview

Blend sends your prompt to multiple models simultaneously, then feeds all responses into a synthesizer model that produces one final answer. The synthesis behavior changes depending on which strategy you choose.

Strategy	Models	Layers	Best for
consensus	2–6	1	Combining strongest points, resolving contradictions
council	2–6	1	Structured debate: agreements, disagreements, follow-ups
best_of	2–6	1	Quick pick — enhance the best response with additions
chain	2–6	1	Iterative integration across all responses
moa	2–6	1–3	Multi-layer Mixture-of-Agents refinement
self_moa	1	1	Single model, multiple diverse candidates (2–8 samples)

All strategies follow the same two-phase execution:

Blend execution phases

Phase 1: Gather

All source models answer the prompt concurrently

Phase 2: Synthesize

Synthesizer model combines all successful responses using the selected strategy

Done

Return synthesized answer, settle credits, log request

Strategy: Consensus

The default strategy. The synthesizer receives all source responses and is instructed to combine the strongest points while resolving any contradictions.

Single-pass synthesis — no refinement layers
Synthesizer decides which parts of each response to keep
Contradictions are resolved by weighing the majority view

{
  "models": ["gpt-5.2", "claude-sonnet-4.5", "gemini-3-flash"],
  "synthesizer": "claude-sonnet-4.5",
  "strategy": "consensus",
  "messages": [{"role": "user", "content": "Explain quantum entanglement"}]
}

Strategy: Council

Structures the synthesis as a deliberation. The synthesizer produces:

Final answer — the synthesized conclusion
Agreement points — where all models aligned
Disagreement points — where models diverged, with analysis
Follow-up questions — areas that need further exploration

Best when you want transparency about model consensus vs. divergence.

Strategy: Best-Of

The synthesizer picks the single best response, then enhances it with useful additions from the others. The quickest synthesis approach — minimal rewriting, focused on augmentation.

Strategy: Chain

Iterative integration. The synthesizer works through each response sequentially, building a comprehensive answer by incrementally incorporating each model's contribution. Produces the most thorough output but may be longer.

Strategy: MoA (Mixture of Agents)

The most sophisticated strategy. Inspired by the Mixture-of-Agents paper, MoA adds refinement layers where models can see and improve upon previous answers.

MoA multi-layer refinement

Layer 0

All models answer the original question independently

Layer 1

Models receive Layer 0 answers as references, produce refined responses

Layer 2

Optional: further refinement using Layer 1 answers

Synthesize

Final synthesis of all latest-layer responses

How MoA layers work

Layer 0: Each model answers the prompt independently (same as other strategies).
Layer 1+: Each model receives the previous layer's answers as reference material, injected via system message. Models are instructed to improve upon, correct, and expand the references.
Final synthesis: The synthesizer combines all responses from the last completed layer.

Reference injection

Previous-layer answers are injected into each model's context:

Total reference budget: 12,000 characters across all references
Per-answer cap: 3,200 characters (truncated if longer)
Injection method: System message + follow-up user message containing formatted references

Early stopping

If a layer produces zero successful responses, MoA keeps the previous layer's successes and skips to synthesis. This prevents total failure when models hit rate limits or errors.

{
  "models": ["gpt-5.2", "claude-sonnet-4.5", "gemini-3-flash"],
  "synthesizer": "claude-sonnet-4.5",
  "strategy": "moa",
  "layers": 2,
  "messages": [{"role": "user", "content": "Design a rate limiter for a distributed system"}]
}

Layer count

Set layers from 1 to 3. Layer 0 (initial answers) always runs. So layers: 2 means 3 total rounds: initial + 2 refinement passes. More layers = better quality but higher latency and cost.

Strategy: Self-MoA

Self-MoA generates diverse candidates from a single model by varying temperature and system prompts. This is useful when you trust one model but want to hedge against its variance.

How it works

You provide exactly 1 model in models[]
Set samples (2–8, default 4) for how many candidates to generate
Each candidate runs with a different temperature offset and agent prompt
The synthesizer combines all candidates into one final answer

Temperature variation

Each candidate gets a different temperature to encourage diversity:

Base offsets: [-0.25, 0.0, +0.25, +0.45, +0.15, +0.35, -0.1, +0.3]
Final temp = clamp(base_temp + offset, 0.2, 1.4)

For example, with temperature: 0.7 and 4 samples:

Candidate 1: temp 0.45 (conservative)
Candidate 2: temp 0.70 (baseline)
Candidate 3: temp 0.95 (creative)
Candidate 4: temp 1.15 (exploratory)

Agent prompt rotation

Six distinct system prompts rotate across candidates, each emphasizing a different quality:

Self-MoA agent perspectives

Correctness

Focus on factual accuracy and precision

Structure

Prioritize clear organization and logical flow

Edge Cases

Consider corner cases and exceptions

Examples

Emphasize concrete examples and illustrations

Plus two more: Clarity (plain-language explanations) and Skepticism (challenge assumptions, flag weaknesses).

{
  "models": ["claude-sonnet-4.5"],
  "synthesizer": "claude-sonnet-4.5",
  "strategy": "self_moa",
  "samples": 4,
  "temperature": 0.7,
  "messages": [{"role": "user", "content": "Write a Python async rate limiter"}]
}

Self-MoA constraints

Self-MoA requires exactly 1 model in models[]. Using 2+ models will return a validation error. The synthesizer can be the same model or a different one.

Blend credit cost

All blend strategies cost 4 credits regardless of strategy or model count. Credits are reserved upfront and refunded if all source models fail. After completion, the actual provider cost is settled — you may receive a partial refund if the real cost was lower than the reservation.

Compare mode algorithm

Compare runs 2–9 models concurrently and streams their responses side-by-side.

All models stream via an asyncio.Queue — chunks are yielded in arrival order (not round-robin)
Queue timeout: 120 seconds per chunk
After all models finish, a summary event reports the fastest model and longest response
Total latency = max(individual latencies) — bottleneck is the slowest model
Cost: 3 credits. Refunded if all models fail; partial status logged if some succeed.

Judge mode algorithm

Judge runs a three-phase competitive evaluation:

Judge three-phase flow

Contest

2–4 contestant models answer simultaneously. Needs at least 2 successes.

Judging

Judge model evaluates all responses at temperature 0.3 for consistent scoring.

Verdict

Ranked scores (0–10), reasoning per model, overall analysis, and winner.

Scoring system

The judge produces structured JSON with rankings sorted by score descending:

{
  "rankings": [
    {"model": "claude-sonnet-4.5", "rank": 1, "score": 9.2, "reasoning": "Most complete and well-structured"},
    {"model": "gpt-5.2", "rank": 2, "score": 8.8, "reasoning": "Accurate but less organized"}
  ],
  "overall_analysis": "Claude response covered more edge cases..."
}

Default evaluation criteria: accuracy, completeness, clarity, helpfulness, code quality. You can override these with the criteria parameter.

Fallback scoring: If the judge returns malformed JSON, default scores are assigned: 8.0 - (i * 0.5) for each contestant in order, with a note that scores were auto-assigned. Cost: 5 credits.

Mesh mode: circuit breaker failover

When you use mesh mode (chat with routing parameter), LLMWise tries models in sequence with automatic failover powered by a circuit breaker.

Circuit breaker state machine

Each model tracks health in-memory:

State	Condition	Behavior
Healthy	consecutive_failures < 3	Model available for requests
Open	3+ consecutive failures	Model skipped for 30 seconds
Half-Open	30s elapsed since circuit opened	One probe request allowed
Recovered	Probe succeeds	Reset to Healthy, consecutive_failures = 0

Failover sequence

Try primary model first
If it fails (or circuit is open), try fallback 1, then fallback 2, etc.
For each attempt: emit a route event (trying, failed, or skipped)
First success stops the chain — no further fallbacks tried
After all attempts, emit a trace event summarizing the route

Latency tracking

Model latency is tracked with exponential smoothing:

avg_latency = (avg_latency * 0.8) + (new_latency * 0.2)

This favors recent measurements, so a model that recovers from a slow period will quickly show improved latency.

Auto-router: heuristic classification

When you set model: "auto", LLMWise classifies your query using zero-latency regex matching (no LLM call overhead) and routes to the best model.

Category	Pattern examples	Routed to
Code	function, debug, error, refactor, python, git, docker	gpt-5.2
Math	equation, integral, derivative, probability, proof	claude-sonnet-4.5
Creative	write poem, story, brainstorm, roleplay, screenplay	claude-sonnet-4.5
Translation	translate, Spanish, French, Chinese, Japanese	gemini-3-flash
Quick fact	Short query (60 chars or less), no pattern match	gemini-3-flash
Analysis	Default fallback for everything else	gpt-5.2
Vision	Any request with images attached	gpt-5.2 (override)

Vision override

If your message contains images, auto-router always picks a vision-capable model regardless of query content.

Policy-based routing

If you have an optimization policy enabled with sufficient historical data, auto-router upgrades from regex heuristics to historical optimization — routing based on actual performance data from your past requests. See the next section.

Optimization scoring algorithm

The optimization engine analyzes your historical request logs and recommends the best model + fallback chain for each goal.

Goals and weight vectors

Each goal uses different weights for the three scoring dimensions:

Goal	Success rate weight	Speed weight	Cost weight
Balanced	0.50	0.30	0.20
Latency (Speed)	0.30	0.60	0.10
Cost	0.30	0.10	0.60
Reliability	0.75	0.20	0.05

Scoring formula

For each eligible model (minimum 3 calls in lookback window):

inv_latency = (max_latency - model_latency) / (max_latency - min_latency)
inv_cost    = (max_cost - model_cost) / (max_cost - min_cost)

raw_score = (Ws * success_rate) + (Wl * inv_latency) + (Wc * inv_cost)

sample_factor = min(1.0, calls / 20)
score = raw_score * (0.7 + 0.3 * sample_factor)

The sample factor gives a small boost to models with more data — a model with 20+ calls gets the full score, while a model with only 3 calls is penalized 30%. Preferred models get an additional +0.04 * sample_factor bonus.

Confidence score

confidence = min(1.0, total_calls / 60)

At 60+ total calls across all models, confidence reaches 1.0 (full certainty). Below that, the recommendation carries a lower confidence signal.

Guardrails

After scoring, models are filtered through policy guardrails:

Max latency: Reject models above threshold (e.g., 5000ms)
Max cost: Reject models above per-request cost (e.g., $0.05)
Min success rate: Reject models below reliability threshold (e.g., 0.95)

The top model that passes all guardrails becomes the recommended primary. The next N models become the fallback chain (configurable, 0–6 fallbacks).

Credit settlement algorithm

LLMWise uses a three-phase credit system:

Credit lifecycle

Reserve

Deduct fixed credits upfront (chat=1, compare=3, blend=4, judge=5)

Execute

Call model(s), track actual provider cost

Settle

Reconcile reserved vs actual cost. Refund overpayment or charge gap.

Settlement formula

Reserved credits are debited at request start. After execution, LLMWise reconciles that reserve against actual token usage.

If usage is lower than reserved credits, unused credits are refunded.
If usage is higher, we charge only the difference. BYOK requests keep provider-facing billing and remain on 0 credits.

Compare / Blend / Judge API Mesh mode tutorial Chat API reference Billing and credits Replay Lab tutorial

Docs Assistant

ChatKit-style guided help

Product-scoped assistant for LLMWise docs and API usage. It does not answer unrelated topics.

Prompt Regression Testing Tutorial

Billing and Credits