Tutorials

Blend Strategies & Orchestration Algorithms

Deep dive into blend synthesis strategies (Consensus, Council, MoA, Self-MoA), mesh failover, auto-routing heuristics, and optimization scoring.

15 minUpdated 2026-02-15
Summary

Deep dive into blend synthesis strategies (Consensus, Council, MoA, Self-MoA), mesh failover, auto-routing heuristics, and optimization scoring.

14 deep-dive sections8 code samples
Quick Start
  1. Start from your current production prompt/request.
  2. Run the exact tutorial flow step-by-step once.
  3. Measure impact in Usage before rollout.
  4. Promote only when quality/cost/reliability metrics match target.

LLMWise orchestrates multiple models through several algorithmic layers. This guide explains every strategy and algorithm in depth, with special focus on Blend mode — the most configurable.

Blend mode overview

Blend sends your prompt to multiple models simultaneously, then feeds all responses into a synthesizer model that produces one final answer. The synthesis behavior changes depending on which strategy you choose.

StrategyModelsLayersBest for
consensus2–61Combining strongest points, resolving contradictions
council2–61Structured debate: agreements, disagreements, follow-ups
best_of2–61Quick pick — enhance the best response with additions
chain2–61Iterative integration across all responses
moa2–61–3Multi-layer Mixture-of-Agents refinement
self_moa11Single model, multiple diverse candidates (2–8 samples)

All strategies follow the same two-phase execution:

Blend execution phases
1
Phase 1: Gather
All source models answer the prompt concurrently
2
Phase 2: Synthesize
Synthesizer model combines all successful responses using the selected strategy
3
Done
Return synthesized answer, settle credits, log request

Strategy: Consensus

The default strategy. The synthesizer receives all source responses and is instructed to combine the strongest points while resolving any contradictions.

  • Single-pass synthesis — no refinement layers
  • Synthesizer decides which parts of each response to keep
  • Contradictions are resolved by weighing the majority view
{
  "models": ["gpt-5.2", "claude-sonnet-4.5", "gemini-3-flash"],
  "synthesizer": "claude-sonnet-4.5",
  "strategy": "consensus",
  "messages": [{"role": "user", "content": "Explain quantum entanglement"}]
}

Strategy: Council

Structures the synthesis as a deliberation. The synthesizer produces:

  1. Final answer — the synthesized conclusion
  2. Agreement points — where all models aligned
  3. Disagreement points — where models diverged, with analysis
  4. Follow-up questions — areas that need further exploration

Best when you want transparency about model consensus vs. divergence.

Strategy: Best-Of

The synthesizer picks the single best response, then enhances it with useful additions from the others. The quickest synthesis approach — minimal rewriting, focused on augmentation.

Strategy: Chain

Iterative integration. The synthesizer works through each response sequentially, building a comprehensive answer by incrementally incorporating each model's contribution. Produces the most thorough output but may be longer.

Strategy: MoA (Mixture of Agents)

The most sophisticated strategy. Inspired by the Mixture-of-Agents paper, MoA adds refinement layers where models can see and improve upon previous answers.

MoA multi-layer refinement
0
Layer 0
All models answer the original question independently
1
Layer 1
Models receive Layer 0 answers as references, produce refined responses
2
Layer 2
Optional: further refinement using Layer 1 answers
S
Synthesize
Final synthesis of all latest-layer responses

How MoA layers work

  1. Layer 0: Each model answers the prompt independently (same as other strategies).
  2. Layer 1+: Each model receives the previous layer's answers as reference material, injected via system message. Models are instructed to improve upon, correct, and expand the references.
  3. Final synthesis: The synthesizer combines all responses from the last completed layer.

Reference injection

Previous-layer answers are injected into each model's context:

  • Total reference budget: 12,000 characters across all references
  • Per-answer cap: 3,200 characters (truncated if longer)
  • Injection method: System message + follow-up user message containing formatted references

Early stopping

If a layer produces zero successful responses, MoA keeps the previous layer's successes and skips to synthesis. This prevents total failure when models hit rate limits or errors.

{
  "models": ["gpt-5.2", "claude-sonnet-4.5", "gemini-3-flash"],
  "synthesizer": "claude-sonnet-4.5",
  "strategy": "moa",
  "layers": 2,
  "messages": [{"role": "user", "content": "Design a rate limiter for a distributed system"}]
}
Layer count

Set layers from 1 to 3. Layer 0 (initial answers) always runs. So layers: 2 means 3 total rounds: initial + 2 refinement passes. More layers = better quality but higher latency and cost.

Strategy: Self-MoA

Self-MoA generates diverse candidates from a single model by varying temperature and system prompts. This is useful when you trust one model but want to hedge against its variance.

How it works

  1. You provide exactly 1 model in models[]
  2. Set samples (2–8, default 4) for how many candidates to generate
  3. Each candidate runs with a different temperature offset and agent prompt
  4. The synthesizer combines all candidates into one final answer

Temperature variation

Each candidate gets a different temperature to encourage diversity:

Base offsets: [-0.25, 0.0, +0.25, +0.45, +0.15, +0.35, -0.1, +0.3]
Final temp = clamp(base_temp + offset, 0.2, 1.4)

For example, with temperature: 0.7 and 4 samples:

  • Candidate 1: temp 0.45 (conservative)
  • Candidate 2: temp 0.70 (baseline)
  • Candidate 3: temp 0.95 (creative)
  • Candidate 4: temp 1.15 (exploratory)

Agent prompt rotation

Six distinct system prompts rotate across candidates, each emphasizing a different quality:

Self-MoA agent perspectives
Correctness
Focus on factual accuracy and precision
Structure
Prioritize clear organization and logical flow
Edge Cases
Consider corner cases and exceptions
Examples
Emphasize concrete examples and illustrations

Plus two more: Clarity (plain-language explanations) and Skepticism (challenge assumptions, flag weaknesses).

{
  "models": ["claude-sonnet-4.5"],
  "synthesizer": "claude-sonnet-4.5",
  "strategy": "self_moa",
  "samples": 4,
  "temperature": 0.7,
  "messages": [{"role": "user", "content": "Write a Python async rate limiter"}]
}
Self-MoA constraints

Self-MoA requires exactly 1 model in models[]. Using 2+ models will return a validation error. The synthesizer can be the same model or a different one.

Blend credit cost

All blend strategies cost 4 credits regardless of strategy or model count. Credits are reserved upfront and refunded if all source models fail. After completion, the actual provider cost is settled — you may receive a partial refund if the real cost was lower than the reservation.

Compare mode algorithm

Compare runs 2–9 models concurrently and streams their responses side-by-side.

  • All models stream via an asyncio.Queue — chunks are yielded in arrival order (not round-robin)
  • Queue timeout: 120 seconds per chunk
  • After all models finish, a summary event reports the fastest model and longest response
  • Total latency = max(individual latencies) — bottleneck is the slowest model
  • Cost: 3 credits. Refunded if all models fail; partial status logged if some succeed.

Judge mode algorithm

Judge runs a three-phase competitive evaluation:

Judge three-phase flow
1
Contest
2–4 contestant models answer simultaneously. Needs at least 2 successes.
2
Judging
Judge model evaluates all responses at temperature 0.3 for consistent scoring.
3
Verdict
Ranked scores (0–10), reasoning per model, overall analysis, and winner.

Scoring system

The judge produces structured JSON with rankings sorted by score descending:

{
  "rankings": [
    {"model": "claude-sonnet-4.5", "rank": 1, "score": 9.2, "reasoning": "Most complete and well-structured"},
    {"model": "gpt-5.2", "rank": 2, "score": 8.8, "reasoning": "Accurate but less organized"}
  ],
  "overall_analysis": "Claude response covered more edge cases..."
}

Default evaluation criteria: accuracy, completeness, clarity, helpfulness, code quality. You can override these with the criteria parameter.

Fallback scoring: If the judge returns malformed JSON, default scores are assigned: 8.0 - (i * 0.5) for each contestant in order, with a note that scores were auto-assigned. Cost: 5 credits.

Mesh mode: circuit breaker failover

When you use mesh mode (chat with routing parameter), LLMWise tries models in sequence with automatic failover powered by a circuit breaker.

Circuit breaker state machine

Each model tracks health in-memory:

StateConditionBehavior
Healthyconsecutive_failures < 3Model available for requests
Open3+ consecutive failuresModel skipped for 30 seconds
Half-Open30s elapsed since circuit openedOne probe request allowed
RecoveredProbe succeedsReset to Healthy, consecutive_failures = 0

Failover sequence

  1. Try primary model first
  2. If it fails (or circuit is open), try fallback 1, then fallback 2, etc.
  3. For each attempt: emit a route event (trying, failed, or skipped)
  4. First success stops the chain — no further fallbacks tried
  5. After all attempts, emit a trace event summarizing the route

Latency tracking

Model latency is tracked with exponential smoothing:

avg_latency = (avg_latency * 0.8) + (new_latency * 0.2)

This favors recent measurements, so a model that recovers from a slow period will quickly show improved latency.

Auto-router: heuristic classification

When you set model: "auto", LLMWise classifies your query using zero-latency regex matching (no LLM call overhead) and routes to the best model.

CategoryPattern examplesRouted to
Codefunction, debug, error, refactor, python, git, dockergpt-5.2
Mathequation, integral, derivative, probability, proofclaude-sonnet-4.5
Creativewrite poem, story, brainstorm, roleplay, screenplayclaude-sonnet-4.5
Translationtranslate, Spanish, French, Chinese, Japanesegemini-3-flash
Quick factShort query (60 chars or less), no pattern matchgemini-3-flash
AnalysisDefault fallback for everything elsegpt-5.2
VisionAny request with images attachedgpt-5.2 (override)
Vision override

If your message contains images, auto-router always picks a vision-capable model regardless of query content.

Policy-based routing

If you have an optimization policy enabled with sufficient historical data, auto-router upgrades from regex heuristics to historical optimization — routing based on actual performance data from your past requests. See the next section.

Optimization scoring algorithm

The optimization engine analyzes your historical request logs and recommends the best model + fallback chain for each goal.

Goals and weight vectors

Each goal uses different weights for the three scoring dimensions:

GoalSuccess rate weightSpeed weightCost weight
Balanced0.500.300.20
Latency (Speed)0.300.600.10
Cost0.300.100.60
Reliability0.750.200.05

Scoring formula

For each eligible model (minimum 3 calls in lookback window):

inv_latency = (max_latency - model_latency) / (max_latency - min_latency)
inv_cost    = (max_cost - model_cost) / (max_cost - min_cost)

raw_score = (Ws * success_rate) + (Wl * inv_latency) + (Wc * inv_cost)

sample_factor = min(1.0, calls / 20)
score = raw_score * (0.7 + 0.3 * sample_factor)

The sample factor gives a small boost to models with more data — a model with 20+ calls gets the full score, while a model with only 3 calls is penalized 30%. Preferred models get an additional +0.04 * sample_factor bonus.

Confidence score

confidence = min(1.0, total_calls / 60)

At 60+ total calls across all models, confidence reaches 1.0 (full certainty). Below that, the recommendation carries a lower confidence signal.

Guardrails

After scoring, models are filtered through policy guardrails:

  • Max latency: Reject models above threshold (e.g., 5000ms)
  • Max cost: Reject models above per-request cost (e.g., $0.05)
  • Min success rate: Reject models below reliability threshold (e.g., 0.95)

The top model that passes all guardrails becomes the recommended primary. The next N models become the fallback chain (configurable, 0–6 fallbacks).

Credit settlement algorithm

LLMWise uses a three-phase credit system:

Credit lifecycle
1
Reserve
Deduct fixed credits upfront (chat=1, compare=3, blend=4, judge=5)
2
Execute
Call model(s), track actual provider cost
3
Settle
Reconcile reserved vs actual cost. Refund overpayment or charge gap.

Settlement formula

Reserved credits are debited at request start. After execution, LLMWise reconciles that reserve against actual token usage.

  • If usage is lower than reserved credits, unused credits are refunded.
  • If usage is higher, we charge only the difference. BYOK requests keep provider-facing billing and remain on 0 credits.
Docs Assistant

ChatKit-style guided help

Product-scoped assistant for LLMWise docs and API usage. It does not answer unrelated topics.

Sign in to ask implementation questions and get runnable snippets.

Sign in to use assistant
Previous
Prompt Regression Testing Tutorial
Next
Billing and Credits