Engineering notes · April 16, 2026

Three models, one answer.

How I rebuilt MIT and Google Brain’s multi-agent debate paper as a chat product, end to end on Convex.

By Layken Varholdt~12 min read

View the Mesh Mind demo on X

The single-model ceiling

Every consumer chat interface is one model talking to you. You ask, it answers, you decide whether to trust it. There’s no second voice in the room to push back, and the model itself has no real incentive to disagree with its own first attempt.

In 2023, a group at MIT and Google Brain published a paper called Improving Factuality and Reasoning in Language Models through Multiagent Debate. The thesis is small and obvious in hindsight: if you have multiple model instances, you can make them critique each other, and the answer that survives a round of debate is meaningfully better than the answer any single instance produced alone.

I wanted to know what happens when you take that protocol off the lab benchmark and put it inside a chat app a real person can use. So I built Mesh Mind.

The research in one minute

Du et al. set up N agents (model instances), give them all the same question, and run the protocol in rounds. Round one: every agent answers independently. Round two onward: each agent sees its peers’ answers and is asked to critique and revise. After a few rounds, the answers converge.

Their reference setup was 3 agents over 2 rounds, the diminishing-returns frontier within their compute budget. Accuracy keeps climbing past that, but you pay for it. They tested on arithmetic, GSM8K grade-school math, chess move optimization, and factual validity benchmarks, and saw consistent improvements over single-agent baselines. They also showed it works across model boundaries: ChatGPT and Bard could debate productively.

Two things made the protocol attractive to me as a product:

It’s model-agnostic. The structure has nothing to do with which model is on the other end.
The state machine is small. Initial → critique-with-peers → final. That maps cleanly onto a workflow primitive.

What Mesh Mind is

Mesh Mind chat UI showing three model cards mid-debate — The model picker on the left, three per-model status cards across the top, and the synthesized answer plus the structured summary table below.

Mesh Mind is a Next.js + Convex chat app where you pick a master model and up to two secondary models from a roster spanning GPT-5, Claude Opus 4.5, Gemini 3 Pro, Grok-4, plus a few open-source entries via Groq. When you submit a message, the app runs initial answers in parallel, then feeds each model its peers’ answers and asks for a revised response, then has the master model synthesize the refined answers into one final reply.

A separate, cheaper model writes a structured JSON summary of who agreed, who disagreed, and who changed their mind. That renders below the answer as a table you can actually inspect.

The whole thing streams. Each model has its own status card that walks through initial → debate → complete, line-chunked with a 500ms throttle. Cost is tracked per request, including reasoning tokens, against a weekly budget.

Mapping the paper to the product

Here’s how each piece of the paper landed in the codebase:

Paper concept	Mesh Mind implementation
N agents	1 master + up to 2 secondaries (N = 2 or 3)
Identical agents	Heterogeneous: pick from OpenAI, Anthropic, Google, xAI, or Groq-hosted OSS
Round 1: initial answers	`generateModelResponse` runs in parallel per model, each on its own sub-thread
Round 2: see peers and revise	`generateDebateResponse` builds a peer-aware prompt and asks for a revised final answer
Convergence / final	`generateSynthesisResponse` runs on the master thread and merges the refined answers
Evaluation / analysis	`generateRunSummary` uses a cheap summary agent + Zod schema to render a structured table

Where I extended the paper

Cross-lab debate. The paper debated identical models, or pairs like ChatGPT and Bard. Mesh Mind lets you put GPT-5, Claude Opus 4.5, and Gemini 3 Pro in the same conversation. The diversity of training data is itself a signal. When three models trained on different corpora converge on the same answer, that means something.

An explicit synthesis step. The paper lets answers converge through repeated rounds. Mesh Mind adds a final synthesis pass on the master model so the user gets one clean answer instead of three competing ones to interpret.

Structured summary for auditability. A separate summary agent emits a Zod-validated JSON object so the UI can show agreements, disagreements, and changed positions as a table. It turns “the models debated” from a vibe into something you can actually inspect.

Real-time streaming. Every stage streams via Convex Agent stream deltas. Per-model status cards update live. The user is never staring at a spinner wondering what’s happening.

Cost is first-class. A usageHandler records tokens and estimates USD against a weekly budget on every request. Debate is roughly three times the cost of single-model inference. If you don’t make that visible, you’ll find out the hard way.

Where I stayed close

The two-round protocol is the same. Du et al. landed on 2 rounds for compute reasons and noted that gains continue past that. The same tradeoff fits chat latency, so I left it alone.

The debate prompt wording quotes the paper’s framing almost verbatim. Models are asked to critically re-evaluate their initial answer in light of the peer responses, defend it if they hold their position, and explain why if they change their mind.

The workflow, step by step

All orchestration lives in convex/workflows.ts, defined with @convex-dev/workflow’s WorkflowManager. The entry point from the UI is startMultiModelGeneration in convex/chat.ts.

Kickoff
The client creates a thread, navigates immediately to /chat/[threadId] (optimistic), and calls the entry point with the thread ID, prompt, master model, secondaries, and any file IDs. The server checks auth, rate limits, weekly budget, and whether the master model can handle the attached files.
Sub-threads per model
Each model gets its own ephemeral Convex Agent thread. This is one of the design decisions I’m happiest with. Keeping each model’s conversation isolated means the master thread stays a clean user-facing conversation, and you can deep-link into “what did Claude say on round 1?” from the per-model cards.
Record the run
A multiModelRuns doc is inserted, indexed by masterMessageId and masterThreadId. Every model’s state starts at status: "initial".
Round 1 — initial responses in parallel
generateModelResponse fires once per model (master included), all in Promise.all. Each one streams via thread.streamText, line-chunked with a 500ms throttle, then awaits the stream to flush. Status flips to debate.

Round 2 — the debate

generateDebateResponse fires once per model, again in parallel. Each model receives a prompt with the other models’ initial answers, never its own. The prompt is roughly:

Here are the solutions to the problem from other agents.
Your task is to critically re-evaluate your own initial answer
in light of these other perspectives.

Response from {OtherModel1}: ...
Response from {OtherModel2}: ...

Using the reasoning from these other agents as additional advice,
provide an updated and improved final response to the original
question. If the other agents' reasoning has convinced you to
change your mind, explain why. If you maintain your original
position, justify it against the alternatives.

The debate prompt is saved as a real user message on each model’s sub-thread, so the whole exchange is replayable and auditable. Status moves debate → complete.

Finalization in parallel
Two things kick off as soon as the refined answers are in. The first is generateSynthesisResponse, running on the master thread with the master model. It builds a synthesis prompt that includes all three refined answers, tagged with model name and which one was Primary, and asks the master to produce one definitive reply.
The second is generateRunSummary, running on a throwaway thread with a cheap OSS summary model. It calls thread.generateObject with a Zod schema enforcing the shape: originalPrompt, overview, crossModel (agreements, disagreements, convergence summary), and a perModel[] array. The JSON gets persisted and the UI renders it as the summary table. The ephemeral thread is deleted in a finally block.
Activity tracking
A threadActivities table tracks activeCount and isGenerating per thread/user. It’s incremented on kickoff and decremented in the synthesis finally, which means the global “currently generating” indicators in the sidebar don’t get stuck if something throws halfway through.

Design decision

The hidden synthesis prompt

This one took a few iterations.

The naive approaches both fail. If you run synthesis on a fresh ephemeral thread, you lose conversational continuity, the assistant reply isn’t attached to the user’s thread, and follow-up messages have no context. If you run synthesis on the master thread without filtering, the user sees an instructional prompt (“here are three refined answers, synthesize them…”) sitting in their conversation history, which looks broken.

The fix is to write the synthesis prompt to the master thread but prefix it with a sentinel string: [HIDDEN_SYNTHESIS_PROMPT]::. The server function that serves messages to the UI strips out anything starting with that prefix. The thread sees the full synthesis chain (good for follow-ups, good for auditing); the UI sees a clean conversation. Same data, two views.

Why Convex

This stack works because of three things Convex gives you for free.

@convex-dev/workflow makes the multi-step orchestration a first-class primitive. You define steps, fan out with Promise.all, and the runtime handles step boundaries and retries. I don’t have to write any of the bookkeeping that usually fills 60% of a workflow file.

@convex-dev/agent handles streaming. Stream deltas are line-chunked and throttled. The client subscribes to the thread and the deltas appear. I never had to build a streaming protocol or a websocket layer.

generateObject plus Zod gives me structured outputs the frontend can trust. The summary table can’t render junk because the backend won’t accept junk: the schema validates before the function returns.

The hard parts

The architecture wasn’t the hard part. The seams were.

Making three streams feel like one UI took real care. Each model card has its own status, but they all need to render in sync, animate cleanly, and not lock the layout when one finishes early. The fix was to render the cards from a single subscription on the run document and let the per-model status drive each card independently, with no shared state at the parent level.

Sub-thread auth was tricky. Each model’s ephemeral thread is owned by the same user, but it’s not directly visible in their sidebar. Auth checks needed to walk from the sub-thread back up to the master run, and from the master run back up to the user. I ended up centralizing that in a single authorizeThreadAccess helper that every action calls before doing anything.

Cost legibility is harder than you’d expect. Token counts come back in different shapes from different providers. Some return reasoningTokens separately, some don’t return them at all. The pricing table is hardcoded and refreshed when models change. The week-rolling budget is enforced both at request time (cheap check) and after the fact (real spend reconciliation against usageEvents).

Errors are first-class for a good reason: when one model fails mid-debate, the user needs to see which model failed and what it said about why. Every generation action has a try/catch that flips status to error and captures errorMessage, which the UI renders inline on the per-model card. Nothing is hidden in a server log somewhere.

See it run

A short walkthrough of the app, including the per-model cards updating live and the structured summary table populating after the synthesis step:

What’s next

A few things I want to try.

More rounds. The paper noted accuracy keeps climbing past 2 rounds. The latency cost is real, but for non-chat use cases (research questions, document synthesis) it might be worth it.

Per-stage metrics in the summary. Right now the structured summary captures what changed; I want it to also capture how much: semantic distance between initial and refined answers per model, time-to-converge, that kind of thing.

User-pickable debate prompt styles. The current prompt is the paper’s framing. I’m curious whether something more adversarial (“steelman the strongest counterargument to your initial answer”) would change the convergence behavior.

Credit

Massive credit to the authors of the original paper: Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving Factuality and Reasoning in Language Models through Multiagent Debate is the entire intellectual frame for Mesh Mind. The landing page is at composable-models.github.io/llm_debate and worth your time.

Mesh Mind is live at meshmind.chat. Source is at github.com/LaykenV/master-prompt.