AI Tools

Orchestrate 3 AI Tools: ChatGPT Drafts, Claude Reviews, GPT-Image Adds Visuals

Orchestrate 3 AI Tools: ChatGPT Drafts, Claude Reviews, GPT-Image Adds Visuals
Contents

Last Tuesday a 1,400-word post went from blank doc to published draft in 18 minutes. The bill was $0.41. Three different AI (artificial intelligence) models did the work — not because I wanted to be clever, but because every time I tried to do the whole job in one of them, the output got worse in a predictable way.

The job is a content production pipeline: write a B2B (business-to-business) marketing post, tighten the logic, generate two inline visuals. I used to run the whole thing in ChatGPT. Then I ran the whole thing in Claude. Then I stopped arguing with myself and split the work. The 3-model relay is what stuck.

This is the build — the three hand-off prompts, the JSON (JavaScript Object Notation, a structured data format machines can parse) contract that travels between stages, the actual cost and timing, and the one failure mode that cost me a published post before I learned to defend against it.

Why single-model loses

The case for the relay is not "AI agents are cool." It is that each frontier model is genuinely good at one thing and mediocre at the other two, and asking one model to do all three is a quality tax on every step.

ChatGPT drafts fast and on-brand. It is the fastest first-draft machine I have used. But ask it to review a draft and the output turns sycophantic ("Great point! Maybe expand on this in a follow-up..."). It does not know how to cut. It defaults to "add more." That is the wrong instinct for a reviewer.

Claude is the opposite. Hand it a draft and ask for a review and it will find every vague claim, every redundant sentence, every "in today's fast-paced world" filler line. It is genuinely good at tightening. But ask it to write a 1,200-word first draft and you get a post that reads like a 2018 SaaS (Software as a Service) blog — measured, hedged, and somehow flavorless. It thinks too much before it writes.

GPT-Image is in a third category. Neither text model is good at generating consistent inline visuals at speed. ChatGPT can do it but the style drifts every call. Claude cannot do it at all — it has no image generation. GPT-Image is the only one of the three where image generation is the actual product, not a side feature.

So the relay is not orchestration for its own sake. It is three specialists, each doing the one thing they are actually built for.

Stage 1 — ChatGPT drafts

The first prompt is the longest and the most prescriptive. ChatGPT is the only model of the three that will write a usable first draft from a short brief, so the prompt is doing the work the other two will not do.

Role: B2B marketing writer with a 15-year practitioner's voice.
Topic: {TOPIC}
Audience: {AUDIENCE}
Length: ~1200 words.
Constraints:
- No "in today's world" openers. Start with a concrete result, a number, or a scene.
- No more than 2 sentences per paragraph on average.
- No bullet lists longer than 5 items. Prefer prose.
- Bilingual acronyms (LLM, API, JSON, GPT) are fine; explain anything else on first use.
- Do NOT write a conclusion paragraph that summarizes the post. End on the strongest point.
Output: a JSON object only. No prose around it.
{
  "title": "...",
  "slug": "kebab-case-here",
  "h2_outline": ["...", "...", "..."],
  "body_markdown": "...",
  "suggested_visuals": [
    {"position": "after_h2_2", "concept": "concrete scene, no text overlay"},
    {"position": "after_h2_4", "concept": "..."}
  ]
}

That last block — the JSON shape — is the contract. The model is told to return only the JSON, no prose wrapper, so the orchestrator parses it without a cleanup pass. The suggested_visuals array is what makes Stage 3 possible. Without it, GPT-Image would be guessing where visuals belong.

This stage runs in ~4 minutes for a 1,200-word post. Cost: about $0.12 (GPT-5.5 at $5/$30 per million tokens).

Stage 2 — Claude reviews

The second prompt is short on purpose. Claude's job is to cut, not to write. The prompt has to defend against the model drifting into "helpful editor" mode and rewriting whole sections.

You are reviewing a draft. The next stage will add visuals. A human will publish.
Your ONLY job: tighten logic and cut fluff.
You are NOT allowed to: rewrite paragraphs, expand points, add new ideas, change the angle,
generate visuals, or write a new title.
For each issue, return:
{"location": "h2_2_para_3", "issue": "vague claim, no number", "fix": "cut or replace with concrete number"}
Then return the revised body in:
{"final_markdown": "...", "issues_found": N, "cuts_made_chars": N}

The "you are NOT allowed to" list is the load-bearing part. Without it, Claude defaults to improving the post in whatever direction it thinks is best, which usually means rewriting the intro and adding a paragraph in the middle. That is editor behavior, not reviewer behavior. The negative list pins it to reviewer behavior.

This stage takes ~6 minutes and costs roughly $0.08 (Claude Opus 4.7 at $5/$25 per million tokens).

Stage 3 — GPT-Image adds visuals

The image stage is the simplest. The brief comes from Stage 1's suggested_visuals array, with the position field preserved so the orchestrator knows where to drop each image.

Style guide: {style_description}. Consistent across all images in this post.
Generate {N} 16:9 inline visuals at quality "medium".
Concepts:
1. (after_h2_2) {concept_1}
2. (after_h2_4) {concept_2}
Output: image URLs or local paths. Match the style guide on color, line weight, and font.

GPT-Image 2 charges $30 per million image output tokens, about $0.07 per medium-quality 1024×1024 image. Two inline visuals: ~$0.14. Total stage cost: ~$0.21.

The full relay, end to end: ~$0.41 per post, ~18 minutes including human review.

The JSON contract

The thing that makes the relay actually work is the JSON envelope. It is small and boring, which is exactly the point.

json{
  "stage": "draft",
  "post_id": "uuid-here",
  "title": "...",
  "slug": "...",
  "h2_outline": ["...", "..."],
  "body_markdown": "...",
  "suggested_visuals": [
    {"position": "after_h2_2", "concept": "..."}
  ],
  "next_stage": "review"
}

Two rules keep it from breaking. (1) Every stage appends its own field, never overwrites — body_markdown in stage 2's output is a new field (body_markdown_v2). (2) The orchestrator validates the schema before passing the envelope on. Missing key = fail loudly, not silently.

The "append, never mutate" rule is the one I learned the hard way. The first version of this pipeline let stage 2 overwrite stage 1's body_markdown in place. Stage 3 could not compare the original draft to the review, and I could not audit which cuts Claude actually made. Now stage 2 returns body_markdown_v2 and cuts_made_chars.

18 minutes vs 35 — the timing comparison

The same post drafted, reviewed, and illustrated in a single model (ChatGPT doing all three jobs):

Workflow Time Cost Issue
Single-model (ChatGPT does all three) ~35 min ~$0.28 Reviewer mode was sycophantic; visuals drifted in style between images
3-model relay ~18 min ~$0.41 Operationally more steps; no other quality issues

Single-model is cheaper per call but slower because every step is a separate ChatGPT session with a different system prompt, and I had to write the review and image prompts into the same conversation. The relay is faster because each stage is a focused call to the model best suited for it.

The 13-minute time savings is real but it is not the main win. The main win is that the reviewer's actual cuts ship, the visuals look like they belong in the same post, and I do not have to fight the model to behave like three specialists in one session.

The failure mode — model identity confusion

This is the one that cost me a published post. The mistake was in the system prompts.

The first version of stage 2's prompt looked like this:

You are part of a 3-stage content production team. Stage 1 writes the draft
(which you are reviewing). Stage 3 will add visuals (which you should not do).
Your job is to review the draft...

It sounded reasonable. It was wrong. Claude received the prompt, understood the whole team, and then — in roughly 1 in 8 reviews — started to anticipate stage 3. It would write comments like "Note for the image stage: this paragraph would work well with a chart of..." or preemptively suggest a visual concept inline. The review itself was still tight. But the output was no longer a clean review. It was a review plus a partial stage 3.

The downstream cost was not theoretical. One of those "helpful" reviews included a Claude-generated visual concept that contradicted the actual concept GPT-Image eventually produced. The post shipped with an inline caption that described a different image from the one rendered next to it. I caught it 20 minutes after publishing. The fix was deleting the post, republishing without that paragraph, and changing the prompt.

The fix: tell each model about its own role, not the team's. Stage 2's prompt was rewritten to start with "You are a reviewer. You review drafts. You do not generate visuals. You do not write content. You do not see stage 3." Stage 3's prompt was rewritten to start with "You are an image generator. You generate images. You do not review text. You do not see stage 1 or stage 2." Stage 1's prompt was tightened to "You are a writer. You write drafts. You do not review. You do not generate images."

The principle: each model is told what it is, not what the team is. Once I stopped describing the relay from the inside and started describing each model from its own point of view, the cross-stage contamination stopped. It has not happened again in 47 posts.

When this is the wrong tool

If your post is under 600 words or has no inline visuals, the relay is overhead — a single ChatGPT call handles it in under 10 minutes for ~$0.10. The relay only earns its keep on 1,000+ word posts with at least one inline visual.

And if you cannot define the JSON contract, do not build this. The contract is the load-bearing piece. Without it, you have three models talking past each other.

The 3-model relay is not a workflow tax. It is a division of labor that pays you back in output quality. The hardest part is not orchestrating three APIs. It is writing each prompt as if the model knows nothing about the other two.