Marketing

I Asked ChatGPT for 100 Email Subject Lines. Only 5 Were Worth Testing.

I Asked ChatGPT for 100 Email Subject Lines. Only 5 Were Worth Testing.
Contents

Last summer I was prepping a flash sale email for a B2B course I run. Forty thousand subscribers, two-day window, the kind of send where a 1% open-rate swing is a meaningful number. I had a list of three subject lines I liked, and I almost shipped them.

Then I thought: "What if I just ask ChatGPT for 100?"

So I did. Thirty seconds later I had 100 subject lines. Most of them were useless. But five of them made it into the test, and one — a variant I would never have written on my own — beat my favorite by 19% on opens.

This is the workflow I use now. Not because 100 is a magic number, but because 100 is what forces you to build a filter. And the filter is the part that matters.

The prompt I actually used

I gave ChatGPT five things, in this order:

  • The email's job (sell a B2B course, flash sale, 48 hours)
  • The audience (mid-career marketers, 30–45, US + EU)
  • The product's price point ($497, with a $100 early-bird discount)
  • The single biggest reader pain point I knew about (they kept saying in DMs: "I don't know which AI tool to learn first")
  • Three negative constraints: no all-caps, no "FREE!!!" spam patterns, no emoji unless it was a clear win

Then I asked for 100 subject lines, split into 5 emotional angles — 20 each:

  1. Curiosity
  2. Specific benefit / number
  3. Urgency / scarcity
  4. Contrarian / counterintuitive
  5. Question / personalization

Output: a table with the subject line + the angle + the character count. That last column matters more than most people think.

Two things to flag. First, 100 is a forcing function. At 20 candidates you get a lot of "How to..." or "X tips for..." that all blur together; at 100 the model is forced into variety because it has run out of its default patterns. Second, splitting the request into 5 angles means I can score each angle independently instead of staring at 100 lines of text and picking by gut.

The 5 that made the test

After a first-pass eyeball review, I cut the 100 down to 25. Then I ran them through a 4-criterion rubric (below) and got to 5. Here are the 5 I tested, in order of how much I expected each to win:

1. "The 48-hour AI tool stack (pick one and ship it by Friday)"

  • Angle: Specific benefit + timeframe
  • Why I expected it to win: It names the outcome ("pick one and ship it") and the time pressure is concrete — not "limited time!" which everyone ignores. The "pick one" part is the real hook. Most readers feel overwhelmed by AI tool lists, and naming a single action is the antidote.

2. "Why we're not selling a course this week"

  • Angle: Contrarian
  • Why I picked it: It subverts the expected "buy my course" framing. The reader's first reaction is "wait, what?" — and that gap is what gets the open. I was honestly nervous about this one, but it fit a broader pattern I've seen: counterintuitive subject lines that connect back to the email body outperform standalone curiosity bait by roughly 2x in my data.

3. "[Last 24 hrs] The AI tool stack that replaces 4 SaaS subscriptions"

  • Angle: Urgency + specific outcome
  • Why I picked it: Bracketed "Last 24 hrs" is a clear visual cue that survives mobile truncation. The "replaces 4 subscriptions" is a number-and-benefit, not a vague claim. This was the variant I expected to win overall.

4. "Should you learn ChatGPT, Claude, or Gemini first? (we asked 1,200 marketers)"

  • Angle: Question + social proof
  • Why I picked it: The question structure pulls the reader in. "We asked 1,200 marketers" is concrete — not "we surveyed experts." This was the underdog of the 5 for me.

5. "Hi {first_name}, your AI tool stack is on sale"

  • Angle: Personalization
  • Why I picked it: It's the most boring of the 5 on paper. I included it because personalization tokens (Klaviyo, Mailchimp, Customer.io all support them) consistently lift opens by 5–10% in my data. The hypothesis: not a winner on its own, but a useful baseline to see what the token alone does.

The 4-criterion filter I use to cut 100 → 5

Before I read the list, I write down the criteria. This is the part most people skip, and the part that matters most. Here's the rubric:

1. Skim-test. Read the subject line at the speed you'd scan an inbox. Would you open it over a competitor's email? If you have to think, it's out. 1–5.

2. Length. Under 50 characters is the sweet spot. Mobile is the majority of opens now, and anything over ~55 gets truncated on iPhone in a way that hides the hook. 1–5.

3. Specificity. Does it name a number, a time, a tool, a place, or a result? Vague wins are usually losses. "Boost your productivity" loses to "Cut your reporting time by 40%." 1–5.

4. Sender-voice fit. Could you imagine yourself sending this? If it sounds like a "growth hacker" but you write like a 50-year-old consultant, your audience will feel the mismatch. 1–5.

Total 20. I keep anything that hits 16+ for the test. Usually that's 3–7 lines.

Two caveats. The rubric caps at 5, not 10 — LLMs as judges tend to inflate scores if you let them. And I re-rank the top 3 by hand, because the model's tie-breakers are useless. The score gets you to a shortlist; the hand-rank gets you to a decision.

How I ran the test

Standard A/B/n test in Customer.io:

  • 4% of the list per arm — I had 5 finalists plus 2 of my own originals, so 7 arms at ~5.7% each
  • Random holdout of 5% for control measurement
  • One-shot send, 9am recipient-local time on a Tuesday
  • Open at 24h, click at 48h, conversion at 7 days (the metric that actually pays the rent)

I did not run a sequential test. Sequential tests are statistically brutal for email — you need thousands of impressions per arm, and you have to wait for the cumulative effect. A 9am Tuesday send is a fresh 24-hour window; you can compare arms directly at hour 24.

The result

The two originals I wrote by hand tied for 3rd–4th. The winner was #2 — "Why we're not selling a course this week" — at a 38.2% open rate, vs. my best original at 32.1%. CTR was a smaller spread (4.1% vs 3.7%) — the open was where the contrarian hook paid off, but click behavior was flatter, which makes sense: the open bait delivered the open, and the body had to carry the rest.

The "stack that replaces 4 subscriptions" (#3) came in second at 35.8% — a clean win, the kind of variant that's easy to scale across product lines because the structure is reusable.

The personalization-only one (#5) was dead last at 27.4%, the worst of the 7. I had been wrong about the token baseline. Without a hook, "Hi {name}" doesn't move the needle. Personalization is a multiplier on a good line, not a substitute for one.

The question variant (#4) was a letdown — 30.2%, below the originals. The hypothesis on why: questions that promise survey-style social proof ("1,200 marketers") have decayed as a hook over the last 18 months. The format reads as "we made a report" more than "we have news." Worth keeping in the back pocket for evergreen content, not flash sales.

The part most people get wrong

The instinct is to test the best-sounding subject line. That is the wrong test.

A subject line you love is a subject line that is calibrated to your taste, not your reader's. Your taste is biased toward clever wordplay, internal references, and (if you're a marketer) copy that sounds "right" by industry standards. None of those are open-rate drivers in the inbox.

The 100 → 5 → test pipeline is what breaks that bias. You ask for 100 because the model can produce them faster than you can read them. You cut to 5 with criteria, not vibes. You test because the only authority that matters is the inbox — not your gut, not your CMO, not the agency's "senior strategist."

I now run this for every send over 10K subscribers. The cost is roughly $0.20 in ChatGPT tokens and 30 minutes of my time per send. The lift is usually a few percentage points on opens, sometimes more. It is the cheapest experiment I run, and the only one that consistently repays the time.

What I'd skip

Don't run this on small sends (under 5K). The statistical noise will eat your result, and the prompt-time will dwarf any test winner's impact. Don't run it on transactional emails (password reset, order confirmation) — the rules are different, and "creative" subject lines on transactional email are usually net-negative on trust. And don't run more than 5 arms in a single send — beyond 5, you're splitting your audience so thin that no arm reaches significance in a 24h window.

If you do those three things — generate 100, filter to 5 with a written rubric, test as a single send — you will stop debating subject lines in your head and start letting the inbox answer. The model doesn't replace the judgment. It replaces the blank page.

And the line that surprised me? "Why we're not selling a course this week." I would not have written that on my own. That's the whole point.