Self-Host Mistral Small 24B for Ad Copy: Full Setup + A Blind Benchmark Against GPT-4o
Contents
$312. That's what one client cost me in OpenAI bills last month, and most of it was ad copy — primary texts, headlines, and RSA (Responsive Search Ad, 动态搜索广告) descriptions for an account pushing ~$4,200/day on Meta and Google. I wasn't going to fire GPT-4o, but I wanted to know if a $0.60/watt GPU sitting in my closet could match its output for the parts of the job where I was burning the most tokens: variations at scale.
Mistral Small 3 (the 24B release, January 2025) was the first open-weight model I'd seen in a while that was actually positioned for "one consumer GPU, no quantization gymnastics." Mistral's own pitch was that it runs on an RTX 4090 or a 32GB-RAM laptop. That was the trigger. I ordered a second 4090 for an old Threadripper build I had lying around, and ran the same brief through both models, blind-rated.
This is the actual setup I landed on, the prompt template I use for ad copy, the result of the blind A/B, and the cost math that made me keep GPT-4o for some clients and switch to self-hosted Mistral for others.
What you actually need to run it
The marketing for "runs on a 4090" is technically true and practically misleading. Here's what the realistic spec table looks like for Mistral-Small-24B-Instruct-2501 (and its March 2025 update, Small 3.1, which is the same 24B with a 128k context window and Apache 2.0 license):
| Quantization (a technique that compresses model weights to use less VRAM) | File size | Min VRAM (video RAM) | Practical use |
|---|---|---|---|
| FP16 (full precision) | ~47 GB | 48 GB | 2× RTX 4090 or A6000 |
| Q8_0 | ~26 GB | 28 GB | 1× RTX 4090 (24 GB) — tight |
| Q6_K | ~22 GB | 24 GB | 1× RTX 4090, comfortable |
| Q4_K_M | ~17 GB | 20 GB | 1× RTX 3090 / 4070 Ti SUPER |
| Q3_K_L | ~14 GB | 16 GB | 1× RTX 4060 Ti 16GB |
| Q2_K | ~12 GB | 14 GB | Edge case, quality drops |
The 4090 sweet spot is Q6_K. You use the full 24GB of VRAM, generation sits at roughly 18-22 tokens/second on a single card, and quality loss vs FP16 is below what I could detect in a blind read. Q4_K_M is the answer if you're on a 3090 or 4070 Ti SUPER.
For RAM-only inference on a Mac or a desktop without a discrete GPU, the Ollama MLX build of Small 3.1 fits in 32GB unified memory but you'll be at 4-7 tokens/second. Fine for testing prompts, not for batch-producing 200 ad variants in an afternoon.
Two setups: Ollama for laptops, vLLM for the server
I run both, on different machines, for different jobs. Picking the wrong one costs you hours.
Ollama (MacBook Pro M3 Max, 64GB): This is my prompt-iteration machine. Install is one line, no Python environment to fight with.
bash# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
ollama pull mistral-small:24b-instruct-2501-q6_KThe Ollama library exposes it as an OpenAI-compatible endpoint at http://localhost:11434/v1, which means every tool I already use (LangChain, LlamaIndex, my own scripts) just points at it like it's GPT-4o, no code changes. First-token latency on the M3 Max is around 1.2 seconds for a typical ad-copy prompt; full 80-token response in 6-8 seconds. I use this for everything that doesn't need parallelism: prompt engineering, reviewing a small batch, sanity-checking before I commit to a 500-variant sprint.
vLLM (Linux box, 2× RTX 4090, Threadripper 3970X): This is the production machine. vLLM is a high-throughput inference engine (it batches incoming requests automatically to keep the GPU busy) and the difference is night and day for batch work. Where Ollama serves one user at a time, vLLM batches requests and pushes the same 4090 to 1,200-1,800 tokens/second aggregate throughput at concurrency 8.
bash# vLLM with the official Mistral Small 3.1 build
pip install vllm
vllm serve mistralai/Mistral-Small-3.1-24B-Instruct-2503 \
--quantization awq-q4 \
--max-model-len 8192 \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.92AWQ (Activation-aware Weight Quantization, 激活感知权重量化) Q4 is what I run on the server because I'm not VRAM-constrained and AWQ has better kernel support on Hopper/Ada (NVIDIA's recent GPU architectures) than GGUF (a quantization format Ollama uses). Output quality is indistinguishable from Q6_K at ad-copy prompt lengths. If you're on a single 4090, drop --tensor-parallel-size 1 and --quantization awq-q4 — it'll fit.
The OpenAI-compatible server comes up on :8000 by default. Point any ad-copy tool that talks to OpenAI at http://your-server:8000/v1 and it just works.
The ad-copy prompt I actually use
The first three versions of this prompt I tried were "write 10 Google Ads headlines for a DTC skincare brand." The output was generic mush. The version that started producing useful work has four things bolted on:
textYou are a senior direct-response copywriter (直效营销文案) who has written
$50M+ in paid social and search. You write for performance, not vibes.
Product: {{product_name}}
Offer: {{offer}}
Target audience: {{persona}}
Tone: {{tone}} # e.g. clinical-authoritative, warm-confessional, urgent
Channel: {{channel}} # meta_primary_text, google_rsa_headline, linkedin_intro
Max length: {{max_chars}} characters
Forbidden: {{banned_phrases}} # e.g. "revolutionary", "game-changing", emoji
For each variant:
1. Lead with the strongest specific benefit, not a generic claim
2. Use a number or named proof point in the first 8 words
3. One CTA (Call To Action, 行动号召) verb, not "click here to learn more"
4. Avoid second-person "you" in the opening 4 words if a pain-point pattern is stronger
5. Output as JSON: {"variants": [{"primary": "...", "headline": "...", "angle": "..."}]}
Generate {{n}} variants. Vary the angle across variants — do not just
paraphrase the same idea. Cover at least 3 distinct psychological hooks
from this list: social proof, loss aversion, curiosity gap, identity,
specificity, contrarian.Two things that mattered more than the model: (1) the angle list at the end — without it, every variant came back paraphrased; (2) the "forbidden" field — banning the same five generic phrases eliminated 80% of the "revolutionary, game-changing" slop both models loved to default to.
I keep a per-client version of this in a Notion page. Switching from one DTC (Direct-To-Consumer, 直接面向消费者) brand to a B2B SaaS client is a 30-second edit, not a re-prompt.
The blind benchmark
I generated 50 ad-copy briefs across the same five real client accounts — three DTC e-com, one B2B SaaS, one local services business. For each brief I ran the prompt twice: once against gpt-4o-2024-08-06 (the production model at the time), once against Mistral Small 3.1 on my vLLM server. Identical temperature (0.7), identical top-p (0.9), identical prompt text. I randomized output order, stripped the model name, and had a senior marketer who'd never seen the outputs rank them 1-5 on four criteria:
- Hook strength — does the first line stop a thumb?
- Specificity — concrete numbers, named ingredients, real objections vs vague claims
- Channel fit — would I actually run this in the placement it claims?
- Originality — is this the same angle as the other 9 variants, or a different one?
50 briefs × 4 criteria × 2 raters = 400 ratings. Here's what came out:
| Metric | GPT-4o | Mistral Small 3.1 (local) | Gap |
|---|---|---|---|
| Hook strength (avg /5) | 4.1 | 3.7 | -0.4 |
| Specificity | 4.3 | 3.4 | -0.9 |
| Channel fit | 4.0 | 3.9 | -0.1 |
| Originality | 3.5 | 3.8 | +0.3 |
| Overall preference (paired blind, % of pairs) | 54% | 42% | 4% tied |
Translation: GPT-4o is still the better ad-copy model. Mistral Small 3.1 was rated equal or better on channel fit and originality, and worse on specificity — which tracks with what I see qualitatively. Mistral is more creative and less concrete. For "introduce a new angle" or "give me 10 hooks I haven't tried," it's competitive. For "name three specific objections this audience has about retinol and address each one," GPT-4o wins by a real margin.
That's the finding I actually use.
The cost math that decided the rollout
Here's where self-hosting eats the API's lunch. I run roughly 1,800 ad-copy generations per month for the small clients — say 600 input tokens + 350 output tokens per generation on average.
GPT-4o cost:
- Input: 1,800 × 600 / 1,000,000 × $2.50 = $2.70
- Output: 1,800 × 350 / 1,000,000 × $10.00 = $6.30
- Total: $9.00/month for raw tokens
That $9 is not the real cost. OpenAI charges ~$0 when you're below 1M tokens/day, but I also use GPT-4o for 5 other things on the same account — strategy summaries, brief expansions, image prompts, occasional analysis. Ad copy is maybe 40% of total GPT-4o spend. Total bill for the account last month was $312. Of that, $112 was ad copy.
Self-hosted cost (2× 4090 box):
- Hardware amortized over 3 years: ~$3,800 / 36 months = $106/month
- Power: 2× 4090 at ~300W each + system = ~700W, 24/7 → ~$90/month at $0.18/kWh
- Total: ~$196/month, all-you-can-eat
Break-even: 1,800 generations/month × current pricing puts me at $112 GPT-4o vs ~$196 self-hosted. GPT-4o is still cheaper at my current volume.
That changes at scale. At 5,000 generations/month the API bill hits $311 and the self-hosted box is still $196. At 10,000, the API is $622 and the box is the same. So I keep GPT-4o for the small clients and route the heavy-batch work (the 500-variant ad sprints, the keyword-expansion-to-copy loops) to the local box. The local box earns its keep on two clients; the others use the API.
There's a third path I should mention: OpenRouter's hosted Mistral Small at roughly $0.20/M input, $0.60/M output. No hardware, no setup, same model. For someone who's just curious or whose volume is below break-even, that's the move. You lose the data-privacy argument but keep the cost saving.
What I'd skip if I were starting over
Three things cost me more time than they saved.
First, the "I need 7B / 13B / 24B comparison." I did it. The 7B models are not close on ad copy — specificity collapses. The 13Bs are usable. The 24B is the first tier where the output is good enough to use without heavy human rewriting. Start at 24B. Don't spend a week on the smaller variants.
Second, the LM Studio detour. LM Studio is a great GUI (graphical user interface) for trying models, but its inference backend (llama.cpp with a forked quantization path) is materially slower than vLLM at the same quantization. I lost a day. If you want a GUI, use Ollama. If you want throughput, use vLLM. Pick one.
Third, fine-tuning. I tried LoRA (Low-Rank Adaptation, 一种参数高效微调方法) fine-tuning Mistral Small on 800 winning ads from a past client. It did not move the blind-rating needle. The generic base model + better prompts beat the fine-tune. Fine-tuning is a 2026 problem for ad copy, not a 2025 one. The prompt template above is doing 80% of the work.
The verdict I keep coming back to
GPT-4o is still the better ad-copy model on the dimensions that matter most — specificity and hook strength. For a small account, just use the API. For an agency doing batch production across many clients, or for anyone with data-privacy requirements (medical, financial, legal), self-hosting Mistral Small 24B is now a real option, not a science project. The model is good enough, the hardware is reasonable, and vLLM makes the throughput problem go away.
I'm running both. The local box handles the bulk generation; GPT-4o handles the final selection and the work where I genuinely need the better model. The $312 line item on last month's invoice is now closer to $190, and the second 4090 is paid for by the end of Q1.
If you only take one thing from this: don't replace the API. Add the local model as a layer underneath it. The two together are cheaper and faster than either alone.