Turn a YouTube Video Into a 2,000-Word Blog Post (Without the AI Smell)
Contents
Last month I helped a B2B SaaS (Software as a Service, 软件即服务) founder convert one 24-minute customer interview into a 2,100-word blog post that now ranks page-one for a search term he'd been chasing for two years. The interview already existed. The post took about 90 minutes to produce. The reason it had taken him two years to get there was that we'd been ignoring a content library worth probably 40 future blog posts, because converting video transcripts to writing felt like more work than starting from scratch.
It isn't, if you know the pipeline.
The trap most marketers fall into is treating a transcript like a draft. It isn't. A transcript is a raw material — closer to ore than to metal. If you ask Claude 3.5 Sonnet or GPT-4o to "turn this transcript into a blog post," you'll get something readable but generic, full of the spoken-to-written translation tics that make AI content sound like AI content: contractions get dropped, hedges get added, every paragraph opens with a throat-clearing transition. The fix is doing the conversion in stages, and being explicit about what each stage does.
Here's the pipeline I now run for clients with a video-first content library.
Step 1: Get a clean transcript, not a fancy one
You do not need to pay Rev $1.50 a minute. For 90% of jobs, three free sources do the job:
- YouTube auto-captions — Open any YouTube video, click the three-dot menu, "Show transcript." Copy. Paste. Lowercase, no punctuation, but the words are right about 95% of the time.
- Whisper (OpenAI's open-source speech-to-text) — Better punctuation than YouTube and identifies multiple speakers if you ask it to. Free if you run it locally; about $0.006/minute via the API.
- Descript — If you're already editing the video in Descript, the transcript is sitting there waiting for you.
The thing nobody tells you: don't clean the transcript before you feed it to the model. The disfluencies — the "ums," "you know what I mean," "so basically" — are signal, not noise. They cluster around the speaker's weak points. The confident, well-thought sentences come out cleanly. If you scrub the disfluencies in advance, you erase the texture the model uses to figure out which paragraphs deserve to be in the post and which were the speaker thinking out loud.
Step 2: Extract the spine before you rewrite anything
This is the step most pipelines skip, and it's why most transcript-to-post outputs feel like loose summaries. Before any rewriting happens, ask the model to surface the argument, not the content.
You're reading a 24-minute video transcript. Before any rewriting, do this:
1. Identify the central claim the speaker is making. One sentence.
2. List 3-5 sub-claims that support it, in the order the speaker made them.
3. For each sub-claim, list the concrete examples, numbers, or stories the
speaker used to back it up. Quote exact phrasing from the transcript.
4. Flag any sub-claim that is asserted but never supported. We may cut it
from the post.
Do not summarize yet. Do not rewrite. Just map the argument.What comes back is essentially an outline of the post — but built from what the speaker actually said, not from what a model assumes the topic should cover. Step 4's flag matters. Speakers improvise. They make claims they can't back up. In a video, an unsupported claim lands fine because the next sentence rolls past it. In a blog post, the same claim sits motionless on the page and gets challenged in the comments.
About a third of the time, this step also surfaces the real thesis of the video, which is rarely the one the speaker promised in the intro. Use what the transcript reveals, not what the title declared.
Step 3: Spoken-to-written translation, section by section
Now you rewrite — but not in one big pass. A one-shot conversion is exactly where the model smooths out your voice and turns the post into AI-flavored sludge. Do it section by section, mapped to the sub-claims from Step 2, with this prompt:
Below is a section of transcript that maps to sub-claim #2 from the outline.
Convert it to a blog paragraph or two. Rules:
- Keep the speaker's word choices wherever they're distinctive. If they said
"ratchet down," do not change it to "reduce."
- Convert spoken-only constructions to written ones:
"so what I mean by that is..." → cut, just say the thing.
"and the other thing is..." → start a new paragraph.
"you know..." → cut.
- If the speaker used a number, keep the number. Do not round, do not soften.
- If the speaker told a 4-sentence story, keep it as a 4-sentence story.
Do not compress it into a clause.
- Length: roughly 1.4x the transcript word count for this section. Spoken
language is dense in ideas; written language needs more scaffolding for
the same point to land.
Output the rewritten section only. No headers, no commentary.The 1.4x rule is the one to internalize. New writers in this workflow always compress the transcript, because spoken filler makes the source feel longer than it really is. The actual content is dense and needs more room on the page, not less. Compress it and you lose the examples that made the point land in the first place.
Run this prompt once per sub-claim. Five sub-claims means five passes. It's slower than a one-shot conversion. The output is three to four times cleaner.
Step 4: Add the things video can do without and blog posts can't
A video can rely on the speaker's face, tone, and pace to carry meaning. A blog post has to do that work with structure. After the section-by-section rewrite, add the things that don't exist in the source:
- A real opening hook. Most video intros are warm-ups: "Hey guys, today we're going to talk about…" Nobody reads that. Cut it. Replace paragraph one with the most surprising sentence from the body of the transcript, pulled up to the top.
- Headers every 250–400 words. Scannable structure matters more for a blog than the words inside it. Readers skim before they read.
- At least one table, code block, or bulleted list if the content earns it. Don't fake structure where the content is prose. But the speaker often said something list-shaped ("there's three ways to do this") that played fine as spoken prose and reads much better as bullets on the page.
- A pull-quote or summary line for the central claim. Bold it. Readers who only read the headers and the bold sentences should still leave with the argument.
One model call to suggest these, not apply them:
Here's the converted draft. Suggest:
- 1 alternative opening pulled from a specific sentence later in the post
- Header locations every 250-400 words
- Any sections that should become a list, table, or code block
- One sentence to bold as the post's central claim
Do not rewrite. Just suggest. I'll apply.You apply. Not the model. This is the editorial pass and it should stay human.
Step 5: SEO without breaking the voice
Optional, but if the post needs to rank, do this last — never before the rewrite, or the keyword starts driving the prose:
The post's primary keyword target is "[your keyword]."
Without changing the voice or adding new claims, suggest:
- One revised title (under 60 chars) that uses the keyword naturally
- A meta description (under 155 chars) drawn from the post's actual content
- 3 places in the body where the keyword (or a close variant) could fit
without making a sentence read awkwardly
- 2 internal-link anchor opportunities, with the suggested anchor text
If any suggestion would require softening the speaker's claims to sound more
"on-brand," skip it. Voice beats keyword density.The last sentence is the one that matters. SEO-tuned prompts that aren't constrained will quietly water down a sharp claim to fit a keyword. Don't let them. A page-three post that sounds like a person is worth more long-term than a page-one post that sounds like a template.
A real before/after
One paragraph from a recent client video, run through the full pipeline.
Raw transcript:
So the thing about, um, content velocity is, like, everyone thinks more is better, right? And it's, it's not. It's really not. I had this client last year, they were publishing four times a week and getting basically zero organic traffic. We cut them to one piece a week and traffic went up like 3x in four months. Same writer. Same niche. Just less stuff but each one was actually good.
Naive AI rewrite (what you get from "turn this into a blog paragraph"):
When it comes to content velocity, many marketers assume that publishing more frequently will yield better results. However, this is often a misconception. For instance, one client was publishing four times per week with minimal organic traffic. After reducing their cadence to once per week and improving quality, their traffic increased threefold over four months.
Pipeline output:
Everyone thinks content velocity means more. It doesn't. A client of mine was publishing four times a week and getting basically zero organic traffic. We cut them to one piece a week. Traffic went up 3x in four months. Same writer, same niche — just less stuff, each one actually good.
The third version is shorter than the first and longer than the second. It's also the only one where you can hear a person making a point. The pipeline is what kept the speaker's "basically zero" instead of letting the model translate it into "minimal."
Where this breaks
A few honest failure modes I've watched this hit:
The video is bad. No pipeline rescues a transcript with no actual argument inside it. If Step 2 produces a thin outline, the post will be thin too. Pick a different video.
The speaker is much funnier than they are clear. Comedy in spoken delivery rarely translates. If the original got its laughs from timing and facial expression, the converted post will feel flat. You'll need to add written-medium humor — tight sentence rhythm, surprising endings, the occasional one-line paragraph — rather than try to preserve the spoken jokes verbatim.
Two-speaker interviews need a different approach. The pipeline above assumes one voice. For interviews, do Steps 1–2 once for each speaker's contributions separately, then weave them together in Step 3. Don't try to merge two voices into one — you'll lose what made the conversation worth recording.
Long videos eat context windows. A 90-minute webinar transcript will run 12,000+ words and start to push against the limits of what a model can hold in working memory cleanly. Break it into 20-minute chunks. Run the pipeline per chunk. Stitch in a final pass.
What I'd do differently if I were starting fresh
If I were rebuilding this workflow from scratch today, I'd build the prompts into a Claude Project with a pinned style guide and the speaker's three best written pieces sitting in the context, so the "spoken-to-written" rules and the voice samples load by default on every conversation. The 90-minute conversion drops to about 40 once that's in place. The first ten posts you do this way are slower than writing from scratch. By post fifteen, the asset library starts paying you back in time you'd otherwise spend staring at a blank doc.
The point of this pipeline isn't to mass-produce content. It's to stop ignoring video work you've already done because the conversion felt harder than it actually is. Most marketers I know have a folder of webinar recordings, podcast appearances, and customer interviews that nobody is reading — because nobody is reading recordings. The text is in there. The pipeline just gets it out.