YouTube Whisper Audio Transcription Fallback — Plan¶
Motivation¶
The existing YouTube scraper (scraper/youtube.py) relies on markitdown / youtube-transcript-api
to pull auto-generated captions. Newly published videos often have no captions yet — YouTube's
auto-captioning lags by hours or days — so those videos are silently skipped. This plan adds a
fallback: when no transcript is available, download the audio with yt-dlp and transcribe it
locally using OpenAI Whisper. The output text flows into the existing ticker-mention extraction
and sentiment pipeline unchanged.
Revised data flow:
_extract_transcript(video_id)
├── markitdown (youtube-transcript-api) → text ← existing, unchanged
└── [fallback if None]
yt-dlp → audio temp file (mp3)
→ Whisper model → text
→ delete temp file
→ ticker mentions → NewsArticle rows
Scope¶
In scope:
- Add _transcribe_audio(video_id) fallback in scraper/youtube.py
- Add YOUTUBE_WHISPER_MODEL and YOUTUBE_MAX_DURATION_MINUTES to config.py
- Add openai-whisper and yt-dlp to requirements.txt
- Lazy-load the Whisper model on first use (avoid loading at import time)
Out of scope:
- Replacing markitdown — it stays as the primary path (free, instant, no compute)
- Real-time streaming transcription
- Whisper via external API (OpenAI / Groq) — local model is sufficient for this workload
- GPU setup / CUDA configuration — Whisper auto-detects; this plan does not prescribe hardware
Approach¶
1. requirements.txt¶
Add:
openai-whisper
yt-dlp
openai-whisper pulls in torch (CPU build by default). If the deployment machine has a
CUDA-capable GPU, install torch with the appropriate CUDA index URL separately — Whisper
will use it automatically via torch.cuda.is_available().
ffmpeg must be available on the system (apt install ffmpeg or brew install ffmpeg).
Add a note to docs/implementation/new-machine-setup.md.
2. config.py¶
Add to the YouTube section:
YOUTUBE_WHISPER_MODEL = "small" # tiny | base | small | medium | large
YOUTUBE_MAX_DURATION_MINUTES = 40 # skip videos longer than this
Model tradeoff reference:
| Model | Size | Approx speed (CPU) | Notes |
|---|---|---|---|
| tiny | 75 MB | ~10× realtime | Low accuracy, not recommended |
| base | 145 MB | ~7× realtime | Decent for clear speech |
| small | 466 MB | ~4× realtime | Good default — recommended |
| medium | 1.5 GB | ~2× realtime | Better accuracy, slower |
| large | 2.9 GB | ~1× realtime | Best quality, heavy on RAM/CPU |
For a 20-min finance video on small: ~5 min on CPU, ~30s on a mid-range GPU.
3. scraper/youtube.py¶
3a. Lazy-load Whisper model¶
_whisper_model = None
def _get_whisper_model():
global _whisper_model
if _whisper_model is None:
import whisper
from config import YOUTUBE_WHISPER_MODEL
_whisper_model = whisper.load_model(YOUTUBE_WHISPER_MODEL)
return _whisper_model
Load on first call, not at import time — keeps startup fast and avoids loading torch if Whisper is never needed (e.g. all videos already have captions).
3b. _get_video_duration(video_id) -> int | None¶
Use yt-dlp with --no-download --print duration to fetch duration in seconds before
downloading audio. Return None if the call fails (treat as unknown, allow through).
def _get_video_duration(video_id: str) -> int | None:
import subprocess
url = f"https://www.youtube.com/watch?v={video_id}"
try:
result = subprocess.run(
["yt-dlp", "--no-download", "--print", "duration", url],
capture_output=True, text=True, timeout=15
)
return int(result.stdout.strip())
except Exception:
return None
3c. _transcribe_audio(video_id) -> str | None¶
def _transcribe_audio(video_id: str) -> str | None:
import subprocess, tempfile, os
from config import YOUTUBE_MAX_DURATION_MINUTES
duration = _get_video_duration(video_id)
if duration is not None and duration > YOUTUBE_MAX_DURATION_MINUTES * 60:
logger.debug(f"Skipping {video_id}: duration {duration}s exceeds cap")
return None
url = f"https://www.youtube.com/watch?v={video_id}"
with tempfile.TemporaryDirectory() as tmpdir:
audio_path = os.path.join(tmpdir, "audio.mp3")
try:
subprocess.run(
[
"yt-dlp", "-x", "--audio-format", "mp3",
"--audio-quality", "5", # ~128 kbps — sufficient for speech
"-o", audio_path, url,
],
capture_output=True, timeout=120, check=True
)
except Exception as exc:
logger.debug(f"yt-dlp failed for {video_id}: {exc}")
return None
try:
model = _get_whisper_model()
result = model.transcribe(audio_path)
return result.get("text") or None
except Exception as exc:
logger.warning(f"Whisper transcription failed for {video_id}: {exc}")
return None
3d. Update _extract_transcript()¶
def _extract_transcript(video_id: str) -> str | None:
# Primary: markitdown (instant, no compute)
try:
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert(f"https://www.youtube.com/watch?v={video_id}")
if result.text_content:
return result.text_content
except Exception as exc:
logger.debug(f"Transcript unavailable for {video_id}: {exc}")
# Fallback: download audio + Whisper
logger.debug(f"Falling back to Whisper for {video_id}")
return _transcribe_audio(video_id)
No changes needed in fetch_youtube_for_tickers() — it just calls _extract_transcript().
4. docs/implementation/new-machine-setup.md¶
Add a note that ffmpeg is required for the Whisper fallback:
ffmpeg # required for YouTube audio transcription (yt-dlp + Whisper)
Open Questions¶
- GPU availability: The deployment laptop GPU model is unknown. If CUDA is available,
Whisper will use it automatically — no code change needed. Verify with
python -c "import torch; print(torch.cuda.is_available())"after setup. - Concurrency: The scheduler runs
fetch_youtubewithmax_instances=1, so only one job runs at a time. Whisper transcription is blocking — this is fine. - Korean audio: Whisper supports Korean natively. For KR channels, consider setting
model.transcribe(audio_path, language="ko")to skip language detection and improve speed. AddYOUTUBE_WHISPER_LANGUAGE_HINT: dict[str, str]to config if this is needed later. - yt-dlp bot detection: YouTube increasingly blocks automated downloads. If
yt-dlpstarts failing with 403s, it may require cookie passing (--cookies-from-browser chrome). Out of scope for now — treat as a known risk.
Status¶
draft