Skip to content

YouTube Whisper Audio Transcription Fallback — Plan

Motivation

The existing YouTube scraper (scraper/youtube.py) relies on markitdown / youtube-transcript-api to pull auto-generated captions. Newly published videos often have no captions yet — YouTube's auto-captioning lags by hours or days — so those videos are silently skipped. This plan adds a fallback: when no transcript is available, download the audio with yt-dlp and transcribe it locally using OpenAI Whisper. The output text flows into the existing ticker-mention extraction and sentiment pipeline unchanged.

Revised data flow:

_extract_transcript(video_id)
  ├── markitdown (youtube-transcript-api) → text      ← existing, unchanged
  └── [fallback if None]
        yt-dlp → audio temp file (mp3)
          → Whisper model → text
            → delete temp file
              → ticker mentions → NewsArticle rows

Scope

In scope: - Add _transcribe_audio(video_id) fallback in scraper/youtube.py - Add YOUTUBE_WHISPER_MODEL and YOUTUBE_MAX_DURATION_MINUTES to config.py - Add openai-whisper and yt-dlp to requirements.txt - Lazy-load the Whisper model on first use (avoid loading at import time)

Out of scope: - Replacing markitdown — it stays as the primary path (free, instant, no compute) - Real-time streaming transcription - Whisper via external API (OpenAI / Groq) — local model is sufficient for this workload - GPU setup / CUDA configuration — Whisper auto-detects; this plan does not prescribe hardware

Approach

1. requirements.txt

Add:

openai-whisper
yt-dlp

openai-whisper pulls in torch (CPU build by default). If the deployment machine has a CUDA-capable GPU, install torch with the appropriate CUDA index URL separately — Whisper will use it automatically via torch.cuda.is_available().

ffmpeg must be available on the system (apt install ffmpeg or brew install ffmpeg). Add a note to docs/implementation/new-machine-setup.md.

2. config.py

Add to the YouTube section:

YOUTUBE_WHISPER_MODEL = "small"          # tiny | base | small | medium | large
YOUTUBE_MAX_DURATION_MINUTES = 40        # skip videos longer than this

Model tradeoff reference:

Model Size Approx speed (CPU) Notes
tiny 75 MB ~10× realtime Low accuracy, not recommended
base 145 MB ~7× realtime Decent for clear speech
small 466 MB ~4× realtime Good default — recommended
medium 1.5 GB ~2× realtime Better accuracy, slower
large 2.9 GB ~1× realtime Best quality, heavy on RAM/CPU

For a 20-min finance video on small: ~5 min on CPU, ~30s on a mid-range GPU.

3. scraper/youtube.py

3a. Lazy-load Whisper model

_whisper_model = None

def _get_whisper_model():
    global _whisper_model
    if _whisper_model is None:
        import whisper
        from config import YOUTUBE_WHISPER_MODEL
        _whisper_model = whisper.load_model(YOUTUBE_WHISPER_MODEL)
    return _whisper_model

Load on first call, not at import time — keeps startup fast and avoids loading torch if Whisper is never needed (e.g. all videos already have captions).

3b. _get_video_duration(video_id) -> int | None

Use yt-dlp with --no-download --print duration to fetch duration in seconds before downloading audio. Return None if the call fails (treat as unknown, allow through).

def _get_video_duration(video_id: str) -> int | None:
    import subprocess
    url = f"https://www.youtube.com/watch?v={video_id}"
    try:
        result = subprocess.run(
            ["yt-dlp", "--no-download", "--print", "duration", url],
            capture_output=True, text=True, timeout=15
        )
        return int(result.stdout.strip())
    except Exception:
        return None

3c. _transcribe_audio(video_id) -> str | None

def _transcribe_audio(video_id: str) -> str | None:
    import subprocess, tempfile, os
    from config import YOUTUBE_MAX_DURATION_MINUTES

    duration = _get_video_duration(video_id)
    if duration is not None and duration > YOUTUBE_MAX_DURATION_MINUTES * 60:
        logger.debug(f"Skipping {video_id}: duration {duration}s exceeds cap")
        return None

    url = f"https://www.youtube.com/watch?v={video_id}"
    with tempfile.TemporaryDirectory() as tmpdir:
        audio_path = os.path.join(tmpdir, "audio.mp3")
        try:
            subprocess.run(
                [
                    "yt-dlp", "-x", "--audio-format", "mp3",
                    "--audio-quality", "5",   # ~128 kbps — sufficient for speech
                    "-o", audio_path, url,
                ],
                capture_output=True, timeout=120, check=True
            )
        except Exception as exc:
            logger.debug(f"yt-dlp failed for {video_id}: {exc}")
            return None

        try:
            model = _get_whisper_model()
            result = model.transcribe(audio_path)
            return result.get("text") or None
        except Exception as exc:
            logger.warning(f"Whisper transcription failed for {video_id}: {exc}")
            return None

3d. Update _extract_transcript()

def _extract_transcript(video_id: str) -> str | None:
    # Primary: markitdown (instant, no compute)
    try:
        from markitdown import MarkItDown
        md = MarkItDown()
        result = md.convert(f"https://www.youtube.com/watch?v={video_id}")
        if result.text_content:
            return result.text_content
    except Exception as exc:
        logger.debug(f"Transcript unavailable for {video_id}: {exc}")

    # Fallback: download audio + Whisper
    logger.debug(f"Falling back to Whisper for {video_id}")
    return _transcribe_audio(video_id)

No changes needed in fetch_youtube_for_tickers() — it just calls _extract_transcript().

4. docs/implementation/new-machine-setup.md

Add a note that ffmpeg is required for the Whisper fallback:

ffmpeg          # required for YouTube audio transcription (yt-dlp + Whisper)

Open Questions

  • GPU availability: The deployment laptop GPU model is unknown. If CUDA is available, Whisper will use it automatically — no code change needed. Verify with python -c "import torch; print(torch.cuda.is_available())" after setup.
  • Concurrency: The scheduler runs fetch_youtube with max_instances=1, so only one job runs at a time. Whisper transcription is blocking — this is fine.
  • Korean audio: Whisper supports Korean natively. For KR channels, consider setting model.transcribe(audio_path, language="ko") to skip language detection and improve speed. Add YOUTUBE_WHISPER_LANGUAGE_HINT: dict[str, str] to config if this is needed later.
  • yt-dlp bot detection: YouTube increasingly blocks automated downloads. If yt-dlp starts failing with 403s, it may require cookie passing (--cookies-from-browser chrome). Out of scope for now — treat as a known risk.

Status

draft