Plan: YouTube Finance Channel Transcript Scraper¶

Context¶

Finance YouTubers surface retail sentiment and trend opinions that lag behind institutional news but often precede price momentum. This plan adds a new scraper that pulls video transcripts from a curated list of YouTube channels (both US and Korean), extracts ticker mentions, and feeds them into the existing NewsArticle → sentiment pipeline — zero new DB tables, zero changes to the ML model.

Data flow:

YouTube RSS feed → recent video IDs
  → markitdown (youtube-transcript-api) → raw transcript text
    → ticker-mention extraction → per-(video, ticker) excerpt
      → NewsArticle row (source="YouTube/…") → score_unscored_articles()

score_unscored_articles() already dispatches by article.language, so Korean articles (language="ko") route to analyze_korean() / ko_lexicon automatically — no changes needed in sentiment/analyzer.py.

Dependencies¶

Add to requirements.txt:

markitdown[youtube]   # Microsoft markitdown with YouTube transcript support
feedparser            # RSS feed parsing (YouTube channel feeds)

markitdown[youtube] installs youtube-transcript-api under the hood. No API key needed — it pulls auto-generated captions directly from YouTube.

`config.py` Changes¶

# --- YouTube Channels ---
# Each entry: (channel_id, display_name, market, language)
# channel_id: the UCxxxxxx part of https://www.youtube.com/channel/UCxxxxxx
# market: "US" or "KR"
# language: "en" or "ko"
YOUTUBE_CHANNELS: list[tuple[str, str, str, str]] = [
    # US finance channels
    # ("UCxxxxxx", "Channel Name", "US", "en"),

    # Korean finance channels
    # ("UCxxxxxx", "채널명", "KR", "ko"),
]

YOUTUBE_MAX_VIDEOS_PER_CHANNEL = 10    # recent videos to check per cycle
YOUTUBE_MENTION_WINDOW_CHARS   = 400   # chars of context around each ticker mention
SCHEDULE_YOUTUBE_INTERVAL      = 6 * 60 * 60  # 6 hours in seconds


# Korean company name → ticker mapping for transcript matching.
# 6-digit codes never appear in speech; spoken names do.
# Keys are canonical tickers; values are name aliases to match in transcript text.
KR_TICKER_KO_NAMES: dict[str, list[str]] = {
    "005930": ["삼성전자", "Samsung Electronics", "Samsung"],
    "000660": ["SK하이닉스", "SK Hynix", "하이닉스"],
    "035420": ["네이버", "NAVER", "Naver"],
    "035720": ["카카오", "Kakao"],
    "051910": ["LG화학", "LG Chem"],
    "373220": ["LG에너지솔루션", "LG Energy Solution", "LG에너지"],
    "006400": ["삼성SDI", "Samsung SDI"],
    "009150": ["삼성전기", "Samsung Electro-Mechanics"],
    "000990": ["DB하이텍", "DB HiTek"],
    "042700": ["한미반도체", "Hanmi Semiconductor"],
    "402340": ["SK스퀘어", "SK Square"],
    "012450": ["한화에어로스페이스", "Hanwha Aerospace", "한화에어로"],
    "042660": ["한화오션", "Hanwha Ocean"],
    "010120": ["LS ELECTRIC", "LS Electric", "엘에스일렉트릭"],
    "207940": ["삼성바이오로직스", "Samsung Biologics", "삼바"],
}

Add SCHEDULE_YOUTUBE_INTERVAL to the imports block in scheduler/jobs.py.

New File: `scraper/youtube.py`¶

Responsibilities¶

Function	Purpose
`_fetch_channel_videos(channel_id, max_videos)`	Hit public RSS feed, return `[(video_id, title, published_at)]`
`_extract_transcript(video_id)`	`markitdown` call, return plain text or `None`
`_build_aliases(market)`	Build `{ticker: [alias, …]}` for all active tickers in a market
`_find_mentions(text, ticker, aliases, window)`	Regex scan, return list of excerpt strings
`fetch_youtube_for_tickers(db)`	Orchestrate: channels → videos → tickers → save NewsArticles

Key implementation notes¶

RSS URL (no API key needed):

RSS_URL = "https://www.youtube.com/feeds/videos.xml?channel_id={channel_id}"

Returns up to 15 most recent uploads. Parse with feedparser.

Transcript extraction:

from markitdown import MarkItDown
_md = MarkItDown()

def _extract_transcript(video_id: str) -> str | None:
    url = f"https://www.youtube.com/watch?v={video_id}"
    try:
        result = _md.convert(url)
        return result.text_content or None
    except Exception:
        return None   # no captions available — skip silently

Ticker alias building — US channels: Source: TICKER_NAMES in config.py. - Always include the bare symbol (AAPL) and $SYMBOL ($AAPL) - Include the English display name if ≥4 chars (avoids false positives on short names like "ARM" or "MU") - Word-boundary regex: \b{re.escape(alias)}\b, case-insensitive

# Result: {"AAPL": ["AAPL", "$AAPL", "Apple"], "NVDA": ["NVDA", "$NVDA", "NVIDIA"], …}

Ticker alias building — KR channels: Source: KR_TICKER_KO_NAMES in config.py. Do not include the bare 6-digit code — it won't appear in speech. - Use all name variants from KR_TICKER_KO_NAMES[ticker] - Korean word boundaries work differently (no spaces between words), so match without \b for Korean-script aliases; keep \b for Latin-script aliases

# Result: {"005930": ["삼성전자", "Samsung Electronics", "Samsung"], …}

Dedup URL scheme: Because one video can mention multiple tickers and NewsArticle.url is UNIQUE, use:

https://www.youtube.com/watch?v={video_id}#{ticker}

One row per (video, ticker) pair. The existing duplicate-check loop handles re-runs.

NewsArticle fields: | Field | Value | |---|---| | title | "{video_title} [{ticker}]" | | summary | Concatenated excerpts (≤3 mentions, joined by …) | | url | https://www.youtube.com/watch?v={video_id}#{ticker} | | source | "YouTube/{channel_name}" | | ticker | ticker symbol (e.g. "NVDA" or "005930") | | market | from channel config ("US" or "KR") | | language | from channel config ("en" or "ko") | | published_at | video publish date from RSS |

Skip conditions: - Transcript unavailable → skip video entirely (log at DEBUG) - No ticker mentions in transcript → skip that video (expected for off-topic content) - Already exists in DB (URL dedup) → skip silently

Sentiment routing — no code changes needed¶

score_unscored_articles() in sentiment/analyzer.py already dispatches by article.language:

result = analyze_text(text, language=article.language or "en")

Channel type	`language` stored	Analyzer invoked
US finance channel	`"en"`	VADER + finance lexicon
KR finance channel	`"ko"`	`ko_lexicon` keyword scorer

Korean transcript excerpts (e.g. "삼성전자 급등, 반도체 수주 기대") will score correctly against the existing KO_POSITIVE / KO_NEGATIVE sets.

`scheduler/jobs.py` Changes¶

New job function¶

def job_fetch_youtube():
    """Fetch transcripts from YouTube finance channels and score ticker mentions."""
    from scraper.youtube import fetch_youtube_for_tickers
    from sentiment.analyzer import score_unscored_articles

    db = _db()
    try:
        total = fetch_youtube_for_tickers(db)
        logger.info(f"YouTube: saved {total} new ticker-mention articles")
        score_unscored_articles(db)
    except Exception as e:
        logger.error(f"job_fetch_youtube error: {e}")
    finally:
        db.close()

Add to `build_scheduler()`¶

scheduler.add_job(
    job_fetch_youtube,
    trigger=IntervalTrigger(seconds=SCHEDULE_YOUTUBE_INTERVAL),
    id="fetch_youtube",
    name="Fetch YouTube Transcripts",
    replace_existing=True,
    max_instances=1,
)

Add to `run_initial_load()`¶

try:
    logger.info("Fetching YouTube transcripts...")
    job_fetch_youtube()
except Exception as e:
    logger.error(f"Initial YouTube fetch failed: {e}")

`CLAUDE.md` Change¶

Add a row to the Scheduler Jobs table:

| fetch_youtube | 6 hours | YouTube channel transcripts (US + KR) → ticker mentions |

What Does NOT Change¶

db/models.py — no new tables; NewsArticle.market and NewsArticle.language already exist
sentiment/analyzer.py — language dispatch already handles "ko" vs "en"
models/predictor.py — sentiment features aggregate over all articles regardless of source
Dashboard — YouTube articles appear in News Feed with source shown as "YouTube/…"

Limitations and Mitigations¶

Limitation	Mitigation
Korean auto-captions noisier than English	Excerpt window ±400 chars focuses on the relevant passage; `ko_lexicon` is keyword-based so random filler words score neutral
6-digit KR codes never appear in speech	`KR_TICKER_KO_NAMES` maps spoken names (삼성전자, Samsung) to the canonical ticker
Korean word boundaries differ from Latin	Use no `\b` for Korean-script aliases; keep `\b` for Latin-script aliases in the same map
Some videos have no captions	Silent skip — logged at DEBUG, not WARNING
Transcripts long (~10–20k chars)	Score only the mention excerpts, not the full transcript
YouTube may throttle transcript requests	6-hour interval + `max_videos=10` keeps rate very low
Channel IDs must be found manually	Lookup: channel page → View Source → search `"channelId"`

Verification¶

Add at least one channel ID to YOUTUBE_CHANNELS in config.py (US or KR)
Run manually: python from scraper.youtube import fetch_youtube_for_tickers from db.database import SessionLocal db = SessionLocal() n = fetch_youtube_for_tickers(db) print(f"Saved {n} articles") db.close()
Check articles saved with correct market/language: sql SELECT title, source, ticker, market, language FROM news_articles WHERE source LIKE 'YouTube%' ORDER BY published_at DESC LIMIT 10;
Confirm sentiment scores and correct model used: sql SELECT a.language, s.model_used, s.score, s.label, a.title FROM news_articles a JOIN sentiment_scores s ON s.article_id = a.id WHERE a.source LIKE 'YouTube%' LIMIT 10; -- KR rows should show model_used = 'ko_lexicon' -- US rows should show model_used = 'vader_finance'
Restart dashboard — YouTube articles appear in the News Feed tab for both markets