Plan: YouTube Finance Channel Transcript Scraper¶
Context¶
Finance YouTubers surface retail sentiment and trend opinions that lag behind institutional
news but often precede price momentum. This plan adds a new scraper that pulls video
transcripts from a curated list of YouTube channels (both US and Korean), extracts ticker
mentions, and feeds them into the existing NewsArticle → sentiment pipeline — zero new DB
tables, zero changes to the ML model.
Data flow:
YouTube RSS feed → recent video IDs
→ markitdown (youtube-transcript-api) → raw transcript text
→ ticker-mention extraction → per-(video, ticker) excerpt
→ NewsArticle row (source="YouTube/…") → score_unscored_articles()
score_unscored_articles() already dispatches by article.language, so Korean articles
(language="ko") route to analyze_korean() / ko_lexicon automatically — no changes
needed in sentiment/analyzer.py.
Dependencies¶
Add to requirements.txt:
markitdown[youtube] # Microsoft markitdown with YouTube transcript support
feedparser # RSS feed parsing (YouTube channel feeds)
markitdown[youtube] installs youtube-transcript-api under the hood. No API key needed —
it pulls auto-generated captions directly from YouTube.
config.py Changes¶
# --- YouTube Channels ---
# Each entry: (channel_id, display_name, market, language)
# channel_id: the UCxxxxxx part of https://www.youtube.com/channel/UCxxxxxx
# market: "US" or "KR"
# language: "en" or "ko"
YOUTUBE_CHANNELS: list[tuple[str, str, str, str]] = [
# US finance channels
# ("UCxxxxxx", "Channel Name", "US", "en"),
# Korean finance channels
# ("UCxxxxxx", "채널명", "KR", "ko"),
]
YOUTUBE_MAX_VIDEOS_PER_CHANNEL = 10 # recent videos to check per cycle
YOUTUBE_MENTION_WINDOW_CHARS = 400 # chars of context around each ticker mention
SCHEDULE_YOUTUBE_INTERVAL = 6 * 60 * 60 # 6 hours in seconds
# Korean company name → ticker mapping for transcript matching.
# 6-digit codes never appear in speech; spoken names do.
# Keys are canonical tickers; values are name aliases to match in transcript text.
KR_TICKER_KO_NAMES: dict[str, list[str]] = {
"005930": ["삼성전자", "Samsung Electronics", "Samsung"],
"000660": ["SK하이닉스", "SK Hynix", "하이닉스"],
"035420": ["네이버", "NAVER", "Naver"],
"035720": ["카카오", "Kakao"],
"051910": ["LG화학", "LG Chem"],
"373220": ["LG에너지솔루션", "LG Energy Solution", "LG에너지"],
"006400": ["삼성SDI", "Samsung SDI"],
"009150": ["삼성전기", "Samsung Electro-Mechanics"],
"000990": ["DB하이텍", "DB HiTek"],
"042700": ["한미반도체", "Hanmi Semiconductor"],
"402340": ["SK스퀘어", "SK Square"],
"012450": ["한화에어로스페이스", "Hanwha Aerospace", "한화에어로"],
"042660": ["한화오션", "Hanwha Ocean"],
"010120": ["LS ELECTRIC", "LS Electric", "엘에스일렉트릭"],
"207940": ["삼성바이오로직스", "Samsung Biologics", "삼바"],
}
Add SCHEDULE_YOUTUBE_INTERVAL to the imports block in scheduler/jobs.py.
New File: scraper/youtube.py¶
Responsibilities¶
| Function | Purpose |
|---|---|
_fetch_channel_videos(channel_id, max_videos) |
Hit public RSS feed, return [(video_id, title, published_at)] |
_extract_transcript(video_id) |
markitdown call, return plain text or None |
_build_aliases(market) |
Build {ticker: [alias, …]} for all active tickers in a market |
_find_mentions(text, ticker, aliases, window) |
Regex scan, return list of excerpt strings |
fetch_youtube_for_tickers(db) |
Orchestrate: channels → videos → tickers → save NewsArticles |
Key implementation notes¶
RSS URL (no API key needed):
RSS_URL = "https://www.youtube.com/feeds/videos.xml?channel_id={channel_id}"
Returns up to 15 most recent uploads. Parse with feedparser.
Transcript extraction:
from markitdown import MarkItDown
_md = MarkItDown()
def _extract_transcript(video_id: str) -> str | None:
url = f"https://www.youtube.com/watch?v={video_id}"
try:
result = _md.convert(url)
return result.text_content or None
except Exception:
return None # no captions available — skip silently
Ticker alias building — US channels:
Source: TICKER_NAMES in config.py.
- Always include the bare symbol (AAPL) and $SYMBOL ($AAPL)
- Include the English display name if ≥4 chars (avoids false positives on short names like "ARM" or "MU")
- Word-boundary regex: \b{re.escape(alias)}\b, case-insensitive
# Result: {"AAPL": ["AAPL", "$AAPL", "Apple"], "NVDA": ["NVDA", "$NVDA", "NVIDIA"], …}
Ticker alias building — KR channels:
Source: KR_TICKER_KO_NAMES in config.py. Do not include the bare 6-digit code —
it won't appear in speech.
- Use all name variants from KR_TICKER_KO_NAMES[ticker]
- Korean word boundaries work differently (no spaces between words), so match without \b
for Korean-script aliases; keep \b for Latin-script aliases
# Result: {"005930": ["삼성전자", "Samsung Electronics", "Samsung"], …}
Dedup URL scheme:
Because one video can mention multiple tickers and NewsArticle.url is UNIQUE, use:
https://www.youtube.com/watch?v={video_id}#{ticker}
One row per (video, ticker) pair. The existing duplicate-check loop handles re-runs.
NewsArticle fields:
| Field | Value |
|---|---|
| title | "{video_title} [{ticker}]" |
| summary | Concatenated excerpts (≤3 mentions, joined by …) |
| url | https://www.youtube.com/watch?v={video_id}#{ticker} |
| source | "YouTube/{channel_name}" |
| ticker | ticker symbol (e.g. "NVDA" or "005930") |
| market | from channel config ("US" or "KR") |
| language | from channel config ("en" or "ko") |
| published_at | video publish date from RSS |
Skip conditions: - Transcript unavailable → skip video entirely (log at DEBUG) - No ticker mentions in transcript → skip that video (expected for off-topic content) - Already exists in DB (URL dedup) → skip silently
Sentiment routing — no code changes needed¶
score_unscored_articles() in sentiment/analyzer.py already dispatches by article.language:
result = analyze_text(text, language=article.language or "en")
| Channel type | language stored |
Analyzer invoked |
|---|---|---|
| US finance channel | "en" |
VADER + finance lexicon |
| KR finance channel | "ko" |
ko_lexicon keyword scorer |
Korean transcript excerpts (e.g. "삼성전자 급등, 반도체 수주 기대") will score correctly
against the existing KO_POSITIVE / KO_NEGATIVE sets.
scheduler/jobs.py Changes¶
New job function¶
def job_fetch_youtube():
"""Fetch transcripts from YouTube finance channels and score ticker mentions."""
from scraper.youtube import fetch_youtube_for_tickers
from sentiment.analyzer import score_unscored_articles
db = _db()
try:
total = fetch_youtube_for_tickers(db)
logger.info(f"YouTube: saved {total} new ticker-mention articles")
score_unscored_articles(db)
except Exception as e:
logger.error(f"job_fetch_youtube error: {e}")
finally:
db.close()
Add to build_scheduler()¶
scheduler.add_job(
job_fetch_youtube,
trigger=IntervalTrigger(seconds=SCHEDULE_YOUTUBE_INTERVAL),
id="fetch_youtube",
name="Fetch YouTube Transcripts",
replace_existing=True,
max_instances=1,
)
Add to run_initial_load()¶
try:
logger.info("Fetching YouTube transcripts...")
job_fetch_youtube()
except Exception as e:
logger.error(f"Initial YouTube fetch failed: {e}")
CLAUDE.md Change¶
Add a row to the Scheduler Jobs table:
| fetch_youtube | 6 hours | YouTube channel transcripts (US + KR) → ticker mentions |
What Does NOT Change¶
db/models.py— no new tables;NewsArticle.marketandNewsArticle.languagealready existsentiment/analyzer.py— language dispatch already handles"ko"vs"en"models/predictor.py— sentiment features aggregate over all articles regardless of source- Dashboard — YouTube articles appear in News Feed with
sourceshown as"YouTube/…"
Limitations and Mitigations¶
| Limitation | Mitigation |
|---|---|
| Korean auto-captions noisier than English | Excerpt window ±400 chars focuses on the relevant passage; ko_lexicon is keyword-based so random filler words score neutral |
| 6-digit KR codes never appear in speech | KR_TICKER_KO_NAMES maps spoken names (삼성전자, Samsung) to the canonical ticker |
| Korean word boundaries differ from Latin | Use no \b for Korean-script aliases; keep \b for Latin-script aliases in the same map |
| Some videos have no captions | Silent skip — logged at DEBUG, not WARNING |
| Transcripts long (~10–20k chars) | Score only the mention excerpts, not the full transcript |
| YouTube may throttle transcript requests | 6-hour interval + max_videos=10 keeps rate very low |
| Channel IDs must be found manually | Lookup: channel page → View Source → search "channelId" |
Verification¶
- Add at least one channel ID to
YOUTUBE_CHANNELSinconfig.py(US or KR) - Run manually:
python from scraper.youtube import fetch_youtube_for_tickers from db.database import SessionLocal db = SessionLocal() n = fetch_youtube_for_tickers(db) print(f"Saved {n} articles") db.close() - Check articles saved with correct market/language:
sql SELECT title, source, ticker, market, language FROM news_articles WHERE source LIKE 'YouTube%' ORDER BY published_at DESC LIMIT 10; - Confirm sentiment scores and correct model used:
sql SELECT a.language, s.model_used, s.score, s.label, a.title FROM news_articles a JOIN sentiment_scores s ON s.article_id = a.id WHERE a.source LIKE 'YouTube%' LIMIT 10; -- KR rows should show model_used = 'ko_lexicon' -- US rows should show model_used = 'vader_finance' - Restart dashboard — YouTube articles appear in the News Feed tab for both markets