Data Ingestion — Implementation¶

Current State¶

Hourly price refresh: US via yfinance, KR via FinanceDataReader
US news via Finnhub (15 min cycle, all active US tickers)
KR news via NAVER Finance scraper (30 min cycle)
YouTube transcripts via scraper/youtube.py (6 hour scheduled cycle; also manually triggerable from the NewsFeed page via the "YouTuber Perspectives" button)
~~US retail sentiment via StockTwits~~ — disabled (public API restricted; see docs/spec/radar.md)
Headline translation via Azure Translator with DB caching

Area	Limitation
KR news scraper	BeautifulSoup scrape of NAVER Finance — fragile; HTML structure changes can break it silently
Azure translation	Free tier: 2M chars/month. If quota exceeded, headlines silently revert to original. No queuing or retry
Data gaps	Missing prices are not forward-filled in the DB — ML feature code handles gaps by defaulting missing sentiment to 0.0
YouTube scraper	Relies on `youtube-transcript-api`; channels without auto-generated captions yield no transcripts

scraper/translator.py calls the Azure Translator API in batches
title_en / title_ko columns are None until batch_translate() is called
FastAPI news routes must handle None translation fields — return null to frontend; let it fall back to original title
Old provider was DeepL; swapped to Azure Translator (free tier: 2M chars/month vs. DeepL's 500k)

scraper/youtube.py → fetch_youtube_for_tickers(db_session) — fetches recent videos from channels in YOUTUBE_CHANNELS (config.py), extracts transcript text, stores as NewsArticle rows, returns count of new articles saved
The youtube job is triggerable on-demand from the NewsFeed page ("YouTuber Perspectives" button) — calls POST /api/jobs/trigger with {"job": "youtube"} and polls /api/jobs/status for live log output
Scoring runs immediately after ingestion via score_unscored_articles()
Channel list lives in YOUTUBE_CHANNELS in config.py; see docs/spec/youtube-channels.md for the curated list

Create scraper/<source>.py — implement a function that inserts NewsArticle rows
Add a scheduler job in scheduler/jobs.py with an appropriate interval
Call score_unscored_articles() after ingestion in the job function
Add the job to the initial_load() function if it should run on startup
Update docs/spec/data-ingestion.md with the new source's spec