Data Ingestion — Spec
Prices
- US:
yfinance, hourly refresh, 5-day lookback window per cycle
- KR:
FinanceDataReader, same schedule and window
- Missing prices are not forward-filled in the DB — store only what was fetched; the ML layer handles gaps
US News
- Source: Finnhub REST API
- Schedule: every 15 minutes, all active US tickers in one cycle
- Rate limit: 60 requests/minute (free tier) — well within limits for a single cycle over the full ticker set
- Articles stored as
NewsArticle rows; duplicates prevented by unique constraint on url
KR News
- Source: NAVER Finance HTML scrape (BeautifulSoup)
- Schedule: every 30 minutes, all active KR tickers
StockTwits ⚠️ Disabled
- Status: Disabled — public unauthenticated API (
api.stocktwits.com/api/2/streams/symbol/) is unreliable and has been progressively restricted
- Scraper code preserved in
scraper/stocktwits.py; scheduler job commented out in scheduler/jobs.py
- See
docs/spec/radar.md for rework options
YouTube Transcripts
- Source: YouTube channel transcripts for tracked channels (see
docs/spec/youtube-channels.md)
- Schedule: every 6 hours
- Extracts ticker mentions from transcripts; stored as
NewsArticle rows
Headline Translation
- Provider: Azure Cognitive Services Translator (optional)
- Free tier: 2M characters/month
- Translations cached in
title_en / title_ko columns to avoid repeated API calls
- Batch mode: one API call per page load for all missing translations
- Graceful fallback: if API fails, headlines revert to original without crashing
Constraints
- All timestamps stored as UTC
url is the deduplication key on news_articles — prevents duplicates across repeated scrape cycles
language field ("en" or "ko") is set by the scraper and used downstream by sentiment and translation