Skip to content

Data Ingestion — Spec

Prices

  • US: yfinance, hourly refresh, 5-day lookback window per cycle
  • KR: FinanceDataReader, same schedule and window
  • Missing prices are not forward-filled in the DB — store only what was fetched; the ML layer handles gaps

US News

  • Source: Finnhub REST API
  • Schedule: every 15 minutes, all active US tickers in one cycle
  • Rate limit: 60 requests/minute (free tier) — well within limits for a single cycle over the full ticker set
  • Articles stored as NewsArticle rows; duplicates prevented by unique constraint on url

KR News

  • Source: NAVER Finance HTML scrape (BeautifulSoup)
  • Schedule: every 30 minutes, all active KR tickers

StockTwits ⚠️ Disabled

  • Status: Disabled — public unauthenticated API (api.stocktwits.com/api/2/streams/symbol/) is unreliable and has been progressively restricted
  • Scraper code preserved in scraper/stocktwits.py; scheduler job commented out in scheduler/jobs.py
  • See docs/spec/radar.md for rework options

YouTube Transcripts

  • Source: YouTube channel transcripts for tracked channels (see docs/spec/youtube-channels.md)
  • Schedule: every 6 hours
  • Extracts ticker mentions from transcripts; stored as NewsArticle rows

Headline Translation

  • Provider: Azure Cognitive Services Translator (optional)
  • Free tier: 2M characters/month
  • Translations cached in title_en / title_ko columns to avoid repeated API calls
  • Batch mode: one API call per page load for all missing translations
  • Graceful fallback: if API fails, headlines revert to original without crashing

Constraints

  • All timestamps stored as UTC
  • url is the deduplication key on news_articles — prevents duplicates across repeated scrape cycles
  • language field ("en" or "ko") is set by the scraper and used downstream by sentiment and translation