Skip to content

Data Ingestion — Implementation

Current State

Working

  • Hourly price refresh: US via yfinance, KR via FinanceDataReader
  • US news via Finnhub (15 min cycle, all active US tickers)
  • KR news via NAVER Finance scraper (30 min cycle)
  • YouTube transcripts via scraper/youtube.py (6 hour scheduled cycle; also manually triggerable from the NewsFeed page via the "YouTuber Perspectives" button)
  • ~~US retail sentiment via StockTwits~~ — disabled (public API restricted; see docs/spec/radar.md)
  • Headline translation via Azure Translator with DB caching

Wired but Unused

  • POLYGON_API_KEY is loaded in config.py but no integration is implemented

Known Limitations

Area Limitation
KR news scraper BeautifulSoup scrape of NAVER Finance — fragile; HTML structure changes can break it silently
Azure translation Free tier: 2M chars/month. If quota exceeded, headlines silently revert to original. No queuing or retry
Data gaps Missing prices are not forward-filled in the DB — ML feature code handles gaps by defaulting missing sentiment to 0.0
YouTube scraper Relies on youtube-transcript-api; channels without auto-generated captions yield no transcripts

Headline Translation

  • scraper/translator.py calls the Azure Translator API in batches
  • title_en / title_ko columns are None until batch_translate() is called
  • FastAPI news routes must handle None translation fields — return null to frontend; let it fall back to original title
  • Old provider was DeepL; swapped to Azure Translator (free tier: 2M chars/month vs. DeepL's 500k)

YouTube Transcript Ingestion

  • scraper/youtube.pyfetch_youtube_for_tickers(db_session) — fetches recent videos from channels in YOUTUBE_CHANNELS (config.py), extracts transcript text, stores as NewsArticle rows, returns count of new articles saved
  • The youtube job is triggerable on-demand from the NewsFeed page ("YouTuber Perspectives" button) — calls POST /api/jobs/trigger with {"job": "youtube"} and polls /api/jobs/status for live log output
  • Scoring runs immediately after ingestion via score_unscored_articles()
  • Channel list lives in YOUTUBE_CHANNELS in config.py; see docs/spec/youtube-channels.md for the curated list

How to Add a New News Source

  1. Create scraper/<source>.py — implement a function that inserts NewsArticle rows
  2. Add a scheduler job in scheduler/jobs.py with an appropriate interval
  3. Call score_unscored_articles() after ingestion in the job function
  4. Add the job to the initial_load() function if it should run on startup
  5. Update docs/spec/data-ingestion.md with the new source's spec