Data Ingestion — Implementation¶
Current State¶
Working¶
- Hourly price refresh: US via yfinance, KR via FinanceDataReader
- US news via Finnhub (15 min cycle, all active US tickers)
- KR news via NAVER Finance scraper (30 min cycle)
- YouTube transcripts via
scraper/youtube.py(6 hour scheduled cycle; also manually triggerable from the NewsFeed page via the "YouTuber Perspectives" button) - ~~US retail sentiment via StockTwits~~ — disabled (public API restricted; see
docs/spec/radar.md) - Headline translation via Azure Translator with DB caching
Wired but Unused¶
POLYGON_API_KEYis loaded inconfig.pybut no integration is implemented
Known Limitations¶
| Area | Limitation |
|---|---|
| KR news scraper | BeautifulSoup scrape of NAVER Finance — fragile; HTML structure changes can break it silently |
| Azure translation | Free tier: 2M chars/month. If quota exceeded, headlines silently revert to original. No queuing or retry |
| Data gaps | Missing prices are not forward-filled in the DB — ML feature code handles gaps by defaulting missing sentiment to 0.0 |
| YouTube scraper | Relies on youtube-transcript-api; channels without auto-generated captions yield no transcripts |
Headline Translation¶
scraper/translator.pycalls the Azure Translator API in batchestitle_en/title_kocolumns areNoneuntilbatch_translate()is called- FastAPI news routes must handle
Nonetranslation fields — returnnullto frontend; let it fall back to original title - Old provider was DeepL; swapped to Azure Translator (free tier: 2M chars/month vs. DeepL's 500k)
YouTube Transcript Ingestion¶
scraper/youtube.py→fetch_youtube_for_tickers(db_session)— fetches recent videos from channels inYOUTUBE_CHANNELS(config.py), extracts transcript text, stores asNewsArticlerows, returns count of new articles saved- The
youtubejob is triggerable on-demand from the NewsFeed page ("YouTuber Perspectives" button) — callsPOST /api/jobs/triggerwith{"job": "youtube"}and polls/api/jobs/statusfor live log output - Scoring runs immediately after ingestion via
score_unscored_articles() - Channel list lives in
YOUTUBE_CHANNELSinconfig.py; seedocs/spec/youtube-channels.mdfor the curated list
How to Add a New News Source¶
- Create
scraper/<source>.py— implement a function that insertsNewsArticlerows - Add a scheduler job in
scheduler/jobs.pywith an appropriate interval - Call
score_unscored_articles()after ingestion in the job function - Add the job to the
initial_load()function if it should run on startup - Update
docs/spec/data-ingestion.mdwith the new source's spec