Data Ingestion — Spec¶

Prices¶

US: yfinance, hourly refresh, 5-day lookback window per cycle
KR: FinanceDataReader, same schedule and window
Missing prices are not forward-filled in the DB — store only what was fetched; the ML layer handles gaps

Source: Finnhub REST API
Schedule: every 15 minutes, all active US tickers in one cycle
Rate limit: 60 requests/minute (free tier) — well within limits for a single cycle over the full ticker set
Articles stored as NewsArticle rows; duplicates prevented by unique constraint on url

Sources: r/wallstreetbets and r/stocks via Reddit's public JSON API (/r/{sub}/new.json)
Schedule: every 1 hour, both subreddits per cycle
No authentication required — uses a custom User-Agent header only
Ticker extraction: word-boundary regex matching against all active US tickers (≥3-char symbols only to avoid false positives on 1–2 char tickers like A, IT)
A post mentioning multiple tickers creates one NewsArticle per (post, ticker) with URL = {post_url}#{TICKER} to satisfy the unique constraint
Stored with source="reddit/{subreddit}", market="US", language="en"
Configurable via REDDIT_SUBREDDITS (list) and REDDIT_POSTS_PER_SUB (int) in config.py

Source: HN Algolia search API (https://hn.algolia.com/api/v1/search) — no authentication required
Schedule: every 6 hours
Coverage: 14 US tech tickers (AAPL, MSFT, GOOGL, NVDA, META, AMZN, TSLA, AMD, INTC, NFLX, ORCL, CRM, ADBE, QCOM) — hard-coded in scraper/hackernews.py as HN_TICKERS; these are the tickers most likely to generate signal-bearing HN discussion
Each search fetches the last 24 hours of stories (numericFilters=created_at_i>{ts}, hitsPerPage=20)
Stored with source="HackerNews", market="US", language="en", summary=None (HN stories are links, not summaries)
URL format: https://news.ycombinator.com/item?id={objectID} — unique per story, used as dedup key
0.2 s inter-ticker delay to be polite to the Algolia API
Articles feed into the standard score_unscored_articles() sentiment pipeline after each run

Status: Disabled — public unauthenticated API (api.stocktwits.com/api/2/streams/symbol/) is unreliable and has been progressively restricted
Scraper code preserved in scraper/stocktwits.py; scheduler job commented out in scheduler/jobs.py
See docs/spec/radar.md for rework options

Source: YouTube channel transcripts for tracked channels (see docs/spec/youtube-channels.md)
Schedule: every 6 hours
Extracts ticker mentions from transcripts; stored as NewsArticle rows

Provider: Azure Cognitive Services Translator (optional)
Free tier: 2M characters/month
Translations cached in title_en / title_ko columns to avoid repeated API calls
Batch mode: one API call per page load for all missing translations
Graceful fallback: if API fails, headlines revert to original without crashing

Source: financialdatasets.ai REST API
Schedule: every 24 hours
Data: quarterly income statements (revenue, net income, EPS) and balance sheets (total debt, total equity); trailing P/E via yfinance
Scraper: scraper/financialdatasets.py → upserts Fundamentals rows with market="US"

Source: Korea FSC DART (Data Analysis, Retrieval and Transfer System) public API at opendart.fss.or.kr
Schedule: every 24 hours
Data: annual consolidated financial statements — revenue (매출액), net income (당기순이익), total debt (부채총계), total equity (자본총계); EPS and P/E not available directly from DART and are stored as None
Scraper: scraper/dart_fetcher.py → upserts Fundamentals rows with market="KR"
Requires DART_API_KEY in .env; job is a no-op when key is absent
Corp code mapping (KRX 6-digit code → 8-digit DART corp_code) is downloaded as a ZIP/XML on first run and cached locally for 30 days at data/dart_cache/CORPCODE.xml
Fetches the most recent full fiscal year; falls back to the prior year if the latest report has not yet been filed
Amounts are stored in KRW millions (the unit DART uses); growth ratios computed by build_features() are dimensionless and unit-agnostic

All timestamps stored as UTC
url is the deduplication key on news_articles — prevents duplicates across repeated scrape cycles
language field ("en" or "ko") is set by the scraper and used downstream by sentiment and translation
Fundamentals rows are keyed on (ticker, period_end); both US and KR rows share the same table