Architecture — Spec¶
Data Flow¶
Finnhub API ─────────→ scraper/finnhub.py ──────────┐
NAVER Finance ──────→ scraper/naver_finance.py ──────┼──→ NewsArticle (DB)
YouTube Channels ───→ scraper/youtube.py ────────────┤
StockTwits API ─────→ scraper/stocktwits.py [DISABLED] ─────────┘│
│
yfinance ───────────→ data/stock_fetcher.py ──────────┐ ▼
FinanceDataReader ──→ data/stock_fetcher.py ──────────┼──→ StockPrice (DB)
│ │
│ ▼
NewsArticle → sentiment/analyzer.py → SentimentScore (DB)
↓ │
Azure Translator ←───────────────────────────────┤
(optional: translate headlines) │
StockPrice + SentimentScore ──┤
▼
models/predictor.py → Prediction (DB)
models/trend_analyzer.py → TrendingTopic (DB)
│
▼
api/main.py (FastAPI :8000) ← reads DB directly
│ all routes under /api prefix
│ serves React SPA from frontend/dist (production)
↓
React frontend (Vite :5173 dev / served from :8000 prod)
— TanStack Query, TanStack Table, Recharts, shadcn/ui, lucide-react, date-fns, Tailwind
M-STOCK screenshots → portfolio/parser.py (vision) → PortfolioSnapshot + PortfolioHolding (DB)
↑
api/routes/portfolio.py → React MyPortfolio page
Ticker (DB) ──────────────────────────────────────────────────────────────────────→ all pipeline consumers
All pipeline consumers (scrapers, price fetcher, predictor, scheduler) read the active ticker list from the Ticker table via db/tickers.py at runtime — not from config.py. Adding a ticker in the Manage Tickers UI takes effect on the next pipeline run with no restart.
Module Dependencies¶
| Module | Depends on | Notes |
|---|---|---|
config.py |
.env via dotenv |
No internal deps; holds API keys, thresholds, schedules, and seed ticker data |
db/database.py |
config.py |
SQLAlchemy engine + session factory; seeds Ticker table from config on init_db() |
db/models.py |
db/database.py |
ORM models; utcnow() defined here |
db/tickers.py |
db/models.py |
get_active_tickers(), get_ticker_map() — single source of truth for live ticker lists |
data/stock_fetcher.py |
db/ |
US via yfinance; KR via FinanceDataReader; caller provides ticker list |
scraper/finnhub.py |
db/, config.py |
Finnhub REST API; 60 req/min free tier; reads ticker list from DB |
scraper/naver_finance.py |
db/ |
BeautifulSoup scrape; reads ticker list from DB |
scraper/youtube.py |
db/, config.py |
YouTube transcript fetch; channel list from YOUTUBE_CHANNELS in config.py |
scraper/stocktwits.py |
db/ |
StockTwits public stream — DISABLED (API restricted); code preserved for future rework |
scraper/translator.py |
db/, config.py |
Azure Translator batch translation with DB caching |
sentiment/analyzer.py |
db/ |
EN: ProsusAI/finbert (VADER fallback); KO: snunlp/KR-FinBert-SC (keyword lexicon fallback) |
models/volatility_analyzer.py |
db/ |
Realized vol (5/20/60-day), GARCH(1,1) 1-day forecast, K-means volatility regime |
models/predictor.py |
db/, config.py, models/volatility_analyzer.py |
RF training + inference + rule-based fallback |
models/buffett.py |
db/, models/volatility_analyzer.py |
Value-investing scorecard |
models/trend_analyzer.py |
db/, config.py |
Sector/keyword trend aggregation, ticker heat ranking, trending alerts |
scheduler/jobs.py |
all modules above | 8 APScheduler BackgroundScheduler jobs |
portfolio/parser.py |
db/models.py, config.py, google-generativeai SDK |
Gemini 2.5 Flash vision parser for Mirae Asset screenshots |
db/auth.py |
db/models.py |
Auth helpers: bcrypt verification, UserSession token creation/validation |
api/main.py |
api/auth.py, api/routes/*, db/ |
FastAPI app; all routers mounted under /api prefix; CORS origins configurable via CORS_ORIGINS env var; serves React SPA from frontend/dist when present |
api/auth.py |
db/auth.py, db/ |
/auth/login, /auth/me, /auth/logout |
api/routes/ |
domain modules | prices, news, sentiment, predictions, trends, portfolio, tickers, jobs, admin |
frontend/src/App.tsx |
React Router, TanStack Query, api/auth.ts |
SPA entry point; AuthGuard; routes all 9 pages |
frontend/src/api/ |
client.ts, domain modules |
Thin fetch wrapper; domain modules call FastAPI endpoints |
frontend/src/pages/ |
React components | 9 pages |
main.py |
scheduler/, api/, db/ |
Init DB → seed tickers → initial load → start scheduler → launch FastAPI |
Scheduler Job Interactions¶
Jobs share the same DB session factory and run independently at fixed intervals. Each job queries Ticker at runtime so newly added tickers are included automatically.
| Job ID | Interval | Writes | Reads |
|---|---|---|---|
fetch_us_news |
15 min | NewsArticle, SentimentScore |
Finnhub API |
~~fetch_stocktwits~~ |
disabled | — | StockTwits public API restricted; job commented out |
fetch_kr_news |
30 min | NewsArticle, SentimentScore |
NAVER HTML |
update_prices |
1 hour | StockPrice |
yfinance / FDR |
compute_trends |
30 min | TrendingTopic |
NewsArticle |
run_predictions |
6 hours | Prediction |
StockPrice, SentimentScore (inference only) |
check_retrain |
6 hours | rf_*.pkl |
Prediction, PredictionResult (accuracy-gated) |
fetch_youtube |
6 hours | NewsArticle, SentimentScore |
YouTube transcript API |
evaluate_predictions |
24 hours | PredictionResult |
Prediction, StockPrice |
fetch_fundamentals |
24 hours | FundamentalData |
financialdatasets.ai API |
prune_old_data |
24 hours | deletes from all tables | all tables |
News fetch jobs call score_unscored_articles() in a batched while-loop (batch_size=500) immediately after each ingestion cycle.
Key Design Decisions¶
- DB-backed ticker registry — the
Tickertable is the single source of truth;config.pyholds seed data only; all runtime consumers read from DB viadb/tickers.py - Soft delete on tickers —
active=Falseexcludes from pipeline while preserving historical data DATABASE_URLenv var selects the backend — production uses Supabase PostgreSQL; falls back to SQLite when unset (local dev only)- One RF model per horizon, shared across all tickers — individual ticker models would have insufficient training samples
- Rule-based fallback — prediction pipeline works on day one before any trained model exists
- URL as unique key on
news_articles— prevents duplicates across repeated scrape cycles - Ticker + market discriminator pattern — always filter by both
tickerandmarketin cross-market queries - Batch sentiment scoring — scoring runs once per prediction cycle rather than inline during ingestion
- No
.KSsuffix — FinanceDataReader accepts bare KRX codes natively - Translation cache in DB —
title_enandtitle_koavoid repeated API calls - Token-based session auth —
UserSessionrows; FastAPI setshttponly; samesite=laxcookie on login - Retention-based pruning —
prune_old_datadaily; retention windows tuned so ML always has sufficient history