Skip to content

Architecture — Spec

Data Flow

Finnhub API ─────────→ scraper/finnhub.py ──────────┐
NAVER Finance ──────→ scraper/naver_finance.py ──────┼──→ NewsArticle (DB)
YouTube Channels ───→ scraper/youtube.py ────────────┤
StockTwits API ─────→ scraper/stocktwits.py [DISABLED] ─────────┘│
                                                                │
yfinance ───────────→ data/stock_fetcher.py ──────────┐        ▼
FinanceDataReader ──→ data/stock_fetcher.py ──────────┼──→ StockPrice (DB)
                                                      │         │
                                                      │         ▼
                                   NewsArticle → sentiment/analyzer.py → SentimentScore (DB)
                                       ↓                                          │
                                Azure Translator ←───────────────────────────────┤
                                (optional: translate headlines)                  │
                                                  StockPrice + SentimentScore ──┤
                                                                                ▼
                                                               models/predictor.py → Prediction (DB)
                                                               models/trend_analyzer.py → TrendingTopic (DB)
                                                                                │
                                                                                ▼
                          api/main.py (FastAPI :8000) ← reads DB directly
                            │  all routes under /api prefix
                            │  serves React SPA from frontend/dist (production)
                                    ↓
                          React frontend (Vite :5173 dev / served from :8000 prod)
                          — TanStack Query, TanStack Table, Recharts, shadcn/ui, lucide-react, date-fns, Tailwind

M-STOCK screenshots → portfolio/parser.py (vision) → PortfolioSnapshot + PortfolioHolding (DB)
                                                                                ↑
                                                               api/routes/portfolio.py → React MyPortfolio page

Ticker (DB) ──────────────────────────────────────────────────────────────────────→ all pipeline consumers

All pipeline consumers (scrapers, price fetcher, predictor, scheduler) read the active ticker list from the Ticker table via db/tickers.py at runtime — not from config.py. Adding a ticker in the Manage Tickers UI takes effect on the next pipeline run with no restart.

Module Dependencies

Module Depends on Notes
config.py .env via dotenv No internal deps; holds API keys, thresholds, schedules, and seed ticker data
db/database.py config.py SQLAlchemy engine + session factory; seeds Ticker table from config on init_db()
db/models.py db/database.py ORM models; utcnow() defined here
db/tickers.py db/models.py get_active_tickers(), get_ticker_map() — single source of truth for live ticker lists
data/stock_fetcher.py db/ US via yfinance; KR via FinanceDataReader; caller provides ticker list
scraper/finnhub.py db/, config.py Finnhub REST API; 60 req/min free tier; reads ticker list from DB
scraper/naver_finance.py db/ BeautifulSoup scrape; reads ticker list from DB
scraper/youtube.py db/, config.py YouTube transcript fetch; channel list from YOUTUBE_CHANNELS in config.py
scraper/stocktwits.py db/ StockTwits public stream — DISABLED (API restricted); code preserved for future rework
scraper/translator.py db/, config.py Azure Translator batch translation with DB caching
sentiment/analyzer.py db/ EN: ProsusAI/finbert (VADER fallback); KO: snunlp/KR-FinBert-SC (keyword lexicon fallback)
models/volatility_analyzer.py db/ Realized vol (5/20/60-day), GARCH(1,1) 1-day forecast, K-means volatility regime
models/predictor.py db/, config.py, models/volatility_analyzer.py RF training + inference + rule-based fallback
models/buffett.py db/, models/volatility_analyzer.py Value-investing scorecard
models/trend_analyzer.py db/, config.py Sector/keyword trend aggregation, ticker heat ranking, trending alerts
scheduler/jobs.py all modules above 8 APScheduler BackgroundScheduler jobs
portfolio/parser.py db/models.py, config.py, google-generativeai SDK Gemini 2.5 Flash vision parser for Mirae Asset screenshots
db/auth.py db/models.py Auth helpers: bcrypt verification, UserSession token creation/validation
api/main.py api/auth.py, api/routes/*, db/ FastAPI app; all routers mounted under /api prefix; CORS origins configurable via CORS_ORIGINS env var; serves React SPA from frontend/dist when present
api/auth.py db/auth.py, db/ /auth/login, /auth/me, /auth/logout
api/routes/ domain modules prices, news, sentiment, predictions, trends, portfolio, tickers, jobs, admin
frontend/src/App.tsx React Router, TanStack Query, api/auth.ts SPA entry point; AuthGuard; routes all 9 pages
frontend/src/api/ client.ts, domain modules Thin fetch wrapper; domain modules call FastAPI endpoints
frontend/src/pages/ React components 9 pages
main.py scheduler/, api/, db/ Init DB → seed tickers → initial load → start scheduler → launch FastAPI

Scheduler Job Interactions

Jobs share the same DB session factory and run independently at fixed intervals. Each job queries Ticker at runtime so newly added tickers are included automatically.

Job ID Interval Writes Reads
fetch_us_news 15 min NewsArticle, SentimentScore Finnhub API
~~fetch_stocktwits~~ disabled StockTwits public API restricted; job commented out
fetch_kr_news 30 min NewsArticle, SentimentScore NAVER HTML
update_prices 1 hour StockPrice yfinance / FDR
compute_trends 30 min TrendingTopic NewsArticle
run_predictions 6 hours Prediction StockPrice, SentimentScore (inference only)
check_retrain 6 hours rf_*.pkl Prediction, PredictionResult (accuracy-gated)
fetch_youtube 6 hours NewsArticle, SentimentScore YouTube transcript API
evaluate_predictions 24 hours PredictionResult Prediction, StockPrice
fetch_fundamentals 24 hours FundamentalData financialdatasets.ai API
prune_old_data 24 hours deletes from all tables all tables

News fetch jobs call score_unscored_articles() in a batched while-loop (batch_size=500) immediately after each ingestion cycle.

Key Design Decisions

  • DB-backed ticker registry — the Ticker table is the single source of truth; config.py holds seed data only; all runtime consumers read from DB via db/tickers.py
  • Soft delete on tickersactive=False excludes from pipeline while preserving historical data
  • DATABASE_URL env var selects the backend — production uses Supabase PostgreSQL; falls back to SQLite when unset (local dev only)
  • One RF model per horizon, shared across all tickers — individual ticker models would have insufficient training samples
  • Rule-based fallback — prediction pipeline works on day one before any trained model exists
  • URL as unique key on news_articles — prevents duplicates across repeated scrape cycles
  • Ticker + market discriminator pattern — always filter by both ticker and market in cross-market queries
  • Batch sentiment scoring — scoring runs once per prediction cycle rather than inline during ingestion
  • No .KS suffix — FinanceDataReader accepts bare KRX codes natively
  • Translation cache in DBtitle_en and title_ko avoid repeated API calls
  • Token-based session authUserSession rows; FastAPI sets httponly; samesite=lax cookie on login
  • Retention-based pruningprune_old_data daily; retention windows tuned so ML always has sufficient history