Post

Building InterCaribbean: A Caribbean News & Stock Market Aggregator

Building InterCaribbean: A Caribbean News & Stock Market Aggregator

What is InterCaribbean?

InterCaribbean is a web application I built that aggregates financial, political, and energy news from across the Caribbean, alongside live stock market data from six Caribbean stock exchanges. It is live at intercaribbeanapps.com.

The core problem I was trying to solve: there is no single place to go for Caribbean financial news and stock data. If you want to follow the TTSE (Trinidad & Tobago Stock Exchange), check the JSE (Jamaica), see what is happening with energy in T&T, and read regional business news all at once, you are bouncing between a dozen different websites. I wanted to fix that.


The Stack

LayerTechnology
Backend APIPython, FastAPI
DatabaseSQLite (WAL mode)
Scrapingrequests, BeautifulSoup, feedparser, Playwright
SchedulingAPScheduler
MonitoringSentry
FrontendVanilla JavaScript (SPA)
ContainerisationDocker
DeploymentRailway

I deliberately kept the stack lean. No ORMs, no message queues, no Redis, just a FastAPI app, a SQLite file, and a set of scraper modules. The architecture tool I ran on the project flagged the file sizes as something to watch, and I agree: as the project grows, the scraper files in particular will likely need to be split out. But for the current scope, keeping everything flat and readable was the right call.


Architecture Overview

The project is four Python modules, each with a clear responsibility:

1
2
3
4
5
main.py      — FastAPI app, routing, caching, scheduler setup
scraper.py   — News scraping (RSS + HTML), relevance filtering, classification
stocks.py    — Stock market scraping (6 exchanges), sentiment analysis
database.py  — SQLite schema, upserts, queries
static/      — SPA frontend (HTML, CSS, JS)

The News Scraper

14 Sources Across 8 Territories

The scraper pulls from 14 news outlets covering Trinidad & Tobago, Jamaica, Barbados, Guyana, St. Lucia, St. Vincent, and the broader Caribbean region. Each source is defined as a configuration object:

1
2
3
4
5
6
7
8
9
10
11
{
    "id": "newsday",
    "name": "Newsday TT",
    "color": "#1a3a6e",
    "country": "TT",
    "scrape_type": "rss",
    "rss_urls": [
        "https://newsday.co.tt/category/business/feed/",
        "https://newsday.co.tt/feed/",
    ],
}

Two scrape strategies exist: RSS (for most sources) and custom HTML (for sources that do not offer RSS or whose feeds are unreliable).

Relevance Filtering

Regional news sources cover everything from crime, sports, lifestyle. I only want finance, energy, and politics. Before any article gets written to the database it must pass a relevance check: a compiled regex pattern that tests the article title and description against 60+ domain-specific terms.

1
2
3
4
5
6
7
8
_RELEVANCE_TERMS = [
    "economy", "economic", "finance", "financial", "budget", "banking",
    "ttse", "ttsec", "gdp", "inflation", "interest rate",
    "petrotrin", "bptt", "ngc", "lng", "natural gas", "crude oil",
    "prime minister", "president", "parliament", "cabinet",
    "caricom", "oecs", "caribbean",
    # ... and ~50 more
]

Articles that do not match are discarded at scrape time never touch the database.

Category Classification

Articles that pass the relevance filter are classified into one of four categories: business, energy, politics, or general. The classifier uses a two-tier scoring system:

  • Anchor terms (highly specific): phrases like "ministry of finance", "monetary policy", "heritage petroleum" — each anchor match scores 2 points.
  • Common terms (broad): words like "economy", "election", "oil" — each common match scores 1 point.

The category with the highest score above a threshold of 2 wins. Below the threshold, the article falls into general. This was more accurate than a pure keyword count because it weights specificity.

Custom HTML Scrapers

Two sources required custom HTML scrapers:

Trinidad Guardian uses the Blox CMS with versioned URL paths like /business-6.3.0.41be6ee2cc. These paths change with every CMS upgrade, so the scraper maintains a list of known versioned paths with stable fallbacks (/business/, /news/, /). It tries each in order, stops at the first that yields articles, and logs a structured warning if it falls through to the homepage — that signals a CMS version change.

Stabroek News (Guyana) embeds article dates in URL paths (/YYYY/MM/DD/slug), so rather than fetching meta tags for dates, the scraper extracts the date directly from the URL with a regex match — faster and more reliable than parsing a date string that may not be present.

Parallel Fetching

All 14 sources are scraped simultaneously using ThreadPoolExecutor with a timeout of 180 seconds across the whole run. Within a source, article pages that need og:image extraction are also fetched in parallel (up to 6 workers). A slow or hung source cannot block the rest from completing.

1
2
3
4
with ThreadPoolExecutor(max_workers=6) as pool:
    futures = {pool.submit(_safe_scrape, src): src for src in SOURCES}
    for future in as_completed(futures, timeout=180):
        total += future.result()

The 180-second timeout was something I arrived at through observation, not calculation. On early runs I noticed a single source could hang indefinitely when a site was down and because I was running sources sequentially at the time, it would hold up everything. Moving to parallel execution with a hard timeout was a significant improvement.


The Stock Market Scraper

Six Exchanges

The app tracks equities, preference shares, bonds, and funds across six Caribbean exchanges:

ExchangeCountry/RegionMethod
TTSETrinidad & TobagoPlaywright (Chromium)
JSEJamaicarequests + BeautifulSoup
ECSEEastern Caribbeanrequests + BeautifulSoup
BSEBarbadosrequests + BeautifulSoup
BISXBahamasrequests + BeautifulSoup
BSXBermudarequests + BeautifulSoup

Five of the six exchanges can be scraped with a simple HTTP request and HTML parsing. The TTSE is the exception.

The TTSE Problem

The Trinidad & Tobago Stock Exchange website (stockex.co.tt) is protected by the Sucuri WAF, which uses a JavaScript challenge to block non-browser clients. A plain requests.get() returns a challenge page, not the market data. The solution was Playwright — a headless Chromium browser that executes JavaScript and navigates the page just like a real user.

1
2
3
4
5
6
7
with sync_playwright() as p:
    browser = p.chromium.launch(headless=True, args=["--no-sandbox", ...])
    page = browser.new_context(user_agent="Chrome/124.0.0.0 ...").new_page()
    page.goto("https://www.stockex.co.tt/market-quote/", wait_until="networkidle")
    page.wait_for_timeout(3_000)
    for table in page.query_selector_all("table"):
        # parse price tables

Playwright is also used for TTSE per-stock historical price data and dividend history visiting each of the 35 listed securities’ detail pages individually.

Subprocess Isolation

Running Chromium inside the same Python process as FastAPI creates resource contention: Chrome is CPU and memory hungry, and a full TTSE scrape (35 detail pages) takes several minutes. Having that compete with incoming HTTP requests was a problem.

The solution was to run Playwright scrapes as child processes rather than threads within the FastAPI process. When the scheduler fires the stock scrape job, it calls:

1
2
3
4
5
result = subprocess.run(
    [sys.executable, "-c", _SCRAPE_SCRIPT, "stocks"],
    env=os.environ.copy(),
    capture_output=False,
)

The parent FastAPI process blocks until the child exits, then busts the relevant caches. Chrome’s memory lives and dies entirely in the child process. FastAPI never sees the overhead.

This was a problem that took me a while to diagnose. Early on the API would become slow and unresponsive during stock scrape runs. I initially assumed it was database contention, then suspected the scheduler itself. It was only when I started monitoring process memory that I realised Chrome was the culprit. The subprocess approach was the cleanest fix I could find that did not require a full job queue like Celery.

Sentiment Analysis

Each stock gets a sentiment score derived from articles in the database that mention the company. The system:

  1. Strips legal suffixes from company names (Ltd, Inc, Plc) to build a meaningful search phrase.
  2. Runs a broad SQL LIKE query (fast index scan) to get candidate articles from the past 60 days.
  3. Applies a Python-side word-boundary regex to eliminate substring false positives — for example, the ticker AML should not match articles about “anti-AML policy.”
  4. Counts positive and negative keyword matches across the candidate articles.
  5. Produces a score from -1.0 (bearish) to +1.0 (bullish) with a neutral band between -0.2 and +0.2.

Tickers under 4 characters are never used as standalone search terms they are too short and cause too many coincidental matches. Only the company name phrase is used for short tickers.


The Database

SQLite with WAL Mode

I chose SQLite over PostgreSQL because this application has one writer (the scraper) and many readers (API requests). SQLite’s WAL (Write-Ahead Logging) mode supports concurrent reads without blocking writes which is exactly the access pattern here.

The database is tuned at connection time:

1
2
3
4
5
conn.execute("PRAGMA journal_mode=WAL")
conn.execute("PRAGMA cache_size=-8000")    # 8 MB page cache
conn.execute("PRAGMA temp_store=MEMORY")   # temp tables in RAM
conn.execute("PRAGMA mmap_size=67108864")  # 64 MB memory-mapped I/O
conn.execute("PRAGMA synchronous=NORMAL")  # safe with WAL, avoids fsync on every write

For a single-server deployment this performs well under load. The whole database is one file, which also makes backups trivial.

Schema

Five tables:

  • articles — scraped news articles with title, URL, description, source, category, image, and published date. URL is the unique key; duplicate articles from re-scrapes are dropped with INSERT OR IGNORE.
  • stocks — current price, change, volume, sentiment, and dividend summary per ticker/exchange.
  • stock_history — historical price records per ticker/exchange, deduplicated by a 30-minute write window to prevent duplicate entries on server restarts.
  • dividends — dividend events per ticker/exchange/ex-date. Unique on (ticker, exchange, ex_date) so re-scraping is idempotent.
  • scrape_log — health record of every scrape run per source, used by the admin status endpoint.

Schema Migrations

The database uses a migration pattern based on ALTER TABLE ... ADD COLUMN with duplicate column error suppression:

1
2
3
4
5
6
7
for table, col, defn in _migrations:
    try:
        conn.execute(f"ALTER TABLE {table} ADD COLUMN {col} {defn}")
        conn.commit()
    except sqlite3.OperationalError as e:
        if "duplicate column" not in str(e).lower():
            sentry_sdk.capture_exception(e)

It is not a migration framework, but it is simple and works reliably for additive changes. Destructive changes (dropping or renaming columns) would need a separate approach.


The API

FastAPI exposes a set of JSON endpoints consumed by the SPA:

EndpointPurposeCache TTL
GET /api/bootstrapSources + first page of TT articles + stats in one round-trip5 min
GET /api/articlesFiltered, paginated article list5 min
GET /api/stocksStock list with filtering30 min
GET /api/stocks/detailSingle stock with history + related news30 min
GET /api/stocks/dividendsDividend history (independent short TTL)5 min
GET /api/statsDatabase stats5 min

Caching uses cachetools.TTLCache with a threading lock, which keeps hot endpoints fast without a Redis dependency. The bootstrap endpoint was specifically designed to load the initial page in a single HTTP round-trip sources, articles, and stats together, avoiding three separate requests on first load.

GZip middleware compresses all text and JSON responses above 500 bytes. Static files get a Cache-Control: public, max-age=3600 header via a custom ASGI middleware layer (not BaseHTTPMiddleware, which has a known buffering conflict with GZipMiddleware).


The Frontend

The frontend is a single-page application written in vanilla JavaScript, no framework. The HTML has a tab bar for each country plus a stocks tab, a filter/search bar, an article grid, and a stock detail modal.

A few implementation decisions worth noting:

  • Chart.js is loaded on demand. The stocks tab is not always the first thing a user opens, so the Chart.js library (~200KB) is only injected into the DOM the first time the Stocks tab is clicked. This keeps the initial page load fast.
  • The bootstrap endpoint means the home screen renders after a single API call instead of three sequential ones.
  • The stock modal shows price history as a Chart.js line chart, dividends as a table, news sentiment with a visual score bar, and related news articles — all fetched in one detail endpoint call, with dividends fetched separately to keep their TTL independent.

Observability

Sentry is integrated at three levels:

  • FastAPI — unhandled exceptions in route handlers become Sentry events.
  • APScheduler — job failures are captured with job name and trigger metadata.
  • Logginglogger.warning() calls become breadcrumbs; logger.error() calls become full Sentry issues automatically via LoggingIntegration.

The scheduled jobs also use Sentry Cron Monitors, which detect when a job does not fire on schedule — useful for catching cases where the scheduler starts but a job silently never runs.

An admin status endpoint (/api/admin/status) shows database row counts per table, the last scrape run per source, and sample rows for spot-checking all protected by a secret key.


Deployment

The Docker image is built from Microsoft’s official Playwright Python image, which ships with Chromium and all system dependencies pre-installed. Without this, getting Playwright’s Chromium to run in a container requires installing a long list of shared libraries manually.

1
2
3
4
5
6
FROM mcr.microsoft.com/playwright/python:v1.49.0-noble
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt && playwright install chromium
COPY . .
CMD uvicorn main:app --host 0.0.0.0 --port ${PORT}

The app runs on Railway, which injects $PORT as an environment variable. The scheduler timezone is configurable via TZ — set to America/Port_of_Spain on the server so the stock scrape fires at 6:00 PM Trinidad time on weekdays, shortly after the TTSE market closes.


What I Would Do Differently

Split the scraper files earlier. stocks.py is over 1,000 lines with six exchange scrapers in one file. Each exchange scraper should probably live in its own module, with a shared base of utility functions. I kept it in one file while I was learning the patterns, but it is now large enough that navigating it is cumbersome.

Use a proper migration tool. The ALTER TABLE approach works for adding columns, but it is fragile for anything more complex. SQLite supports ATTACH, CREATE TABLE ... AS SELECT, and full migrations via tools like Alembic — I should wire that up before the schema needs any destructive changes.

PostgreSQL for serious growth. SQLite WAL mode handles this workload well at current scale, but it does have ceiling — a single write lock, no concurrent writes, and a file that lives on the application server. If traffic grows significantly or I move to a multi-instance deployment, migrating to PostgreSQL would be the natural next step. The SQLAlchemy-compatible query patterns I am using now would make that transition manageable.


Closing Thoughts

This project forced me to think about problems I had not encountered before: scraping JavaScript-rendered pages inside a web server, designing a classifier without training data, keeping a background scrape from degrading API response times, and writing idempotent data pipelines that can restart safely at any point.

The biggest lesson for me was that building something end-to-end on your own, even something small, exposes gaps that study alone does not. I had read about things like WAL mode, process isolation, and TTL caches, but actually implementing them in a context where the failure is visible taught me far more than any course did. The site going blank because the TTSE scraper was eating all the server's memory at 6 PM was a memorable lesson.

The application is live at intercaribbeanapps.com. All the scraping, categorisation, sentiment scoring, and market data happens automatically in the background. I will be writing follow-up posts on specific components, the TTSE Playwright scraper and the sentiment classifier in particular, as those had the most interesting problems to solve.

This post is licensed under CC BY 4.0 by the author.