Building PharmaSight: 14 Data Sources in One Night

Drug shortages hit record levels in 2024. I built a data pipeline that ingests 14 public pharmaceutical data sources — 25 million Medicaid records, FDA shortages, regulatory filings, clinical trials, and news sentiment — into a unified star schema. Here's what I learned.

Drug shortages in the US hit record levels in 2024. Patients couldn't get cancer treatments. Children's antibiotics ran out. Hospitals scrambled for basic anaesthetics. And yet, the pharmaceutical industry still forecasts demand using methods that would embarrass a first-year data science student — last quarter's sales, plus a gut feeling.

I decided to do something about it. PharmaSight is a pharmaceutical intelligence platform that fuses publicly available data from 14 different sources into a unified ML forecasting system. The thesis is simple: drug demand doesn't exist in a vacuum. It's shaped by disease outbreaks, supply disruptions, regulatory changes, patent cliffs, and public narrative. If we can capture all of these signals and feed them into a model, we should be able to forecast demand far better than looking at historical sales alone.

Last night, I built the entire data ingestion layer. Here's what I learned.

The Data Landscape

The first surprise was how much pharmaceutical data is publicly available. The US government publishes an extraordinary amount of information about drugs — who prescribes them, who manufactures them, when patents expire, which drugs are in shortage, and what adverse events have been reported. You just have to know where to look.

I ended up pulling from 14 sources across five categories:

Demand signal — Medicaid State Drug Utilization Data is the backbone. Every prescription filled through Medicaid across all 50 states, broken down by drug and quarter, going back to 2019. That's 25.3 million rows covering 70,592 unique drugs and $872 billion in reimbursements.

Disease drivers — CDC FluView provides weekly influenza surveillance data at the regional level. When flu spikes, demand for antivirals and antibiotics follows. 3,650 data points going back to 2018, with weighted ILI rates ranging from 0.13% to 13.40%.

Supply-side signals — The FDA publishes real-time data on drug shortages (1,683 records, with 1,148 currently active), recall enforcement reports (17,428 records going back to 2006), and adverse event reports through FAERS (389,809 records). I also pulled the complete drug approval history from Drugs@FDA (50,859 products dating to 1939) and the Orange Book (47,780 products with 20,174 patent records). The Orange Book is particularly valuable because patent expiry dates are known years in advance — they're leading indicators of when generic competition will enter and branded demand will collapse.

Regulatory intelligence — The Federal Register API gave me 13,389 regulatory documents from FDA, CMS, DEA, and HHS — including 981 final rules and 609 proposed rules. Regulations.gov added another 1,749 rulemaking dockets. These aren't just background noise: when CMS changes Medicaid reimbursement policy or the FDA finalises new manufacturing requirements, the demand impact is direct and measurable.

News and sentiment — RSS feeds from BioPharma Dive, Fierce Pharma, and Endpoints News provide a daily pulse of industry news. Twitter/X gives us real-time public discourse around drug shortages, FDA actions, and pricing. ClinicalTrials.gov contributes 10,000 Phase III and Phase IV trials — a window into future market changes.

The Hardest Part: Making It All Connect

Downloading data is the easy part. The real engineering challenge is making 14 sources talk to each other. Every source uses different identifiers, different formats, and different time grains.

The universal join key in pharmaceutical data is the NDC — National Drug Code. It's an 11-digit number that uniquely identifies a specific drug from a specific manufacturer in a specific package size. The problem is that different sources format it differently: some use 10 digits, some use 11, some include dashes, some don't, and the segment lengths vary.

I built a harmonisation utility that standardises every NDC to the 11-digit 5-4-2 format. On the Medicaid data, it achieved a 100% match rate — every single row got a valid NDC. That was a relief.

The second challenge was name matching. Medicaid truncates drug names — "AMOXICILLIN" becomes "AMOXICILLI", "ATORVASTATIN" becomes "ATORVASTAT". A naive exact-match join between Medicaid and the FDA product database only matched 18.6% of records. I added a prefix-matching algorithm as a second pass — if "AMOXICILLI" is a prefix of exactly one ingredient in the product database, we match it. That brought the enrichment rate up to 71%, recovering nearly 10 million additional row matches.

What Surprised Me

49% of Medicaid records are suppressed. When a specific drug in a specific state in a specific quarter has fewer than 11 prescriptions, Medicaid blanks out the numbers for privacy. That's nearly half the dataset. It means our data is systematically biased toward high-volume drugs in large states. This is a methodological challenge I'll need to address in the modelling phase.

1,148 drugs are currently in shortage. That's not a historical number — that's right now. The top reasons are demand increases, active ingredient shortages, and manufacturing delays. When Drug A goes into shortage, demand for Drug B (the therapeutic alternative) spikes. Modelling this substitution effect is one of PharmaSight's key research questions.

The regulatory corpus is massive. 13,389 Federal Register documents from pharma-relevant agencies in 6 years. Most pharmaceutical forecasting papers completely ignore regulatory signals. My hypothesis is that proposed rules (published months before they take effect) serve as leading indicators for demand shifts — and that this signal has been hiding in plain sight.

The Star Schema

All of this data converges into a star schema — a central fact table surrounded by dimension and feature tables:

fact_demand — 18.7 million rows: date × state × drug × prescriptions × reimbursement
dim_product — 47,780 products with approval type, therapeutic equivalence, patent expiry, and generic competitor counts
dim_geography — 54 states and territories with HHS regions

Feature tables for disease, supply, safety, regulation, and news signals will be built next, each joining to the fact table through NDC and date.

What Comes Next

The data pipeline is done. Next up:

Feature engineering — turning raw supply events, disease rates, and regulatory documents into model-ready features
NLP pipeline — extracting drug mentions, event types, and sentiment from regulatory text and news articles using SciSpacy and transformer models, all running locally on a laptop GPU
Model comparison — Seasonal Naive → LightGBM → XGBoost → Temporal Fusion Transformer → N-BEATS, with an ablation study measuring the marginal predictive value of each signal type
The research question — does incorporating news sentiment and regulatory text mining significantly improve pharmaceutical demand forecasting? And if so, which signals matter most, for which drugs, and at what lead time?

The entire codebase is open source. If you're interested in pharmaceutical data, NLP, or time series forecasting, I'd love to collaborate. Reach out through the contact form on this site or connect on LinkedIn.