E
Epstein Suite

System Status: Active Ingestion

Live processing: OCR, AI summaries, and data indexing in progress across ~3.5 million newly released pages.

Ask Epstein Files
Ask Epstein Files Chat with the archive
Feedback Suggest improvements

Last updated: December 23, 2025

How the Epstein Suite runs

Public-source intelligence demands transparency and resilience. This page explains the key technologies behind our PHP suite, ingestion infrastructure, and secure data handling—without exposing any private credentials or operational secrets.

Suite at a glance

PHP-first productivity interface

  • PHP 8.4 with strict typing, PSR-12 formatting, and Composer autoloading.
  • Custom MVC-inspired layout: standalone page scripts combined with shared navigation and layout components.
  • TailwindCSS utilities layered with Material-inspired cards for a unified productivity-suite UI.
  • Vanilla JS helpers plus lightweight Alpine-style behaviors for modals, filters, and async actions.
  • Media served through a hardened delivery layer that sanitizes file paths before streaming anything to the browser.

Automation

Python ingestion & AI summarization

  • Python 3.11 virtual environments power scraping, downloads, OCR, and AI pipelines.
  • Playwright, pdf2image, PyMuPDF, Tesseract, and Pillow enhance scans before OCR.
  • OpenAI GPT-4o summaries + entity extraction via Structured Outputs for deterministic JSON.
  • MySQL Connector tracks ingestion status, AI logs, and entity relationships.
  • CLI scripts support batching, worker pools, and rate limiting to respect public sources.

Data backbone

Database & storage layers

  • MySQL 8.0 (InnoDB) hosts documents, AI summaries, entities, emails, and flight logs.
  • Full-text indexes enable high-signal search across OCR pages and metadata.
  • Redundant storage keeps documents, previews, and logs segmented from the public web root.
  • Edge caches reduce repeated upstream hits and can be invalidated through internal admin tools.
  • Backups mirror critical datasets to off-site object storage via boto3-powered maintenance scripts.

Hosting & ops

Infrastructure & security guardrails

  • AlmaLinux 8 VPS running Apache + PHP-FPM; HTTPS enforced via managed certificates.
  • Automation jobs execute over SSH/cron with virtualenv isolation and environment-scoped secrets.
  • Strict separation of configuration secrets, prepared statements everywhere, CSRF protection, and explicit anti-doxxing rules.
  • Robots policies keep private endpoints hidden, while admin tools require signed keys plus IP allowlists.
  • Operational monitoring relies on structured logs and audit trails, with roadmap items for automated alerting.

Privacy & Safety Standards

Every ingestion job respects DOJ/FBI access rules, honors removal requests, and avoids exposing redacted victims or minors. We ship updates through documented scripts, log every OCR/scrape run, and review anomalies manually before publishing new material.

Strict PDO queries Rate-limited scraping Tesseract preprocessing OpenAI Structured Outputs

Want to go deeper?

Engineers can review our full TECH.md internally for contributor details. Public readers can browse the Drive to see how these systems surface raw DOJ documents.