Last updated: April 2, 2026

What this page is

A private progress report showing how the SCALES PACER-tools infrastructure was extended into a domain-specific court analytics platform for bankruptcy. Updated as new tools and findings develop. Not public, not indexed.

What PACER-tools Made Possible

Starting from the SCALES docket scraper and parser, we built a bankruptcy-specific analytics pipeline that now spans 347 tools, 5.1 million FJC cases, and 11,000+ enriched dockets. Every analytical layer depends on PACER-tools as the ingestion foundation.

347 Analysis tools built
(Python)

5.1M FJC cases loaded
(94 districts)

11,038 Dockets enriched
(SCALES + eyecite + LexNLP)

27,685 Docket entries
parsed

6 Analytical modules
on the pipeline

Dependency tree

Data sources
SCALES PACER-tools -- docket scraping, HTML parsing, party extraction
eyecite (Free Law Project) -- citation extraction from docket text
LexNLP -- money/amount extraction from fee applications
FJC IDB -- 5.1M case records (public download)

Enrichment layer (OBP)
├ Case Enricher -- unified pipeline: SCALES parser + eyecite + LexNLP + event classification
├ Docket event classifier -- maps raw entries to canonical event types
└ Timing extractor -- computes intervals between events

Analytical modules (OBP)
├ Portfolio Health -- per-attorney outcome analysis, confidence intervals, red flags
├ Docket Velocity -- filing-to-confirmation, MTD-to-dismissal, stay relief response time
├ Template Detector -- fingerprints docket sequences to find copy-paste filing patterns
├ Early Warning -- predictive risk scorer for active cases (MTDs, OSCs, stale dockets)
├ Citation Profiler -- boilerplate vs. substantive statute usage by attorney
└ Mill Composite Scorer -- 3-axis detection model (review + FJC cross-link)

Analytical Modules (Detail)

1. Portfolio Health

Per-attorney outcome analysis across all cases in the dataset. Computes discharge rate, dismissal rate, chapter mix, confidence intervals, and red flag counts. Outputs letter-grade scorecards (A through F) with peer comparison. Uses FJC data for the full 5.1M-case view, SCALES-parsed dockets for the detailed view.

CLI: run.py check [attorney_name]

2. Docket Velocity

Measures timing between key docket events: filing-to-confirmation, MTD-to-dismissal, stay relief response time, plan modification cadence. Compares individual attorneys against district and national baselines. Reveals whether an attorney engages with the case or lets it drift.

Finding: One subject firm's cases take a median 103.5 days to confirmation vs. 67 days for same-court peers -- a 54% delay.

3. Template Detector

Fingerprints docket entry sequences (event type + order) and clusters cases with identical patterns. Identifies copy-paste filing behavior -- a structural indicator of high-volume/low-touch practice. Uses SCALES-parsed docket entries as input.

Finding: 16 cases spanning 4 attorneys share an identical docket event sequence across 19 years and two firms. Same filings, same order, same cadence.

4. Early Warning

Predictive risk scorer for active (non-disposed) cases. Signals: unconfirmed plan past 90 days, multiple motions to dismiss, orders to show cause, stale docket (no activity 60+ days), missing pay advices. Weighted composite score produces LOW/MODERATE/HIGH/CRITICAL ratings.

Finding: 35 active cases scored CRITICAL from a single firm during the initial scan.

5. Citation Profiler

Uses eyecite to extract statute citations from docket entry text. Classifies each citation as boilerplate (11 U.S.C. 301, 362, etc.) vs. substantive (506, 1325, 1328). Scores attorneys on absence of key provisions -- attorneys who never cite 1325(a)(5) or 506(a) in Chapter 13 cases are flagged.

6. Mill Composite Scorer (new -- April 2026)

3-axis detection model cross-linking independent signals: (1) Google review suppression, (2) review account padding, (3) Ch.13 dismissal rate delta from district baseline. Scored 38 firms. Identifies two mill archetypes: review-suppressed and unsuppressed. District adjustment is critical -- Ch.13 baselines range from 21% to 82% across districts.

Open question: Can this model predict mills out-of-sample? Requires attorney-level national case data (AACER or equivalent) for the outcome axis to work across all 94 districts.

NLP and Text Analysis Layer

The enrichment pipeline combines three text analysis libraries into a single pass over each docket:

Library	Source	What it extracts	Integration
SCALES PACER-tools	Georgia Tech / SCALES-OKN	Docket structure, party names, attorney names, filing metadata	Foundation -- every tool depends on this
eyecite	Free Law Project	Statute citations, case law references, reporter citations	Bundled locally, feeds Citation Profiler
LexNLP (LexPredict)	Elevate / LexPredict	Dollar amounts, dates, durations, ratios from unstructured text	Bundled locally, feeds fee extraction

All three are bundled in the project's lib/ directory -- no external API calls, no cloud dependencies. The enrichment pass produces a normalized JSON structure per case with events, citations, amounts, and timing data.

PACER Fee Exemption -- The Bottleneck

Institutional affiliation unlocks national access

The Administrative Office of the U.S. Courts processes PACER fee exemptions for researchers affiliated with educational institutions. An independent researcher submitted a multi-court exemption application and was told the AO can only process exemptions with institutional backing. The door was left open for resubmission with an affiliation.

With a fee exemption, the entire 94-district pipeline becomes viable at zero per-page cost. Without it, scaling from 2 districts to 94 would cost an estimated $15,000-25,000 at $0.10/page.

This is the single largest bottleneck in the project. The FJC data (free) shows the problem. PACER data (paywalled) confirms the specific violations. The tools are ready to run nationally -- the constraint is access cost.

What institutional partnership would unlock:

PACER fee exemption across all 94 bankruptcy courts
National-scale docket enrichment (currently 11,038 cases; target: 100,000+)
Template detection across districts (currently limited to 2 districts)
Early warning system deployed nationally
Publishable dataset with institutional credibility

SCALES-Specific Integration Opportunities

What could make both projects stronger

Bankruptcy domain module for SCALES: The 6 analytical modules could become a bankruptcy-specific extension of the SCALES platform. The event classifier and template detector are generalizable to any federal court docket.
Cross-domain template detection: The docket fingerprinting technique works on any court document sequence. Applying it to civil litigation could identify template-practice firms in personal injury, debt collection, and foreclosure.
Enrichment pipeline contribution: The eyecite + LexNLP integration layer could be contributed upstream to SCALES as a standard enrichment pass.
National bankruptcy outcome dataset: The FJC + enriched docket combination is the largest open attorney-outcome dataset for bankruptcy. Could be published through SCALES-OKN infrastructure.
PACER fee exemption: An institutional affiliation with Georgia Tech / SCALES would unlock the AO exemption pathway for national scaling.

Key Findings (Summary)

Finding	Method	Scale
392,412 prior filers received Ch.13 discharges with no verifiable 1328(f) check	FJC cross-tabulation (PRFILE x DIESSION)	National, 5.1M cases
Two distinct mill archetypes identifiable from public review data	Google review analysis + FJC cross-reference	38 firms scored
Ch.13 dismissal rate varies 4x across districts (21%-82%)	FJC baseline computation	30 districts with sufficient data
Template filing patterns persist across firms and decades	Docket sequence fingerprinting (SCALES parser)	125 deep-mined cases
One firm's dismissal rate is +9pp above controls (p = 6.2 x 10^-38)	Tiered comparison with confidence intervals	56,256 comparison cases
Rules Committee accepted proposed 1328(f) verification amendment	Formal submission, docketed as 26-BK-3	National policy

Progress Log

April 2, 2026

Built 3-axis composite mill scorer. Cross-links Google review signals with FJC outcome data. District-adjusted baselines prevent false positives from structural variation. 38 firms scored. Open question: AACER data needed for national Axis 3 coverage.

April 1, 2026

Google review scraping and forensic audit pipeline. CDP-based scraper + FJC cross-reference engine. Discovered two mill archetypes from review patterns alone.

March 30, 2026

National audit tool. 9-screen detection model. 93-district heatmap. Validated against manually-identified subjects.

March 27, 2026

Open Bankruptcy Project launched. 501(c)(3) filed. GitHub organization created. 139 domains, 2,353 pages. All tools open-source.

March 25, 2026

First academic validation. Empirical legal scholar at a top law school reviewed the methodology -- no red flags. Co-authorship interest. AACER dataset access discussed.

March 17-23, 2026

Rules Committee submission accepted (26-BK-3). Proposed amendment to Rule 4004: mandatory 1328(f) verification with docket notation. Published on uscourts.gov.

March 15, 2026

Initial outreach to SCALES team. Shared what PACER-tools made possible. CC'd contact@scales-okn.org and engineering@scales-okn.org.

Open Research Questions

Cross-domain template detection: Does the docket fingerprinting technique generalize to civil litigation, foreclosure, and debt collection mills? The method is domain-agnostic -- it fingerprints event sequences, not bankruptcy-specific content.
Enrichment pipeline upstream contribution: Would a SCALES PR adding eyecite + LexNLP enrichment be useful? The integration is clean -- single-pass, no external dependencies.
Attorney outcome benchmarking at scale: With AACER or PACER fee exemption, every consumer bankruptcy attorney in the country could be scored against district peers. Is there appetite for this as a SCALES module?
NLP on docket entry text: Currently the event classifier uses keyword matching. A transformer-based classifier trained on the 27,685 parsed entries could improve accuracy and generalize to other court types.
Publication path: The combined dataset (FJC + enriched dockets + review signals) is novel. What's the right venue -- JELS, law review, computer science (NLP/legal tech)?