What this page is
A private progress report showing how the SCALES PACER-tools infrastructure was extended into a domain-specific court analytics platform for bankruptcy. Updated as new tools and findings develop. Not public, not indexed.
What PACER-tools Made Possible
Starting from the SCALES docket scraper and parser, we built a bankruptcy-specific analytics pipeline that now spans 347 tools, 5.1 million FJC cases, and 11,000+ enriched dockets. Every analytical layer depends on PACER-tools as the ingestion foundation.
(Python)
(94 districts)
(SCALES + eyecite + LexNLP)
parsed
on the pipeline
Dependency tree
SCALES PACER-tools -- docket scraping, HTML parsing, party extraction
eyecite (Free Law Project) -- citation extraction from docket text
LexNLP -- money/amount extraction from fee applications
FJC IDB -- 5.1M case records (public download)
Enrichment layer (OBP)
├ Case Enricher -- unified pipeline: SCALES parser + eyecite + LexNLP + event classification
├ Docket event classifier -- maps raw entries to canonical event types
└ Timing extractor -- computes intervals between events
Analytical modules (OBP)
├ Portfolio Health -- per-attorney outcome analysis, confidence intervals, red flags
├ Docket Velocity -- filing-to-confirmation, MTD-to-dismissal, stay relief response time
├ Template Detector -- fingerprints docket sequences to find copy-paste filing patterns
├ Early Warning -- predictive risk scorer for active cases (MTDs, OSCs, stale dockets)
├ Citation Profiler -- boilerplate vs. substantive statute usage by attorney
└ Mill Composite Scorer -- 3-axis detection model (review + FJC cross-link)
Analytical Modules (Detail)
Per-attorney outcome analysis across all cases in the dataset. Computes discharge rate, dismissal rate, chapter mix, confidence intervals, and red flag counts. Outputs letter-grade scorecards (A through F) with peer comparison. Uses FJC data for the full 5.1M-case view, SCALES-parsed dockets for the detailed view.
CLI: run.py check [attorney_name]
Measures timing between key docket events: filing-to-confirmation, MTD-to-dismissal, stay relief response time, plan modification cadence. Compares individual attorneys against district and national baselines. Reveals whether an attorney engages with the case or lets it drift.
Finding: One subject firm's cases take a median 103.5 days to confirmation vs. 67 days for same-court peers -- a 54% delay.
Fingerprints docket entry sequences (event type + order) and clusters cases with identical patterns. Identifies copy-paste filing behavior -- a structural indicator of high-volume/low-touch practice. Uses SCALES-parsed docket entries as input.
Finding: 16 cases spanning 4 attorneys share an identical docket event sequence across 19 years and two firms. Same filings, same order, same cadence.
Predictive risk scorer for active (non-disposed) cases. Signals: unconfirmed plan past 90 days, multiple motions to dismiss, orders to show cause, stale docket (no activity 60+ days), missing pay advices. Weighted composite score produces LOW/MODERATE/HIGH/CRITICAL ratings.
Finding: 35 active cases scored CRITICAL from a single firm during the initial scan.
Uses eyecite to extract statute citations from docket entry text. Classifies each citation as boilerplate (11 U.S.C. 301, 362, etc.) vs. substantive (506, 1325, 1328). Scores attorneys on absence of key provisions -- attorneys who never cite 1325(a)(5) or 506(a) in Chapter 13 cases are flagged.
3-axis detection model cross-linking independent signals: (1) Google review suppression, (2) review account padding, (3) Ch.13 dismissal rate delta from district baseline. Scored 38 firms. Identifies two mill archetypes: review-suppressed and unsuppressed. District adjustment is critical -- Ch.13 baselines range from 21% to 82% across districts.
Open question: Can this model predict mills out-of-sample? Requires attorney-level national case data (AACER or equivalent) for the outcome axis to work across all 94 districts.
NLP and Text Analysis Layer
The enrichment pipeline combines three text analysis libraries into a single pass over each docket:
| Library | Source | What it extracts | Integration |
|---|---|---|---|
| SCALES PACER-tools | Georgia Tech / SCALES-OKN | Docket structure, party names, attorney names, filing metadata | Foundation -- every tool depends on this |
| eyecite | Free Law Project | Statute citations, case law references, reporter citations | Bundled locally, feeds Citation Profiler |
| LexNLP (LexPredict) | Elevate / LexPredict | Dollar amounts, dates, durations, ratios from unstructured text | Bundled locally, feeds fee extraction |
All three are bundled in the project's lib/ directory -- no external API calls, no cloud dependencies. The enrichment pass produces a normalized JSON structure per case with events, citations, amounts, and timing data.
PACER Fee Exemption -- The Bottleneck
Institutional affiliation unlocks national access
The Administrative Office of the U.S. Courts processes PACER fee exemptions for researchers affiliated with educational institutions. An independent researcher submitted a multi-court exemption application and was told the AO can only process exemptions with institutional backing. The door was left open for resubmission with an affiliation.
With a fee exemption, the entire 94-district pipeline becomes viable at zero per-page cost. Without it, scaling from 2 districts to 94 would cost an estimated $15,000-25,000 at $0.10/page.
This is the single largest bottleneck in the project. The FJC data (free) shows the problem. PACER data (paywalled) confirms the specific violations. The tools are ready to run nationally -- the constraint is access cost.
What institutional partnership would unlock:
- PACER fee exemption across all 94 bankruptcy courts
- National-scale docket enrichment (currently 11,038 cases; target: 100,000+)
- Template detection across districts (currently limited to 2 districts)
- Early warning system deployed nationally
- Publishable dataset with institutional credibility
SCALES-Specific Integration Opportunities
What could make both projects stronger
- Bankruptcy domain module for SCALES: The 6 analytical modules could become a bankruptcy-specific extension of the SCALES platform. The event classifier and template detector are generalizable to any federal court docket.
- Cross-domain template detection: The docket fingerprinting technique works on any court document sequence. Applying it to civil litigation could identify template-practice firms in personal injury, debt collection, and foreclosure.
- Enrichment pipeline contribution: The eyecite + LexNLP integration layer could be contributed upstream to SCALES as a standard enrichment pass.
- National bankruptcy outcome dataset: The FJC + enriched docket combination is the largest open attorney-outcome dataset for bankruptcy. Could be published through SCALES-OKN infrastructure.
- PACER fee exemption: An institutional affiliation with Georgia Tech / SCALES would unlock the AO exemption pathway for national scaling.
Key Findings (Summary)
| Finding | Method | Scale |
|---|---|---|
| 392,412 prior filers received Ch.13 discharges with no verifiable 1328(f) check | FJC cross-tabulation (PRFILE x DIESSION) | National, 5.1M cases |
| Two distinct mill archetypes identifiable from public review data | Google review analysis + FJC cross-reference | 38 firms scored |
| Ch.13 dismissal rate varies 4x across districts (21%-82%) | FJC baseline computation | 30 districts with sufficient data |
| Template filing patterns persist across firms and decades | Docket sequence fingerprinting (SCALES parser) | 125 deep-mined cases |
| One firm's dismissal rate is +9pp above controls (p = 6.2 x 10^-38) | Tiered comparison with confidence intervals | 56,256 comparison cases |
| Rules Committee accepted proposed 1328(f) verification amendment | Formal submission, docketed as 26-BK-3 | National policy |
Progress Log
Built 3-axis composite mill scorer. Cross-links Google review signals with FJC outcome data. District-adjusted baselines prevent false positives from structural variation. 38 firms scored. Open question: AACER data needed for national Axis 3 coverage.
Google review scraping and forensic audit pipeline. CDP-based scraper + FJC cross-reference engine. Discovered two mill archetypes from review patterns alone.
National audit tool. 9-screen detection model. 93-district heatmap. Validated against manually-identified subjects.
Open Bankruptcy Project launched. 501(c)(3) filed. GitHub organization created. 139 domains, 2,353 pages. All tools open-source.
First academic validation. Empirical legal scholar at a top law school reviewed the methodology -- no red flags. Co-authorship interest. AACER dataset access discussed.
Rules Committee submission accepted (26-BK-3). Proposed amendment to Rule 4004: mandatory 1328(f) verification with docket notation. Published on uscourts.gov.
Initial outreach to SCALES team. Shared what PACER-tools made possible. CC'd contact@scales-okn.org and engineering@scales-okn.org.
Open Research Questions
- Cross-domain template detection: Does the docket fingerprinting technique generalize to civil litigation, foreclosure, and debt collection mills? The method is domain-agnostic -- it fingerprints event sequences, not bankruptcy-specific content.
- Enrichment pipeline upstream contribution: Would a SCALES PR adding eyecite + LexNLP enrichment be useful? The integration is clean -- single-pass, no external dependencies.
- Attorney outcome benchmarking at scale: With AACER or PACER fee exemption, every consumer bankruptcy attorney in the country could be scored against district peers. Is there appetite for this as a SCALES module?
- NLP on docket entry text: Currently the event classifier uses keyword matching. A transformer-based classifier trained on the 27,685 parsed entries could improve accuracy and generalize to other court types.
- Publication path: The combined dataset (FJC + enriched dockets + review signals) is novel. What's the right venue -- JELS, law review, computer science (NLP/legal tech)?