Lab Notebook PRIVATE

Open Bankruptcy Project -- Research Progress Log

Last updated: April 2, 2026

What this page is

A running research notebook documenting new tools, datasets, and findings as they develop. Not public, not indexed, not linked from anywhere. Updated between conversations. Intended as a shared reference for research collaborators -- see what's been built, what the data says, and where the gaps are.

Current State of the Toolkit

5.1M FJC cases loaded
(94 districts)
347 Analysis tools
(Python scripts)
38 Firms scored
(review + FJC cross-ref)
264 Verified 1328(f)
violations
30 Districts with
Ch.13 baselines

New: 3-Axis Composite Mill Detection Model

Built April 2, 2026

The core question: can you identify bankruptcy mills from public data alone, without relying on insider knowledge or media reports? We now have a working 3-axis model that cross-links independent signals.

Axis 1: Suppression

Google review 1-star percentage

Mills suppress or prevent negative reviews. Control group median: ~11%. Suppressed mills: 0-3%.

Weight: 0-35 points

Axis 2: Solicitation

1-review Google account ratio

Mills pad reviews with single-use accounts. Legitimate firms: 0-15%. Mills: 15-25%+.

Weight: 0-35 points

Axis 3: Outcome

Ch.13 dismissal rate delta from district baseline

Not raw dismissal rate -- the delta from the firm's own district baseline. A 48% rate means different things in a 27% district vs. a 54% district.

Weight: 0-30 points

Why district-adjusted?

Raw Ch.13 dismissal rates vary dramatically by district. Without adjustment, a firm performing at the national average gets scored identically whether it's in a low-dismissal district or a high-dismissal one. The delta isolates firm-specific underperformance from structural variation.

District baselines computed from the FJC dataset range from ~21% to ~82%, a nearly 4x spread. Example range from districts with 10,000+ closed Ch.13 cases:

District TypeCh.13 BaselineClosed Cases
Low-dismissal district~27%16,000+
Mid-range district~43%16,000+
National average (all Ch.13)~45%5.1M
High-dismissal district~63%17,000+

Archetype discovery

The model identifies two distinct mill archetypes through the cross-link of Axes 1 and 2:

Both archetypes can be confirmed by Axis 3 (outcome delta), but this axis requires attorney-level case data in the firm's actual operating district.

Sample results

38 firms scored: NACBA board members, circuit leaders, known mills, and controls. Firms with fewer than 10 reviews receive a 60% confidence discount; fewer than 20 receive a 30% discount.

Firm TypeReviews1-Star %1-Acct %DeltaScore Range
Known suppressed mill (A)7340.3%23.6%+3.270-75
Known suppressed mill (B)3241.2%14.2%+4.047-48
Known unsuppressed mill (C)1,50026.5%21.6%+6.838
Legitimate firm (control avg)18-2807-23%3-15%-3 to -150-15

Key finding: The reviewer sample for Axis 3 understates the real outcome gap, because clients who leave Google reviews are biased toward satisfied clients. In one case, the composite scorer showed a +4.0 delta from the reviewer sample, but direct PACER docket mining of the same firm's full caseload revealed an actual Ch.13 dismissal rate of 89% -- more than 3x the district baseline. The reviewer sample sees the happy clients; the docket sees everyone.

Where the Model Breaks Down

Data coverage gap: Axis 3 depends on attorney-level case data

What AACER would unlock

AACER provides attorney-level case data nationally: attorney names (first and last), firm names, filing dates, dispositions, prior filing history, case-level detail across all 94 districts. With AACER:

Tool Inventory (Research-Relevant)

ToolWhat it doesData source
mill_composite_score.py3-axis composite scorer with district baselinesGoogle reviews + FJC 5.1M cases
review_scrape.pyGoogle Maps review scraper (CDP automation)Google Maps
review_audit.pyCross-reference reviewers against FJC casesScraped reviews + FJC
run.py check [name]Attorney scorecard: caseload, dismissal rate, chapter mix, red flagsFJC 5.1M cases
run.py scanBlind outlier detection across all attorneys in datasetFJC
run.py predict [name]ML mill probability (logistic regression, 7 features)FJC
run.py compareSubject firm vs. same-district controlsFJC
run.py worst NWorst N attorneys by composite scoreFJC
1328(f) ScreenerClient-side SQL.js tool checking discharge eligibilityPACER case data (user-provided)
FJC MCP ServerNatural language queries against FJC dataFJC 5.1M cases

Research Log

April 2, 2026

Built 3-axis composite mill scorer. Cross-linked review suppression, review padding, and Ch.13 dismissal rate delta from district baseline. Scored 38 firms (NACBA board/leaders, known mills, controls). Discovery: district adjustment is essential -- without it, firms in low-baseline districts appear similar to firms in high-baseline districts despite vastly different performance relative to local peers. Identified data gap: FJC coverage insufficient for Axis 3 in most districts. AACER data would close the gap.

April 1, 2026

Built Google review scraping and audit pipeline. CDP-based scraper for Google Maps reviews. Forensic cross-reference engine matching reviewer names against 5.1M FJC bankruptcy cases (nickname expansion, fuzzy matching). Scraped 38 firms. Discovered two mill archetypes: review-suppressed (0-2% one-star) and unsuppressed (10-27% one-star). Both show elevated one-review Google accounts (15-25%).

March 30, 2026

Built national audit tool. 9-screen detection model (inspired by Madoff-style red flags: outsized returns without visible mechanism). 93-district heatmap visualization. Automated detection of all previously-known mill attorneys validated against manual identification.

March 25, 2026

First research call. Methodology reviewed -- no red flags identified. Co-authorship interest expressed. AACER dataset access discussed as path to national coverage. LoPucki introduction mentioned.

March 17-23, 2026

Rules Committee submission accepted. Proposed amendment to Rule 4004 (mandatory 1328(f) verification with docket notation). Assigned docket number 26-BK-3. Published on uscourts.gov.

Open Research Questions

  1. Can the composite model predict mills out-of-sample? Train on known mills, test against unlabeled firms. Requires AACER for Axis 3 coverage nationally.
  2. What is the true 1328(f) violation rate? The 392,412 figure is the universe of cases requiring verification. With AACER date data, the actual violation count could be computed directly.
  3. Is there an attorney-level correlation between dismissal rate and 1328(f) violation rate? Hypothesis: high-dismissal attorneys are more likely to file barred cases because they churn clients through repeat filings.
  4. Does review manipulation correlate with other misconduct indicators? Axes 1-2 are cheap to compute (no PACER costs). If they predict Axis 3 outcomes, review data alone could serve as a national screening tool.
  5. District-level structural variation: Why does the Ch.13 dismissal baseline range from 21% to 82% across districts? Judicial culture, debtor demographics, trustee behavior, or attorney quality? The FJC data can answer this with the right controls.

What Comes Next

With AACER access

Without AACER access