1328f.org

Consumer Bankruptcy Research & Accountability

← Collaboration Brief

Convergence Study Design

Methodology Validation Open Design Prepared March 25, 2026 - Draft for discussion

Status

This is an open research design, not a finished protocol. It is intended as a starting point for collaboration. The framework, phases, and validation criteria below are all open to revision by a research partner with expertise in empirical study design.

1. The Problem

A single independent researcher built a dataset of 264 confirmed Section 1328(f) violations across 7 federal judicial districts. The dataset was produced using open-source tools applied to public federal court records.

The central question any reviewer will ask:

How do we know these results aren't an artifact of one person's methodology, selection bias, or errors?

The convergence study answers this by separating the person who built the tools from the people validating the results.

2. The Core Idea

Two independent data streams are being generated right now:

StreamSourceControllerStatus
Ground truth 264 verified cases, 7 districts Dataset builder (private) Complete
Independent verification GitHub cloners running the screener Self-selected public users In progress (198 unique cloners)

If independent users running the same tool against the same public records in overlapping districts produce the same results, the findings are externally validated without requiring trust in any single researcher.

If results diverge, the study identifies exactly where and why, which is equally valuable.

3. Why This Is Novel

4. Proposed Phases

Phase 1 - Baseline

Characterize existing data streams

Determine which districts the 198 GitHub cloners have run the screener against. Identify overlap with the 7-district ground truth set. Measure: how many independent verification data points exist today, without any additional collection?

Phase 2 - Overlap test

Compare results in shared districts

For cases where both the ground truth and independent users screened the same district, compare results case by case. Metrics: agreement rate, false positive rate, false negative rate. If agreement exceeds a pre-specified threshold (to be determined by research design), the tool is validated.

Phase 3 - Expansion

Extend to new districts with institutional PACER access

Using fee-exempt access through a university affiliation, run the screener against a stratified random sample of the 391,951-case verification universe. Sample design: stratify by district, filing year, and prior-filer discharge rate. Target: statistically representative coverage across all 94 districts.

Phase 4 - Publication

Two papers, not one

Paper 1 (methods): The convergence framework itself. Can crowdsourced verification of court records produce reliable results? What are the conditions for validity? Applicable beyond 1328(f) to any court-record verification task.

Paper 2 (findings): The national 1328(f) violation rate, estimated from the expanded sample. Geographic variation analysis. Policy implications for Rule 4004 and discharge eligibility verification.

5. Validation Criteria (Open for Discussion)

What constitutes "convergence"? Proposed thresholds, subject to revision:

MetricProposed thresholdNotes
Case-level agreement rate ≥ 95% Independent result matches ground truth for same case
District-level violation rate Within 5 percentage points Independent district estimate vs. ground truth district estimate
False positive rate ≤ 2% Cases flagged as violations that are not (dates outside bar window)
False negative rate To be measured Cases missed by the screener that are actual violations

These thresholds are placeholders. A research partner with experience in validation study design should set the actual criteria before data collection begins.

6. Known Threats to Validity

ThreatDescriptionMitigation
Selection bias GitHub cloners are self-selected. They found the tool through Reddit, search, or word of mouth. They may not be representative. Phase 3 uses stratified random sampling with institutional access, removing self-selection entirely.
Tool error The screener could have bugs that produce systematic errors. The 264-case ground truth was manually verified against PACER dockets. Any screener error that contradicts manual verification would surface in Phase 2.
PACER data quality PACER records may contain errors (incorrect dates, missing cases, miscoded chapters). This affects all PACER-based research equally. The convergence design tests whether independent users encounter the same data quality, not whether PACER itself is perfect.
Temporal drift PACER records can be amended. A case screened in March may show different data than the same case screened in June. Timestamp all screenings. Compare only results generated within the same time window.
Non-independence Some GitHub cloners may share results with each other, compromising independence. The screener output is deterministic. If two users get the same result, it is because the data is the same, not because they compared notes. Non-independence does not affect a deterministic tool.

7. What a Research Partner Brings

The dataset builder built the infrastructure. The research partner designs the study. Specifically:

8. What Exists Today

ComponentStatus
Ground truth dataset (264 cases, 7 districts)Complete
Screening tool (open-source, deterministic)Live, ranking #1 nationally
FJC national dataset (4.9M Ch. 13 cases)Loaded, queryable
391,951 verification universe identifiedComplete
RSS real-time monitoring (all 94 districts)Running
RECAP enrichment pipeline (16,000+ cases)Running
Independent GitHub cloners198 unique as of March 25, 2026
Research design for convergence testThis document (draft)
Institutional PACER accessNot yet available
Formal study protocolAwaiting research partner

This design is open

Every element on this page - the phases, the thresholds, the threats, the publication strategy - is a proposal, not a decision. The purpose of this document is to show that a rigorous validation framework is possible and that the infrastructure to execute it already exists. The research design itself should be shaped by someone with the expertise to do it right.