Preprint · 46 pages · ~45 min read

Measuring Cross-Channel Disagreement in LLM Evaluation

CCD-Bench: frozen diagnostic corpus for parser, judge, and human disagreement on identical transcripts

nirav.rohra@utdallas.edu

Abstract

LLM evaluation often collapses heterogeneous evidence into scalars or judge scores, hiding when parsers, judges, and humans disagree on the same transcript. We release CCD-Bench, a frozen diagnostic corpus (eighteen behavioral suites, twelve heterogeneous checkpoints) with joint parser outputs, multi-judge scores, and replayable rows; suites were frozen before aggregate analysis with null and error rows retained. The central empirical claim is finite-population and protocol-specific: on n=642 suite-stratified transcript rows after QC (all eighteen suites), parser-only violation flags align far better with blinded human-majority labels than the frozen three-judge Llama panel at τ=0, while transcript-preserving judge replacements improve recall but leave substantial false negatives unless parsers enter fusion rules. We publish harness code, prompts, manifests, and frozen JSON trees for reproduction.

Contributions

  • Conceptual. Cross-channel disagreement (CCD), an evaluation inconsistency index (EII), and when scalar fusion cannot preserve incompatible channel orderings.
  • Apparatus. Per-run audit ledger with integrity hooks for identifier-aligned parser, judge, and human channels.
  • CCD-Bench. Eighteen suites spanning seven recurring risk constructs; five flagship suites anchor the judge-panel audit; stratified blinded human coding on n=642 pooled rows.
  • Evidence. Judge misses concentrate in parser-positive cells humans usually validate; replacing judges moves recall but does not eliminate disagreement with humans or parsers.

Headline results (n = 642)

Finite-population, protocol-specific metrics on stratified blinded human-majority labels across all eighteen CCD-Bench suites. Full tables and BCa intervals in the paper.

Parser-only recall vs. human majority
87.3%
95% Wilson CI [83.2, 90.5] on n=642 stratified rows
Judge-only recall vs. human majority
27.8%
Frozen three-judge Llama panel at τ=0; CI [23.2, 33.0]
Parser-positive / judge-negative rows
243
203 (83.5%) validated as violations by blinded human-majority labels
Parser ∨ judge fusion FNR
7.6%
Illustrative fusion rule; trades recall for higher false-positive rate

CCD-Bench coverage

  • Authority conflict
  • Prompt injection
  • False-premise honesty
  • Information control / redaction
  • Autonomy / continuity framing
  • Contained dual-use code
  • Multilingual / framing robustness

Eighteen frozen behavioral suites, twelve heterogeneous checkpoints, joint parser outputs and multi-judge scores on replayable rows. Suites were frozen before aggregate analysis; null, refusal, and error rows were retained. Intended as wind-tunnel diagnostic metrology—not natural-user prevalence estimation.

Paper outline

  1. 01

    Introduction

    Why scalar benchmarks hide cross-channel splits on identical transcripts.

  2. 02

    Framework & CCD

    Parser vs. judge channels, EII, and positioning vs. prior agent benchmarks.

  3. 03

    CCD-Bench protocol

    Eighteen frozen suites, twelve checkpoints, anti-cherry-picking design.

  4. 04

    Empirical results

    Parser–judge–human contingency tables on n=642 stratified rows.

  5. 05

    Judge ablations

    Transcript-preserving rescoring with parser-aware and API judges.

  6. 06

    Reproducibility

    Public harness, frozen JSON trees, and table-regeneration scripts.

Full paper (PDF)

Embedded viewer for Measuring Cross-Channel Disagreement in LLM Evaluation. Open PDF in a new tab.

How to cite

@misc{rohra2026ccd,
  author       = {Nirav Rohra},
  title        = {Measuring Cross-Channel Disagreement in LLM Evaluation},
  year         = {2026},
  howpublished = {Preprint},
  institution  = {University of Texas at Dallas},
  url          = {https://niravrohra.com/research}
}

Reproducibility artifacts and harness details are described in the full paper (Section 8).

About the author

Nirav Rohra researches mechanistic interpretability and LLM evaluation safety metrology. Founder of Honrly, AI/cyber intern at Zebra Technologies, CS (AI) at UT Dallas. More work and contact on the main portfolio.