LongBench v2 Evaluation

Full benchmark report for Otsofy adaptive compression evaluated on GPT-4o-mini.

Date
December 15, 2025
Model
GPT-4o-mini
Samples
230 questions
Total API calls
126,500

Results Summary

CutoffMean AccuracyStd DevΔ vs BaselineToken ReductionSignificance
baseline28.2%±0.6%
0.128.4%±0.7%+0.2%10.3%
0.227.7%±0.7%-0.5%15.5%
0.329.2%±1.0%+1.0%23.4%
0.429.1%±0.7%+0.9%24.6%
0.528.9%±0.7%+0.7%31.4%
0.628.5%±0.9%+0.3%35.6%
0.727.8%±1.0%-0.4%42.4%
0.829.0%±0.8%+0.8%52.4%
0.929.2%±0.8%+1.1%66.1%
0.9527.7%±0.6%-0.5%77.4%

Significance: Two-sample t-test, |t| > 2 (~95% confidence)

Key Observations

1

Six cutoffs showed statistically significant accuracy improvements over baseline (0.3, 0.4, 0.5, 0.6, 0.8, 0.9).

2

Best accuracy improvement: 0.9 cutoff at +1.1% with 66% token reduction.

3

Non-monotonic relationship: Some cutoffs (0.2, 0.7) underperform their neighbors despite removing fewer tokens. This is likely an artifact of how specific thresholds interact with the importance score distribution in this dataset, rather than a fundamental property of compression. Different benchmarks may show slight dips at different thresholds.

4

Aggressive compression (0.95) removes too much context and loses accuracy.

Recommended Configurations

0.3importance_cutoff

Conservative — preserves more context

  • +1.0% accuracy improvement
  • 23% token reduction
  • Lower risk of removing important tokens
0.9importance_cutoff

Aggressive — maximum cost savings

  • +1.1% accuracy improvement
  • 66% token reduction
  • Best for cost-sensitive workloads

Both configurations showed statistically significant accuracy improvements across 50 runs.

Bootstrap Analysis

10,000 bootstrap iterations. Baseline: 50 runs, mean accuracy 0.2817.

ConfigMeanDiff95% CIP(better)Verdict
0.10.2842+0.0024[-0.0001, +0.0050]96.91%
0.20.2770-0.0048[-0.0074, -0.0022]0.01%
0.30.2917+0.0100[+0.0070, +0.0132]100.00%
0.40.2907+0.0090[+0.0063, +0.0116]100.00%
0.50.2890+0.0073[+0.0047, +0.0100]100.00%
0.60.2850+0.0033[+0.0003, +0.0062]98.42%
0.70.2781-0.0037[-0.0070, -0.0004]1.28%
0.80.2901+0.0084[+0.0056, +0.0114]100.00%
0.90.2924+0.0107[+0.0079, +0.0135]100.00%
0.950.2768-0.0049[-0.0073, -0.0025]0.01%
Significantly BETTER (6)
0.3: +0.0100 (100.0% prob)
0.4: +0.0090 (100.0% prob)
0.5: +0.0073 (100.0% prob)
0.6: +0.0033 (98.4% prob)
0.8: +0.0084 (100.0% prob)
0.9: +0.0107 (100.0% prob)
Significantly WORSE (3)
0.2: -0.0048 (0.0% prob)
0.7: -0.0037 (1.3% prob)
0.95: -0.0049 (0.0% prob)

Methodology

  • Dataset: LongBench v2 multiple-choice questions (paper)
  • Sampling: 230 questions stratified from 503, filtered to ≤100k tokens
  • Compression: Otsofy adaptive compression with importance_cutoff parameter
  • Token counting: tiktoken (gpt-4o-mini encoding)
  • Runs: 50 independent evaluations per configuration
  • Temperature: 0 (near-deterministic)

Limitations

  • Results specific to GPT-4o-mini; may differ for other models
  • LongBench v2 subset (230/503 questions due to token limits)
  • Effect sizes are small (~1%); practical significance depends on use case

Token counts calculated with tiktoken. Compression performed using Otsofy.