LongBench v2 Evaluation
Full benchmark report for Otsofy adaptive compression evaluated on GPT-4o-mini.
Results Summary
| Cutoff | Mean Accuracy | Std Dev | Δ vs Baseline | Token Reduction | Significance |
|---|---|---|---|---|---|
| baseline | 28.2% | ±0.6% | — | — | |
| 0.1 | 28.4% | ±0.7% | +0.2% | 10.3% | |
| 0.2 | 27.7% | ±0.7% | -0.5% | 15.5% | |
| 0.3 | 29.2% | ±1.0% | +1.0% | 23.4% | |
| 0.4 | 29.1% | ±0.7% | +0.9% | 24.6% | |
| 0.5 | 28.9% | ±0.7% | +0.7% | 31.4% | |
| 0.6 | 28.5% | ±0.9% | +0.3% | 35.6% | |
| 0.7 | 27.8% | ±1.0% | -0.4% | 42.4% | |
| 0.8 | 29.0% | ±0.8% | +0.8% | 52.4% | |
| 0.9 | 29.2% | ±0.8% | +1.1% | 66.1% | |
| 0.95 | 27.7% | ±0.6% | -0.5% | 77.4% |
Significance: Two-sample t-test, |t| > 2 (~95% confidence)
Key Observations
Six cutoffs showed statistically significant accuracy improvements over baseline (0.3, 0.4, 0.5, 0.6, 0.8, 0.9).
Best accuracy improvement: 0.9 cutoff at +1.1% with 66% token reduction.
Non-monotonic relationship: Some cutoffs (0.2, 0.7) underperform their neighbors despite removing fewer tokens. This is likely an artifact of how specific thresholds interact with the importance score distribution in this dataset, rather than a fundamental property of compression. Different benchmarks may show slight dips at different thresholds.
Aggressive compression (0.95) removes too much context and loses accuracy.
Recommended Configurations
0.3importance_cutoffConservative — preserves more context
- +1.0% accuracy improvement
- 23% token reduction
- Lower risk of removing important tokens
0.9importance_cutoffAggressive — maximum cost savings
- +1.1% accuracy improvement
- 66% token reduction
- Best for cost-sensitive workloads
Both configurations showed statistically significant accuracy improvements across 50 runs.
Bootstrap Analysis
10,000 bootstrap iterations. Baseline: 50 runs, mean accuracy 0.2817.
| Config | Mean | Diff | 95% CI | P(better) | Verdict |
|---|---|---|---|---|---|
| 0.1 | 0.2842 | +0.0024 | [-0.0001, +0.0050] | 96.91% | |
| 0.2 | 0.2770 | -0.0048 | [-0.0074, -0.0022] | 0.01% | |
| 0.3 | 0.2917 | +0.0100 | [+0.0070, +0.0132] | 100.00% | |
| 0.4 | 0.2907 | +0.0090 | [+0.0063, +0.0116] | 100.00% | |
| 0.5 | 0.2890 | +0.0073 | [+0.0047, +0.0100] | 100.00% | |
| 0.6 | 0.2850 | +0.0033 | [+0.0003, +0.0062] | 98.42% | |
| 0.7 | 0.2781 | -0.0037 | [-0.0070, -0.0004] | 1.28% | |
| 0.8 | 0.2901 | +0.0084 | [+0.0056, +0.0114] | 100.00% | |
| 0.9 | 0.2924 | +0.0107 | [+0.0079, +0.0135] | 100.00% | |
| 0.95 | 0.2768 | -0.0049 | [-0.0073, -0.0025] | 0.01% |
Methodology
- •Dataset: LongBench v2 multiple-choice questions (paper)
- •Sampling: 230 questions stratified from 503, filtered to ≤100k tokens
- •Compression: Otsofy adaptive compression with
importance_cutoffparameter - •Token counting: tiktoken (gpt-4o-mini encoding)
- •Runs: 50 independent evaluations per configuration
- •Temperature: 0 (near-deterministic)
Limitations
- •Results specific to GPT-4o-mini; may differ for other models
- •LongBench v2 subset (230/503 questions due to token limits)
- •Effect sizes are small (~1%); practical significance depends on use case
Token counts calculated with tiktoken. Compression performed using Otsofy.