LongBench v2 Evaluation

Full benchmark report for Otsofy adaptive compression evaluated on GPT-4o-mini.

Date

December 15, 2025

Model

GPT-4o-mini

Samples

230 questions

Total API calls

126,500

Results Summary

Cutoff	Mean Accuracy	Std Dev	Δ vs Baseline	Token Reduction
baseline	28.2%	±0.6%	—	—
0.1	28.4%	±0.7%	+0.2%	10.3%
0.2	27.7%	±0.7%	-0.5%	15.5%
0.3	29.2%	±1.0%	+1.0%	23.4%
0.4	29.1%	±0.7%	+0.9%	24.6%
0.5	28.9%	±0.7%	+0.7%	31.4%
0.6	28.5%	±0.9%	+0.3%	35.6%
0.7	27.8%	±1.0%	-0.4%	42.4%
0.8	29.0%	±0.8%	+0.8%	52.4%
0.9	29.2%	±0.8%	+1.1%	66.1%
0.95	27.7%	±0.6%	-0.5%	77.4%

Significance: Two-sample t-test, |t| > 2 (~95% confidence)

Key Observations

Six cutoffs showed statistically significant accuracy improvements over baseline (0.3, 0.4, 0.5, 0.6, 0.8, 0.9).

Best accuracy improvement: 0.9 cutoff at +1.1% with 66% token reduction.

Non-monotonic relationship: Some cutoffs (0.2, 0.7) underperform their neighbors despite removing fewer tokens. This is likely an artifact of how specific thresholds interact with the importance score distribution in this dataset, rather than a fundamental property of compression. Different benchmarks may show slight dips at different thresholds.

Aggressive compression (0.95) removes too much context and loses accuracy.

Recommended Configurations

0.3importance_cutoff

Conservative — preserves more context

+1.0% accuracy improvement
23% token reduction
Lower risk of removing important tokens

0.9importance_cutoff

Aggressive — maximum cost savings

+1.1% accuracy improvement
66% token reduction
Best for cost-sensitive workloads

Both configurations showed statistically significant accuracy improvements across 50 runs.

Bootstrap Analysis

10,000 bootstrap iterations. Baseline: 50 runs, mean accuracy 0.2817.

Config	Mean	Diff	95% CI	P(better)
0.1	0.2842	+0.0024	[-0.0001, +0.0050]	96.91%
0.2	0.2770	-0.0048	[-0.0074, -0.0022]	0.01%
0.3	0.2917	+0.0100	[+0.0070, +0.0132]	100.00%
0.4	0.2907	+0.0090	[+0.0063, +0.0116]	100.00%
0.5	0.2890	+0.0073	[+0.0047, +0.0100]	100.00%
0.6	0.2850	+0.0033	[+0.0003, +0.0062]	98.42%
0.7	0.2781	-0.0037	[-0.0070, -0.0004]	1.28%
0.8	0.2901	+0.0084	[+0.0056, +0.0114]	100.00%
0.9	0.2924	+0.0107	[+0.0079, +0.0135]	100.00%
0.95	0.2768	-0.0049	[-0.0073, -0.0025]	0.01%

Significantly BETTER (6)

0.3: +0.0100 (100.0% prob)

0.4: +0.0090 (100.0% prob)

0.5: +0.0073 (100.0% prob)

0.6: +0.0033 (98.4% prob)

0.8: +0.0084 (100.0% prob)

0.9: +0.0107 (100.0% prob)

Significantly WORSE (3)

0.2: -0.0048 (0.0% prob)

0.7: -0.0037 (1.3% prob)

0.95: -0.0049 (0.0% prob)

Methodology

•Dataset: LongBench v2 multiple-choice questions (paper)
•Sampling: 230 questions stratified from 503, filtered to ≤100k tokens
•Compression: Otsofy adaptive compression with importance_cutoff parameter
•Token counting: tiktoken (gpt-4o-mini encoding)
•Runs: 50 independent evaluations per configuration
•Temperature: 0 (near-deterministic)

Limitations

•Results specific to GPT-4o-mini; may differ for other models
•LongBench v2 subset (230/503 questions due to token limits)
•Effect sizes are small (~1%); practical significance depends on use case

Token counts calculated with tiktoken. Compression performed using Otsofy.