Skip to content

docs[fastlanes]: add transpose optimization plan and results

17c7783
Select commit
Loading
Failed to load commit list.
Draft

feat[fastlanes]: add optimized 1024-bit transpose implementations #6135

docs[fastlanes]: add transpose optimization plan and results
17c7783
Select commit
Loading
Failed to load commit list.
CodSpeed HQ / CodSpeed Performance Analysis failed Jan 26, 2026 in 0s

Performance Regression: -44.35%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 2 improved benchmarks
❌ 9 regressed benchmarks
✅ 1251 untouched benchmarks
🆕 16 new benchmarks
⏩ 1290 skipped benchmarks1

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
WallTime u8_FoR[10M] 5.7 µs 10.2 µs -44.35%
WallTime u16_FoR[10M] 7.7 µs 10.5 µs -26.71%
Simulation canonical_into_non_nullable[(10000, 100, 0.1)] 3.7 ms 4.5 ms -18.26%
🆕 Simulation transpose_baseline_throughput N/A 2.5 ms N/A
🆕 Simulation transpose_best_throughput N/A 92.8 µs N/A
🆕 Simulation transpose_baseline N/A 10.9 µs N/A
🆕 Simulation untranspose_best N/A 2.8 µs N/A
🆕 Simulation transpose_scalar_throughput N/A 661 µs N/A
🆕 Simulation transpose_scalar N/A 3.4 µs N/A
🆕 Simulation transpose_best N/A 2 µs N/A
🆕 Simulation untranspose_scalar N/A 3.2 µs N/A
🆕 Simulation transpose_scalar_fast_throughput N/A 64.2 µs N/A
Simulation canonical_into_non_nullable[(10000, 100, 0.0)] 1.9 ms 2.7 ms -29.9%
🆕 Simulation untranspose_baseline N/A 10.9 µs N/A
Simulation canonical_into_nullable[(10000, 10, 0.0)] 528.5 µs 445.6 µs +18.61%
🆕 Simulation transpose_avx2 N/A 2.8 µs N/A
🆕 Simulation untranspose_bmi2 N/A 2.7 µs N/A
🆕 Simulation transpose_avx2_throughput N/A 314.3 µs N/A
Simulation canonical_into_nullable[(10000, 100, 0.0)] 4.9 ms 4.1 ms +19.6%
Simulation canonical_into_non_nullable[(10000, 100, 0.01)] 2.1 ms 3 ms -27.53%
... ... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.


Comparing claude/bitpacking-transpose-optimization-tM1U4 (17c7783) with develop (1a6ece1)

Open in CodSpeed

Footnotes

  1. 1290 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.