feat[fastlanes]: add optimized 1024-bit transpose implementations #6135
Performance Regression: -44.35%
⚠️ Unknown Walltime execution environment detected
Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.
For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.
⚡ 2 improved benchmarks
❌ 9 regressed benchmarks
✅ 1251 untouched benchmarks
🆕 16 new benchmarks
⏩ 1290 skipped benchmarks1
⚠️ Please fix the performance issues or acknowledge them on CodSpeed.
Performance Changes
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | WallTime | u8_FoR[10M] |
5.7 µs | 10.2 µs | -44.35% |
| ❌ | WallTime | u16_FoR[10M] |
7.7 µs | 10.5 µs | -26.71% |
| ❌ | Simulation | canonical_into_non_nullable[(10000, 100, 0.1)] |
3.7 ms | 4.5 ms | -18.26% |
| 🆕 | Simulation | transpose_baseline_throughput |
N/A | 2.5 ms | N/A |
| 🆕 | Simulation | transpose_best_throughput |
N/A | 92.8 µs | N/A |
| 🆕 | Simulation | transpose_baseline |
N/A | 10.9 µs | N/A |
| 🆕 | Simulation | untranspose_best |
N/A | 2.8 µs | N/A |
| 🆕 | Simulation | transpose_scalar_throughput |
N/A | 661 µs | N/A |
| 🆕 | Simulation | transpose_scalar |
N/A | 3.4 µs | N/A |
| 🆕 | Simulation | transpose_best |
N/A | 2 µs | N/A |
| 🆕 | Simulation | untranspose_scalar |
N/A | 3.2 µs | N/A |
| 🆕 | Simulation | transpose_scalar_fast_throughput |
N/A | 64.2 µs | N/A |
| ❌ | Simulation | canonical_into_non_nullable[(10000, 100, 0.0)] |
1.9 ms | 2.7 ms | -29.9% |
| 🆕 | Simulation | untranspose_baseline |
N/A | 10.9 µs | N/A |
| ⚡ | Simulation | canonical_into_nullable[(10000, 10, 0.0)] |
528.5 µs | 445.6 µs | +18.61% |
| 🆕 | Simulation | transpose_avx2 |
N/A | 2.8 µs | N/A |
| 🆕 | Simulation | untranspose_bmi2 |
N/A | 2.7 µs | N/A |
| 🆕 | Simulation | transpose_avx2_throughput |
N/A | 314.3 µs | N/A |
| ⚡ | Simulation | canonical_into_nullable[(10000, 100, 0.0)] |
4.9 ms | 4.1 ms | +19.6% |
| ❌ | Simulation | canonical_into_non_nullable[(10000, 100, 0.01)] |
2.1 ms | 3 ms | -27.53% |
| ... | ... | ... | ... | ... | ... |
ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.
Comparing claude/bitpacking-transpose-optimization-tM1U4 (17c7783) with develop (1a6ece1)
Footnotes
-
1290 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩