feat[fastlanes]: add optimized 1024-bit transpose implementations #6135

Performance Regression: -44.35%

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 2 improved benchmarks
❌ 9 regressed benchmarks
✅ 1251 untouched benchmarks
🆕 16 new benchmarks
⏩ 1290 skipped benchmarks¹

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
❌	WallTime	`u8_FoR[10M]`	5.7 µs	10.2 µs	-44.35%
❌	WallTime	`u16_FoR[10M]`	7.7 µs	10.5 µs	-26.71%
❌	Simulation	`canonical_into_non_nullable[(10000, 100, 0.1)]`	3.7 ms	4.5 ms	-18.26%
🆕	Simulation	`transpose_baseline_throughput`	N/A	2.5 ms	N/A
🆕	Simulation	`transpose_best_throughput`	N/A	92.8 µs	N/A
🆕	Simulation	`transpose_baseline`	N/A	10.9 µs	N/A
🆕	Simulation	`untranspose_best`	N/A	2.8 µs	N/A
🆕	Simulation	`transpose_scalar_throughput`	N/A	661 µs	N/A
🆕	Simulation	`transpose_scalar`	N/A	3.4 µs	N/A
🆕	Simulation	`transpose_best`	N/A	2 µs	N/A
🆕	Simulation	`untranspose_scalar`	N/A	3.2 µs	N/A
🆕	Simulation	`transpose_scalar_fast_throughput`	N/A	64.2 µs	N/A
❌	Simulation	`canonical_into_non_nullable[(10000, 100, 0.0)]`	1.9 ms	2.7 ms	-29.9%
🆕	Simulation	`untranspose_baseline`	N/A	10.9 µs	N/A
⚡	Simulation	`canonical_into_nullable[(10000, 10, 0.0)]`	528.5 µs	445.6 µs	+18.61%
🆕	Simulation	`transpose_avx2`	N/A	2.8 µs	N/A
🆕	Simulation	`untranspose_bmi2`	N/A	2.7 µs	N/A
🆕	Simulation	`transpose_avx2_throughput`	N/A	314.3 µs	N/A
⚡	Simulation	`canonical_into_nullable[(10000, 100, 0.0)]`	4.9 ms	4.1 ms	+19.6%
❌	Simulation	`canonical_into_non_nullable[(10000, 100, 0.01)]`	2.1 ms	3 ms	-27.53%
...	...	...	...	...	...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

_{Comparing claude/bitpacking-transpose-optimization-tM1U4 (17c7783) with develop (1a6ece1)}

1290 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat[fastlanes]: add optimized 1024-bit transpose implementations #6135

Uh oh!

Uh oh!

feat[fastlanes]: add optimized 1024-bit transpose implementations #6135

Uh oh!

Performance Regression: -44.35%

Performance Changes

Re-running checks...

feat[fastlanes]: add optimized 1024-bit transpose implementations #6135

Are you sure you want to change the base?

Uh oh!

docs[fastlanes]: add transpose optimization plan and results

Uh oh!

feat[fastlanes]: add optimized 1024-bit transpose implementations #6135

Uh oh!

Performance Regression: -44.35%

Performance Changes

Footnotes

Re-running checks...