Skip to content

Conversation

@romshark
Copy link
Contributor

@romshark romshark commented Feb 16, 2025

Handling larger bitsets in 8-batches is more efficient on modern CPUs.
I assume it's related to instruction-level parallelism.
This technique can effectively be applied to most bitset methods and functions.

goos: darwin
goarch: arm64
pkg: github.com/KernelPryanic/bitmask
cpu: Apple M1 Max
                    │   old.txt   │              new.txt               │
                    │   sec/op    │   sec/op     vs base               │
BitSet_Xor/empty-10   2.498n ± 4%   2.493n ± 3%        ~ (p=0.372 n=6)
BitSet_Xor/5-10       2.491n ± 1%   2.492n ± 1%        ~ (p=0.729 n=6)
BitSet_Xor/10k-10     76.10n ± 1%   49.79n ± 1%  -34.57% (p=0.002 n=6)
BitSet_Xor/1m-10      8.453µ ± 0%   5.112µ ± 1%  -39.52% (p=0.002 n=6)
geomean               44.73n        35.46n       -20.72%

                    │   old.txt    │              new.txt               │
                    │     B/op     │    B/op     vs base                │
BitSet_Xor/empty-10   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
BitSet_Xor/5-10       0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
BitSet_Xor/10k-10     0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
BitSet_Xor/1m-10      0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
geomean                          ²               +0.00%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                    │   old.txt    │              new.txt               │
                    │  allocs/op   │ allocs/op   vs base                │
BitSet_Xor/empty-10   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
BitSet_Xor/5-10       0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
BitSet_Xor/10k-10     0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
BitSet_Xor/1m-10      0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
geomean                          ²               +0.00%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

Handling larger bitsets in 8-batches is more efficient on modern CPUs.
I assume it's related to instruction-level parallelism.

goos: darwin
goarch: arm64
pkg: github.com/KernelPryanic/bitmask
cpu: Apple M1 Max
                    │   old.txt   │              new.txt               │
                    │   sec/op    │   sec/op     vs base               │
BitSet_Xor/empty-10   2.498n ± 4%   2.493n ± 3%        ~ (p=0.372 n=6)
BitSet_Xor/5-10       2.491n ± 1%   2.492n ± 1%        ~ (p=0.729 n=6)
BitSet_Xor/10k-10     76.10n ± 1%   49.79n ± 1%  -34.57% (p=0.002 n=6)
BitSet_Xor/1m-10      8.453µ ± 0%   5.112µ ± 1%  -39.52% (p=0.002 n=6)
geomean               44.73n        35.46n       -20.72%

                    │   old.txt    │              new.txt               │
                    │     B/op     │    B/op     vs base                │
BitSet_Xor/empty-10   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
BitSet_Xor/5-10       0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
BitSet_Xor/10k-10     0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
BitSet_Xor/1m-10      0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
geomean                          ²               +0.00%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                    │   old.txt    │              new.txt               │
                    │  allocs/op   │ allocs/op   vs base                │
BitSet_Xor/empty-10   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
BitSet_Xor/5-10       0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
BitSet_Xor/10k-10     0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
BitSet_Xor/1m-10      0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
geomean                          ²               +0.00%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean
@KernelPryanic KernelPryanic merged commit 63daa84 into KernelPryanic:main Feb 16, 2025
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants