Skip to content

Conversation

@salvatoredipietro
Copy link

We would like to add the support for ARMv9 architecture to use sb instruction instead of isb.
Based on our micro benchmark (patch attached), sb is ~30% faster than isb (ratio=1.710:1) and the change do not seems to introduce any regression on the spin_cpu_spinwait() function (isb_spin=8740725us vs standard_spin=8739722us).

# Jemalloc build
$ make clean all ; ./autogen.sh && ./configure && make -j4

# Test on m8g.2xlarge with patch
$ make tests_stress && ./test/stress/arm_spin_bench
Running on ARM64 architecture
SB instruction is supported
1000000 iterations, isb_spin=8740725us (8740.725 ns/iter), standard_spin=5110839us (5110.839 ns/iter), time consumption ratio=1.710:1

# Test on m8g.2xlarge without patch 
1000000 iterations, isb_spin=8739722us (8739.722 ns/iter), standard_spin=8739722us (8739.722 ns/iter), time consumption ratio=1.000:1

Original post: jemalloc#2843

@meta-cla meta-cla bot added the cla signed label Jan 8, 2026
@lexprfuncall lexprfuncall self-assigned this Jan 17, 2026
@lexprfuncall
Copy link

Thanks for the PR! Do you have any thoughts on the use of a higher-throughput instruction for a delay loop? Intuitively, you need to waste some time so the difference between an sb and an isb could be either a win or a lose depending on the number of iterations.

As an aside, the approach suggested by ARM involves uses a more precise delay loop to account for the differences. That might require a slightly bigger change to the spin_adaptive implementation:

https://developer.arm.com/community/arm-community-blogs/b/architectures-and-processors-blog/posts/multi-threaded-applications-arm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants