[generator.c] optimize copy_remaining_bytes #924
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR focuses on optimizing
copy_remaining_bytes. TheMEMCPY(s, search->ptr, char, len);generates a function call aslenis not constant. However, we know thatlenis between 6 (now 4) andvec_len-1bytes.Instead of the
MEMCPY, if available, we use__builtin_memcpywith a constant length which ends up emitting directloadandstoreinstructions. The copies are structured to copy between 4 and 15 bytes by utilizing copying overlapping byte ranges to copy the correct number of bytes. The__builtin_memcpyis important, at least forclangon MacOS. Attempting to usememcpy, the compiler is smart enough to recognize the only difference is either a8or an4then uses a conditional select to choose the right value then loads and stores. This is quite a bit slower than the__builtin_memcpy.Additionally, I noticed that the
memset(s, 'X', vec_len);generates three instructions:This is because
X(0x5858585858585858) cannot be represented as an immediate in Aarch64/ARM64 assembly. However, a space (0x20) can be. It doesn't really matter what filler character is used as long as it doesn't need to be escaped. Using a space,clangnow generates this:I realize this only save a single instruction and doesn't really make much difference but I'll take it.
The
__builtin_memcpycertainly introduces a level of complexity I wouldn't normally entertain but the performance improvements were quite surprising. Here are the results of running a benchmark on my M1 Macbook Air. The percentages are similar on my M4 Macbook Pro. As always the percentages vary a bit between runs but this one is fairly typical.The numbers were shocking enough that I thought I broke something. I added a few more tests in addition to running this shell script to verify the output between the default
jsongem that comes with Ruby and the current version.I excluded
canada.jsonas some of the numbers output slightly different precision.I did lower the
SIMD_MINIMUM_THRESHOLDto4as the copy is now almost free and that seems to change the math a bit for when it makes sense to fall back to the lookup table. Additionally, since the "else" copies overlapping 4 byte chunks, 4 seemed like the logical minimum threshold.I have thought about how to clean this up a little and have this idea:
in
simd.h:Then in the
generator.c: