Skip to content

Conversation

@samyron
Copy link
Contributor

@samyron samyron commented Jan 16, 2026

This PR focuses on optimizing copy_remaining_bytes. The MEMCPY(s, search->ptr, char, len); generates a function call as len is not constant. However, we know that len is between 6 (now 4) and vec_len-1 bytes.

Instead of the MEMCPY, if available, we use __builtin_memcpy with a constant length which ends up emitting direct load and store instructions. The copies are structured to copy between 4 and 15 bytes by utilizing copying overlapping byte ranges to copy the correct number of bytes. The __builtin_memcpy is important, at least for clang on MacOS. Attempting to use memcpy, the compiler is smart enough to recognize the only difference is either a 8 or an 4 then uses a conditional select to choose the right value then loads and stores. This is quite a bit slower than the __builtin_memcpy.

Additionally, I noticed that the memset(s, 'X', vec_len); generates three instructions:

mov x8, #0x1818181818181818
orr x8, x8, #0x4040404040404040
stp x8, x8, [x23]

This is because X (0x5858585858585858) cannot be represented as an immediate in Aarch64/ARM64 assembly. However, a space (0x20) can be. It doesn't really matter what filler character is used as long as it doesn't need to be escaped. Using a space, clang now generates this:

mov x10, #0x2020202020202020
stp x10, x10, [x8]

I realize this only save a single instruction and doesn't really make much difference but I'll take it.

The __builtin_memcpy certainly introduces a level of complexity I wouldn't normally entertain but the performance improvements were quite surprising. Here are the results of running a benchmark on my M1 Macbook Air. The percentages are similar on my M4 Macbook Pro. As always the percentages vary a bit between runs but this one is fairly typical.

== Encoding activitypub.json (52595 bytes)
ruby 3.4.8 (2025-12-17 revision 995b59f666) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after     2.570k i/100ms
Calculating -------------------------------------
               after     26.661k (± 1.4%) i/s   (37.51 μs/i) -    133.640k in   5.013636s

Comparison:
              before:    25175.5 i/s
               after:    26660.6 i/s - 1.06x  faster


== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.8 (2025-12-17 revision 995b59f666) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   133.000 i/100ms
Calculating -------------------------------------
               after      1.338k (± 0.7%) i/s  (747.18 μs/i) -      6.783k in   5.068387s

Comparison:
              before:     1273.7 i/s
               after:     1338.4 i/s - 1.05x  faster


== Encoding twitter.json (466906 bytes)
ruby 3.4.8 (2025-12-17 revision 995b59f666) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   269.000 i/100ms
Calculating -------------------------------------
               after      2.694k (± 2.1%) i/s  (371.21 μs/i) -     13.719k in   5.094830s

Comparison:
              before:     2509.0 i/s
               after:     2693.9 i/s - 1.07x  faster


== Encoding ohai.json (20145 bytes)
ruby 3.4.8 (2025-12-17 revision 995b59f666) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after     3.418k i/100ms
Calculating -------------------------------------
               after     34.115k (± 0.8%) i/s   (29.31 μs/i) -    170.900k in   5.009885s

Comparison:
              before:    31162.9 i/s
               after:    34114.9 i/s - 1.09x  faster

The numbers were shocking enough that I thought I broke something. I added a few more tests in addition to running this shell script to verify the output between the default json gem that comes with Ruby and the current version.

#!/bin/bash

all_match=true

for file in $(ls ./benchmark/data/*.json | grep -v canada.json); do
    echo "Testing: $(basename $file)"
    
    dev_hash=$(ruby -Ilib:ext -e "require 'json'; puts JSON.generate(JSON.parse(File.read('$file')));" | sha1sum | cut -d' ' -f1)
    rel_hash=$(ruby -e "require 'json'; puts JSON.generate(JSON.parse(File.read('$file')));" | sha1sum | cut -d' ' -f1)
    
    if [ "$dev_hash" = "$rel_hash" ]; then
        echo "  ✓ $dev_hash"
    else
        echo "  ✗ MISMATCH: dev=$dev_hash rel=$rel_hash"
        all_match=false
    fi
done

$all_match && echo -e "\n✓ All tests passed!" || echo -e "\n✗ Some tests failed!"
% bash ./verify-changes.sh 
Testing: activitypub-pretty.json
  ✓ 421ab39e9ee7eed1392a24ddde4deee218abeae2
Testing: activitypub.json
  ✓ 421ab39e9ee7eed1392a24ddde4deee218abeae2
Testing: citm_catalog.json
  ✓ 09b74151f3e4310e9339be0ff1c0d8b316dbdabd
Testing: github_events.json
  ✓ d5248cad8f4e573145009842e9f0116919e2b4f0
Testing: integers-pretty.json
  ✓ 6ffad71396cc4207b53e34eaf02c5bf1a6f65d92
Testing: integers.json
  ✓ 7274a0b8c6c55fa0478b5a5f80433ea31b51c123
Testing: ohai.json
  ✓ 8dde8bd01ef7baac0330da5a094ab1781bfbe142
Testing: semanticscholar-corpus.json
  ✓ 52e8b47615c1cbcc538d47d55d999eadfaffdf84
Testing: twitter.json
  ✓ ccab6079fd5cf8408f7bf4ff832729d4fc59f9e8
Testing: twitterescaped.json
  ✓ ccab6079fd5cf8408f7bf4ff832729d4fc59f9e8
Testing: update-center.json
  ✓ 12d1ba8473163a50c6632821c7228b4f067642a2

✓ All tests passed!

I excluded canada.json as some of the numbers output slightly different precision.

I did lower the SIMD_MINIMUM_THRESHOLD to 4 as the copy is now almost free and that seems to change the math a bit for when it makes sense to fall back to the lookup table. Additionally, since the "else" copies overlapping 4 byte chunks, 4 seemed like the logical minimum threshold.

I have thought about how to clean this up a little and have this idea:

in simd.h:

#if defined(__has_builtin)
#if __has_builtin(__builtin_memcpy)
#define EXPLICIT_MEMCPY(dst, src, n) __builtin_memcpy(dst, src, n)
#define SIMD_MINIMUM_THRESHOLD 4
#endif
#else
SIMD_MINIMUM_THRESHOLD 6
#endif

Then in the generator.c:

#ifdef EXPLICIT_MEMCPY
<new optimized code>
#else
MEMCPY(s, search->ptr, char, len);
#endif

…n copy_remaining_bytes to avoid a branch to MEMCPY. Additionally use a space as padding byte instead of an 'X' so it can be represented diretly on AArch64 with a single instruction.
@byroot
Copy link
Member

byroot commented Jan 16, 2026

if available, we use __builtin_memcpy with a constant length which ends up emitting direct load and store instructions.

Have you tried the same with MEMCPY or memcpy ? I know that compilers can do similar optiomization when Ruby's MEMCMP has convenient constant size too. (just to get rid of the ifdefs etc).

Comment on lines +306 to +307
#if defined(__has_builtin)
#if __has_builtin(__builtin_memcpy)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#if defined(__has_builtin)
#if __has_builtin(__builtin_memcpy)
#if defined(__has_builtin) && __has_builtin(__builtin_memcpy)

// to escape, then everything ends up in the correct spot. Otherwise it was convenient temporary storage.
#if defined(__has_builtin)
#if __has_builtin(__builtin_memcpy)
if (vec_len == 16 && len >= 4) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking we should add RBIMPL_ASSERT_OR_ASSUME(len < 16)

@byroot
Copy link
Member

byroot commented Jan 16, 2026

Ok, I yet to finish my coffee, but just to see if I get it:

copy_remaining_bytes only ever copy up to 16B.

Have you considered always copying 16B regardless?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants