Thank you for the heads up and the detailed analysis Charles.

A 3x speedup for X86_64 is very significant.

How do the alternate methods compare for smaller vectors ?
I think it would be enlightening to see the benchmark results for
8/16/32/64/128 bytes to make sure that there is no regression.

Another issue is that HBase usually does not compare ByteBuffers directly,
rather only subvectors, which is a problem for mismatch().
I guess that ByteBuffer.slice() can be used to get around that, but that
one has its own overhead, which has to be taken into account.

Looking at the vector code, that one seems to be able to handle offsets
without additional overhead simply by setting the offsets.

ByteBufferUtils already supports multiple comparator implementations, so
adding new ones based on
mismatch or the vector ops should be relatively easy.

I could also see a comparator that dynamically chooses the implementation
based on the length to be compared if the new
methods regress in performance on smaller vectors.

getBestComparer() could choose based on the architecture or AVX support,
and/or could be configurable, so this would not
have to negatively affect ARM (or other architectures)

The next step would be implementing Comparators based on either/both of the
new methods, and benchmarking them via the ByteBufferUtils API.

Istvan



On Fri, Oct 3, 2025 at 6:29 PM Charles Connell <[email protected]>
wrote:

> Hi folks,
>
> TLDR: there might be a faster way to compare byte buffers than
> ByteBufferUtils.
>
> I thought it may be of interest to others if I share my findings about
> row key comparison performance. Row key comparison can represent a
> significant part of a RegionServer's CPU cycles. In CPU profiles taken
> at my company, 2% of CPU cycles is typical, but some extreme cases go
> past 20%. If there is a way to speed this up, it would be great. I'm
> also aware that the current row key comparator, which uses Java's
> Unsafe, is holding back parts of the HBase build on Java 8.
>
> I did some micro-benchmarks, focused on comparing direct ByteBuffers.
> I tested HBase's ByteBufferUtils against the JDK's
> ByteBuffer.mismatch() method, and against the incubating Vector API.
> ByteBufferUtils compares 8 bytes per loop iteration, but both of these
> alternative methods can compare more bytes per iteration.
>
> I ran my benchmark on two Amazon Web Services servers. One was type
> i8g.large, which has two Neoverse V2 arm64 cores. The other was
> i7ie.large, which has two Emerald Rapids x86-64 cores. Both of these
> CPUs are about the latest and greatest that you can get in their
> respective architectures. My JVM was Temurin-25+36. The benchmark code
> is pasted below, which is adapted from a test in the OpenJDK source
> tree.
>
> Scores to follow. The scores are the number of byte array comparisons
> done per second. Higher is better. I'm just showing results for
> buffers of size 257, which is long enough to benefit from
> vectorization, but not very long, and I think represents a realistic
> row key length. The first half of each buffer being compared always
> contains matching bytes, and the second halves do not.
>
> Benchmark             Mode   Cnt            Score         Error  Units
> (arm64)
> byteBufferMismatch    thrpt   25   86,281,436.335 ±  890778.433  ops/s
>   # no JVM intrinsic, compares 8 bytes per loop in Java, but C2 may
> vectorize it further, not sure
> byteBufferUtils       thrpt   25  132,360,665.737 ±  765164.120  ops/s
>   # compares 8 bytes per loop, confirmed that C2 does not vectorize it
> further
> vector                thrpt   25   81,494,739.828 ± 2139303.751  ops/s
>   # uses NEON registers, compares 16 bytes per loop
>
> (x86-64)
> byteBufferMismatch    thrpt   25   69,825,596.046 ± 1136949.216  ops/s
>   # JVM intrinsic, uses AVX512 registers, compares 64 bytes per loop
> byteBufferUtils       thrpt   25   20,096,803.307 ±  785102.258  ops/s
>   # compares 8 bytes per loop
> vector                thrpt   25   69,006,936.897 ±  617447.660  ops/s
>   # I assume uses AVX512 registers, compares 64 bytes per loop
>
> On x86, ByteBuffer.mismatch() and the Vector API perform the best. On
> arm64, ByteBufferUtils performs the best. Every arm64 score is better
> than every x86 score, which I guess says good things about the
> Neoverse processors. The Vector API doesn't beat
> ByteBuffer.mismatch(), so no need to consider that further. I assume
> that the most popular architecture for running HBase is x86, so x86
> users might appreciate a switch to ByteBuffer.mismatch(), which has
> been available since Java 11. Selfishly, this would be bad for my
> company, which primarily uses arm64. I want to provide this
> information in case it's useful to spark a discussion.
>
> Charles Connell
> Hubspot Inc
>
> Benchmark code:
>
> /*
>  *  Copyright (c) 2021, 2024, Oracle and/or its affiliates. All rights
> reserved.
>  *  DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
>  *
>  *  This code is free software; you can redistribute it and/or modify it
>  *  under the terms of the GNU General Public License version 2 only, as
>  *  published by the Free Software Foundation.
>  *
>  *  This code is distributed in the hope that it will be useful, but
> WITHOUT
>  *  ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
>  *  FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
>  *  version 2 for more details (a copy is included in the LICENSE file that
>  *  accompanied this code).
>  *
>  *  You should have received a copy of the GNU General Public License
> version
>  *  2 along with this work; if not, write to the Free Software Foundation,
>  *  Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA.
>  *
>  *  Please contact Oracle, 500 Oracle Parkway, Redwood Shores, CA 94065 USA
>  *  or visit www.oracle.com if you need additional information or have any
>  *  questions.
>  *
>  */
>
> package cconnell;
>
> import java.io.IOException;
> import java.lang.foreign.MemorySegment;
> import java.nio.ByteBuffer;
> import java.nio.ByteOrder;
> import java.util.random.RandomGenerator;
> import jdk.incubator.vector.ByteVector;
> import jdk.incubator.vector.VectorMask;
> import jdk.incubator.vector.VectorOperators;
> import jdk.incubator.vector.VectorSpecies;
> import org.apache.hadoop.hbase.util.ByteBufferUtils;
> import org.openjdk.jmh.annotations.Benchmark;
> import org.openjdk.jmh.annotations.Fork;
> import org.openjdk.jmh.annotations.Param;
> import org.openjdk.jmh.annotations.Scope;
> import org.openjdk.jmh.annotations.Setup;
> import org.openjdk.jmh.annotations.State;
> import org.openjdk.jmh.runner.RunnerException;
>
> @State(Scope.Benchmark)
> @Fork(jvmArgs = {"--add-modules=jdk.incubator.vector",
> "-Dsun.misc.unsafe.memory.access=allow"})
> public class ByteArrayCompareBenchmark {
>
>   public static void main(String[] args) throws RunnerException,
> IOException {
>     org.openjdk.jmh.Main.main(args);
>   }
>
>   @Param({ "10", "257" })
>   private int size;
>
>   private ByteBuffer buffer1;
>   private ByteBuffer buffer2;
>   private MemorySegment memorySegment1;
>   private MemorySegment memorySegment2;
>
>   private static final VectorSpecies<Byte> BYTE_SPECIES_PREFERRED =
>     ByteVector.SPECIES_PREFERRED;
>
>   @Setup
>   public void setup() {
>     RandomGenerator random = RandomGenerator.getDefault();
>     int common = (int) (0.5 * size);
>     byte[] commonBytes = new byte[common];
>     random.nextBytes(commonBytes);
>
>     buffer1 = ByteBuffer.allocateDirect(common + size);
>     buffer1.put(commonBytes);
>     byte[] arr = new byte[size];
>     random.nextBytes(arr);
>     buffer1.put(arr);
>     buffer1.flip();
>     memorySegment1 = MemorySegment.ofBuffer(buffer1);
>
>     buffer2 = ByteBuffer.allocateDirect(common + size);
>     buffer2.put(commonBytes);
>     arr = new byte[size];
>     random.nextBytes(arr);
>     buffer2.put(arr);
>     buffer2.flip();
>     memorySegment2 = MemorySegment.ofBuffer(buffer2);
>   }
>
>   @Benchmark
>   public int byteBufferMismatch() {
>     int index = buffer1.mismatch(buffer2);
>     if (index >= 0) {
>       return (buffer1.get(index) & 255) - (buffer2.get(index) & 255);
>     } else {
>       return 0;
>     }
>   }
>
>   @Benchmark
>   public int byteBufferUtils() {
>     return ByteBufferUtils.compareTo(
>         buffer1,
>       0,
>       buffer1.limit(),
>         buffer2,
>       0,
>       buffer2.limit()
>     );
>   }
>
>   @Benchmark
>   public int vector() {
>     int length = Math.min(buffer1.limit(), buffer2.limit());
>     int offset = 0;
>     for (
>       ;
>       offset < BYTE_SPECIES_PREFERRED.loopBound(length);
>       offset += BYTE_SPECIES_PREFERRED.length()
>     ) {
>       ByteVector vector1 = ByteVector.fromMemorySegment(
>         BYTE_SPECIES_PREFERRED,
>         memorySegment1,
>         offset,
>         ByteOrder.LITTLE_ENDIAN
>       );
>       ByteVector vector2 = ByteVector.fromMemorySegment(
>         BYTE_SPECIES_PREFERRED,
>         memorySegment2,
>         offset,
>         ByteOrder.LITTLE_ENDIAN
>       );
>       VectorMask<Byte> mask = vector1.compare(VectorOperators.NE, vector2);
>       if (mask.anyTrue()) {
>         int index = offset + mask.firstTrue();
>         return (buffer1.get(index) & 255) - (buffer2.get(index) & 255);
>       }
>     }
>     // process the tail
>     int mismatch = -1;
>     for (int i = offset; i < length; ++i) {
>       if (buffer1.get(i) != buffer2.get(i)) {
>         mismatch = i;
>         break;
>       }
>     }
>     if (mismatch >= 0) {
>       return (buffer1.get(mismatch) & 255) - (buffer2.get(mismatch) & 255);
>     } else {
>       return 0;
>     }
>   }
> }
>
>

-- 
*István Tóth* | Sr. Staff Software Engineer
*Email*: [email protected]
cloudera.com <https://www.cloudera.com>
[image: Cloudera] <https://www.cloudera.com/>
[image: Cloudera on Twitter] <https://twitter.com/cloudera> [image:
Cloudera on Facebook] <https://www.facebook.com/cloudera> [image: Cloudera
on LinkedIn] <https://www.linkedin.com/company/cloudera>
------------------------------
------------------------------

Reply via email to