Thanks, Maurizio.
On 12/9/24 03:42, Maurizio Cimadamore wrote:
Great work Vlad!
The simdsort part seems a more "classic" FFM binding - where you have a
method handle per entry point. That seems to fit the design of FFM
rather well. In the second case (SVML/SLEEF) usage of FFM is limited to
build a "table of entry points" (e.g. we're just using SymbolLookup +
MemorySegment here -- the invocation part is intrinsified as part of the
new VectorSupport methods).
I'd say that both simdsort and SVML/SLEEF cases are slightly off from
the sweet spot FFM API is designed for since all 3 libraries heavily
rely on CPU dispatching.
If it helps, it might be possible to define a custom (JDK internal)
family of value layouts for vector types. Then we could enhance the
Linker classification to support such layouts. This means you could call
into native functions with vector parameters and return types using the
Linker API more directly. Not sure if it will give you the same
performance, but it's also an approach worth exploring.
FTR I experimented a bit with vector calling conventions support, but as
Vector API is implemented now, it introduced significant amount of
complexity on both sides, so I decided to keep vector intrinsics for
now. It already enables significant simplifications in Vector API.
Still, it would be convenient to eventually get vector support in FFM.
Re. support for custom calling conventions to call into hotspot stubs
from Java, this might be possible - our story for supporting calling
conventions other than the system calling convention is that there
should be a dedicated linker instance per calling convention. So, if the
JVM defines its own calling convention for its stubs there should
probably be a custom Linker implementation that is used to call into
such stubs - which uses the machinery in the Linker implementation (e.g.
Bindings) to classify the incoming function descriptors and determine
the shuffle sequence for a given particular call. This should all be
doable (at least inside the JDK) - it's just matter of "writing more code".
Interesting. Thanks for the details.
I agree with Paul that, as we move more stuff to use Panama, we will
need to look more at the avenues available to us to claim back some of
the additional warm up cost introduced by the use of var/method handles.
This is probably part of a bigger exploration on warmup and FFM.
In case of C2 intrinsics it may be less of an issue. Additional startup
costs may be quickly recuperated during warmup because optimized
implementation is available earlier.
Best regards,
Vladimir Ivanov
On 06/12/2024 23:18, Vladimir Ivanov wrote:
Recently, a trend emerged to use native libraries to back intrinsics
in HotSpot JVM. SVML stubs for Vector API paved the road and it was
soon followed by SLEEF and simdsort libraries.
After examining their support, I must confess that it doesn't look
pretty. It introduces significant accidental complexity on JVM side.
HotSpot has to be taught about every entry point in each library in an
ad-hoc manner. It's inherently unsafe, error-prone to implement and
hard to maintain: JVM makes a lot of assumptions about an entry point
based solely on its symbolic name and each library has its own naming
conventions. Overall, current approach doesn't scale well.
Fortunately, new FFI API (java.lang.foreign) was finalized in 22. It
provides enough functionality to interact with native libraries from
Java in performant manner.
I did an exercise to migrate all 3 libraries away from intrinsics and
the results look promising:
simdsort: https://github.com/openjdk/jdk/pull/22621
SVML/SLEEF: https://github.com/openjdk/jdk/pull/22619
As of now, java.lang.foreign lacks vector calling convention support,
so the actual calls into SVML/SLEEF are still backed by intrinsics.
But it still enables a major cleanup on JVM side.
Also, I coded library headers and used jextract to produce initial
library API sketch in Java and it worked really well. Eventually, it
can be incorporated into JDK build process to ensure the consistency
between native and Java parts of library API.
Performance wise, it is on par with current (intrinsic-based)
implementation.
One open question relates to CPU dispatching.
Each library exposes multiple functions with different requirements
about CPU ISA extension support (e.g., no AVX vs AVX2 vs AVX512, NEON
vs SVE). Right now, it's JVM responsibility, but once it gets out of
the loop, the library itself should make the decision. I experimented
with 2 approaches: (1) perform CPU dispatching with linking library
from Java code (as illustrated in aforementioned PRs); or (2) call
into native library to query it about the right entry point [1] [2]
[3]. In both cases, it depends on additional API to sense the JVM/
hardware capabilities (exposed on jdk.internal.misc.VM for now).
Let me know if you have any questions/suggestions/concerns. Thanks!
I plan to eventually start publishing PRs to upstream this work.
Best regards,
Vladimir Ivanov
[1] https://github.com/openjdk/jdk/commit/
b6e6f2e20772e86fbf9088bcef01391461c17f11
[2] https://github.com/iwanowww/jdk/
blob/09234832b6419e54c4fc182e77f6214b36afa4c5/src/java.base/share/
classes/java/util/SIMDSortLibrary.java
[3] https://github.com/iwanowww/jdk/
blob/09234832b6419e54c4fc182e77f6214b36afa4c5/src/java.base/linux/
native/libsimdsort/simdsort.c