On 12/9/24 07:55, Paul Sandoz wrote:
Some further observations.

- This arguably makes it harder for the auto-vectorize to access the SVML/SLEEF 
functionality. However, in comes cases, we cannot guarantee the same guarantees 
(IIRC mainly around monotonicity) as the scalar operations in Math.

I'm not too optimistic about auto-vectorization unless the very same stubs are shared between scalar and vectorized code. Our previous experience with FP operations strongly indicates that users expect FP operations to give reproducible results (bitwise equivalent) across the same run.

Moreover, migration to FFI enables usage of SVML/SLEEF across all execution modes which should make it easier to reason about Vector API usages.

- There is an open bug to adjust the simd sort behavior on AMD zen 4 cores due 
to poor performance of an AVX 512 instruction. The simplest solution is to fall 
back to AVX2. That may be simpler to manage in Java? (I was looking at the 
HotSpot code).

For now, the patch guards AVX512 entries with VM.isIntelCPU() check. In order to distinguish between AMD Zen 4 and 5, either a new platform-sensing check is needed or reimplementation of x86-specific platform sensing in Java on top of CPUID info.

Best regards,
Vladimir Ivanov

On Dec 6, 2024, at 4:48 PM, Vladimir Ivanov <vladimir.x.iva...@oracle.com> 
wrote:

Thanks, Paul.

Excellent work, very happy to see more of this moved to Java leveraging Panama 
features. The Java code looks very organized.
I am wondering if this technique can be applied to stubs dynamically generated 
by HotSpot via some sort of special library lookup e.g., for crypto.

It's an interesting idea. A JVM could expose individual symbols, so they can be 
looked up, but a more promising approach is to just expose a table of generated 
stubs through a native call into JVM (similar to simdsort_link [1]).

The problematic part is that stubs don't have to obey to platform ABI. Some of 
them deliberately rely on very restrictive calling conventions (e.g., no 
caller-saved registers), so calling them from generated code is much simpler 
and cheaper.

In a longer term, custom calling conventions for each entry point can be  coded 
if there's enough java.lang.foreign support present. (So, an entry point 
returned by the JVM comprises of an entry address accompanied by an appropriate 
invoker.)


Do you have a sense of the differences in static memory footprint and startup 
cost? Things I imagine Leyden could help with.

Are you asking about simdsort/SVML/SLEEF case here?

Yes.


I didn't measure, but initialization costs will definitely be higher (compared 
to JVM-only solution). In absolute numbers it should be negligible though (the 
libraries expose small number of entry points).


Regarding CPU dispatching, my preference would be to do it in Java. Less native 
logic.

Fair enough. The nice thing about doing CPU dispatching on native library side 
is that all those cryptic naming conventions don't show up on Java side [2], 
but IMO it requires too much ceremony, so I kept it on Java side for now.


This may also be useful to help determine whether we can/should expose 
capabilities in the Vector API regarding what is optimally supported or not.

IMO Vector API (as it is implemented now) would benefit from a higher-level 
C2-specific API.


Ok.

Paul.


I presume it also does not preclude some sort of jlink plugin that strips 
unused methods from the native libraries, something which may be tricker if 
done in the native library itself?

Good point. It may be the case, but I don't have enough experience with native 
library stripping to comment on it.

Best regards,
Vladimir Ivanov

[1] 
https://github.com/openjdk/jdk/commit/b6e6f2e20772e86fbf9088bcef01391461c17f11


[2] 
https://github.com/iwanowww/jdk/blob/09234832b6419e54c4fc182e77f6214b36afa4c5/src/java.base/linux/native/libsimdsort/simdsort.c

Paul.
On Dec 6, 2024, at 3:18 PM, Vladimir Ivanov <vladimir.x.iva...@oracle.com> 
wrote:

Recently, a trend emerged to use native libraries to back intrinsics in HotSpot 
JVM. SVML stubs for Vector API paved the road and it was soon followed by SLEEF 
and simdsort libraries.

After examining their support, I must confess that it doesn't look pretty. It 
introduces significant accidental complexity on JVM side. HotSpot has to be 
taught about every entry point in each library in an ad-hoc manner. It's 
inherently unsafe, error-prone to implement and hard to maintain: JVM makes a 
lot of assumptions about an entry point based solely on its symbolic name and 
each library has its own naming conventions. Overall, current approach doesn't 
scale well.

Fortunately, new FFI API (java.lang.foreign) was finalized in 22. It provides 
enough functionality to interact with native libraries from Java in performant 
manner.

I did an exercise to migrate all 3 libraries away from intrinsics and the 
results look promising:

  simdsort: https://github.com/openjdk/jdk/pull/22621

  SVML/SLEEF: https://github.com/openjdk/jdk/pull/22619

As of now, java.lang.foreign lacks vector calling convention support, so the 
actual calls into SVML/SLEEF are still backed by intrinsics. But it still 
enables a major cleanup on JVM side.

Also, I coded library headers and used jextract to produce initial library API 
sketch in Java and it worked really well. Eventually, it can be incorporated 
into JDK build process to ensure the consistency between native and Java parts 
of library API.

Performance wise, it is on par with current (intrinsic-based) implementation.

One open question relates to CPU dispatching.

Each library exposes multiple functions with different requirements about CPU 
ISA extension support (e.g., no AVX vs AVX2 vs AVX512, NEON vs SVE). Right now, 
it's JVM responsibility, but once it gets out of the loop, the library itself 
should make the decision. I experimented with 2 approaches: (1) perform CPU 
dispatching with linking library from Java code (as illustrated in 
aforementioned PRs); or (2) call into native library to query it about the 
right entry point [1] [2] [3]. In both cases, it depends on additional API to 
sense the JVM/hardware capabilities (exposed on jdk.internal.misc.VM for now).

Let me know if you have any questions/suggestions/concerns. Thanks!

I plan to eventually start publishing PRs to upstream this work.

Best regards,
Vladimir Ivanov

[1] 
https://github.com/openjdk/jdk/commit/b6e6f2e20772e86fbf9088bcef01391461c17f11

[2] 
https://github.com/iwanowww/jdk/blob/09234832b6419e54c4fc182e77f6214b36afa4c5/src/java.base/share/classes/java/util/SIMDSortLibrary.java

[3] 
https://github.com/iwanowww/jdk/blob/09234832b6419e54c4fc182e77f6214b36afa4c5/src/java.base/linux/native/libsimdsort/simdsort.c




Reply via email to