Re: RFR: 8342103: C2 compiler support for Float16 type and associated operations

Jatin Bhateja Tue, 19 Nov 2024 12:04:47 -0800

On Mon, 18 Nov 2024 23:11:20 GMT, Sandhya Viswanathan 
<sviswanat...@openjdk.org> wrote:


>> Hi All,
>> 
>> This patch adds C2 compiler support for various Float16 operations added by 
>> [PR#22128](https://github.com/openjdk/jdk/pull/22128)
>> 
>> Following is the summary of changes included with this patch:-
>> 
>> 1. Detection of various Float16 operations through inline expansion or 
>> pattern folding idealizations.
>> 2. Float16 operations like add, sub, mul, div, max, and min are inferred 
>> through pattern folding idealization.
>> 3. Float16 SQRT and FMA operation are inferred through inline expansion and 
>> their corresponding entry points are defined in the newly added Float16Math 
>> class.
>>       -    These intrinsics receive unwrapped short arguments encoding IEEE 
>> 754 binary16 values.
>> 5. New specialized IR nodes for Float16 operations, associated 
>> idealizations, and constant folding routines.
>> 6. New Ideal type for constant and non-constant Float16 IR nodes. Please 
>> refer to [FAQs 
>> ](https://github.com/openjdk/jdk/pull/21490#issuecomment-2482867818)for more 
>> details.
>> 7. Since Float16 uses short as its storage type, hence raw FP16 values are 
>> always loaded into general purpose register, but FP16 ISA instructions 
>> generally operate over floating point registers, therefore compiler injectes 
>> reinterpretation IR before and after Float16 operation nodes to move short 
>> value to floating point register and vice versa.
>> 8. New idealization routines to optimize redundant reinterpretation chains. 
>> HF2S + S2HF = HF
>> 6. Auto-vectorization of newly supported scalar operations.
>> 7. X86 and AARCH64 backend implementation for all supported intrinsics.
>> 9. Functional and Performance validation tests.
>> 
>> **Missing Pieces:-**
>> **-  AARCH64 Backend.**
>> 
>> Kindly review and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 3974:
> 
>> 3972:   generate_libm_stubs();
>> 3973: 
>> 3974:   StubRoutines::_fmod = generate_libmFmod(); // from 
>> stubGenerator_x86_64_fmod.cpp
> 
> Good to retain the is_intrinsic_available checks.

I reinstantiated it, it was an artifact of my commit.

> src/hotspot/cpu/x86/x86.ad line 4518:
> 
>> 4516: #ifdef _LP64
>> 4517: instruct ReplS_imm(vec dst, immH con, rRegI rtmp) %{
>> 4518:   predicate(VM_Version::supports_avx512_fp16() && 
>> Matcher::vector_element_basic_type(n) == T_SHORT);
> 
> I have a question about the predicate for ReplS_imm. What happens if the 
> predicate is false? There doesn't seem to be any other instruct rule to cover 
> that situation. Also I don't see any check in match rule supported on 
> Replicate node.

We only create Half Float constants (ConH) if the target supports FP16 ISA. 
These constants are generated by Value transforms associated with FP16-specific 
IR, whose creation is guarded by target-specific match rule supported checks.

> src/hotspot/cpu/x86/x86.ad line 10964:
> 
>> 10962:   match(Set dst (SubVHF src1 src2));
>> 10963:   format %{ "evbinopfp16_reg $dst, $src1, $src2" %}
>> 10964:   ins_cost(450);
> 
> Why ins_cost 450 here for reg version and 150 for mem version of binOps?  
> Whereas sqrt above has 150 cost for both reg and mem version. Good to be 
> consistent.

Cost does not play much role here, removed it for consistency, matching 
algorithm is a BURS style two pass algorithm, binary state tree construction is 
done during a bottom-up walk of  expressions, each state captures the cost 
associated with different reductions, actual selection is done through top down 
walk of the state tree, it is during this stage we pick the reduction with 
minimum cost from the set of reductions generating same kinds of result 
operand, once selected, matcher then follows the low-cost path of the state 
tree, associating cost guide the selector in choosing from the set of active 
reducitions. in general it's advisable to assign lower cost to memory variant 
patterns on CISC targets since this way we can save emitting explicit load.

> src/hotspot/cpu/x86/x86.ad line 11015:
> 
>> 11013:   ins_encode %{
>> 11014:     int vlen_enc = vector_length_encoding(this);
>> 11015:     __ evfmadd132ph($dst$$XMMRegister, $src2$$XMMRegister, 
>> $src1$$XMMRegister, vlen_enc);
> 
> Wondering if for auto vectorization the natural fma form is dst = dst + src1 
> * src2 i.e.
>      match(Set dst (FmaVHF dst (Binary src1 src2)));
> which then leads to fmadd231.

ISA supports multiple flavors, the current scheme is in line with the wiring of 
inputs done before matching.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/21490#discussion_r1847906271
PR Review Comment: https://git.openjdk.org/jdk/pull/21490#discussion_r1847906153
PR Review Comment: https://git.openjdk.org/jdk/pull/21490#discussion_r1847907028
PR Review Comment: https://git.openjdk.org/jdk/pull/21490#discussion_r1847906530

Re: RFR: 8342103: C2 compiler support for Float16 type and associated operations

Reply via email to