Re: [lworld+vector] Integrated: Merge lworld

Jatin Bhateja Wed, 17 Dec 2025 09:42:47 -0800

On Tue, 21 Oct 2025 01:27:06 GMT, Jatin Bhateja <[email protected]> wrote:


> Merge from lworld in lworld+vector branch.
> 
> Apart from the usual merge conflict resolution, the patch also contains bug 
> fixes for some special handling around multifields.
> To segregate the code changes and maintain the history[1][2][3][4][5], these 
> bug fixes were initially checked into a temporary private merge branch and 
> are now part of the merge. 
> 
> 
> [1] Special copy constructor used for initializing base and synthetic 
> multifields 
>     
> https://github.com/jatin-bhateja/valhalla/commit/0fe3153e79251fe661ea1bd368472f2f75cb360d
> [2] Fix ciField population for multifields 
>     
> https://github.com/jatin-bhateja/valhalla/commit/4a26aff14801e010ea238e8d5a52a52465c6f849
> [3] Multifield handling during type domain buildup
>      
> https://github.com/jatin-bhateja/valhalla/commit/74ca0245c5d9f5eb186baa962e11d3f5c8f4ed35
> [4] prevent store forwarding across CPU memory barrier
>      
> https://github.com/jatin-bhateja/valhalla/commit/cfc71acd89b2e3233d788565f1cf1061651a70f9
> [5] Fix to accommodate additional padding during offset computation of 
> multifields
>      
> https://github.com/jatin-bhateja/valhalla/commit/75ec6296531839ab375efa58b909827269994449
> 
> After latest merge, all the VectorAPI and Valhalla JTREG tests are passing 
> with -XX:-UseNonAtomicValueFlattening at
> variaous AVX levels. With -XX:+UseNonAtomicValueFlattening there are some 
> failures which will be addressed post
> merge commit.

For the record copy pasting the comments from closed PR 
https://github.com/openjdk/valhalla/pull/1593

Hi @liach, For now, vectorAPI fallback does make heavy use of unsafe APIs for 
explicit larval transitions.
There is a whole lot of discussion that went into the usage of the @multifield 
annotation for contiguous backing storage allocation. Sharing some links for 
your reference.

https://cr.openjdk.org/~jrose/values/multi-field.html
https://cr.openjdk.org/~jbhateja/Valhalla/Valhalla-VectorAPI-OracleF2F-3-Apr_2024.pptx
 
----------

Hi @liach , 

Currently, Unsafe.put* APIs expect to operate on a mutable value, without 
Unsafe.makePrivateBuffer, there is no way to transition a value object to 
larval state.

<img width="800" height="400" alt="image" 
src="https://github.com/user-attachments/assets/af826cda-55e1-4b0c-a2ea-62592f7623d6";
 />


Here is a typical update kernel for the nary operation fallback implementation. 

<img width="500" height="200" alt="image" 
src="https://github.com/user-attachments/assets/4a31baa7-52b8-4e0b-8c42-924407bb5665";
 />



**Here are some relevant FAQs on the need for multifield annotation.**

Q. Why do we need @multifield annotated field in VectorPayloads and not just 
retain the array-backed backing storage?
A.  Even currently, Vector instances are immutable, with each modification or 
application of an operation, a new vector is generated. 
      Each new vector has a distinct backing storage in the form of an array; 
thus, no two vector ever share their backing storage, which makes vectors an 
immutable quantity. 
      
     Vector<Float>  newVector  =  Vec1.lanewise(VectorOperators.ADD, Vec2);
     
Since arrays are always allocated over the heap, they carry an identity, which 
is the distinctive heap address for each new backing storage array.

This contradicts the philosophy of value type instances, which are 
identity-free; the compiler treats two values with the same contents as 
equivalent entities and is free to substitute one with another. 

By replacing existing array-backed storage with a @multifield annotated 
storage, we ensure that payload adheres to true value semantics, a @multifiled 
is seen as a bundle of fields, encapsulating payload is a value class, unlike 
an array, a multifield is never allocated an explicit heap storage. 

Here is an example code

 
<img width="400" height="500" alt="image" 
src="https://github.com/user-attachments/assets/14cb3fbc-15cf-461e-846e-b200f744f793";
 />


Even though Payload is a value class, its two instances with the same backing 
storage are not equal, because arrays have identity.
By treating vectors as value objects, we expect that two vectors with the same 
contents should be equal.

Q.  Is there any alternative to @multifield?
A.  All we need to ensure is that the backing storage has no identity.  Thus, 
we could have multiple primitive type fields in the payload, one for each lane 
of the vector. 

 
<img width="450" height="310" alt="image" 
src="https://github.com/user-attachments/assets/af977c64-6373-4533-aceb-6d283cecd094";
 />

Consider the above modified payload class ‘TrueValuePayload’, here we create a 
separate primitive type field for each lane of the vector, thus two vector 
values with the same contents are treated as equivalent entities. 

With @multifield we intend to create identity-free backing storage. 


Q.  What are the complications with explicit fields for each lane in the 
container payload?
A.  First, at load time, build_layout randomizes the fields' layout,  but that 
is not a major concern since fields can always be updated using their effective 
offsets, for a multifield update **, existing reflective name-based APIs don't 
suffice, we need an UNSAFE api which accepts the offsets, since a multifield 
has hidden / synthetic fields which can only be accessed by adding offsets 
w.r.t to base field** 

A bigger concern is that the C2 compiler will see scalar fields as different 
inputs to InlineTypeNode. The compiler creates an InlineTypeNode for each 
instance of a value class and scalarizes the fields, with different scalar 
field we may observe multiple inputs to IR, one for each lane. Each scalar 
field will be of primitive Type, while the type of InlineTypeNode will be 
PayloadType, since an inline type node corresponds to a value object. We expect 
to deal in vector-type field, i.e., the one that carry an ideal type TypeVect. 

Vector IR  is forwarded to its user every time we perform a vector_unbox 
operation over VectorPayload, this way, vector IR inputs are directly forwarded 
to the vector operation nodes.

Keeping multiple scalar fields, one for each lane type in the VectorPayload 
class, while creating a vector IR for a bundle of lanes, will result in adding 
kludges in the code.  Consider the following case

     var  v1 = FloatVector.fromArray(SP, arr1, index)
     var  v2 = FloatVector.fromArray(SP, arr2, index)
     res =  v1.lanewise(VectorOperators.ADD, v2)

    v1 and v2 are concrete vector class instances, where the vector class is a 
valuetype, with stock Valhalla, IR corresponding to fromArray API will look 
like the following

                                           
<img width="374" height="224" alt="image" 
src="https://github.com/user-attachments/assets/5f923efe-5b42-4e47-b04f-13abd4cfeee8";
 />


While the expected IR pallet should look like the following 

 
<img width="354" height="313" alt="image" 
src="https://github.com/user-attachments/assets/26d1ce93-f4d4-48fd-afaf-1b32a0904d2d";
 />


------------

Hi @liach ,

Here is a quick sync-up on the merge status.

I started the merge yesterday, but it may take a few more days as there have 
been lots of changes in the code over the last 4 months. I am studying the code 
changes and documenting the new code model, in crystalline form. Following is 
our new model for larval objects.

Core concept (New model):-
  - A value object can be updated when it's in the larval state, either 
explicitly through makePrivateBuffer or implicitly during new object creation.
  - A value in a larval state must be in a non-scalarized form, i.e., it should 
exist as an oop.
  - Updates to a larval value field are done by emitting an explicit store 
operation/IR, This mandates an OOP form of larval object; this is a major 
change from the earlier model, where puffield could directly update the 
incoming edge of scalarized InlineTypeNode IR. Given that a value object is an 
immutable quantity, any update is only possible by way of creating a new value, 
thereby going through a temporary larval state.
  - Value object is transitioned to a non-larval state after the constructor 
call or after finishPrivateBuffer.
  - For vector API, backing storage is in the form of a @Multifield payload, 
since only the base field can be exposed in Java code, and the rest of the 
synthetic fields are internally generated based on the annotation processing 
during class file parsing; hence, we definitely need an Unsafe API to update 
all the synthetic fields.
  - New flexible constructor does allow updates to fields before the super 
call, but we still cannot pass uninitializedThis to an external initialization 
routine (like an unsafe API) from a constructor. As a result of this, currently 
the VectorPayalod*MF constructors only initialize the base multifield through a 
putfield bytecode. [JDK-8368810](https://bugs.openjdk.org/browse/JDK-8368810) 
tracks this issue.

In the current context, we have the following options to perform vector updates 
in the fallback implementation:-

Premise: Value objects are immutable, and two vector values with the same lane 
contents must be treated as equals.


  **a/ Current update loop:-**
      

                          res = Unsafe.makePrivateBuffer(arg1);
                          for (int i = 0; i < laneCount; i++) {
                                 larg1 =  Unsafe.getFloat(arg1, offset[i]);
                                 larg2 =  Unsafe.getFloat(arg2, offset[i]);
                                 Unsafe.putFloat(res, larg1 + larg2);           
     // res is in oop form since its in a mutable state.         
                          }
                          res = Unsafe.finishPrivateBuffer(res);                
   // re-create scalarized representation from larval state. 

         
The lifetime of a larval is very restricted today, and by Java application 
compilation, we don't emit any bytecode b/w the new and its subsequent 
initializing constructor. However, using handcrafted bytecode, we can have a 
gap b/w a new ininitialized allocation and its initializing constructor, but 
the JVM runtime rules for bytecode verification catch any undefined behavior. 
Unsafe operations are outside the purview of JVM bytecode verification.

Problem arises when we make the lifetime of the larval object being acted upon 
by an unsafe API span across a block or even loop of bytecode; in such cases, 
we may need to handle larval and non-larval merges OR emit an undefined 
behavior error.

  Q. How performant is the unsafe get / put operation?
  A. There are intrinsic and native implementations of get/put APIs, given that 
it's an unsafe operation compiler does not care about emitting any out-of-bound 
checks, as it's the responsibility of the user. In such a case, an unsafe get 
simply computes an effective address (base + offset) and emits load / store 
instructions. Does the loop involving unsafe operation auto-vectorized ? Yes, 
since the auto-vectorizer looks for contiguity in the address across iterations.

Consider the impact of long-lived larval res in the above update loop on 
intrinsicfication


               <statistics type='intrinsic'>
                      Compiler intrinsic usage:
                         2 (50.0%) _getFloat (worked)
                         1 (25.0%) _putFloat (worked)
                         1 (25.0%) _makePrivateBuffer (worked)
                         0 ( 0.0%) _finishPrivateBuffer (failed)
                         4 (100.0%) total (worked,failed)
                    </statistics> 



res, was made a larval quantity using makePrivateBuffer, but scalarization 
triggered at gen_checkcast, which observed a value type oop, which by the way 
is in larval state. As a result, finishPrivateBuffer receives an InlineTypeNode 
while it always expects to receive a larval value oop, and hence 
intrinsification fails, it has a side effect on auto-vectorization, which does 
not observe an IR expression for store and its associated address computation 
since we take the native implementation route. Hence, the performance of fall 
fallback implementation degrades.

Q. What are the possible solutions here? 
A. Compiler should never try to scalarize the larval oop at checkcast bytecode; 
this handling exists because type profiling sharpens the oop's type to a value 
type, thus facilitating scalarization.  Another solution is to make 
finishPrivateBuffer intrinsic more robust; it should simply return the 
InlineTypeNode by setting the Allocation->_larval to false. This is attempted 
by an unmerged patch[1]
All in all, there are loopholes if we increase the life span of the larval 
object; in this case, it spans across the loop.

  **b/ Other larval lifetime limiting options:-**

In the fallback implementation, we intend to operate at the lane level; it may 
turn out that the generated IR is auto-vectorizable, but it's purely a 
performance optimization by the Compiler, and was not the intent of the user in 
the first place. So we read the input lanes using Unsafe API, we need an Unsafe 
get here, since synthetic multifields are not part of user visible Java code 
and have no symbol or type information in the constant pool. Thus, with 
multi-field, we ought to use unsafe APIs. Unsafe. get does not mandate larval 
transitions of the input value vector. What needs to change, well, to limit the 
larval life time, we need to allocate a temporary buffer before the loop for 
storing the result of the computation loop.

We need a new Unsafe API that accepts the temporary buffer and the result value 
object. This API will internally buffer the scalarized value, update its field 
from the temporary result storage, and then re-materialize the scalarized 
representation from the updated buffer. 

  Q. Will this API again have both native and intrinsic implementation?
  A.  Yes, it will definitely have native implementation, and the intrinsic 
implementation should make use of vector IR if the target supports the required 
vector size, OR it will need to emit scalar load / store IR, We can be 
intelligent in the intrinsic implementation to optimize this copy operation 
using a narrower vector. 

  Q. Will this be performant over the existing solution? 
  A. Additional allocation of a temporary buffer is costly; we can bank on 
escape analysis and loop unrolling to eliminate this allocation, but given that 
the temporary storage is passed to a native routine, it escapes the scope. With 
a Vector IR-based intrinsic implementation, we do need a contiguous memory view 
of temporary storage; otherwise, we may need to emit multiple scalar inserts 
into a vector, which is very costly. Overall, we don't expect the new 
implementation to be performant, mainly due to the additional allocation 
penalty. BTW, even today, vectors are immutable and vector factory routines are 
used in the fallback implementation to allocate memory for backing storage, 
thus we not only allocate a new concrete vector but also its backing storage, 
with current Valhalla and flat payloads we only need one allocation for vector 
and its backing storage, with new approach we may have to give away that 
benefit for reduced larval lifetime. It may turn out that the new approach is
  inferior in terms of performance with respect to the existing 
VectorAPI-Valhalla solution, but eventually may be at par with mainline support.

So it's tempting to try out the new solution to limit the complexity due to 
long-lifespan larval objects.

   **c/ Alternative approach : Create a new immutable value with each iteration 
of the update loop  this is near to new code model .**


                          res = __default_value__
                          for (int i = 0; i < laneCount; i++) {
                                 larg1 =  Unsafe.getFloat(arg1, offset[i]);
                                 larg2 =  Unsafe.getFloat(arg2, offset[i]);
                                 res =     Unsafe.updateLane(res, larg1 + 
larg2);     // res is in oop form since its in a mutable state.         
                          }


This also limits the lifetime of the larval value object, but on the hind side 
performs a buffering and scalarization with each iteration. When we buffer, we 
not only allocate a new temporary storage, but also emit field-level stores to 
dump the contents into temporary storage; this is very expensive if done for 
each lane update, given that we are only updating one lane of vector but are 
still paying the penalty to save and restore all unmodified lanes per iteration.

While this does limit the larval life span, it has a huge performance penalty 
in comparison to both the existing approaches of larval life time across loop 
and single update with temporary storage allocation. 

In the vector API, the fallback implementation is wrapped within the intrinsic 
entry point. As long as intrinsification is successful, we are not hit by a 
fallback; a poor fallback may delay the warmup stage, but for long-running SIMD 
kernels, warmups are never an issue as long as intrinsifications are successful.

I have created a prototype application [2] mimicking the vector API use case 
for you. Please give it a try with the existing lworld+vector branch after 
integrating the patch[1]

**While I am merging and understanding the implications of the new code mode, 
it will help if you can share your plan/test sample for the latest API to limit 
the scope of larval transition, keeping in view the above context.**

**Some notes on the implications of patch[1]**
Currently, the compiler does not handle vector calling convention emitting out 
of scalarization; this is a concern with the generic handling of multifield, 
where the return value was a container of a multifield and the multifield value 
was held by a vector register, in such a case, the caller will expect to 
receive the value in a vector register, with the vector API we deal in abstract 
vectors across method boundaries, which are never scalarized, anyway, extending 
the calling convention is on our list of items for Valhalla.

To circumvent this issue seen with the test case, we can disable scalarization 
of return values using -XX:-InlineTypeReturnedAsFields

[1] 
https://github.com/jatin-bhateja/external_staging/blob/main/Backup/lworld_vector_performance_uplift_patch_30_9_2025.diff

[2] 
https://github.com/jatin-bhateja/external_staging/blob/main/Code/java/valhalla/valhalla_401/valhalla_vector_api_mf_proto.java

--------------------------------

Hi @liach ,

I am working on refreshing lworld+vector branch and also thinking through some 
post-merge improvements.
For record sake, adding my notes in the comments.

post merge improvements:-
    - Replace VectorBox and VectorBoxAllocate nodes with InlineTypeNode.
    - New Ideal Transforms to sweep intermediate logic after inline expansion.
  
  Code snippet:-


Source:-
       vec1 = fromArray(SRC1)              - [1]
       vec2 = fromArray(SRC2)              - [2]
       res = vec1.lanewise(MUL,vec2)       - [3]
       res.intoArray(RES)                  - [4]


  Intermediate Representation for [1][2]
       
       LoadVector 
          |
          v
       InlineTypeNode (VectorPayload)
          |
          v
       InlineTypeNode (Double256Vector)
          |
          v
       CastPP (NotNull) 
          |
          v
       CheckCastPP (Double256Vector)
        |
        v     
       [x] -> AddP -> LoadVector
        |                  |
        |                  v
        |             InlineTypeNode (VectorPayload)      [[[[ make_from_oop 
]]]           
        |                  |
        | oop              v
        |-----------> InlineTypeNode (Double256Vector)
                           |
                           v
                          [y]


  Intermediate Representation for [3]

           [y]         [y]
            |           |
            v           v
      VectorUnbox   VectorUnbox
            |         /
            |        /
            |       /
            |      /
            |     /                                                             
                               
            |    /
            |   /
            [Mul]
              |
              v
        InlineTypeNode (VectorPayloadMF)
              |
              v
        InlineTypeNode (Double256Vector)
              |
              v
            CastPP
              |
              v
          CheckCastPP
              |
              v    
             [x] -> AddP -> LoadVector
              |                  |
              |                  v
              |             InlineTypeNode (VectorPayload)      [[[[ 
make_from_oop ]]]           
              |                  |
              | oop              v
              |-----------> InlineTypeNode (Double256Vector)
                                 |
                                 v
                                [y]
 
 Intermediate Representation for [4]

         [y]
          |
          v
    VectorUnboxNode
          |
          v
     StoreVector

New Transforms for boxing + unboxing optimizations :-
-------------------------------------------------------------

- Unboxing will simply be reduced to the action of fetching the correct 
VectorIR from its preceding input graph snippet.
- In the context where boxing is needed, we will bank process_inline_types node 
for buffering 

   Q. Do we have sufficient infrastructure for lazy buffering during 
process_inline_types in the downstream flow?
   A. No, buffering is done upfront as it needs precise memory and control 
edges. Currently, the compiler performs eager buffering 
        in scenarios that mandate non-scalarized forms of value objects. i.e. 
non-flat field assignment or passing 
        value type arguments when InlineTypePassFieldsAsArgs is not set.
       
     Q. What if we do upfront buffering of InlineTypeNode corresponding to 
vector boxes ?
     A. A buffer and its associated InlineTypeNode must be in sync all the 
time. Any field update of a value object creates a new value. This can only be 
achieved through a new value instance allocation i.e., it must go through an 
intermediate larval state. All the fields of a value instance are final; thus, 
we cannot use putfield to update their contents, getfield does return the 
initializing value of the value instance field.
         
        MyValue (value type)
            - float f1;
            - float f2;
            - float f3;
            - float f4;

        val1 = new MyValue(0.0f, 1.0f, 2.0f, 3.0f);
            -->  new MyValue (oop)
                  |
                  --> make_from_oop(oop) post construction

        val2 = new MyValue(val1.f1, 2.0f, val1.f3, val1.f4)  // getfield 
val1.f1 will pass 0.0f
                  |                                          // getfield 
val1.f3 will pass 2.0f
                  |                                          // getfield 
val1.f4 will pass 3.0f
                  |                                          
                  --> make_from_oop(oop) post construction
                        
        return val2.f1 + val2.f2; will return 0.0f + 2.0f, thus there will be 
no consumer of InlineTypeNodes, and they will be swept 
        out along with their allocations. 

      Q. In the new code model, is an InlineTypeNode always backed by an 
allocation?
      A. An inline type node is created from initializing this pointer passed 
as Parma0 of the constructor, unless it's created through make_from_multi, 
i.e., from scalarize arguments parameters. 


(gdb) l
889         if (cg->method()->is_object_constructor() && receiver != nullptr && 
gvn().type(receiver)->is_inlinetypeptr()) {
890           InlineTypeNode* non_larval = InlineTypeNode::make_from_oop(this, 
receiver, gvn().type(receiver)->inline_klass());
891           // Relinquish the oop input, we will delay the allocation to the 
point it is needed, see the
892           // comments in InlineTypeNode::Ideal for more details
893           non_larval = non_larval->clone_if_required(&gvn(), nullptr);
894           non_larval->set_oop(gvn(), null());
895           non_larval->set_is_buffered(gvn(), false);
896           non_larval = gvn().transform(non_larval)->as_InlineType();
897           map()->replace_edge(receiver, non_larval);
898         }


     
  [Action Item] 
        -Try to always link the backing oop to the newly scalarized 
InlineTypeNode, since the new code model always creates
          a new larval value on every field modification, hence, oop and 
scalarized nodes are always in sync.

  [Results] 
         Working on standalone tests, need to regress through the performance 
and validation suite.
            - Functional validation is almost fine.
            - Also, most of the InlineTypeNode are swept out unless there is 
any specific context that needs a materialized
              value.

-------------

PR Comment: https://git.openjdk.org/valhalla/pull/1688#issuecomment-3640741548

Re: [lworld+vector] Integrated: Merge lworld

Reply via email to