[Bug target/115466] rs6000 vec_ld built-in works on BE but not LE

carll at gcc dot gnu.org via Gcc-bugs Thu, 13 Jun 2024 08:27:12 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115466


--- Comment #4 from Carl Love <carll at gcc dot gnu.org> ---
comment 1

Yes, I can confirm if I add the alignment statement to the declarations it
works fine.  

I originally tried to use the built-in as part of a re-write of a function in
the Milvus AI source code.  The function fvec_L2sqr_batch_4_ref() accounts for
about 90% of the workload run time.  The code is:

fvec_L2sqr_batch_4_ref_v1(const float* x, const float* y0, const float* y1, 
   const float* y2, const float* y3, const size_t d, float& dis0, float& dis1,
   float& dis2, float& dis3) {

    float d0 = 0;
    float d1 = 0;
    float d2 = 0;
    float d3 = 0;
    for (size_t i = 0; i < d; ++i) {
        const float q0 = x[i] - y0[i];
        const float q1 = x[i] - y1[i];
        const float q2 = x[i] - y2[i];
        const float q3 = x[i] - y3[i];

        d0 += q0 * q0;
        d1 += q1 * q1;
        d2 += q2 * q2;
        d3 += q3 * q3;
    }
    dis0 = d0;
    dis1 = d1;
    dis2 = d2;
    dis3 = d3;
}

When compiled with -O3, it does generate vsx instructions.  But I noticed that
it was not using the vsx multiply add instructions.  So, I tried rewriting it
to explicitly load a vector with the vec_ld built-in, followed by vec_sub and
vec_madd.  Which by the way gives a 45% reduction in the execution time for my
standalone test.  

At this point, not sure where the arguments get defined so adding the alignment
to the declaration is not so easy. 

That said, Peter mentioned the vec_xl built-in which does seem to work.  The
vec_xl does not require the data to be aligned from what I see in the PVIPR.


comment 3, Segher

I was looking at the PVIPR document when I chose the built-in for my rewrite. 
Looking at the documentation, it does say "Load a 16-byte vector from the
memory address specified by the displacement and the pointer, ignoring the
low-order bits of the calculated address."  In retrospect, I should have picked
up on the ignoring of the low-order bits to imply the addresses needed  to be
aligned.  It would really be good if the documentation explicitly said the data
must be 16-bye aligned.  That said, my bad for not reading/understanding the
documentation well enough.

The vec_xl documentation does not say anything about ignoring the lower bits of
the address.  So, in my case that is a better load built-in to use so I don't
have to try and find all the declarations for the arrays that could be passed
to the function.

It would be great to update the PVIPR to be more explicit about the alignment
needs.


Sorry everyone for the noise.  I think we can close the issue as "User error".

[Bug target/115466] rs6000 vec_ld built-in works on BE but not LE

Reply via email to