Hi Tim, I'll discuss the loads here for simplicity; the situation for stores is analogous.
There are a couple of differences between lvx and lxvd2x. The most important one is that lxvd2x supports unaligned loads, while lvx does not. You'll note that lvx will zero out the lower 4 bits of the effective address in order to force an aligned load. lxvd2x loads two doublewords into a vector register using big-endian element order, regardless of whether the processor is running in big-endian or little-endian mode. That is, the first doubleword from memory goes into the high-order bits of the vector register, and the second doubleword goes into the low-order bits. This is semantically incorrect for little-endian, so the xxpermdi swaps the doublewords in the register to correct for this. At optimization -O1 and higher, gcc will remove many of the xxpermdi instructions that are added to correct for LE semantics. In many vector computations, the lanes where the computations are performed do not matter, so we don't have to perform the swaps. For unaligned loads where we are unable to remove the swaps, this is still better than the alternative using lvx. An unaligned load requires a four-instruction sequence to load the two aligned quadwords that contain the desired data, set up a permutation control vector, and combine the desired pieces of the two aligned quadwords into a vector register. This can be pipelined in a loop so that only one load occurs per loop iteration, but that requires additional vector copies. The four-instruction sequence takes longer and increases vector register pressure more than an lxvd2x/xxpermdi. When the data is known to be aligned, lvx is equivalent to lxvd2x performance if we are able to remove the permutes, and is preferable to lxvd2x if not. There are cases where we do not yet use lvx in lieu of lxvd2x when we could do so and improve performance. For example, saving and restoring of vector parameters in a function prolog and epilog does not yet always use lvx. This is a performance opportunity we plan to improve in the future. A rule of thumb for your purposes is that if you can guarantee that you are using aligned data, you should use vec_ld and vec_st, and otherwise you should use vec_vsx_ld and vec_vsx_st. Depending on your application, it may be worthwhile to copy your data into an aligned buffer before performing vector calculations on it. GCC provides attributes that will allow you to specify alignment on a 16-byte boundary. Note that the above discussion presumes POWER8, which is the only POWER hardware that currently supports little-endian distributions and applications. Unaligned load/store performance on earlier processors was less efficient, so the tradeoffs differ. I hope this is helpful! Bill Schmidt, Ph.D. IBM Linux Technology Center You wrote: > I have a issue/question using VMX/VSX on Power8 processor on a little endian > system. > Using intrinsics function, if I perform an operation with vec_vsx_ld(â) - > vet_vsx_st(), the compiler will add > a permutation, and then perform an operations (memory correctly aligned) > lxvd2x â > xxpermdi â > operations â. > xxpermdi > stxvd2x â > If I use vec_ld() - vec_st() > lvx > operations â > stvx > Reading the ISA, I do not see a real difference between this 2 instructions ( > or I miss it) > So my 3 questions are: > Why do I have permutations ? > What is the cost of these permutations ? > What is the difference vet_vsx_ld and vec_ld for the performance ?