> > Do I understand correctly that the "only" issue is memory vs. register > > element ordering? Thus a fixup could be as simple as extra shuffles > > inserted after vector memory loads and before vector memory stores? > > (with the hope of RTL optimizers optimizing those)? > > It's not even necessary to use explicit shuffles -- NEON has perfectly > good instructions for loading/storing vectors in the "right" order, in > the form of vld1 & vst1. I'm afraid the solution to this problem might > have been staring us in the face for years, which is simply to forbid > vldr/vstr/vldm/vstm (the instructions which lead to weird element > permutations in BE mode) for loading/storing NEON vectors altogether. > That way the vectorizer gets what it wants, the intrinsics can continue > to use __builtin_shuffle exactly as they are doing, and we get to > remove all the bits which fiddle vector element numbering in BE mode in > the ARM backend. > > I can't exactly remember why we didn't do that to start with. I think > the problem was ABI-related, or to do with transferring NEON vectors > to/from ARM registers when it was necessary to do that... I'm planning > to do some archaeology to try to see if I can figure out a definitive > answer.
The ABI defined vector types (uint32x4_t etc) are defined to be in vldm/vstm order. Paul