> > > I can't exactly remember why we didn't do that to start with. I > > > think the problem was ABI-related, or to do with transferring NEON > > > vectors to/from ARM registers when it was necessary to do that... > > > I'm planning to do some archaeology to try to see if I can figure > > > out a definitive answer. > > > > The ABI defined vector types (uint32x4_t etc) are defined to be in > > vldm/vstm order. > > There's no conflict with the ABI-defined vector order -- the ABI > (looking at AAPCS, IHI 0042D) describes "containerized" vectors which > should be used to pass and return vector quantities at ABI boundaries, > but I couldn't find any further restrictions. Internally to a function, > we are still free to use vld1/vst1 vector ordering. Using > "containerized"/opaque transfers, the bit pattern of a vector in one > function (using vld1/vst1 ordering internally) will of course remain > unchanged if passed to another function and using the same ordering > there also.
Ah, ok. If you make the ABI defined types distinct from the GCC generic vector types (as used by the vectorizer), then in principle that should work. I agree that current GCC probably does not have the infrastructure to do that, and some of the vector code plays a bit fast and loose with type conversions/subregs. Remember that it's not just function arguments, it's any interface shared between functions. i.e. including structures and global variables. > Actually making that work (especially efficiently) with GCC is a > slightly different matter. Let's call vldm/vstm-ordered vectors > "containerized" format, and vld1/vst1-ordered vectors "array" format. We > need to do introduce the concept of marshalling vector arguments from > array format to containerized format when passing them to a function, > and unmarshalling those vector arguments back the other way on function > entry. AFAICT, GCC does not have suitable infrastructure for > implementing such functionality at present: consider that e.g. vectors > passed by value on the stack should use containerized format, which > means the called function cannot simply dereference the stack pointer > to read the vector: IIRC I/we tried to do something very similar (possibly the other way around) by abusing the unaligned load mechanism. I don't remember why that failed. Paul