On 17/09/12 20:04, Christophe Lyon wrote: > On 17 September 2012 20:04, Richard Earnshaw <rearn...@arm.com> wrote: >> On 17/09/12 16:50, Christophe Lyon wrote: >>> On 17 September 2012 17:21, Richard Earnshaw <rearn...@arm.com> wrote: >>>> On 17/09/12 16:13, Christophe Lyon wrote: >>>>> On 17 September 2012 14:56, Richard Earnshaw <rearn...@arm.com> wrote: >>>>>> On 05/09/12 23:14, Christophe Lyon wrote: >>>>>>> Hello, >>>>>>> >>>>>>> Although the recent optimization I have committed to use Neon vext >>>>>>> instruction for suitable builtin_shuffle calls does not support >>>>>>> big-endian yet, I have written a patch to the existing testcases such >>>>>>> they now support big-endian mode. >>>>>>> >>>>>>> I think it's worth improving these tests since writing the right masks >>>>>>> for big-endian (such that the program computes the same results as in >>>>>>> little-endian) is not always straightforward. >>>>>>> >>>>>>> In particular: >>>>>>> * I have added some comments in a few tests were it took me a while to >>>>>>> find the right mask. >>>>>>> * In the case of the test which is executed, I had to force the >>>>>>> noinline attribute on the helper functions, otherwise the computed >>>>>>> results are wrong in big-endian. It is probably an overkill workaround >>>>>>> but it works :-) >>>>>>> I am going to file a bugzilla for this problem. >>>>>>> >>>>>>> I have checked that replacing calls to builtin_shuffle by the expected >>>>>>> Neon vext variant produces the expected results in big-endian mode, >>>>>>> and I arranged the big-endian masks to get the same results. >>>>>>> >>>>>>> Christophe.= >>>>>>> >>>>>>> >>>>>>> neon-vext-big-endian-tests.patch >>>>>>> >>>>>>> >>>>>>> N ¬n‡r¥ªíÂ)emçhÂyhi× ¢w^™©Ý >>>>>>> >>>>>> >>>>>> >>>>>> I'm not sure about this. Looking at the documentation in the manual for >>>>>> builtin_suffle makes no mention of the results/behaviour being endian >>>>>> dependent, which makes me wonder why this test needs to be. >>>>>> >>>>>> R. >>>>> >>>>> >>>>> Indeed, but I had to modify the mask value in order to get the same >>>>> results in big and little-endian. >>>>> >>>>> If the mask should be the same (it would be much more confortable for >>>>> the developers indeed), then GCC needs to be changed/fixed. >>>>> >>>> >>>> That's what I'm trying to establish. I suspect that there is a bug in >>>> GCC for all big-endian code here. >>>> >>>> What happens for a test of uint8x8_t? >>>> >>> >>> Well, in my sample testcase in little-endian, I used mask = {2, 3, 4, >>> 5, 6, 7, 8, 9}, which can be optimized into vext #2. >>> >>> In big-endian mode, explicitly forcing use of vext #2 leads to the >>> right result, but to achieve it using builtin_shuffle, I had to change >>> the mask into {14, 15, 0, 1, 2, 3, 4, 5}. >>> >>> I did read the thread starting at >>> http://gcc.gnu.org/ml/gcc-patches/2011-10/msg01133.html and the >>> threads it references, and I must admit that I got a bit confused :-) >>> >>> IMHO, it's currently impossible for a GCC user to write code using >>> vector initializers that would be portable on big and little endian >>> targets. It's too much of a headache.... >>> >>> It was also a purpose of this patch: have someone react if it looked >>> inappropriate. >>> >>> Thanks for the review, >>> >>> Christophe. >>> >> >> I think for big-endian, __builtin_shuffle needs to expand to (for 64-bit >> vectors) >> >> vrev64.<size> mask >> vext >> >> and for 128-bit vectors >> >> vrev64.<size> mask >> vswap mask<top-dword>, mask<low-dword> >> vext ... >> >> Obviously, if you've got a literal you can simplify the operations down >> to something that doesn't need the extra instructions. >> >> R. >> > > Does this mean that the currently generated code is wrong (I mean when > no optimization is performed by the compiler, as it is currently the > case with GCC in big-endian) ?
Quite possibly. In order to determine what is right, we first need to understand the specification. My reading of that is that the semantics should be endian independent, but I was hoping that someone would know for certain and be able to chip in. The alternative would be to try the code on PPC and x86 to check the behaviour there for big and little endian respectively, but I don't have access to a PPC board and I'd rather not trust simulators for some of this. > When the input mask is known by the compiler, it generates a series a > moves to perform the shuffle operation. > Which theoretically should, I think, be doing the transform I described above. R.