On 11/14/2013 02:30 AM, Jakub Jelinek wrote: > As discussed earlier, if we strictly follow the Intel ABI for simds, > we run into various issues. The clones then have to use __regcall calling > convention which e.g. mandates that on x86_64 up to 16 vector arguments > are passed in xmm/ymm registers (problem, because the dynamic linker > during lazy binding can clobber ymm8 through ymm15), requires up to 16 > vector values returned in xmm/ymm registers (for e.g. > #pragma omp declare simd simdlen(16) > _Complex double foo (double); > ) - we don't have infrastructure for that plus we'd need to teach backend(s) > about that new calling convention, and declares {x,y}mm4-7 for 32-bit > and {x,y}mm8-15 for 64-bit to be call saved (on 64-bit again there is a > problem with that because the dynamic linker may clobber that, plus > it is an issue for bt/up in the debugger (we don't save/restore those in > unwind info and how big vectors would we save; note, elementals aren't > allowed to throw or setjmp/longjmp (the standard doesn't mention > setcontext/swapcontext etc. though)).
Sadly, the last time I reviewed Intel's document, I only looked at the mangling itself, and ignored the calling convention addition. I agree with you that the __regcall convention is broken as written. I think we should ignore it until it gets fixed. > So, shall we just use different ISA letters to make it clear we are ABI > incompatible with ICC? Yes, that is also prudent. > I wonder if the generic representation > just shouldn't be ISA 'a', which would pass all non-uniform/non-linear > arguments as pointers to array of simdlen elements, and ditto for return > value through first hidden argument. For x86_64/i?86, because (at least on > a tiny benchmark I've tried) the pointer arguments variant is somewhat > slower, we would use ISA 'b', 'c', 'd' for SSE2/AVX/AVX2 (shall we do > anything for AVX512-F too?) if simdlen is in between 2 and 16, otherwise > we'd use 'a' and arrays too. Pointers are certainly a decent fallback that would always be compatible, but I wonder if we need go that far. Each target will have a (set of) natural simdlen to which it vectorizes. This is the set returned by autovectorize_vector_sizes. That means we've got registers of those sizes, and probably parameter passing of those sizes will be efficient. It's easy to split input parameters into multiples, as you've done; no reason this can't apply generically. It's the return value wider than the register size that's tricky. Here I think we may be best off returning a struct/array and letting the base calling convention handle it. Normally that _will_ be via a pointer, but sometimes that pointer will be in some special non-parameter register. Thus I think we're best off not performing the hidden argument conversion manually. We could generically use log2(vector_byte_size) + 'a' as the abi letter. I'll look at the patches themselves later. r~