On 11/14/2013 02:30 AM, Jakub Jelinek wrote:
> As discussed earlier, if we strictly follow the Intel ABI for simds,
> we run into various issues.  The clones then have to use __regcall calling
> convention which e.g. mandates that on x86_64 up to 16 vector arguments
> are passed in xmm/ymm registers (problem, because the dynamic linker
> during lazy binding can clobber ymm8 through ymm15), requires up to 16
> vector values returned in xmm/ymm registers (for e.g.
> #pragma omp declare simd simdlen(16)
> _Complex double foo (double);
> ) - we don't have infrastructure for that plus we'd need to teach backend(s)
> about that new calling convention, and declares {x,y}mm4-7 for 32-bit
> and {x,y}mm8-15 for 64-bit to be call saved (on 64-bit again there is a
> problem with that because the dynamic linker may clobber that, plus
> it is an issue for bt/up in the debugger (we don't save/restore those in
> unwind info and how big vectors would we save; note, elementals aren't
> allowed to throw or setjmp/longjmp (the standard doesn't mention
> setcontext/swapcontext etc. though)).

Sadly, the last time I reviewed Intel's document, I only looked at the mangling
itself, and ignored the calling convention addition.

I agree with you that the __regcall convention is broken as written.
I think we should ignore it until it gets fixed.

> So, shall we just use different ISA letters to make it clear we are ABI
> incompatible with ICC?

Yes, that is also prudent.

> I wonder if the generic representation
> just shouldn't be ISA 'a', which would pass all non-uniform/non-linear
> arguments as pointers to array of simdlen elements, and ditto for return
> value through first hidden argument.  For x86_64/i?86, because (at least on
> a tiny benchmark I've tried) the pointer arguments variant is somewhat
> slower, we would use ISA 'b', 'c', 'd' for SSE2/AVX/AVX2 (shall we do
> anything for AVX512-F too?) if simdlen is in between 2 and 16, otherwise
> we'd use 'a' and arrays too.

Pointers are certainly a decent fallback that would always be compatible,
but I wonder if we need go that far.

Each target will have a (set of) natural simdlen to which it vectorizes.  This
is the set returned by autovectorize_vector_sizes.  That means we've got
registers of those sizes, and probably parameter passing of those sizes will be
efficient.  It's easy to split input parameters into multiples, as you've done;
no reason this can't apply generically.

It's the return value wider than the register size that's tricky.  Here I think
we may be best off returning a struct/array and letting the base calling
convention handle it.  Normally that _will_ be via a pointer, but sometimes
that pointer will be in some special non-parameter register.  Thus I think
we're best off not performing the hidden argument conversion manually.

We could generically use log2(vector_byte_size) + 'a' as the abi letter.

I'll look at the patches themselves later.


r~

Reply via email to