rjmccall added a comment.

In D108643#3000223 <https://reviews.llvm.org/D108643#3000223>, @erichkeane 
wrote:

> In D108643#3000193 <https://reviews.llvm.org/D108643#3000193>, @rjmccall 
> wrote:
>
>> In D108643#2999776 <https://reviews.llvm.org/D108643#2999776>, @erichkeane 
>> wrote:
>>
>>>> ! In D108643#2965852 <https://reviews.llvm.org/D108643#2965852>, @rjmccall 
>>>> wrote:
>>>>
>>>> The choice that high bits are unspecified rather than extended is an 
>>>> interesting one.  Can you speak to that?  That's good for +, -, *, &, |, 
>>>> ^, <<, and narrowing conversions, but bad for ==, <, /, >>, and widening 
>>>> conversions.
>>>
>>> So we chose this for a few reasons:
>>>
>>> 1- Consistency with struct padding bits.  It seemed strange to specify the 
>>> padding bits here, when the rest of the standards/ABI don't specify padding 
>>> bits.
>>
>> I think it's a mistake to think of these as padding bits.  Implementations 
>> on general-purpose hardware will be doing operations in word-sized chunks; 
>> these are the high bits of the most significant word.
>
> I guess thats fair?
>
>>> 2- Flexibility of implementation: Requiring zeroing is a more constraining 
>>> decision, which limits implementation to having to set these bits.  By 
>>> leaving it unspecified, the implementation is free to zero them out if it 
>>> feels it is worth-while.
>>
>> This is a trade-off.  Extending constrains the implementation of operations 
>> that produce noise in the high bits.  Not extending constrains the 
>> implementation of operations that are affected by noise in the high bits.  
>> I'm willing to believe that the trade-off favors leaving the bits undefined, 
>> but that's why I'm asking, to see if you've actually evaluated this 
>> trade-off, because it kindof sounds like you've evaluated one side of it.
>
> Perhaps I'm missing something... the intent in the ABI/standard was to leave 
> it unconstrained so that an implementation could choose to zero it out if 
> possible.  We DID consider making zero-ing the padding bits internally. On 
> our hardware (see the reasoning below) there was not a perf-advantage.

The question is whether you can rely on extension at places that receive an 
arbitrary ABI-compatible value, like function parameters or loads.  If nobody 
who receives such a value can rely on extension having been done, then the ABI 
is not meaningfully "leaving it unconstrained": it is constraining some places 
by the lack of constraint elsewhere.  That is why this is a trade-off.

>>> I'll note that our backends choose NOT to zero them out when not necessary, 
>>> since (so I'm told) 'masked' compares are trivial in most processors.
>>
>> They're trivial to implement in custom hardware, of course, but what 
>> existing ISAs actually provide masked compare instructions?  Is this a 
>> common feature I'm simply totally unaware of?  In practice I think this will 
>> be 1-2 extra instructions in every comparison.
>
> Intel and AMD's X86-64 both do masked-compare, at least in microcode from 
> what I was told. I'll have to see if it was @craig.topper or @andrew.w.kaylor 
> (perhaps someone else?) who said we'd observed similar behavior on arm.

Okay, but this is a general-purpose feature being added to general-purpose 
targets.  Clang cannot directly emit AMD x86-64 microcode, and I doubt that an 
`and; and; cmp` instruction sequence gets fused into a masked compare in 
microcode.  Those masked comparisons are probably just used to implement 
8/16/32-bit comparisons, selected directly on encountering a compare 
instruction of that width.

>>> 3- Implement-ability on FPGAs: Since this was our motivating example, 
>>> forcing an FPGA to zero out these bits when dealing with an interaction 
>>> with a byte-aligned processor would have incredible performance overhead.
>>
>> How on earth does making the store unit zero/sign-extend have "incredible 
>> performance overhead"?  This is totally trivial technologically.  It's not 
>> like you're otherwise sending 17-bit stores out on the bus.
>
> My understanding is, it IS in fact as if you're putting them out on a bus, 
> extending these bits for pushing to our x86 core from our FPGA core ends up 
> requiring more gates, longer reads during transition between processors, and 
> longer 'interconnects'.  My understanding is the bus between the x86_64 cores 
> and FPGA cores is a limited size, so being able to have the FPGA pack them is 
> highly beneficial to the throughput. That said, my understanding of this is 
> limited to the ELI5 explanation of our former FPGA architect.

This still doesn't make any sense.  If you're transmitting a `_BitInt(17)` as 
exactly 17 bits on a dedicated FPGA<->x86 bus, then of course continue to do 
that.  The ABI rules govern the representation of values in the places that 
affect the interoperation of code, such as calling conventions and in-memory 
representations.  They do not cover bus protocols.

>> I'm not sure it's appropriate to think of this as primarily an FPGA feature 
>> when in fact it's being added to standard targets.
>
> I don't believe I _AM_ thinking of this primarily as an FPGA feature, it is a 
> motivating example for one of our architectures. Just like standards 
> shouldn't unfairly punish 32 bit devices, it shouldn't punish FPGAs.

This entire discussion is about what the ABI rules should be for implementing 
this feature on general-purpose devices that doesn't directly support e.g. 
17-bit arithmetic.  Custom hardware that does support native 17-bit arithmetic 
obviously doesn't need to care about those parts of the ABI and is not being 
"punished".  At some point, 17-bit values will come from that specialized 
hardware and get exposed to general-purpose hardware by e.g. being written into 
a GPR; this is the first point at which the ABI even starts dreaming of being 
involved.  Now, it's still okay under a mandatory-extension ABI if that GPR has 
its upper bits undefined: you're in the exact same situation as you would be 
after an addition, where it's fine to turn around and use that in some other 
operation that doesn't care about the upper bits (like a multiplication), but 
if you want to use it in something that cares about those bits (like a 
comparison), you need to zero/sign-extend them away first.  The only difference 
between an ABI that leaves the upper bits undefined and one that mandates 
extension is that places which might expose the value outside of the function — 
like returning the value, passing the value as an argument, and writing the 
value into a pointer — have to be considered places that care about the upper 
bits; and then you get to rely on that before you do things like comparisons.

Again, I'm not trying to insist that a mandatory-extension ABI is the right way 
to go.  I just want to make sure that we've done a fair investigation into this 
trade-off.  Right now, my concern is that it sounds like that investigation 
invented a lot of extra constraints for mandatory-extension ABIs, like that 
somehow mandatory extension meant that you would need to transmit a bunch of 
zero bits between your FPGA and the main CPU.  I am not a hardware specialist, 
but I know enough to know that this doesn't check out.

>>> 4- Ease of implementation: Forcing LLVM to zero out these bits would either 
>>> mean we had to do quite a bit of work in our CodeGen to zero them out, or 
>>> modify most of the backends to not zero padding bits in these cases. Since 
>>> there isn't a particular performance benefit (see #2) we didn't think it 
>>> would be worth while.
>>
>> The obvious lowering would be for clang to use i17 as the scalar type 
>> lowering but i32 as the "in-memory" lowering, then make sure that the 
>> backends are reasonably intelligent about folding extend/trunc operations 
>> around operations that aren't sensitive / don't produce noise.
>
> I guess we could?  That sounds like more work for opt for, what seems to me, 
> to be little gain.

I have a lot of concerns about turning "whatever LLVM does when you pass an i17 
as an argument" into platform ABI.  My experience is that LLVM does a lot of 
things that you wouldn't expect when you push it outside of simple cases like 
power-of-two integers.  Different targets may even use different rules, because 
the IR specification doesn't define this stuff.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D108643/new/

https://reviews.llvm.org/D108643

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to