fhahn added a comment.

In D99152#2655304 <https://reviews.llvm.org/D99152#2655304>, @lebedev.ri wrote:

> In D99152#2655274 <https://reviews.llvm.org/D99152#2655274>, @fhahn wrote:
>
>> In D99152#2649520 <https://reviews.llvm.org/D99152#2649520>, @LuoYuanke 
>> wrote:
>>
>>> In D99152#2647681 <https://reviews.llvm.org/D99152#2647681>, @fhahn wrote:
>>>
>>>> I can't see any `load <256 x i32>` in the linked example, just a store. 
>>>> Could you check the example?
>>>
>>> I create another example at https://gcc.godbolt.org/z/v6od5ceEz. In bar() 
>>> function, you can see the `load <256 x i32>*` in the IR. The bar() function 
>>> is actually buggy code, because tilec is not initialized by amx intrinsics. 
>>> We want user call amx intrinsic to load/store tile explicitly. Ideally 
>>> front-end can detect the issue and report error.
>>
>> Thanks AFAIK in those cases the conversion intrinsic makes sense to use, 
>> because you effectively need to convert between 2 types in a non-trivial 
>> way. @lebedev.ri WDYT?
>
> I'm not sure. I think first and foremost the `load`/`store` miscompile should 
> be addressed.
> I think the rest is confusing because it seems to me that the only reason why 
> that `bitcast`
> is needed is not correctness reason, but as an opaque optimization barrier.

I'm not sure if the loads and store are actually incorrect. `_tile1024i ` is 
defined as `typedef int _tile1024i __attribute__((__vector_size__(1024), 
__aligned__(64)));` and the loads/stores are for assignments/reads from 
variables that have that type, which is `<256 x i32>` in IR. So it is not 
obvious to me why the loads/stores would be wrong, as long as `_tile1024i` is 
defined as it is (if it would be a different type, that all changes).

As a consequence,  `__builtin_ia32_tilezero_internal` & the other builtins need 
to be defined as returning  `_tile1024i` / `<256 x i32>`. I don't think there's 
any other way to specify this, unless you have a dedicated AMX type in the 
frontend. IIUC the current lowering is to call an intrinsic that returns 
`x86_amx` and then a `bitcast` is used for the conversion to the result type 
`<256 x i32>`, with the (incorrect) assumption that the `bitcast` does complex 
conversion between types. Another consequence of the builtins returning 
`_tile1024i` / `<256 x i32>` is that the conversion from the intrinsic result 
to `<256 x i32>` should happen at the place where Clang emits the call to the 
intrinsic directly, not patched up later as it is done now.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99152/new/

https://reviews.llvm.org/D99152

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to