On 2020/9/10 18:08, Richard Biener wrote:
> On Wed, Sep 9, 2020 at 6:03 PM Segher Boessenkool
> <seg...@kernel.crashing.org> wrote:
>>
>> On Wed, Sep 09, 2020 at 04:28:19PM +0200, Richard Biener wrote:
>>> On Wed, Sep 9, 2020 at 3:49 PM Segher Boessenkool
>>> <seg...@kernel.crashing.org> wrote:
>>>>
>>>> Hi!
>>>>
>>>> On Tue, Sep 08, 2020 at 10:26:51AM +0200, Richard Biener wrote:
>>>>> Hmm, yeah - I guess that's what should be addressed first then.
>>>>> I'm quite sure that in case 'v' is not on the stack but in memory like
>>>>> in my case a SImode store is better than what we get from
>>>>> vec_insert - in fact vec_insert will likely introduce a RMW cycle
>>>>> which is prone to inserting store-data-races?
>>>>
>>>> The other way around -- if it is in memory, and was stored as vector
>>>> recently, then reading back something shorter from it is prone to
>>>> SHL/LHS problems. There is nothing special about the stack here, except
>>>> of course it is more likely to have been stored recently if on the
>>>> stack. So it depends how often it has been stored recently which option
>>>> is best. On newer CPUs, although they can avoid SHL/LHS flushes more
>>>> often, the penalty is relatively bigger, so memory does not often win.
>>>>
>>>> I.e.: it needs to be measured. Intuition is often wrong here.
>>>
>>> But the current method would simply do a direct store to memory
>>> without a preceeding read of the whole vector.
>>
>> The problem is even worse the other way: you do a short store here, but
>> so a full vector read later. If the store and read are far apart, that
>> is fine, but if they are close (that is on the order of fifty or more
>> insns), there can be problems.
>
> Sure, but you can't simply load/store a whole vector when the code didn't
> unless you know it will not introduce data races and it will not trap (thus
> the whole vector needs to be at least naturally aligned).
>
> Also if there's a preceeding short store you will now load the whole vector
> to avoid the short store ... catch-22
>
>> There often are problems over function calls (where the compiler cannot
>> usually *see* how something is used).
>
> Yep. The best way would be to use small loads and larger stores
> which is what CPUs usually tend to handle fine (with alignment
> constraints, etc.). Of course that's not what either of the "solutions"
> can do.
>
> That said, since you seem to be "first" in having an instruction
> to insert into a vector at a variable position the idea that we'd
> have to spill anyway for this to be expanded and thus we expand
> the vector to a stack location in the first place falls down. And
> that's where I'd first try to improve things.
>
> So what can the CPU actually do?
>
Not sure whether this reflects the issues you discussed above.
I constructed below test cases and tested with and without this patch,
only if "a+c"(which means store only), the performance is getting bad with
this patch due to extra load/store(about 50% slower).
For "v = vec_insert (i, v, n);" usage, v is always loaded after store, so this
patch will always get big benefit.
But for "v[n % 4] = i;" usage, it depends on whether "v" is used immediately
inside the function or out of the function soon. Does this mean unify the two
usage to same gimple code not a good idea sometimes? Or is it possible to
detect the generated IFN ".VEC_INSERT (&v, i_4(D), _1);" destination "v" is
used not far away inside or out side of the function?
#define TYPE int
vector TYPE v = {1, 2, 3, 4}; // a. global vector.
vector TYPE s = {4, 3, 2, 1};
__attribute__ ((noinline))
vector TYPE
test (vector TYPE u, TYPE i, size_t n)
{
v[n % 4] = i;
return u;
}
int main ()
{
//vector TYPE v = {1, 2, 3, 4}; // b. local vector.
//vector TYPE s = {4, 3, 2, 1};
vector TYPE r = {0};
for (long k = 0; k < 3989478945; k++)
{
r += test (s, 254.0f, k); // c. access different vector. store only.
//r += test (v, 254.0f, k); // d. access same vector. load after store
in callee function.
//test (s, 254.0f, k); r += v; //e. load after store out of function.
}
return r[0];
}
Thanks,
Xionghu