On Wed, Sep 9, 2020 at 6:03 PM Segher Boessenkool <seg...@kernel.crashing.org> wrote: > > On Wed, Sep 09, 2020 at 04:28:19PM +0200, Richard Biener wrote: > > On Wed, Sep 9, 2020 at 3:49 PM Segher Boessenkool > > <seg...@kernel.crashing.org> wrote: > > > > > > Hi! > > > > > > On Tue, Sep 08, 2020 at 10:26:51AM +0200, Richard Biener wrote: > > > > Hmm, yeah - I guess that's what should be addressed first then. > > > > I'm quite sure that in case 'v' is not on the stack but in memory like > > > > in my case a SImode store is better than what we get from > > > > vec_insert - in fact vec_insert will likely introduce a RMW cycle > > > > which is prone to inserting store-data-races? > > > > > > The other way around -- if it is in memory, and was stored as vector > > > recently, then reading back something shorter from it is prone to > > > SHL/LHS problems. There is nothing special about the stack here, except > > > of course it is more likely to have been stored recently if on the > > > stack. So it depends how often it has been stored recently which option > > > is best. On newer CPUs, although they can avoid SHL/LHS flushes more > > > often, the penalty is relatively bigger, so memory does not often win. > > > > > > I.e.: it needs to be measured. Intuition is often wrong here. > > > > But the current method would simply do a direct store to memory > > without a preceeding read of the whole vector. > > The problem is even worse the other way: you do a short store here, but > so a full vector read later. If the store and read are far apart, that > is fine, but if they are close (that is on the order of fifty or more > insns), there can be problems.
Sure, but you can't simply load/store a whole vector when the code didn't unless you know it will not introduce data races and it will not trap (thus the whole vector needs to be at least naturally aligned). Also if there's a preceeding short store you will now load the whole vector to avoid the short store ... catch-22 > There often are problems over function calls (where the compiler cannot > usually *see* how something is used). Yep. The best way would be to use small loads and larger stores which is what CPUs usually tend to handle fine (with alignment constraints, etc.). Of course that's not what either of the "solutions" can do. That said, since you seem to be "first" in having an instruction to insert into a vector at a variable position the idea that we'd have to spill anyway for this to be expanded and thus we expand the vector to a stack location in the first place falls down. And that's where I'd first try to improve things. So what can the CPU actually do? Richard. > > > Segher