Hi Segher, on 2021/9/23 上午6:36, Segher Boessenkool wrote: > Hi! > > On Tue, Sep 21, 2021 at 11:24:08AM +0800, Kewen.Lin wrote: >> on 2021/9/18 上午6:01, Segher Boessenkool wrote: >>> On Thu, Sep 16, 2021 at 09:14:15AM +0800, Kewen.Lin wrote: >>>> The way with nunits * stmt_cost can get one much exaggerated >>>> penalized cost, such as: for V16QI on P8, it's 16 * 20 = 320, >>>> that's why we need one bound. To make it scale, this patch >>>> doesn't use nunits * stmt_cost any more, but it still keeps >>>> nunits since there are actually nunits scalar loads there. So >>>> it uses one cost adjusted from stmt_cost, since the current >>>> stmt_cost sort of considers nunits, we can stablize the cost >>>> for big nunits and retain the cost for small nunits. After >>>> some tries, this patch gets the adjusted cost as: >>>> >>>> stmt_cost / (log2(nunits) * log2(nunits)) >>> >>> So for V16QI it gives *16/(4*4) so *1 >>> V8HI it gives *8/(3*3) so *8/9 >>> V4SI it gives *4/(2*2) so *1 >>> V2DI it gives *2/(1*1) so *2 >>> and for V1TI it gives *1/(0*0) which is UB (no, does not crash for us, >>> just gives wildly wrong answers; the div returns 0 on recent systems). >> >> I don't expected we will have V1TI for strided/elementwise load, >> if it's one unit vector, it's the whole vector itself. >> Besides, the below assertion should exclude it already. > > Yes. But ignoring the UB for unexpectedly large vector components, the > 1 / 1.111 / 1 / 2 scoring does not make much sense. The formulas > "look" smooth and even sort of reasonable, but as soon as you look at > what it *means*, and realise the domain if the function is discrete > (only four or five possible inputs), and then see how the function > behaves on that... Hrm :-) > >>> This of course is assuming nunits will always be a power of 2, but I'm >>> sure that we have many other places in the compiler assuming that >>> already, so that is fine. And if one day this stops being true we will >>> get a nice ICE, pretty much the best we could hope for. >> >> Yeah, exact_log2 returns -1 for non power of 2 input, for example: > > Exactly. > >>>> + unsigned int adjusted_cost = stmt_cost / nunits_sq; >>> >>> But this can divide by 0. Or are we somehow guaranteed that nunits >>> will never be 1? Yes the log2 check above, sure, but that ICEs if this >>> is violated; is there anything that actually guarantees it is true? >> >> As I mentioned above, I don't expect we can have nunits 1 strided/ew load, >> and the ICE should check this and ensure dividing by zero never happens. :) > > Can you assert that *directly* then please? >
Fix in v2. >>> A magic crazy formula like this is no good. If you want to make the >>> cost of everything but V2D* be the same, and that of V2D* be twice that, >>> that is a weird heuristic, but we can live with that perhaps. But that >>> beats completely unexplained (and unexplainable) magic! >>> >>> Sorry. >> >> That's all right, thanks for the comments! let's improve it. :) > > I like that spirit :-) > >> How about just assigning 2 for V2DI and 1 for the others for the >> penalized_cost_per_load with some detailed commentary, it should have >> the same effect with this "magic crazy formula", but I guess it can >> be more clear. > > That is fine yes! (Well, V2DF the same I guess? Or you'll need very > detailed commentary :-) ) > > It is fine to say "this is just a heuristic without much supporting > theory" in places. That is what most of our --param= are as well, for > example. If counting two-element vectors as twice as expensive as all > other vectors helps performance, then so be it: if there is no better > way to cost things (or we do not know one), then what else are we to do? > > Thanks a lot for the suggestion, I just posted v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580358.html BR, Kewen