Re: [gem5-users] Writeback buffer kills O3 performance, what is it meant to model?

Mitch Hayenga via gem5-users Tue, 13 May 2014 07:32:59 -0700

I actually wrote a patch a while back (apparently Feb 20) that fixed the
load squash issue.  I kind of abandoned it, but it was able to run a few
benchmarks (never ran the regression tests on it).  I'll revive that and
see if it passes the regression tests.


All it did was force the load to be repetitively replayed until it was
successfully not blocked, rather than squashing the entire pipeline. I
remember incrWB() and decrWb() was the most annoying part of writing it.

As a side note, I've found generally that increasing tgts_per_mshr to
something unlikely to get hit largely eliminates the issue (this is why I
abandoned the patch).  You are still limiting the number of outstanding
cache lines to a specific number via the number of MSHRs, but don't squash
just because a bunch of loads all accessed the same line.   This is
probably a good temporary solution.



On Tue, May 13, 2014 at 3:09 AM, Vamsi Krishna via gem5-users <
gem5-users@gem5.org> wrote:

> Hello All,
>
> As Paul was mentioning, I tried to come up with small analysis on how the
> number of writeback buffers affect performance of PARSEC benchmarks when
> increased by 5x the default size. I found out that 2-wide processor
> improved by 22% , 4-wide processor by 7% and 8-wide processor by 0.6% in
> performance in average. This is mainly because of increased effective issue
> width because of increased availability of buffers. Clearly only effective
> writeback width should be affected not the effective issue width if modeled
> correctly. Long latency instructions like load miss will result in
> decreased issue width until load is completed. Processors with less width
> seems to suffer significantly because of this.
>
> Regarding the issue where multiple accesses to same block causing pipeline
> flushes, I have posted this question earlier (
> http://comments.gmane.org/gmane.comp.emulators.m5.users/16657),
> unfortunately the thread did not proceed further. It has a huge impact on
> performance of upto 40% in PARSEC benchmarks in 8-wide processor, 29% in
> 4-wide processor and 13% in 2-wide processor in average. It would be great
> to have the fix for this in gem5 because it is causing unusually high
> flushing activity in pipeline and affects speculation.
>
> Thanks,
> Vamsi Krishna
>
>
> On Mon, May 12, 2014 at 9:39 PM, Steve Reinhardt via gem5-users <
> gem5-users@gem5.org> wrote:
>
>> Paul,
>>
>> Are you talking about the issue where multiple accesses to the same block
>> cause Ruby to tell the core to retry, which in turn causes a pipeline
>> flush?  We've seen that too and have a patch that we've been intending to
>> post... this discussion (and the earlier one about store prefetching) have
>> inspired me to try and get that process started again.
>>
>> Thanks for speaking up.  I'd much rather have people point out problems,
>> or better yet post patches for them, than stockpile them for a WDDD paper
>> ;-).
>>
>> Steve
>>
>>
>>
>> On Mon, May 12, 2014 at 7:07 PM, Paul V. Gratz via gem5-users <
>> gem5-users@gem5.org> wrote:
>>
>>> Hi All,
>>> Agreed, thanks for confirming we were not missing something.  Just some
>>> followup, my student has some data about this he'll post to here shortly
>>> with the performance impact he sees for this issue, but it is quite large
>>> for 2-wide OOO.   I was thinking it might be something along those lines
>>> (or something about the bypass network width) but it seems like grabbing
>>> them at issue time is probably too conservative (as opposed to grabbing
>>> them at completion and stalling the functional unit if you can't get one).
>>>
>>> I believe Karu Sankaralingham at Wisc also found this and a few other
>>> issues, they have a related paper at WDDD this year.
>>>
>>> We also found a problem where multiple outstanding loads to the same
>>> address causing heavy flushing in O3 w/ ruby that has a similarly large
>>> performance impact, we'll start another thread on that shortly.
>>> Thanks!
>>> Paul
>>>
>>>
>>>
>>> On Mon, May 12, 2014 at 3:51 PM, Mitch Hayenga via gem5-users <
>>> gem5-users@gem5.org> wrote:
>>>
>>>> *"Realistically, to me, it seems like those buffers would be
>>>> distributed among the function units anyway, not a global resource, so
>>>> having a global limit doesn't make a lot of sense.  Does anyone else out
>>>> there agree or disagree?"*
>>>>
>>>> I believe that's more or less correct.  With wbWidth probably meant to
>>>> be the # of write ports on the register file and wbDepth being the pipe
>>>> stages for a multi-cycle write back.
>>>>
>>>> I don't fully agree that it should be distributed at the function unit
>>>> level, as you could imagine designs with higher issue width and functional
>>>> units than the number of register file write ports.  Essentially allowing
>>>> more instructions to be issued on a given cycle, as long as they did not
>>>> all complete on the same cycle.
>>>>
>>>> Going back to Paul's issue (loads holding write back slots on misses).
>>>>  The "proper" way to do it would probably be to reserve a slot assuming an
>>>> L1 cache hit latency.  Give up the slot on a miss.  Have an early signal
>>>> that a load-miss is coming back from the cache so that you could reserve a
>>>> write back slot in parallel with doing all the other necessary work for a
>>>> load (CAMing vs the store queue, etc). But this would likely be annoying to
>>>> implement.
>>>>
>>>>
>>>> *In general though, yes this seems like something not worth modeling in
>>>> gem5 as the potential negative impacts of its current implementation
>>>> outweigh the benefits.  And the benefits of fully modeling it are likely
>>>> small.*
>>>>
>>>>
>>>>
>>>> On Mon, May 12, 2014 at 2:08 PM, Arthur Perais via gem5-users <
>>>> gem5-users@gem5.org> wrote:
>>>>
>>>>>  Hi all,
>>>>>
>>>>> I have no specific knowledge on what are the buffers modeling or what
>>>>> they should be modeling, but I too have encountered this issue some time
>>>>> ago. Setting a high wbDepth is what I do to work around it (actually, 3 is
>>>>> sufficient for me), because performance is indeed suffering quite a lot
>>>>> (and even more for narrow-issue cores if wbWidth == issueWidth, I would
>>>>> expect) in some cases.
>>>>>
>>>>> Le 12/05/2014 19:39, Steve Reinhardt via gem5-users a écrit :
>>>>>
>>>>> Hi Paul,
>>>>>
>>>>>  I assume you're talking about the 'wbMax' variable?  I don't recall
>>>>> it specifically myself, but after looking at the code a bit, the best I 
>>>>> can
>>>>> come up with is that there's assumed to be a finite number of buffers
>>>>> somewhere that hold results from the function units before they write back
>>>>> to the reg file.  Realistically, to me, it seems like those buffers would
>>>>> be distributed among the function units anyway, not a global resource, so
>>>>> having a global limit doesn't make a lot of sense.  Does anyone else out
>>>>> there agree or disagree?
>>>>>
>>>>>  It doesn't seem to relate to any structure that's directly modeled
>>>>> in the code, i.e., I think you could rip the whole thing out (incrWb(),
>>>>> decrWb(), wbOustanding, wbMax) without breaking anything in the model...
>>>>> which would be a good thing if in fact everyone else is either suffering
>>>>> unaware or just working around it by setting a large value for wbDepth.
>>>>>
>>>>>  That said, we've done some internal performance correlation work,
>>>>> and I don't recall this being an issue, for whatever that's worth.  I know
>>>>> ARM has done some correlation work too; have you run into this?
>>>>>
>>>>>  Steve
>>>>>
>>>>>
>>>>>
>>>>> On Fri, May 9, 2014 at 7:45 AM, Paul V. Gratz via gem5-users <
>>>>> gem5-users@gem5.org> wrote:
>>>>>
>>>>>> Hi All,
>>>>>> Doing some digging on performance issues in the O3 model we and
>>>>>> others have run into allocation of the writeback buffer having a big
>>>>>> performance impact.  Basically, the a writeback buffer is grabbed at 
>>>>>> issue
>>>>>> time and held through till completion.  With default assumptions about 
>>>>>> the
>>>>>> number of available writeback buffers, (x*issue width, where x is 1 by
>>>>>> default), the buffers often end up bottlenecking the effective issue 
>>>>>> width
>>>>>> (particularly in the face of long latency loads grabbing up all the WB
>>>>>> buffers).  What are these structures trying to model?  I can see limiting
>>>>>> the number of instructions allowed to complete and writeback/bypass in a
>>>>>> cycle but this ends up being much more conservative than that if it is 
>>>>>> the
>>>>>> intent.  If not why does it do this?  We can easily make number of WB 
>>>>>> bufs
>>>>>> high but want to understand what is going on here first...
>>>>>> Thanks!
>>>>>>  Paul
>>>>>>
>>>>>>  --
>>>>>> -----------------------------------------
>>>>>> Paul V. Gratz
>>>>>> Assistant Professor
>>>>>> ECE Dept, Texas A&M University
>>>>>> Office: 333M WERC
>>>>>> Phone: 979-488-4551
>>>>>> http://cesg.tamu.edu/faculty/paul-gratz/
>>>>>>
>>>>>> _______________________________________________
>>>>>> gem5-users mailing list
>>>>>> gem5-users@gem5.org
>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> gem5-users mailing 
>>>>> listgem5-users@gem5.orghttp://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Arthur Perais
>>>>> INRIA Bretagne Atlantique
>>>>> Bâtiment 12E, Bureau E303, Campus de Beaulieu
>>>>> 35042 Rennes, France
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> gem5-users mailing list
>>>>> gem5-users@gem5.org
>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> gem5-users mailing list
>>>> gem5-users@gem5.org
>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>
>>>
>>>
>>>
>>> --
>>> -----------------------------------------
>>> Paul V. Gratz
>>> Assistant Professor
>>> ECE Dept, Texas A&M University
>>> Office: 333M WERC
>>> Phone: 979-488-4551
>>> http://cesg.tamu.edu/faculty/paul-gratz/
>>>
>>> _______________________________________________
>>> gem5-users mailing list
>>> gem5-users@gem5.org
>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>
>>
>>
>> _______________________________________________
>> gem5-users mailing list
>> gem5-users@gem5.org
>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>
>
>
>
> --
> Regards,
> Vamsi Krishna
>
> _______________________________________________
> gem5-users mailing list
> gem5-users@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>

_______________________________________________
gem5-users mailing list
gem5-users@gem5.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Re: [gem5-users] Writeback buffer kills O3 performance, what is it meant to model?

Reply via email to