Hi Arthur,

I have been working with the source code of O3CPU for a few months, but I
had never touched the part where memory operations are handled. After your
message, I got curious about how superscalar processors implement these
operations and did some digging in the source code and textbooks.

Here's an excerpt from the textbook "Modern Processor Design - Fundamentals
of Superscalar Processors" by Lipasti & Shen, Beta Edition, 2003 (p. 186):

"For a store instruction, instead of updating the memory at completion, it
is possible to move the data to the store buffer at completion. The store
buffer is a FIFO buffer that buffers architecturally completed store
instructions. Each of these store instructions is then retired, i.e,
updates the memory, when the memory bus is available. *The purpose of the
store buffer is to allow stores to be retired when the memory bus is not
busy, thus giving priority to loads that need to access the memory bus*. We
use the term *completion* to refer to the updating of the CPU state and the
term *retiring* to refer to the updating of the memory state. With the
store buffer, an instruction can be architecturally complete but not yet
retired to memory."

I'm posting this because it looks exactly like the model that is currently
implemented in gem5. I noticed that both loads and stores are executed by
the LSQ, by calling ldstQueue->executeLoad(inst) and
lsdtQueue->executeStore(inst) (lines 1244 and 1260 in iew_impl.hh), but
they work a bit different. Loads 'go to memory' first, at execution, and
therefore increment used cache ports. Stores are actually 'sent to memory'
not at execution, but at writeback, which happens when
ldstQueue->writebackStores() is called (iew_impl.hh:1502). Notice the
comment: "Writeback any stores using any leftover bandwidth", i.e., the
remaining available cache ports.

The only problem with the current implementation, as I understand it, is
that there should be a check for loadFUs >= cachePorts. Therefore, since
loads are executed first, there is no need to test for available cache
ports. I understand that the arbitration mechanism you mentioned is already
implemented.

Your patch is correct, but it does change the model a little bit.
Previously, cache ports could be used by both loads and stores (with
preference for loads) and now we have dedicated ports for loads (= the
amount of load FUs) and dedicated ports for stores (= cacheStorePorts). I'm
not sure whether modern superscalar processors implement dedicated or
shared cache ports (maybe someone working more closely with caches would
know that?).

Regards,

On Tue, Apr 26, 2016 at 8:32 AM, Arthur Perais <[email protected]>
wrote:

> Alright, I was also waiting for someone else to comment, but I will try to
> submit a patch this week.
>
> Best,
>
> Arthur.
>
>
> Le 25/04/2016 17:52, Andreas Hansson a écrit :
>
> Hi Arthur,
>
> I just wanted to re-iterate that the solution you suggest sounds good.
> Could you also make sure that the comments (and possibly variable names)
> are updated to reflect the change?
>
> Thanks,
>
> Andreas
>
> From: gem5-users <[email protected]> on behalf of Andreas
> Hansson <[email protected]>
> Reply-To: gem5 users mailing list < <[email protected]>
> [email protected]>
> Date: Saturday, 23 April 2016 at 12:18
> To: gem5 users mailing list < <[email protected]>[email protected]>
> Subject: Re: [gem5-users] o3cpu: cache ports
>
> Hi Arthur,
>
> I agree with your observations, but it would be good if someone more
> familiar with the o3 model could chime in.
>
> Andreas
>
> From: gem5-users < <[email protected]>
> [email protected]> on behalf of Arthur Perais <
> [email protected]>
> Reply-To: gem5 users mailing list < <[email protected]>
> [email protected]>
> Date: Tuesday, 19 April 2016 at 10:41
> To: "[email protected]" <[email protected]>
> Subject: [gem5-users] o3cpu: cache ports
>
> Hi all,
>
> In the O3 LSQ there is a variable called "cachePorts" which controls the
> number of stores that can be made each cycle (see lines 790-795 in
> lsq_unit_impl.hh).
> cachePorts defaults to 200 (see O3CPU.py), so in practice, there is no
> limit on the number of stores that are written back to the D-Cache, and
> everything works out fine.
>
> Now, silly me wanted to be a bit more realistic and set cachePorts to one,
> so that I could issue one store per cycle to the D-Cache only.
> In a few SPEC programs, this caused the SQFullEvent to get very high,
> which I assumed was reasonable because well, less stores per cycle.
> However, after looking into it, I found that the variable "usedPorts"
> (which allows stores to WB only if it is < to "cachePorts") is increased by
> stores when they WB (which is fine), but also by *load*s when they access
> the D-Cache (see lines 768 and 814 in lsq_unit.hh). However, the number of
> loads that can access the D-Cache each cycle is controlled by the number of
> load functional units, and not at all by "cachePorts".
>
> This means that if I set cachePorts to 1, and I have 2 load FUs, I can do
> 2 loads per cycle, but as soon as I do one load, then I cannot writeback
> any store this cycle (because "usePorts" will already be 1 or 2 when gem5
> enters writebackStores() in lsq_unit_impl.hh). On the other hand, if I set
> cachePorts to 3 I can do 2 loads and one store per cycle, but I can also WB
> three stores in a single cycle, which is not what I wanted to be able to do.
>
> This should be addressed by not increasing "usedPorts" when loads access
> the D-Cache and being explicit about what variable constrains what (i.e.,
> loads are constrained by load FUs and stores by "cachePorts"), or by also
> contraining loads on "cachePorts" (which will be hard since arbitration
> would potentially be needed between loads and stores, and since store WBs
> happen after load accesses in gem5, this can get messy). As of now, this is
> a bit of both, and performance looks fine at first, but it's really not.
>
> I can write a small patch for the first solution (don't increase
> "usedPorts" on load accesses), but I am not sure this corresponds to the
> philosophy of the code. What do you think would be the best course of
> action?
>
> Best,
>
> Arthur.
>
> --
> Arthur Perais
> INRIA Bretagne Atlantique
> Bâtiment 12E, Bureau E303, Campus de Beaulieu
> 35042 Rennes, France
>
> IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose the
> contents to any other person, use it for any purpose, or store or copy the
> information in any medium. Thank you.
> IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose the
> contents to any other person, use it for any purpose, or store or copy the
> information in any medium. Thank you.
>
> _______________________________________________
> gem5-users mailing 
> [email protected]http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
>
>
> --
> Arthur Perais
> INRIA Bretagne Atlantique
> Bâtiment 12E, Bureau E303, Campus de Beaulieu
> 35042 Rennes, France
>
>
> _______________________________________________
> gem5-users mailing list
> [email protected]
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>



-- 
Marcelo Brandalero
PhD student
Programa de Pós Graduação em Computação
Universidade Federal do Rio Grande do Sul
_______________________________________________
gem5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Reply via email to