Re: [gem5-users] A Patch for DRAMsim2 Integration

Gabriel Michael Black Wed, 02 May 2012 14:35:00 -0700

Yes, thanks for your perseverance. I've been meaning to reply but Ihaven't found the time to look at your email carefully. I'll try to dothat soon.


Gabe


Quoting Ali Saidi <sa...@umich.edu>:



Hi Andrew,

Thanks for digging into this. I think there is an issue
somewhere, but I'm still not sure where.

Ali

On 01.05.2012 23:34,
Andrew Cebulski wrote:

Okay, I'm positive now that the issue lies

with delayed translations that are squashed before finishing.

On the
data on instruction side? You seem to allude to data in the paragraph
below, but then instructions in the latter text.

It seems to me like

speculative load/stores are being executed, rather than waiting for the
instructions to commit. Once the instructions begin getting
(speculatively) executed in the TLB, a reference is left there, which
seems hard to root out and dereference after the instruction ends up
being squashed. At least, I have not been able to find that out in the
source code as of yet. Can anyone clarify on this?

There should only be
one translation outstanding from each instruction and data side walker.
Any nested transactions should be queued in the walker. Until one
finishes, I'm not sure how multiple would ever be outstanding.
R

ses

linearly for varying periods of time:

http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
[1]

After enabling the TLB debug flag, I see that the linear increase

in instructions in flight is proportional to the number of TLB misses.
These TLB misses have a much larger delay (resulting in translation
delays) due to the fact the DramSim2 models the memory system more
accurately. It seems that with the classic memory system, TLB misses
often do not have translation delays. For whatever reason, it would also
seem that every instruction that has a TLB miss also is eventually
squashed...


From a data side perspective this is reasonable. While

a miss is outstanding at
structions will stop committing and thus the
instructions in flight will begin to rise until the miss is satisfied.


Here's a summary of outputs from my trace. These two DPRINTF messages
appears on the rising slopes (repeated up until the peak):
TLB Miss

This is interesting/odd. I don't know a good reason why (1) a miss would
be outstanding to both address 0 and address 4 at the same time. In
almost all cases these pages are marked as no-access to detect
segfaults. Perhaps there is an issue where the
g into a loop faulting on
a bad access and then faulting again on the fault handler. I could
imagine this would happen if there was some corruption in the memory
system (for example the timings in dramsim exposing a bug in the cache
models or something).

 At the peak, the following message appears
(from fetch) almost every tick for (what I believe to be) every single
one of the table walkers that were squashed.
Fetch is waiting ITLB walk
to finish!

There must be another walk in flight? The instruction side
will only have one fault outstanding at once. Successive branch
mispredicts will re-direct

ht thing."

The problem is that

these ITLB table walks are for instructions that were squashed as
much
on cycles earlier, and since been removed from the CPU's
instruction list.

I'm not following here.

Any help will be greatly
appreciated in solving this problem. I've hit a roadblock with getting
Ruby working with ARM, most likely due to the fact that ARM has disjoint
m

r. I brought this up in my last email about trying to get Ruby

working. Therefore, I'm trying to get this DramSim2 integration fixed so
I can start modeling FS with DRAM memory.

Brad/Steve/Nilay anyone have
a suggestion on how to make this work?

Note that these problems also
occur in Soplex from the Spec CP

en't tested on other benchmarks.

Thanks,

Andrew

On Tue, May 1, 2012 at 4:27 AM, Andrew Cebulski

<af...@drexel.edu [2]> wrote:

Hey Gabe,
Thanks for this...very

helpful. I just recently got back into debugging this problem. I made a
small
c/base/refcnt.hh to allow me to return the current count of
references to a DynInst object.
 I then modified existing DPRINTFs to
also print out reference counts, then added some of my own when I needed
extra

What's happening is that is progresses as far as getting

executed in the IEW once, but a delayed translation occurs, deferring
the store. By the time it reenters the IEW, the IQ has marked the
instruction as squashed. Everything progresses as usual from here on
out, with one exception. When the instruction is removed from the CPUs
instruction list, there is one reference count hanging.

I've added in

some additional debugging for my traces to help narrow down where this
reference is coming from. As far as I can tell, it's because of a call
to initiateAcc() within the executeStore function in the lsq unit.
Please see the following two traces. The first trace shows what I just
discussed. The second trace is another memory store instruction that got
squashed, however, it was squashed upon its first entry into the IEW,
therefore it never started execution.

http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out [21]

http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out [22]

Let

me know if you have any ideas based on these two instruction traces. I
do not understand how the initiateAcc function results in another
reference, but maybe someone else does.... Since I don't see how it
makes a reference, it's hard to find out how to make sure it gets
dereferenced...

Unfortunately, I haven't been able to add a DPRINTF

in src/base/refcnt.hh ...this would make things more clear (i.e. exactly
when references/deferences occur). Let me know if you have any advice on
this...if it's possible. I can't seem to get the right include files,
and likely right SConscript compile order...

Thanks,
Andrew

On Sat, Apr 7, 2012 at 9:48 PM, Gabe Black <gbl...@eecs.umich.edu [23]>
wrote:

Without digging into things too deeply, it looks like you

may be leaking references to dynamic instructions. The CPU may think
it's done with one, but until that final reference is removed, the
object will hang around forever. I think I've had problems before where
there reference count ended up off by one somehow and instructions would
start piling up. It's also possible that a clog develops in O3's
pipeline and some internal structure stops letting instructions through
and starts accumulating them. Either of these problems will be annoying
to track down, but with enough digging I've been able to fix these sorts
of things.


This may have more to do with O3 not handling the

benchmark you're running well rather than a problem with your new DRAM
model. There may be some interaction between the two, though, where the
new memory makes the timing line up to cause O3 to behave poorly. What
you can do is instrument dynamic instruction creation and destruction
and reference counting (try print "this" for both the reference counting
wrapper and the dyn inst itself) and turn it on as close as you can to
where things go bad tick wise. Then look for an instruction which gets
lost, and look for where it's reference count is incremented and
decremented. It should be relatively easy to pair up where references
are created and destroyed, and you should be able to identify the
reference which never goes away. Then you need to figure out where that
reference is being created. After that, you should have enough
information to identify why the reference counting isn't being done
correctly. It's arduous, but that's the only way.


It's important

to also make sure reference counts aren't decremented to zero
prematurely. I had a problem once where that happened and the memory
behind the object was updated by something that didn't know it was dead.
The memory had since been reallocated to another object of the same
type, so that other object reflected what happened to the phantom one.
If I remember that manifested as something weird like an add causing a
page fault or something.


Gabe

On 04/07/12 18:21, Andrew

Cebulski wrote:

Hi all,
I've looked into this problem some

more, and have put together a couple traces. I've been becoming more
familiar with how gem5 handles dynamic instructions, in particular how
it destroys them. I have two traces to compare, one with the physical
memory, and the other with the integrated dramsim2 dram memory. I also
have two plots showing instruction counts over time (sim ticks). All of
these are linked at the end of the email.

First, I'm going to go

into what I've been able to interpret regarding how instructions are
destroyed. In particular, comparing when DynInst's vs. DynInstPtr's are
deconstructed/removed from the cpu. I separate these because I've seen a
difference, as I discuss later. These explanations are fairly
non-existent on the wiki. There is a section header waiting to be
filled...

From what I have been able to gather from the code, there

is a list of all the instructions in flight in cpu/o3/cpu.cc called
instList, with the type DynInstPtr. There are three conditions to
instructions being cleaned from this list:

1.) The ROB retires its

head instruction

2.) Fetch receives a rob squashing signal from the

commit, resulting in removing any instruction not in the ROB

3.)

Decode detects an incorrect branch prediction, resulting in removal of
all instructions back to the bad seq num.

Once all five stages have

completed, the CPU cleans up all the removed in-flight instructions.
This line in particular in cleanUpRemovedInsts() in cpu/o3/cpu.cc
deconstructs a DynInstPtr:

instList.erase(removeList.front());

When I turn on the debug flag O3CPU, I see the message "Removing
instruction, ..." (from o3/cpu.cc) with the threadNum, seqNum and
pcState after all 5 cpu stages have completed, and one of the conditions
above is met. I also see what tick it occurs on.

When I turn on the

DynInst debug flag, I see when instructions are created and destroyed
(cpu/base_dyn_inst_impl.hh) and what tick. From analyzing the trace
files, I've gathered that this takes into account that instructions have
different execution lengths. So if one tick a memory instruction in the
instList (DynInstPtr) is removed, the DynInst for that memory
instruction will occur much later (i.e. 1M ticks later). I have yet to
determine how this is implemented.

Now for the problem.
What

I'm seeing when I run dramsim2 dram memory is a significant difference
between the size of the instList vector (of DynInstPtr objects), and the
size of dynamic instruction count (of DynInst objects). The benchmark
I'm running is libquantum from SPEC 2006. For the first roughly 130B
ticks, the dynamic instruction count kept in cpu/base_dyn_inst.impl.hh
shadows the instList size in o3/cpu.cc (figure linked below) very
closely. Around tick 130B after libquantum started, it starts hitting
what I'm assuming are loops (therefore branch prediction), resulting in
some behavior that seems to imply improper instruction handling (i.e.
more instructions in flight than allowed by ROB).

I wasn't able to

sync-up the physical and dramsim2 traces exactly by trace, but they
should represent roughly the same area of execution. They don't execute
the same due to the dramsim2 modeling the memory differently (i.e.
latency and other delays).

I've shared both traces on my public

Dropbox here --

http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
[14]

http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
[15]

Here are a couple plots of tick versus instruction count,

with respect to cpu->instcount in cpu/base_dyn_inst.impl.hh and
instList.size() in cpu/o3/cpu.cc. --
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
[16]

http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
[17]

Note that I added the printout of the instList size to an

existing O3CPU DPRINTF in cleanUpRemovedInsts() in cpu/o3/cpu.cc.

Here are the commands I ran to parse the traces into data files to
analyze in MATLAB and create the plots:

zgrep DynInst

dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | grep
destroyed | awk '{print $1,$11}' > cpuinstcount.out

zgrep instList

dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | awk '{print
$1,$11}' > instlistsize.out

It seems to me like the problem might

lie in gem5, but has just been exposed by integrating this more detailed
memory model, dramsim2, into gem5. Either that, or their are some timing
errors in how dramsim2 was integrated. I doubt this, however, since
those first 190B ticks executed used the dramsim2 memory. I believe the
problem is a combination of memory instructions + complex loops (branch
prediction), resulting in improper destroying of instructions.

I've

included the ROB, Commit, Fetch, DynInst and O3CPU debug flags. Their
are 192 ROB entries, which is why the instList size generally has a max
of about 192 instructions. The dynamic instruction counts (seen in the
dramsim2 plot) seem to also imply that instructions are incorrectly been
removed from the ROB, and then from the cpu's instruction list in
cpu.cc, which allows more and more instructions to be added to the
system (possibly from a bad branch).

I appreciate any help in

debugging this and further figuring out the root problem, just let me
know if you need anything else from me. I don't have much more time at
the moment to debug, but I can take any advice for quick changes and/or
additional traces, then send the results back to the list for
discussion.

Thanks,
Andrew
P.S. Paul - I did try

decreasing the size of the dramsim2 transaction (and even command) queue
from 512 to 32. The same instructions problem occurred. It basically
just decreased the execution time.


On Wed, Mar 14, 2012 at

2:10 PM, Ali Saidi <sa...@umich.edu [18]> wrote:

The error is

that there are more that 1500 instructions currently in flight in the
system. It could mean several things:


1. The value is

somewhat arbitrarily defined and maybe there are more than 1500 in your
system at one time?


2. Instructions aren't being destroyed

correctly


You could try to to run a debug binary so you'll

get a list of instructions when it happens or increase the number which
may be appropriate for certain situations (but 1500 is quite a few
inflight instructions).


Ali

On 13.03.2012 10:56,

Andrew Cebulski wrote:

Hi Xiangyu,
I just started

looking into this some more. So at first I thought it was due to
updating to a more recent revision, but then I went back to revision
8643, added your patch, built and ran....and now get the error with it
too (when running ARM_FS/gem5.opt). I"m testing now to see if an update
to SWIG might have resulted in this error, maybe someone on the mailing
list would know if that's possible. The difference is 1.3.40 vs. 2.0.3,
both of which are supported according to the dependencies wiki page.

Just for completeness, here's the error from revision 8643:

build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void

BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion
`cpu->instcount


I have not tried running with gem5.debug,

so I will be doing that today. Maybe this is an assertion that is
occurring due to an optimization. That would mean it wouldn't be
triggered in gem5.debug since it runs without optimizations. Have you
tested all debug, opt and fast with your tests?

Thanks,

Andrew


On Tue, Mar 13, 2012 at 1:37 PM, Rio Xiangyu Dong

<riosher...@gmail.com [11]> wrote:

Hi Andrew,

I didn't see this error in my simulations. May I ask which gem5

version you are using? I find some of the latest code updates do not
comply with my changes. I am still using the DRAMsim2 patch on Gem5
repo8643, and have run all the runnable benchmarks in SPEC2006,
SPEC2000, EEMBC2, and PARSEC2 on ARM_SE.


Thank you!


Best,

Xiangyu

FROM:

Andrew Cebulski [mailto:af...@drexel.edu [8]]

SENT: Thursday,

March 08, 2012 6:52 PM


TO: gem5 users mailing list

CC:riosher...@gmail.com [9]; sa...@umich.edu [10]

SUBJECT: Re: [gem5-users] A Patch for DRAMsim2 Integration

Xiangyu,

I've been having an issue recently with

the number of instructions I've been seeing committed to the CPU (I have
a separate thread on this). It turns out the issue seems to be coming
from this patch you created to integrate DramSim2 with Gem5.
Unfortunately, I've been running with gem5.fast, not gem5.opt. So up
until now, I haven't been seeing assertions. I thought I'd run it with
gem5.opt or debug back in December, but I must not have. My runs on the
Arm O3 cpu fails with this assertion:

build/ARM/cpu/base_dyn_inst_impl.hh:149: void BaseDynInst::initVars()
[with Impl = O3CPUImpl]: Assertion `cpu->instcount

-Andrew

Date: Sun, 18 Dec 2011 01:48:58 -0800

From: "Dong, Xiangyu" <riosher...@gmail.com [3]>

To: "gem5 users

mailing list" <gem5-users@gem5.org [4]>

Subject: [gem5-users] A

Patch for DRAMsim2 Integration Message-ID: gmail.com [5]>

Content-Type: text/plain; charset="us-ascii"

Hi all,


I have a Gem5+DRAMsim2 patch. I've tested it

under both SE and FS modes.

I'm willing to share it

here.


For those who have such needs, please go to my

website

www.cse.psu.edu/~xydong [6] to download the patch and

test it. To enable

DRAMSim2, use se_dramsim2.py script instead

of se.py (for FS, you can create

by yourself). The basic idea to

enable the DRAMsim2 module is to use the

derived DRAMMemory

class instead of PhysicalMemory class.


Please let me

know if there are bugs.


Thank you!

Best,


Xiangyu Dong

-------------- next

part --------------

An HTML attachment was scrubbed...

URL:
<http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
[7]>


_______________________________________________

gem5-users mailing list

gem5-users@gem5.org [12]

http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users [13]

_______________________________________________

gem5-users mailing

list

gem5-users@gem5.orghttp://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

_______________________________________________
gem5-users

mailing list

gem5-users@gem5.org [19]

http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[20]


Links:
------
[1]
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
[2]
mailto:af...@drexel.edu
[3] mailto:riosher...@gmail.com
[4]
mailto:gem5-users@gem5.org
[5] http://gmail.com
[6]
http://www.cse.psu.edu/%7Exydong
[7]
http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
[8]
mailto:af...@drexel.edu
[9] mailto:riosher...@gmail.com
[10]
mailto:sa...@umich.edu
[11] mailto:riosher...@gmail.com
[12]
mailto:gem5-users@gem5.org
[13]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[14]
http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
[15]
http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
[16]
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
[17]
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
[18]
mailto:sa...@umich.edu
[19] mailto:gem5-users@gem5.org
[20]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[21]
http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out
[22]
http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out
[23]
mailto:gbl...@eecs.umich.edu



_______________________________________________
gem5-users mailing list
gem5-users@gem5.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Re: [gem5-users] A Patch for DRAMsim2 Integration

Reply via email to