>From your command it's pretty clear that CPU 1 isnt getting directed to the
proper start address for the program, but CPU 1 is.
cpu0.execute: Waking up Fetch (via Execute) by issuing a branch:
(0x400190=>0x400198).(0=>1)
cpu1.execute: Waking up Fetch (via Execute) by issuing a branch:
(0=>0x8).(0
The Minor CPU supports full system mode on ARM. It was developed by ARM
and is used there to run Linux-based full system simulations. It is
primarily tested/used with the classic memory system.
On Tue, Jun 16, 2015 at 8:30 AM, Konstadinos PARASYRIS
wrote:
>
> Hello,
>
> Could someone please in
) out before pushing the SMT ones.
>
> It seems to me our work complements each other.
>
> Best regards,
> Alex
> --
> *From:* gem5-users [gem5-users-boun...@gem5.org] on behalf of Mitch
> Hayenga [mitch.hayenga+g...@gmail.com]
> *Sent:* Friday, Ap
Hi Alexander,
Just saw this thread and thought I'd contribute. Are you focusing
additional SMT support on the various CPU models or just x86-ISA/library
side of things? I'm wondering how much overlap we have.
I've recently been working on extending gem5 SMT support by adding SMT to
the Atomic,
Here's how o3 would work in this case. The relevant code is in
src/cpu/o3/lsq_unit.hh (in the "LSQUnit::read" function) around line 640.
The code in the backend explicitly works on micro-ops, so each load/store
micro-op will get it's own LSQ entry. If both the ldrd and strd are
cracked, then not
Hmm, unsure of what was removed. Prefetching across page boundaries should
have always been broken/bad from the perspective of the prefetchers. Since
they are in the memory system which is purely physical addresses in gem5.
Without a TLB/walker interface to the prefetchers, there is no proper way
Have you tried running with the O3CPUAll debug flag? That may shed some
more light on whats happening. Steve's suggestion sounds like a
possibility.
On Sat, Nov 22, 2014 at 12:24 PM, Steve Reinhardt via gem5-users <
gem5-users@gem5.org> wrote:
> I don't recall the details, but there's some issu
Whoops, sorry just read the ruby stats at the end, missed the earlier
sim_insts. Sorry for the mis-read, someone with knowledge of the ruby
stats is needed I guess.
On Fri, Nov 14, 2014 at 8:27 AM, Mitch Hayenga wrote:
> Haven't used ruby with gem5... But are instruction fetches enti
Haven't used ruby with gem5... But are instruction fetches entire cache
lines (or something larger than a single instruction)? Than 1 load per
cache line of instructions isn't crazy.
On Fri, Nov 14, 2014 at 7:42 AM, Geeta Patil via gem5-users <
gem5-users@gem5.org> wrote:
>
> Hi All,
>
> I got
Hi,
I suspect one of a few things might be behind what you think you are seeing.
1) "I have observed that call and return instructions are predicted as if these
instructions were conditional branches."
The ARMv7 ISA actually does have conditional calls and returns (this is a
consequence of lett
numCycles is only incremented on cycles where the CPU is clocked (see
src/cpu/o3/cpu.cc: line 540).
Two things can lead to this not correlating with the number of sim_ticks.
1) Quiesce instructions ("wait for interrupt" on ARM), cause the CPU to
sleep until an interrupt or some external event occu
In general there are 3 functions that the CPU calls on instructions in
order to execute them. These are all functions within the "StaticInst"
class.
1) initiateAcc
2) completeAcc
3) execute
The first 2 are used for memory operations while the 3rd is what you care
about for your integer example.
Hi, this should have been fixed in this changeset.
http://repo.gem5.org/gem5?cmd=changeset;node=0edd36ea6130
I don't believe this fix is yet in gem5-stable, but it is in the
development branch.
On Wed, Sep 24, 2014 at 7:47 PM, Khaled Mahmoud via gem5-users <
gem5-users@gem5.org> wrote:
> Hi,
>
Hi George,
For the tagged prefetcher, I believe this is a bug in the current
implementation. I hit this a few weeks ago on ARM. For ARM (assuming X86
is the same), hardware page table walk requests are not assigned a thread
ID. When generating prefetches, the tagged prefetcher attempts to tag t
ith switched cpu after
> restore?
>
>
> On Tue, Sep 2, 2014 at 3:33 PM, Mitch Hayenga <
> mitch.hayenga+g...@gmail.com> wrote:
>
>> Yes you can. Generally the preferred way to run is to boot/start a
>> benchmark with the atomic CPU and then drop a checkpoint.
Yes you can. Generally the preferred way to run is to boot/start a
benchmark with the atomic CPU and then drop a checkpoint. You can then
restore from the checkpoint with the "detailed" CPU.
Simple use case:
1) specify gem5 command with "--checkpoint-at-end" and the atomic CPU
2) Once the benchm
last committed user instruction and
> >> first instruction in apic_timer_interrupt function. This confirms that
> >> the last user instruction sits in commit until timer interrupt happens.
> >> Am I right about this?
> >>
> >> Next step, I think I need to
x86KvmCPU to boot up, then take checkpoints and run from
> checkpoints.
>
> I will report whether this works or not.
>
> Thanks.
>
> --
> Best Regards
> Yan Zi
>
> On 27 Aug 2014, at 15:44, Mitch Hayenga wrote:
>
> > There are probably three main patches th
There are probably three main patches that could help. The fact you
mention the timer interrupt makes me think Andreas is right and these might
solve your issue.
1. http://reviews.gem5.org/r/2363/ - o3 is supposed to stop fetching
instructions immediately once a quiesce instruction is encountere
> Thanks for the response Mitch. It seems like a nice way to fake a
> pipelined fetch.
>
> Amin
>
>
> On Tue, Aug 26, 2014 at 10:54 AM, Mitch Hayenga <
> mitch.hayenga+g...@gmail.com> wrote:
>
>> Yep,
>>
>> I've thought of the need for a fully pip
Yep,
I've thought of the need for a fully pipelined fetch as well. However my
current method is to fake longer instruction cache latencies by leaving the
delay as 1 cycle, but make up for it by adding additional "fetchToDecode"
delay. This makes the front-end latency and branch mispredict penal
Are you sure its actually ignoring dependencies? Run with
--debug-flags=IntRegs,FloatRegs. To verify what vdivs wrote and what vstr
read.
I'd assume it's just a bug in the gem5 generateDissasembly routine printing
out the instruction format. It probably just passed reg 15, meaning
floating poin
Hi,
I'm the one who originally wrote that config file. If you don't need
anything else from those scripts, I'd just use the mainline run scripts
like Andreas said and copy the appropriate config values like O3_ARM_v7a.
Since those scripts were written the branch predictor structure was
re-organi
PS: This issue was fixed about two weeks ago by putting in an assert to
warn when MaxWidth was <= any width in the machine. So it's in the
mainline, just not in gem5-stable.
http://repo.gem5.org/gem5?cmd=changeset;node=790a214be1f4
On Tue, May 13, 2014 at 4:36 PM, Mitch Hayenga wrote:
Gem5 has some hard compile-time limits on how large certain widths can be.
In src/cpu/o3/impl.hh there is a line that sets "MaxWidth = 8". Increase
this to greater than or equal to 16 (or whatever the maximum width in your
machine is).
The issue you are hitting is time buffer entries writing int
functional unit if you can't get one).
>>>
>>> I believe Karu Sankaralingham at Wisc also found this and a few other
>>> issues, they have a related paper at WDDD this year.
>>>
>>> We also found a problem where multiple outstanding loads to the same
>&g
*"Realistically, to me, it seems like those buffers would be distributed
among the function units anyway, not a global resource, so having a global
limit doesn't make a lot of sense. Does anyone else out there agree or
disagree?"*
I believe that's more or less correct. With wbWidth probably mean
Yep, the single-store in flight is a significant limitation of TSO. There
are things you can do to alleviate it (which gem5 doesn't do).
A cpu could speculatively try to obtain ownership for a cacheline before a
store were fully committed. Thus the store could be retired much more
quickly to the
"--trace-start" and "--trace-file" were renamed to "--debug-start" and
"--debug-file"
Hope that helps.
On Tue, Apr 15, 2014 at 2:40 PM, Kuk-Hwan Kim wrote:
>
> Dear Gem5 community member,
>
> I wish to display pipeline stages from 200cycles to 1000. So, I would like
> to create trace.out by us
You get this when you try to execute a non-memry instruction as a memory
instruction.
You first need to figure out what type of instruction was being executed
and then go about figuring what isn't done properly. Since you have a core
dump you could view the stack trace most likely to figure out w
Praxal,
I'm pretty sure the other two answers answer your question. It is
completely possible for slight timing changes to change the number of
memory accesses and instructions simulated.
Because the o3 cpu is speculative, slight timing changes can result in
fewer or more speculative memory acce
So,
The first thing you need to do is identify which x86 instruction is causing
this (mnemonic and binary encoding). This looks to be an issue in the ISA
decoder for gem5 either not properly detecting the instruction you are
executing or not fully supporting it.
Basically, you are executing what
g at the rename and iew stages for a given instruction. It
> turns out that this register is being accessed in rename stage but not in
> iew stage. Does this mean that the CPSR is not required for these
> instructions?
>
> Thanks
> V Vanchinathan
>
>
> On Thu, Jan 2, 20
R33 is a "zero register". It is used whenever a zero is required. It is
also often sourced unnecessarily if an instruction requires fewer source
registers.
In gem5 the basis for splitting is solely up to whoever wrote the ISA
decoder. For arm its mostly what you would expect 2 real sources (not
-r 5e8970397ab7" but I have the error : unknown
>> revision '5e8970397ab7' !
>>
>> Cordialement / Best Regards
>>
>> SENNI Sophiane
>> Ph.D. candidate - Microelectronics
>> LIRMM - www.lirmm.fr
>>
>> Le 05/12/2013 14:58, Mitch H
All mercurial commands have a built in help that explains their options
("hg help diff").
For this one you what "hg diff -r ".
Hope that helps.
On Thu, Dec 5, 2013 at 6:44 AM, senni sophiane wrote:
> Hi all,
>
> I want to apply the command "hg diff" for the gem5 revision 5e8970397ab7.
> I di
By default they don't exist, but they should be fairly simple to add.
You just have to add your own "tag" class in src/mem/cache/tags. So, I'd
just copy the existing LRU files, rename them something different, then
edit them to implement them to be what you what (doing search/replace
within the
The prefetchers for the classic memory system are located under
src/mem/cache/prefetch.
You'll have to add your prefetcher to the listed classes in Prefetecher.py
as well as add corresponding source files. Note: The "GHB" prefetcher
class is effectively a misnomer and doesn't really work. So don
Just type in the gem5 binary without any arguments. It gives you a list of
accepted parameters.
--stats-file=FILE Sets the output file for statistics [Default:
stats.txt]
--outdir=DIR, -d DIRSet the output directory to DIR [Default: m5out]
I don't use ruby, so I don't know how to renam
Amin/Tony, there is a very big reason for why gem5 does this. It's about
modeling what real processors do.
Modern out of orders are very deeply pipelined and instructions take
multiple cycles to execute from the time they are scheduled. To enable
back-to-back execution of dependent instructions,
it is more specific
>> than "Re: Contents of gem5-users digest..."
>>
>>
>> Today's Topics:
>>
>>1. gem5 with simpoint (Jagadish Kotra)
>>2. Re: gem5 with s
Hi, I'm the person who wrote the config scripts you are using. I don't
have access to them right at this moment but if I remember correctly
#1 would properly warm up the caches. It keeps the caches in the system,
it just swaps the connection between the atomic or detailed cpu (depending
on i
Hi,
This happens to me whenever I compile with google's tcmalloc. If you have
that, try disabling it. I tend to just remove the tcmalloc package and do
a fresh rebuild whenever I need to use valgrind to debug memory issues.
This is just because by default valgrind doesn't recognize/trap the
all
Using the atomic cpu, with fastmem, it takes about 7 days for most of the
benchmarks to finish on the "train" input set. This was with the ARM ISA
on a bunch of old 2.4 ghz opterons.
I calculated it out once using other paper's presented instruction counts,
and running the whole reference input s
Hi Ali,
There is actually a minor bug/race condition in the gem5
ListenSocket::listen function (src/base/socktet.cc). I think Hao might be
hitting this, I just haven't had time time to upload the patch for it to
the mainline. I hit this when launching hundreds of simulations at the
same time (on
Other possible sources off the top of my head
Prefetchers?
Wrong path loads that don't contribute to IPC?
Sent from my phone.
On May 9, 2013 5:06 AM, "Mitch Hayenga"
wrote:
> Trying to analyze stats like this is often more trouble than it's worth...
> But anyway,
Trying to analyze stats like this is often more trouble than it's worth...
But anyway, here is one way this could happen I think. Write misses. As
long as the in order does not stall for the tlb translations, it can still
get memory level parallelism for its writebacks. And if they missed in t
That number is equal to the # of instructions in the basic block multiplied
by the # of times the basic block was executed. Looking at what you
attached, I could figure out that basic block 462 was actually a 28
instruction loop. In general if you sum up the second numbers across a
line, they sh
++ to this likely just being an issue of reading the wrong stat. I've
personally diffed every instruction on a small run of libquantum (though on
ARM).
You can always implement a "poor man's checker" to execute two gem5 cpu
models in lock-step, verifying the committed instruction path (assuming
s
; Hi Mitch,
>
> Thanks for reporting. Is there an easy way to reproduce this?
>
> Andreas
>
> From: Mitch Hayenga
> Reply-To: gem5 users mailing list
> Date: Tuesday, 23 April 2013 01:17
> To: gem5 users mailing list
> Subject: [gem5-users] SimpleDDR3 failing on an
Hi all,
I'm running the SimpleDDR3 memory with default parameters and one of my
benchmarks is failing on a panic/sanity check. I was wondering if any knew
of any issues with the default DDR3 parameters or if this sanity check
might be overzealous?
Here's how I've configured the memory:
physmem =
You could add your own DPRINTF that accesses the fields of the
staticInstruction.
Check out src/cpu/static_inst.hh. Specifically the functions
hasBranchTarget() and the two branchTarget() functions.
On Sat, Apr 20, 2013 at 4:24 PM, Meng Wang wrote:
> Hi, all
> I dumped trace of ARM benchmark
Quick answer: system.cpuname.commit.committedInsts counts nops and
instruction prefetches, system.cpuname.committedInsts doesn't.
These stats are incremented in src/cpu/o3/commit_impl.hh:updateComInstStats
and src/cpu/o3/cpu.cc:instDone. instDone is called by updateComInstStats
after testing for
Hi Meng,
I'm CC'ing the mailing list in case anyone else has interest in running
with the simpoint patch.
This part of the patch was coded by Ali I think. I originally wrote the
profiling bit that generated the bbv file. I use this current patch with
with my own custom se.py script. I've linked
Last level cache miss rates can be quite high on SPEC. Effectively the
higher level caches "filter" all of the easily cacheable accesses, so that
the last level cache only sees accesses that tend to miss.
Aameer Jaleel, of Intel, has published miss rates for L1/L2/L3 cache
configurations on the r
The changeToTiming function was removed ~3 weeks ago from the mainline.
http://repo.gem5.org/gem5/rev/1cd02decbfd3
CPU's now define a method that is used to determine their memory mode
(timing or atomic). Check O3CPU.py for an example. If a mode change is
necessary when swapping CPUs, the memor
It looks like the logic is just organized poorly. Yes, it will
unnecessarily stall non-loads if there are no free LSQ entries. It
shouldn't take many changes to fix this (basically changing the later while
loop to track number of LSQ entries remaining and accounting based upon
number of loads sen
Oops, perhaps posted a bit too quickly... It seems my detailed cpu model
wasn't properly connected to the system, and this was just a very poor
error message.
On Sun, Mar 3, 2013 at 2:42 PM, Mitch Hayenga
wrote:
> Hi all,
>
> I'm trying to automate switching between an atomic
Hi all,
I'm trying to automate switching between an atomic cpu and my own cpu.
This is done via my own python configuration script. With the default
config script (configs/common/se.py), switching works properly. With my
own script, which makes the same calls, it fails because it doesn't find
By using the "timing" cpu, you are effectively using something that is like
an idealized 1-wide, in-order cpu model. So the maximum possible IPC would
be one and with cache accesses, etc it should be expected to be much lower
than one.
For relative comparisons, especially papers not looking expli
Not that I know of. For a poor man's version, since it seems you are just
trying to generate various traces from a region of interest You could
just use the existing m5ops to checkpoint @ the beginning of the region of
interest and exit @ the end. That way you could just run from the
checkpo
They all inherit from the same sets of classes in
src/arch/generic/types.hh.You can use any of those constructors or
"set" methods to properly convert. Also look at the corresponding types.hh
file in the ISA-specific folder (ex: src/arch/arm/types.hh).
TheISA::PCState pc_addr(0x12cf4); //
99e).(0=>1).
2 loads sent to the memory system in the same cycle, both hit in the L1
cache, but result in different cycle latencies.
On Fri, Feb 15, 2013 at 10:36 AM, Mitch Hayenga <
mitch.hayenga+g...@gmail.com> wrote:
> This is a nicely timed thread. I just hit a related ticking issue
This is a nicely timed thread. I just hit a related ticking issue while
performance validating my core model. Here is an example case:
ld r1, [sp, #0x16] // L1 cache hit
ld r2, [sp, #0x24] // L1 cache hit
My core assumes 2 load ports, so both of these loads issue and hit in the
same cycle. B
Hi,
I'm currently traveling and don't run full system that much myself these
days. So maybe someone else would be a better choice to help you.
That said, I searched the list for your error and it looks like its the
same problem as discussed in this thread?
http://comments.gmane.org/gmane.comp.em
Hi,
It looks to me like you might just be dumping execution information during
the OS boot (before the benchmark has even started to run). Which, should
be the same regardless of the benchmark (since it wouldn't have started).
Is this the case? Also, be warned dumping execution information for
ons (
http://www.cs.ucr.edu/~tianc/).
On Sat, Jan 26, 2013 at 8:23 PM, Mitch Hayenga wrote:
> "If both answers are t-1, which means the output of any stage only depends
> on some other stages' output at previous cycle, then I can understand why
> time buffer can get ride of th
"If both answers are t-1, which means the output of any stage only depends
on some other stages' output at previous cycle, then I can understand why
time buffer can get ride of the dependencies. However, if a stage requires
a result from another stage at the same cycle, I cannot see how this works.
Nilay,
Ticking pipestages in reverse (and allowing values to propagate in that
order) is a *very* common way to implement processor simulators. I'd almost
call it the standard method. Though gem5 gets around this via the
timebuffer, other simulators do not use a timebuffer/pipe method. For
examp
You should do
gem5.opt configs/example/se.py --help
It's clearly documented there how to do this.
-i INPUT, --input=INPUT
Read stdin from a file.
--output=OUTPUT Redirect stdout to a file.
--errout=ERROUT Redirect stderr to a file.
So for your case
The SC (2) failing after (1) should force the program to loop trying
properly execute a LL/SC pair. Assuming (1) and (2) properly execute, the
value of the lock will be set to taken. This would force your other thread
to continuously loop on its LL until it saw the lock was free.
I think your co
t; **
>
> Any updates Mitch?
>
> Thanks,
>
> Ali
>
>
>
> On 11.10.2012 20:44, Mitch Hayenga wrote:
>
> Hi,
>
> I have a patch that fixes this in classic and ruby. I was waiting for
> another student (Dibakar, he runs a lot more parallel code than I do) to
>
"It releases the lock using normal store"
I think this might be where your confusion is coming from.
This is not true, it does a store conditional not a normal store. The
store conditional only stores if the context id is still set on the
cacheline. This code is in (if ruby) src/mem/ruby/system/
;
>
> On Tue, Oct 23, 2012 at 10:02 AM, Mitch Hayenga <
> mitch.hayenga+g...@gmail.com> wrote:
>
>> Since gem5 the O3 cpu model actually executes instructions @ execute (not
>> fetch/decode) a perfect branch predictor is a bit tricky. Assuming you are
>> running a s
t; inst->setPredTaken(false);
> return false;
> }
>
>
> to
>
> //if (!inst->isControl()) {
> TheISA::advancePC(nextPC, inst->staticInst);
> inst->setPredTarg(nextPC);
> inst->setPredTaken(false);
>
Since gem5 the O3 cpu model actually executes instructions @ execute (not
fetch/decode) a perfect branch predictor is a bit tricky. Assuming you are
running a single-threaded app in SE mode (so you don't have
OS/multi-threaded time variance issues), you could simply run the
application twice. Sav
not be implicitly expanded
> to cover the whole block as we've done. So you've convinced me that that's
> not just the most straightforward fix, but probably the right one.
>
> If you get it working, please submit the patch.
>
> Thanks!
>
> Steve
>
>
; Steve
>
>
> On Wed, Sep 26, 2012 at 12:50 PM, Mitch Hayenga <
> mitch.hayenga+g...@gmail.com> wrote:
>
>> Thanks for the reply.
>>
>> Thinking about this... I don't know too much about the O3 store-set
>> predictor, but it would seem that load-l
want to mark the ops as serializing as that slows
> down the cpu quite a bit.
>
>
>
> Thanks,
>
> Ali
>
>
>
> On 26.09.2012 13:14, Mitch Hayenga wrote:
>
> Background:
> I have a non-o3, out of order CPU implemented on gem5. Since I don't have
> a check
Background:
I have a non-o3, out of order CPU implemented on gem5. Since I don't have
a checker implemented yet, I tend to diff committed instructions vs o3.
Yesterday's patches caused a few of these diffs change because of
load-linked/store-conditional behavior (better prediction on data ops tha
f
>> 75,899,868 flits and the successful reception of 75,899,865 flits. Am I
>> doing something wrong with the simulation? Do I need to set some parameters
>> for the power calculations?
>>
>> Thanks for your time.
>>
>> Thanks,
>> Pavan
>>
>&
I actually did a slight modification of the m5 classic protocol to "fix"
this.
Basically, I allowed the data to remain dirty in the L2 & forwarded a
version that looked clean-exclusive to the L1. The way m5 is structured,
the L2 would already snoop upwards to the L1 if it got a request from below
understand this somewhat moreso than the
previous issue, since forcing traffic on another cache is undesirable, but
with the non-inclusive nature of the hierarchy, this original request may
have to go all the way out to memory.
Just doing some sanity checking that this is how things are supposed to
83 matches
Mail list logo