Re: [gem5-users] System Hangs

Castillo Villar, Emilio via gem5-users Fri, 06 Jun 2014 08:49:37 -0700

Hello,

I have seen similar issues when running X86 timing and detailed cpus with the 
Classic memory system. Mostly due to X86 atomic memory accesses not being 
implemented. The stdout freezes but instructions are still being committed.


If you want to run with timing or detailed cpus in X86 & FS & multi-core I am 
afraid you will need to use Ruby.

Emilio
________________________________
De: gem5-users [gem5-users-boun...@gem5.org] en nombre de Ivan Stalev via 
gem5-users [gem5-users@gem5.org]
Enviado: viernes, 06 de junio de 2014 1:14
Para: Joel Hestness
CC: gem5 users mailing list
Asunto: Re: [gem5-users] System Hangs

Hi Joel,

Thanks for getting back to me.

I ran it again with the ProtocolTrace flag and the only output there is:  0: 
rtc: Real-time clock set to Sun Jan  1 00:00:00 2012

With the Exec flag, I do see spinlock output on and off in the beginning during 
regular execution, so that is normal as you said. But once the "problem" occurs 
shortly after, the Exec flag output is just continuous spinlock forever as I 
posted previously.

The exact gem5 command lines I use are posted in my previous post. The kernel 
and disk image are the simply the default ones from the GEM5 downloads page: 
http://www.m5sim.org/dist/current/x86/x86-system.tar.bz2

I have attached a zip file containing the following files:

BOOT-config.ini - The config.ini from the first run, i.e. booting in atomic 
mode, creating a checkpoint, and exiting.
BOOT-system.pc.com_1.terminal - The terminal output from the first run
CPT-config.ini - The config.ini when restoring from the checkpoint in detailed 
mode
CPT-system.pc.com_1.terminal - The system output after restoring from the 
checkpoint
run.c - The dummy program started by the run script
run.rcS - The run script
flag-exec-partial.out - The output from the Exec flag, right before the 
"problem" occurs, The infinite spinlock starts at tick 5268700121500

Again, this problem occurs even without checkpointing. I have also tried a few 
different kernels and disk images. I did the same test with both alpha and 
arm64 and it works, so it appears to just be an issue with x86.

Thank you,

Ivan



On Tue, Jun 3, 2014 at 7:53 PM, Joel Hestness 
<jthestn...@gmail.com<mailto:jthestn...@gmail.com>> wrote:
Hi Ivan,
  Sorry for the delay on this.

  I haven't had an opportunity to try to reproduce your problem, though the 
traces you've supplied here can help a bit.  Specifically, the stalled 
LocalApics (plural, because 2 CPU cores) is fishy, because we'd expect periodic 
interrupts to continue.  However, the last interrupt on CPU 1 appears to get 
cleared, which looks fine.  The CPU spin lock is normal for threads that don't 
have any work to complete, but it's confusing why they wouldn't be doing 
something.

  The next thing to dig into would be to figure out what the CPUs were doing 
last before they entered the spin loop.  For this we may need to trace a bit 
earlier in time using the Exec flags, and since it is likely that 
messages/responses may be getting lost in the memory hierarchy or devices, 
we'll need to use the ProtocolTrace flag to see what is being communicated.  
You could try playing around with these as a start.

  I may also have time to try to reproduce this over the next week, so I'm 
hoping you could give me some more information: can you send me your gem5 
command line, config.ini, and system.pc.com_1.terminal output from your 
simulation, and details on the kernel and disk image that you're trying to use?


  Thanks!
  Joel




On Sat, May 24, 2014 at 7:27 PM, Ivan Stalev 
<ids...@psu.edu<mailto:ids...@psu.edu>> wrote:
Hi,

Has anyone been able to reproduce this issue?

Thanks,

Ivan


On Sat, May 17, 2014 at 1:50 AM, Ivan Stalev 
<ids...@psu.edu<mailto:ids...@psu.edu>> wrote:
Hi Joel,

I am using revision 10124. I removed all of my own modifications just to be 
safe.

Running with gem5.opt and restoring from a boot-up checkpoint 
with--debug-flag=Exec, it appears that the CPU is stuck in some sort of 
infinite loop, executing this continuously:

5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.0  :   CMP_M_I : 
limm   t2d, 0  : IntAlu :  D=0x0000000000000000
5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.1  :   CMP_M_I : 
ld   t1d, DS:[rdi] : MemRead :  D=0x00000000fffffffe A=0xffffffff80822400
5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.2  :   CMP_M_I : 
sub   t0d, t1d, t2d : IntAlu :  D=0x0000000000000000
5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.0  :   JLE_I : 
rdip   t1, %ctrl153,  : IntAlu :  D=0xffffffff80596897
5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.1  :   JLE_I : 
limm   t2, 0xfffffffffffffff9 : IntAlu :  D=0xfffffffffffffff9
5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.2  :   JLE_I : 
wrip   , t1, t2  : IntAlu :
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+16    :   NOP       
               : IntAlu :
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.0  :   CMP_M_I : 
limm   t2d, 0  : IntAlu :  D=0x0000000000000000
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.1  :   CMP_M_I : 
ld   t1d, DS:[rdi] : MemRead :  D=0x00000000fffffffe A=0xffffffff80822400
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.2  :   CMP_M_I : 
sub   t0d, t1d, t2d : IntAlu :  D=0x0000000000000000
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.0  :   JLE_I : 
rdip   t1, %ctrl153,  : IntAlu :  D=0xffffffff80596897
5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.1  :   JLE_I : 
limm   t2, 0xfffffffffffffff9 : IntAlu :  D=0xfffffffffffffff9
5268959012000: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.2  :   JLE_I : 
wrip   , t1, t2  : IntAlu :
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+16    :   NOP       
               : IntAlu :
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.0  :   CMP_M_I : 
limm   t2d, 0  : IntAlu :  D=0x0000000000000000
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.1  :   CMP_M_I : 
ld   t1d, DS:[rdi] : MemRead :  D=0x00000000fffffffe A=0xffffffff80822400
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.2  :   CMP_M_I : 
sub   t0d, t1d, t2d : IntAlu :  D=0x0000000000000000
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.0  :   JLE_I : 
rdip   t1, %ctrl153,  : IntAlu :  D=0xffffffff80596897
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.1  :   JLE_I : 
limm   t2, 0xfffffffffffffff9 : IntAlu :  D=0xfffffffffffffff9
5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.2  :   JLE_I : 
wrip   , t1, t2  : IntAlu :
5268959013000: system.switch_cpus1 T0 : @_spin_lock_irqsave+16    :   NOP       
               : IntAlu :

....and so on repetitively without stopping.

Using --debug-flag=LocalApic, the output does indeed stop shortly after 
restoring from the checkpoint. The last output is:
..
5269570990500: system.cpu1.interrupts: Reported pending regular interrupt.
5269570990500: system.cpu1.interrupts: Reported pending regular interrupt.
5269570990500: system.cpu1.interrupts: Generated regular interrupt fault object.
5269570990500: system.cpu1.interrupts: Reported pending regular interrupt.
5269570990500: system.cpu1.interrupts: Interrupt 239 sent to core.
5269571169000: system.cpu1.interrupts: Writing Local APIC register 5 at offset 
0xb0 as 0.

...and no more output from this point on.

I appreciate your help tremendously.

Ivan



On Fri, May 16, 2014 at 11:18 AM, Joel Hestness 
<jthestn...@gmail.com<mailto:jthestn...@gmail.com>> wrote:
Hi Ivan,
  I believe that the email thread you previously referenced was related to a 
bug that we identified and fixed with changeset 
9624<http://permalink.gmane.org/gmane.comp.emulators.m5.devel/19326>.  That bug 
was causing interrupts to be dropped in x86 when running with the O3 CPU.  Are 
you using a version of gem5 after that changeset?  If not, I'd recommend 
updating to a more recent version and trying to replicate this issue again.

  If you are using a more recent version of gem5, first, please let us know 
which changeset and whether you've made any changes.  Then, I'd recommend 
compiling gem5.opt and using the DPRINTF tracing functionality to see if you 
can zero in on what is happening.  Specifically, first try passing the flag 
--debug-flag=Exec to look at what the CPU cores are executing (you may also 
want to pass --trace-start=<<tick>> with a simulator tick time close to when 
the hang happens).  This trace will include Linux kernel symbols for at least 
some of the lines if executing in the kernel (e.g. handling an interrupt).  If 
you've compiled your benchmark without debugging symbols, it may just show the 
memory addresses of instructions being executed within the application.  I will 
guess that you'll see kernel symbols for at least some of the executed 
instructions for interrupts.

  If it appears that the CPUs are continuing to execute, try running with 
--debug-flag=LocalApic.  This will print the interrupts that each core is 
receiving, and if it stops printing at any point, it means something has gone 
wrong and we'd have to do some deeper digging.

  Keep us posted on what you find,
  Joel



On Thu, May 15, 2014 at 11:16 PM, Ivan Stalev 
<ids...@psu.edu<mailto:ids...@psu.edu>> wrote:
Hi Joel,

I have tried several different kernels and disk images, including the default 
ones provided on the GEM5 website in the x86-system.tar.bz2 download. I run 
with these commands:

build/X86/gem5.fast -d m5out/test_run configs/example/fs.py 
--kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp
 -n 2 --mem-size=4GB --cpu-type=atomic --cpu-clock=2GHz 
--script=rcs_scripts/run.rcS --max-checkpoints=1

My run.rcS script simply contains:

#!/bin/sh
/sbin/m5 resetstats
/sbin/m5 checkpoint
echo 'booted'
/extras/run
/sbin/m5 exit

where "/extras/run" is simply a C program with an infinite loop that prints a 
counter.

I then restore:

build/X86/gem5.fast -d m5out/test_run configs/example/fs.py 
--kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp
 -r 1 -n 2 --mem-size=4GB --cpu-type=detailed --cpu-clock=2GHz --caches 
--l2cache --num-l2caches=1 --l1d_size=32kB --l1i_size=32kB --l1d_assoc=4 
--l1i_assoc=4 --l2_size=4MB --l2_assoc=8 --cacheline_size=64

I specified the disk image file in Benchmarks.py. Restoring from the same 
checkpoint and running in atomic mode works fine. I also tried booting the 
system in detailed and letting it run for a while, but once it boots, there is 
no more output. So it seems that checkpointing is not the issue. The "run" 
program is just a dummy, and the same issue also persists when running SPEC 
benchmarks or any other program.

My dummy program is simply:

    int count=0;
    printf("**************************** HEYY \n");
    while(1)
        printf("\n %d \n", count++);

Letting it run for a while, the only output is exactly this:

booted
*******

It doesn't even finish printing the first printf.

Another thing to add: In another scenario, I modified the kernel to call an m5 
pseudo instruction on every context switch, and then GEM5 prints that a context 
switch occurred. Once again, in atomic mode this worked as expected. However, 
in detailed, even the GEM5 (printf inside GEM5 itself) output stopped along 
with the system output in the terminal.

Thank you for your help.

Ivan


On Thu, May 15, 2014 at 10:51 PM, Joel Hestness 
<jthestn...@gmail.com<mailto:jthestn...@gmail.com>> wrote:
Hi Ivan,
  Can you please give more detail on what you're running?  Specifically, can 
you give your command line, and which kernel, disk image you're using?  Are you 
using checkpointing?

  Joel


On Mon, May 12, 2014 at 10:52 AM, Ivan Stalev via gem5-users 
<gem5-users@gem5.org<mailto:gem5-users@gem5.org>> wrote:
Hello,

I am running X86 in full system mode. When running just 1 CPU, both atomic and 
detailed mode work fine. However, with more than 1 CPU, atomic works fine, but 
in detailed mode the system appears to hang shortly after boot-up. GEM5 doesn't 
crash, but the system stops having any output. Looking at the stats, it appears 
that instructions are still being committed, but the actual 
applications/benchmarks are not making progress. The issue persists with the 
latest version of GEM5. I also tried two different kernel versions and several 
different disk images.

I might be experiencing what seems to be the same issue that was found about a 
year ago but appears to not have been fixed: 
https://www.mail-archive.com/gem5-dev@gem5.org/msg08839.html

Can anyone reproduce this or know of a solution?

Thank you,

Ivan



_______________________________________________
gem5-users mailing list
gem5-users@gem5.org<mailto:gem5-users@gem5.org>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users



--
  Joel Hestness
  PhD Student, Computer Architecture
  Dept. of Computer Science, University of Wisconsin - Madison
  http://pages.cs.wisc.edu/~hestness/




--
  Joel Hestness
  PhD Student, Computer Architecture
  Dept. of Computer Science, University of Wisconsin - Madison
  http://pages.cs.wisc.edu/~hestness/





--
  Joel Hestness
  PhD Student, Computer Architecture
  Dept. of Computer Science, University of Wisconsin - Madison
  http://pages.cs.wisc.edu/~hestness/

_______________________________________________
gem5-users mailing list
gem5-users@gem5.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Re: [gem5-users] System Hangs

Reply via email to