Re: [gem5-users] System Hangs

Joel Hestness via gem5-users Tue, 03 Jun 2014 16:54:07 -0700

Hi Ivan,
  Sorry for the delay on this.

  I haven't had an opportunity to try to reproduce your problem, though the
traces you've supplied here can help a bit.  Specifically, the stalled
LocalApics (plural, because 2 CPU cores) is fishy, because we'd expect
periodic interrupts to continue.  However, the last interrupt on CPU 1
appears to get cleared, which looks fine.  The CPU spin lock is normal for
threads that don't have any work to complete, but it's confusing why they
wouldn't be doing something.


  The next thing to dig into would be to figure out what the CPUs were
doing last before they entered the spin loop.  For this we may need to
trace a bit earlier in time using the Exec flags, and since it is likely
that messages/responses may be getting lost in the memory hierarchy or
devices, we'll need to use the ProtocolTrace flag to see what is being
communicated.  You could try playing around with these as a start.

  I may also have time to try to reproduce this over the next week, so I'm
hoping you could give me some more information: can you send me your gem5
command line, config.ini, and system.pc.com_1.terminal output from your
simulation, and details on the kernel and disk image that you're trying to
use?


  Thanks!
  Joel




On Sat, May 24, 2014 at 7:27 PM, Ivan Stalev <ids...@psu.edu> wrote:

> Hi,
>
> Has anyone been able to reproduce this issue?
>
> Thanks,
>
> Ivan
>
>
> On Sat, May 17, 2014 at 1:50 AM, Ivan Stalev <ids...@psu.edu> wrote:
>
>> Hi Joel,
>>
>> I am using revision 10124. I removed all of my own modifications just to
>> be safe.
>>
>> Running with gem5.opt and restoring from a boot-up checkpoint
>> with--debug-flag=Exec, it appears that the CPU is stuck in some sort of
>> infinite loop, executing this continuously:
>>
>> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.0  :
>> CMP_M_I : limm   t2d, 0  : IntAlu :  D=0x0000000000000000
>> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.1  :
>> CMP_M_I : ld   t1d, DS:[rdi] : MemRead :  D=0x00000000fffffffe
>> A=0xffffffff80822400
>> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.2  :
>> CMP_M_I : sub   t0d, t1d, t2d : IntAlu :  D=0x0000000000000000
>> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.0  :
>> JLE_I : rdip   t1, %ctrl153,  : IntAlu :  D=0xffffffff80596897
>> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.1  :
>> JLE_I : limm   t2, 0xfffffffffffffff9 : IntAlu :  D=0xfffffffffffffff9
>> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.2  :
>> JLE_I : wrip   , t1, t2  : IntAlu :
>> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+16    :   NOP
>>                      : IntAlu :
>> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.0  :
>> CMP_M_I : limm   t2d, 0  : IntAlu :  D=0x0000000000000000
>> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.1  :
>> CMP_M_I : ld   t1d, DS:[rdi] : MemRead :  D=0x00000000fffffffe
>> A=0xffffffff80822400
>> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.2  :
>> CMP_M_I : sub   t0d, t1d, t2d : IntAlu :  D=0x0000000000000000
>> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.0  :
>> JLE_I : rdip   t1, %ctrl153,  : IntAlu :  D=0xffffffff80596897
>> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.1  :
>> JLE_I : limm   t2, 0xfffffffffffffff9 : IntAlu :  D=0xfffffffffffffff9
>> 5268959012000: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.2  :
>> JLE_I : wrip   , t1, t2  : IntAlu :
>> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+16    :   NOP
>>                      : IntAlu :
>> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.0  :
>> CMP_M_I : limm   t2d, 0  : IntAlu :  D=0x0000000000000000
>> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.1  :
>> CMP_M_I : ld   t1d, DS:[rdi] : MemRead :  D=0x00000000fffffffe
>> A=0xffffffff80822400
>> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.2  :
>> CMP_M_I : sub   t0d, t1d, t2d : IntAlu :  D=0x0000000000000000
>> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.0  :
>> JLE_I : rdip   t1, %ctrl153,  : IntAlu :  D=0xffffffff80596897
>> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.1  :
>> JLE_I : limm   t2, 0xfffffffffffffff9 : IntAlu :  D=0xfffffffffffffff9
>> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.2  :
>> JLE_I : wrip   , t1, t2  : IntAlu :
>> 5268959013000: system.switch_cpus1 T0 : @_spin_lock_irqsave+16    :   NOP
>>                      : IntAlu :
>>
>> ....and so on repetitively without stopping.
>>
>> Using --debug-flag=LocalApic, the output does indeed stop shortly after
>> restoring from the checkpoint. The last output is:
>> ..
>> 5269570990500: system.cpu1.interrupts: Reported pending regular interrupt.
>> 5269570990500: system.cpu1.interrupts: Reported pending regular interrupt.
>> 5269570990500: system.cpu1.interrupts: Generated regular interrupt fault
>> object.
>> 5269570990500: system.cpu1.interrupts: Reported pending regular interrupt.
>> 5269570990500: system.cpu1.interrupts: Interrupt 239 sent to core.
>> 5269571169000: system.cpu1.interrupts: Writing Local APIC register 5 at
>> offset 0xb0 as 0.
>>
>> ...and no more output from this point on.
>>
>> I appreciate your help tremendously.
>>
>> Ivan
>>
>>
>>
>> On Fri, May 16, 2014 at 11:18 AM, Joel Hestness <jthestn...@gmail.com>
>> wrote:
>>
>>> Hi Ivan,
>>>   I believe that the email thread you previously referenced was related
>>> to a bug that we identified and fixed with changeset 9624
>>> <http://permalink.gmane.org/gmane.comp.emulators.m5.devel/19326>.  That
>>> bug was causing interrupts to be dropped in x86 when running with the O3
>>> CPU.  Are you using a version of gem5 after that changeset?  If not, I'd
>>> recommend updating to a more recent version and trying to replicate this
>>> issue again.
>>>
>>>   If you are using a more recent version of gem5, first, please let us
>>> know which changeset and whether you've made any changes.  Then, I'd
>>> recommend compiling gem5.opt and using the DPRINTF tracing functionality to
>>> see if you can zero in on what is happening.  Specifically, first try
>>> passing the flag --debug-flag=Exec to look at what the CPU cores are
>>> executing (you may also want to pass --trace-start=<<tick>> with a
>>> simulator tick time close to when the hang happens).  This trace will
>>> include Linux kernel symbols for at least some of the lines if executing in
>>> the kernel (e.g. handling an interrupt).  If you've compiled your benchmark
>>> without debugging symbols, it may just show the memory addresses of
>>> instructions being executed within the application.  I will guess that
>>> you'll see kernel symbols for at least some of the executed instructions
>>> for interrupts.
>>>
>>>   If it appears that the CPUs are continuing to execute, try running
>>> with --debug-flag=LocalApic.  This will print the interrupts that each core
>>> is receiving, and if it stops printing at any point, it means something has
>>> gone wrong and we'd have to do some deeper digging.
>>>
>>>   Keep us posted on what you find,
>>>   Joel
>>>
>>>
>>>
>>> On Thu, May 15, 2014 at 11:16 PM, Ivan Stalev <ids...@psu.edu> wrote:
>>>
>>>> Hi Joel,
>>>>
>>>> I have tried several different kernels and disk images, including the
>>>> default ones provided on the GEM5 website in the x86-system.tar.bz2
>>>> download. I run with these commands:
>>>>
>>>> build/X86/gem5.fast -d m5out/test_run configs/example/fs.py
>>>> --kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp
>>>> -n 2 --mem-size=4GB --cpu-type=atomic --cpu-clock=2GHz
>>>> --script=rcs_scripts/run.rcS --max-checkpoints=1
>>>>
>>>> My run.rcS script simply contains:
>>>>
>>>> #!/bin/sh
>>>> /sbin/m5 resetstats
>>>> /sbin/m5 checkpoint
>>>> echo 'booted'
>>>> /extras/run
>>>> /sbin/m5 exit
>>>>
>>>> where "/extras/run" is simply a C program with an infinite loop that
>>>> prints a counter.
>>>>
>>>> I then restore:
>>>>
>>>> build/X86/gem5.fast -d m5out/test_run configs/example/fs.py
>>>> --kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp
>>>> -r 1 -n 2 --mem-size=4GB --cpu-type=detailed --cpu-clock=2GHz --caches
>>>> --l2cache --num-l2caches=1 --l1d_size=32kB --l1i_size=32kB --l1d_assoc=4
>>>> --l1i_assoc=4 --l2_size=4MB --l2_assoc=8 --cacheline_size=64
>>>>
>>>> I specified the disk image file in Benchmarks.py. Restoring from the
>>>> same checkpoint and running in atomic mode works fine. I also tried booting
>>>> the system in detailed and letting it run for a while, but once it boots,
>>>> there is no more output. So it seems that checkpointing is not the issue.
>>>> The "run" program is just a dummy, and the same issue also persists when
>>>> running SPEC benchmarks or any other program.
>>>>
>>>> My dummy program is simply:
>>>>
>>>>     int count=0;
>>>>     printf("**************************** HEYY \n");
>>>>     while(1)
>>>>         printf("\n %d \n", count++);
>>>>
>>>> Letting it run for a while, the only output is exactly this:
>>>>
>>>> booted
>>>> *******
>>>>
>>>> It doesn't even finish printing the first printf.
>>>>
>>>> Another thing to add: In another scenario, I modified the kernel to
>>>> call an m5 pseudo instruction on every context switch, and then GEM5 prints
>>>> that a context switch occurred. Once again, in atomic mode this worked as
>>>> expected. However, in detailed, even the GEM5 (printf inside GEM5 itself)
>>>> output stopped along with the system output in the terminal.
>>>>
>>>> Thank you for your help.
>>>>
>>>> Ivan
>>>>
>>>>
>>>> On Thu, May 15, 2014 at 10:51 PM, Joel Hestness <jthestn...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Ivan,
>>>>>   Can you please give more detail on what you're running?
>>>>>  Specifically, can you give your command line, and which kernel, disk 
>>>>> image
>>>>> you're using?  Are you using checkpointing?
>>>>>
>>>>>   Joel
>>>>>
>>>>>
>>>>> On Mon, May 12, 2014 at 10:52 AM, Ivan Stalev via gem5-users <
>>>>> gem5-users@gem5.org> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I am running X86 in full system mode. When running just 1 CPU, both
>>>>>> atomic and detailed mode work fine. However, with more than 1 CPU, atomic
>>>>>> works fine, but in detailed mode the system appears to hang shortly after
>>>>>> boot-up. GEM5 doesn't crash, but the system stops having any output.
>>>>>> Looking at the stats, it appears that instructions are still being
>>>>>> committed, but the actual applications/benchmarks are not making 
>>>>>> progress.
>>>>>> The issue persists with the latest version of GEM5. I also tried two
>>>>>> different kernel versions and several different disk images.
>>>>>>
>>>>>> I might be experiencing what seems to be the same issue that was
>>>>>> found about a year ago but appears to not have been fixed:
>>>>>> https://www.mail-archive.com/gem5-dev@gem5.org/msg08839.html
>>>>>>
>>>>>> Can anyone reproduce this or know of a solution?
>>>>>>
>>>>>> Thank you,
>>>>>>
>>>>>> Ivan
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> gem5-users mailing list
>>>>>> gem5-users@gem5.org
>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>   Joel Hestness
>>>>>   PhD Student, Computer Architecture
>>>>>   Dept. of Computer Science, University of Wisconsin - Madison
>>>>>   http://pages.cs.wisc.edu/~hestness/
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>>   Joel Hestness
>>>   PhD Student, Computer Architecture
>>>   Dept. of Computer Science, University of Wisconsin - Madison
>>>   http://pages.cs.wisc.edu/~hestness/
>>>
>>
>>
>


-- 
  Joel Hestness
  PhD Student, Computer Architecture
  Dept. of Computer Science, University of Wisconsin - Madison
  http://pages.cs.wisc.edu/~hestness/

_______________________________________________
gem5-users mailing list
gem5-users@gem5.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Re: [gem5-users] System Hangs

Reply via email to