Re: [gem5-users] System Hangs

Ivan Stalev via gem5-users Sat, 24 May 2014 17:28:26 -0700

Hi,

Has anyone been able to reproduce this issue?


Thanks,

Ivan


On Sat, May 17, 2014 at 1:50 AM, Ivan Stalev <ids...@psu.edu> wrote:

> Hi Joel,
>
> I am using revision 10124. I removed all of my own modifications just to
> be safe.
>
> Running with gem5.opt and restoring from a boot-up checkpoint
> with--debug-flag=Exec, it appears that the CPU is stuck in some sort of
> infinite loop, executing this continuously:
>
> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.0  :
> CMP_M_I : limm   t2d, 0  : IntAlu :  D=0x0000000000000000
> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.1  :
> CMP_M_I : ld   t1d, DS:[rdi] : MemRead :  D=0x00000000fffffffe
> A=0xffffffff80822400
> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.2  :
> CMP_M_I : sub   t0d, t1d, t2d : IntAlu :  D=0x0000000000000000
> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.0  :
> JLE_I : rdip   t1, %ctrl153,  : IntAlu :  D=0xffffffff80596897
> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.1  :
> JLE_I : limm   t2, 0xfffffffffffffff9 : IntAlu :  D=0xfffffffffffffff9
> 5268959012000: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.2  :
> JLE_I : wrip   , t1, t2  : IntAlu :
> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+16    :   NOP
>                      : IntAlu :
> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.0  :
> CMP_M_I : limm   t2d, 0  : IntAlu :  D=0x0000000000000000
> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.1  :
> CMP_M_I : ld   t1d, DS:[rdi] : MemRead :  D=0x00000000fffffffe
> A=0xffffffff80822400
> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+18.2  :
> CMP_M_I : sub   t0d, t1d, t2d : IntAlu :  D=0x0000000000000000
> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.0  :
> JLE_I : rdip   t1, %ctrl153,  : IntAlu :  D=0xffffffff80596897
> 5268959012500: system.switch_cpus0 T0 : @_spin_lock_irqsave+21.1  :
> JLE_I : limm   t2, 0xfffffffffffffff9 : IntAlu :  D=0xfffffffffffffff9
> 5268959012000: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.2  :
> JLE_I : wrip   , t1, t2  : IntAlu :
> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+16    :   NOP
>                      : IntAlu :
> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.0  :
> CMP_M_I : limm   t2d, 0  : IntAlu :  D=0x0000000000000000
> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.1  :
> CMP_M_I : ld   t1d, DS:[rdi] : MemRead :  D=0x00000000fffffffe
> A=0xffffffff80822400
> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+18.2  :
> CMP_M_I : sub   t0d, t1d, t2d : IntAlu :  D=0x0000000000000000
> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.0  :
> JLE_I : rdip   t1, %ctrl153,  : IntAlu :  D=0xffffffff80596897
> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.1  :
> JLE_I : limm   t2, 0xfffffffffffffff9 : IntAlu :  D=0xfffffffffffffff9
> 5268959012500: system.switch_cpus1 T0 : @_spin_lock_irqsave+21.2  :
> JLE_I : wrip   , t1, t2  : IntAlu :
> 5268959013000: system.switch_cpus1 T0 : @_spin_lock_irqsave+16    :   NOP
>                      : IntAlu :
>
> ....and so on repetitively without stopping.
>
> Using --debug-flag=LocalApic, the output does indeed stop shortly after
> restoring from the checkpoint. The last output is:
> ..
> 5269570990500: system.cpu1.interrupts: Reported pending regular interrupt.
> 5269570990500: system.cpu1.interrupts: Reported pending regular interrupt.
> 5269570990500: system.cpu1.interrupts: Generated regular interrupt fault
> object.
> 5269570990500: system.cpu1.interrupts: Reported pending regular interrupt.
> 5269570990500: system.cpu1.interrupts: Interrupt 239 sent to core.
> 5269571169000: system.cpu1.interrupts: Writing Local APIC register 5 at
> offset 0xb0 as 0.
>
> ...and no more output from this point on.
>
> I appreciate your help tremendously.
>
> Ivan
>
>
>
> On Fri, May 16, 2014 at 11:18 AM, Joel Hestness <jthestn...@gmail.com>wrote:
>
>> Hi Ivan,
>>   I believe that the email thread you previously referenced was related
>> to a bug that we identified and fixed with changeset 
>> 9624<http://permalink.gmane.org/gmane.comp.emulators.m5.devel/19326>.
>>  That bug was causing interrupts to be dropped in x86 when running with the
>> O3 CPU.  Are you using a version of gem5 after that changeset?  If not, I'd
>> recommend updating to a more recent version and trying to replicate this
>> issue again.
>>
>>   If you are using a more recent version of gem5, first, please let us
>> know which changeset and whether you've made any changes.  Then, I'd
>> recommend compiling gem5.opt and using the DPRINTF tracing functionality to
>> see if you can zero in on what is happening.  Specifically, first try
>> passing the flag --debug-flag=Exec to look at what the CPU cores are
>> executing (you may also want to pass --trace-start=<<tick>> with a
>> simulator tick time close to when the hang happens).  This trace will
>> include Linux kernel symbols for at least some of the lines if executing in
>> the kernel (e.g. handling an interrupt).  If you've compiled your benchmark
>> without debugging symbols, it may just show the memory addresses of
>> instructions being executed within the application.  I will guess that
>> you'll see kernel symbols for at least some of the executed instructions
>> for interrupts.
>>
>>   If it appears that the CPUs are continuing to execute, try running with
>> --debug-flag=LocalApic.  This will print the interrupts that each core is
>> receiving, and if it stops printing at any point, it means something has
>> gone wrong and we'd have to do some deeper digging.
>>
>>   Keep us posted on what you find,
>>   Joel
>>
>>
>>
>> On Thu, May 15, 2014 at 11:16 PM, Ivan Stalev <ids...@psu.edu> wrote:
>>
>>> Hi Joel,
>>>
>>> I have tried several different kernels and disk images, including the
>>> default ones provided on the GEM5 website in the x86-system.tar.bz2
>>> download. I run with these commands:
>>>
>>> build/X86/gem5.fast -d m5out/test_run configs/example/fs.py
>>> --kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp
>>> -n 2 --mem-size=4GB --cpu-type=atomic --cpu-clock=2GHz
>>> --script=rcs_scripts/run.rcS --max-checkpoints=1
>>>
>>> My run.rcS script simply contains:
>>>
>>> #!/bin/sh
>>> /sbin/m5 resetstats
>>> /sbin/m5 checkpoint
>>> echo 'booted'
>>> /extras/run
>>> /sbin/m5 exit
>>>
>>> where "/extras/run" is simply a C program with an infinite loop that
>>> prints a counter.
>>>
>>> I then restore:
>>>
>>> build/X86/gem5.fast -d m5out/test_run configs/example/fs.py
>>> --kernel=/home/mdl/ids103/full_system_images/binaries/x86_64-vmlinux-2.6.22.9.smp
>>> -r 1 -n 2 --mem-size=4GB --cpu-type=detailed --cpu-clock=2GHz --caches
>>> --l2cache --num-l2caches=1 --l1d_size=32kB --l1i_size=32kB --l1d_assoc=4
>>> --l1i_assoc=4 --l2_size=4MB --l2_assoc=8 --cacheline_size=64
>>>
>>> I specified the disk image file in Benchmarks.py. Restoring from the
>>> same checkpoint and running in atomic mode works fine. I also tried booting
>>> the system in detailed and letting it run for a while, but once it boots,
>>> there is no more output. So it seems that checkpointing is not the issue.
>>> The "run" program is just a dummy, and the same issue also persists when
>>> running SPEC benchmarks or any other program.
>>>
>>> My dummy program is simply:
>>>
>>>     int count=0;
>>>     printf("**************************** HEYY \n");
>>>     while(1)
>>>         printf("\n %d \n", count++);
>>>
>>> Letting it run for a while, the only output is exactly this:
>>>
>>> booted
>>> *******
>>>
>>> It doesn't even finish printing the first printf.
>>>
>>> Another thing to add: In another scenario, I modified the kernel to call
>>> an m5 pseudo instruction on every context switch, and then GEM5 prints that
>>> a context switch occurred. Once again, in atomic mode this worked as
>>> expected. However, in detailed, even the GEM5 (printf inside GEM5 itself)
>>> output stopped along with the system output in the terminal.
>>>
>>> Thank you for your help.
>>>
>>> Ivan
>>>
>>>
>>> On Thu, May 15, 2014 at 10:51 PM, Joel Hestness <jthestn...@gmail.com>wrote:
>>>
>>>> Hi Ivan,
>>>>   Can you please give more detail on what you're running?
>>>>  Specifically, can you give your command line, and which kernel, disk image
>>>> you're using?  Are you using checkpointing?
>>>>
>>>>   Joel
>>>>
>>>>
>>>> On Mon, May 12, 2014 at 10:52 AM, Ivan Stalev via gem5-users <
>>>> gem5-users@gem5.org> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I am running X86 in full system mode. When running just 1 CPU, both
>>>>> atomic and detailed mode work fine. However, with more than 1 CPU, atomic
>>>>> works fine, but in detailed mode the system appears to hang shortly after
>>>>> boot-up. GEM5 doesn't crash, but the system stops having any output.
>>>>> Looking at the stats, it appears that instructions are still being
>>>>> committed, but the actual applications/benchmarks are not making progress.
>>>>> The issue persists with the latest version of GEM5. I also tried two
>>>>> different kernel versions and several different disk images.
>>>>>
>>>>> I might be experiencing what seems to be the same issue that was found
>>>>> about a year ago but appears to not have been fixed:
>>>>> https://www.mail-archive.com/gem5-dev@gem5.org/msg08839.html
>>>>>
>>>>> Can anyone reproduce this or know of a solution?
>>>>>
>>>>> Thank you,
>>>>>
>>>>> Ivan
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> gem5-users mailing list
>>>>> gem5-users@gem5.org
>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>   Joel Hestness
>>>>   PhD Student, Computer Architecture
>>>>   Dept. of Computer Science, University of Wisconsin - Madison
>>>>   http://pages.cs.wisc.edu/~hestness/
>>>>
>>>
>>>
>>
>>
>> --
>>   Joel Hestness
>>   PhD Student, Computer Architecture
>>   Dept. of Computer Science, University of Wisconsin - Madison
>>   http://pages.cs.wisc.edu/~hestness/
>>
>
>

_______________________________________________
gem5-users mailing list
gem5-users@gem5.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Re: [gem5-users] System Hangs

Reply via email to