Re: [gem5-users] Help with debugging a kernel panic

Pritha Ghoshal Sat, 23 Jun 2012 17:26:21 -0700

Hi,

I have been running with the Exec debug flag and have the following
incongruities:


1. For the strcmp issue, this is the case:
3514789948500: system.cpu2 T0 : @strcmp    : ldq_u      r3,0(r16)       :
MemRead :  A=0x8982d70
Tracing back r16:
3514789947500: system.cpu2 T0 : @e1000_clean_rx_irq+400    : bis        r31,
r10,r16     : IntAlu :  D=0x0000000008982d70
Tracing back r10:
3514789947500: system.cpu2 T0 : @e1000_clean_rx_irq+384    : ldq        r10,
32(r9)      : MemRead :  D=0x0000000008982d70 A=0xfffffc003c34c2e0

Now tracing back this address I have:
3452373647750: system.cpu2 T0 : @__netdev_alloc_skb+56    : stq
 r9,32(r0)       : MemWrite :  D=0xfffffc003f1a1000 A=0xfffffc003c34c2e0
3484029521750: system.cpu2 T0 : @e1000_clean_rx_irq+384    : ldq
 r10,32(r9)      : MemRead :  D=0xfffffc003f1a1000 A=0xfffffc003c34c2e0
3484029521750: system.cpu2 T0 : @e1000_clean_rx_irq+384    : ldq
 r10,32(r9)      : MemRead :  D=0x0000000008982d70 A= 0xfffffc003c34c2e0

So though there was a different value loaded and read from the address
till 3484029521750,
somehow between 3484029521750 and 3484029521750 the value at the address
has changed to 0x0000000008982d70.

2. I have another oops at a different location, somewhat similar. In
ip_sabotage_in function, the code is proceeding down a different path than
in the normal case. In the normal case
a) This loop stores the value at the address:
3370440886750: system.cpu2 T0 : @loop    : stq        r17,0(r5)       :
MemWrite :  D=0x0000000000000000 A=0xfffffc003cb22948
3370440886750: system.cpu2 T0 : @loop+4    : subq       r3,1,r3         :
IntAlu :  D=0x0000000000000003
3370440886750: system.cpu2 T0 : @loop+8    : addq       r5,8,r5         :
IntAlu :  D=0xfffffc003cb22950
3370440886750: system.cpu2 T0 : @loop+12    : bne        r3,loop         :
IntAlu :
b) This is where the value is read:
3483182609000: system.cpu2 T0 : @ip_sabotage_in    : ldq        r1,136(r17)
    : MemRead :  D=0x0000000000000000 A=0xfffffc003cb22948
3483182609000: system.cpu2 T0 : @ip_sabotage_in+4    : beq
 r1,0xfffffc0000680950 : IntAlu :
3483182623000: system.cpu2 T0 : @ip_sabotage_in+32    : lda
 r0,1(r31)       : IntAlu :  D=0x0000000000000001
3483182623000: system.cpu2 T0 : @ip_sabotage_in+36    : ret        (r26)
        : IntAlu :
So after ip_sabotage_in+4 the code jumps to ip_sabotage_in+32

But in the faulty case, this is what happens:
a) The loop storing the value is the same:
3370447153000: system.cpu2 T0 : @loop    : stq        r17,0(r5)       :
MemWrite :  D=0x0000000000000000 A=0xfffffc003cb22248
3370447153000: system.cpu2 T0 : @loop+4    : subq       r3,1,r3         :
IntAlu :  D=0x0000000000000003
3370447153000: system.cpu2 T0 : @loop+8    : addq       r5,8,r5         :
IntAlu :  D=0xfffffc003cb22250
3370447153000: system.cpu2 T0 : @loop+12    : bne        r3,loop         :
IntAlu :
b) The loop reading the value reads a wrong data:
3483490680000: system.cpu2 T0 : @ip_sabotage_in    : ldq        r1,136(r17)
    : MemRead :  D=0x0000000000000090 A=0xfffffc003cb22248
3483490680000: system.cpu2 T0 : @ip_sabotage_in+4    : beq
 r1,0xfffffc0000680950 : IntAlu :
3483490705250: system.cpu2 T0 : @ip_sabotage_in+8    : ldl        r1,24(r1)
      : MemRead :  A=0xa8
It goes to line ip_sabotage_in+8 and lands up in a oops as the memory
address cannot be accessed..

Could you suggest how I could debug this further? And could I trace the
reason back to see what is the cause? I have loaded a kernel module while
running the experiment, which spawn 3 threads which are run on cpu1, cpu2
and cpu3 respectively. Is the error somehow linked to the code?

Thanks,
Pritha




On Fri, Jun 22, 2012 at 11:18 AM, Pritha Ghoshal <pritha9...@neo.tamu.edu>wrote:

> Hi Ali,
>
> I think the problem is somewhere in the compiler.. I removed a printf
> statement from a location and rebuilt the kernel, that particular error
> seemed to go off and the simulation ran for a much longer time, before it
> got killed with "unhandled unaligned exception".
>
> So I was trying to enable CONFIG_DEBUG_SLAB and run it again.. But there
> seems to be some problem with this debug mode and the slab code.. This
> comes only when it runs with more than one core..  The following bug stops
> the simulation..
>
> #if DEBUG
> static void check_irq_off(void)
> {
>     BUG_ON(!irqs_disabled());
> }
>
> Do I have to setup something more while enabling the config_debug_slab
> mode?
>
> Thanks,
> Pritha
>
>
> On Wed, Jun 20, 2012 at 9:42 PM, Ali Saidi <sa...@umich.edu> wrote:
>
>> **
>>
>> Hi Pritha,
>>
>>
>>
>> I seem to be missing something... so the ldq_u t2, 0(a0) is loading a
>> bogus address. Did the a0 address get stored by the first stq?  If so, what
>> value was stored by the stq? (You can use the debug-flags to figure that
>> out). Was it the right value? There are still a couple of possible issues
>> here: 1) kernel bug 2) compiler bug 3) gem5 bug. You need to trace the
>> source of the value back as far as possible using a combination of what
>> you've done and the exec debug flag. If the value was stored and later read
>> and isn't the same, something has likely gone wrong with gem5.
>> Unfortunately, it's also possible there is a bug with the compiler or
>> kernel.
>>
>>
>>
>> Ali
>>
>>
>>
>> On 20.06.2012 18:01, Pritha Ghoshal wrote:
>>
>> Hi Ali,
>>
>> I have a different panic now(not sure about the old one, that is also
>> there). I had modified e1000_clean_rx_irq function to check for the
>> skb_dev_name and match it with eth0 and process the packet only if it
>> matched, just for a check.. The panic comes in the first line of code in
>> strcmp:
>>  fffffc00004e8ff0:
>> fffffc00004e8ff0:       00 00 70 2c     ldq_u   t2,0(a0)
>> This is because a0($16) holds 000032b25000ada8 which is not a valid
>> address. I tried to trace back when a0 was last loaded:
>>          stq $16,168($30)         # adapter, adapter
>>
>>         stq $18,176($30)         # work_done, work_done
>>
>>         stq $19,184($30)         # work_to_do, work_to_do
>>
>> This is at the beginning of the function e1000_clean_rx_irq :
>> static bool e1000_clean_rx_irq(struct e1000_adapter *adapter,
>>                    struct e1000_rx_ring *rx_ring,
>>                     int *work_done, int work_to_do)
>> I am not sure how to fix this.. Is there a problem during compiling, a0
>> should have been loaded but it is not? I followed the instructions in this
>> site to match the assembly code and c code :
>> http://kerneltrap.org/node/3648
>> I added the assembly comments after each line to trace the flow of the
>> code and made sure I went through all the parts of the code till before the
>> strcmp call to check if a0 is loaded.. Do you have any suggestion about
>> what I can do next?
>>
>> Thanks,
>> Pritha
>>
>> On Tue, Jun 19, 2012 at 7:03 PM, Pritha Ghoshal 
>> <pritha9...@neo.tamu.edu>wrote:
>>
>>> I was able to use 1 core with the remote gdb.. With the 4 cores though,
>>> even after connecting remote gdb-s to each of the cores, I get the same
>>> output even after a kernel panic:
>>> (gdb) c
>>> Continuing.
>>> Watchdog has expired.  Target detached.
>>> I am not able to get a backtrace on any of the connected gdb-s..
>>> Pritha
>>>
>>> On Tue, Jun 19, 2012 at 2:38 PM, Ali Saidi <sa...@umich.edu> wrote:
>>>
>>>>  I think i missed that post, but you might need to connect 4 instances
>>>> of gdb to the four cpus. This doesn't happen with 1, 2 or 3 cores?
>>>>
>>>>
>>>>
>>>> You can go to every cache and add code to the inbound port or dram port
>>>> that has an explicit check on that address in the packet (cache block
>>>> aligned). Every time it sees a read or write you should print out the fact
>>>> that the write happened and at some point hopefully you'll find the bad
>>>> piece of data.
>>>>
>>>>
>>>>
>>>> Ali
>>>>
>>>>
>>>>
>>>> On 19.06.2012 14:31, Pritha Ghoshal wrote:
>>>>
>>>> Hi Ali,
>>>>
>>>> I am having some troubles using the gdb on a 4 core machine (I had
>>>> posted a previous mail to the group about that), I'll try it out once more
>>>> and see..
>>>>
>>>> How could I add the memory checks?
>>>>
>>>> Thanks,
>>>> Pritha
>>>>
>>>> On Tue, Jun 19, 2012 at 2:02 PM, Ali Saidi <sa...@umich.edu> wrote:
>>>>
>>>>>
>>>>>
>>>>> On 19.06.2012 13:06, Pritha Ghoshal wrote:
>>>>>
>>>>>  Hi,
>>>>> I am getting a kernel panic which I am not able to debug. The pc
>>>>> itself is getting polluted.. I have added the trace of the panic at the 
>>>>> end
>>>>> of the email.
>>>>> This is a snippet from the object dump of the kernel code.
>>>>>  fffffc00005d51e8:       00 00 69 a7     ldq     t12,0(s0)
>>>>> fffffc00005d51ec:       00 40 5b 6b     jsr
>>>>> ra,(t12),fffffc00005d51f0
>>>>>   fffffc00005d51f0:       2a 00 ba 27     ldah    gp,42(ra)
>>>>> The panic is when ra = fffffc00005d51f0. Therefore the jsr should have
>>>>> jumped to the address in t12 which is 0000000002969588. t12 gets loaded
>>>>> from s0 in the previous step. I was unable to trace back the memory 
>>>>> address
>>>>> content, is there a way to do it? The last function in the trace is given
>>>>> in the following link:
>>>>>
>>>>> http://lxr.free-electrons.com/source/net/core/neighbour.c?v=2.6.28#L1187
>>>>> Could someone suggest how I go about debugging this kernel panic?
>>>>> Thanks in advance..
>>>>> Thanks,
>>>>> Pritha
>>>>>
>>>>> You'll need to either use the gdb support in gem5 or maybe put some
>>>>> checks in the memory system for that specific address and print as it gets
>>>>> changed.
>>>>> Ali
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> gem5-users mailing list
>>>>> gem5-users@gem5.org
>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>
>

_______________________________________________
gem5-users mailing list
gem5-users@gem5.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Re: [gem5-users] Help with debugging a kernel panic

Reply via email to