Hi, I have been running with the Exec debug flag and have the following incongruities:
1. For the strcmp issue, this is the case: 3514789948500: system.cpu2 T0 : @strcmp : ldq_u r3,0(r16) : MemRead : A=0x8982d70 Tracing back r16: 3514789947500: system.cpu2 T0 : @e1000_clean_rx_irq+400 : bis r31, r10,r16 : IntAlu : D=0x0000000008982d70 Tracing back r10: 3514789947500: system.cpu2 T0 : @e1000_clean_rx_irq+384 : ldq r10, 32(r9) : MemRead : D=0x0000000008982d70 A=0xfffffc003c34c2e0 Now tracing back this address I have: 3452373647750: system.cpu2 T0 : @__netdev_alloc_skb+56 : stq r9,32(r0) : MemWrite : D=0xfffffc003f1a1000 A=0xfffffc003c34c2e0 3484029521750: system.cpu2 T0 : @e1000_clean_rx_irq+384 : ldq r10,32(r9) : MemRead : D=0xfffffc003f1a1000 A=0xfffffc003c34c2e0 3484029521750: system.cpu2 T0 : @e1000_clean_rx_irq+384 : ldq r10,32(r9) : MemRead : D=0x0000000008982d70 A= 0xfffffc003c34c2e0 So though there was a different value loaded and read from the address till 3484029521750, somehow between 3484029521750 and 3484029521750 the value at the address has changed to 0x0000000008982d70. 2. I have another oops at a different location, somewhat similar. In ip_sabotage_in function, the code is proceeding down a different path than in the normal case. In the normal case a) This loop stores the value at the address: 3370440886750: system.cpu2 T0 : @loop : stq r17,0(r5) : MemWrite : D=0x0000000000000000 A=0xfffffc003cb22948 3370440886750: system.cpu2 T0 : @loop+4 : subq r3,1,r3 : IntAlu : D=0x0000000000000003 3370440886750: system.cpu2 T0 : @loop+8 : addq r5,8,r5 : IntAlu : D=0xfffffc003cb22950 3370440886750: system.cpu2 T0 : @loop+12 : bne r3,loop : IntAlu : b) This is where the value is read: 3483182609000: system.cpu2 T0 : @ip_sabotage_in : ldq r1,136(r17) : MemRead : D=0x0000000000000000 A=0xfffffc003cb22948 3483182609000: system.cpu2 T0 : @ip_sabotage_in+4 : beq r1,0xfffffc0000680950 : IntAlu : 3483182623000: system.cpu2 T0 : @ip_sabotage_in+32 : lda r0,1(r31) : IntAlu : D=0x0000000000000001 3483182623000: system.cpu2 T0 : @ip_sabotage_in+36 : ret (r26) : IntAlu : So after ip_sabotage_in+4 the code jumps to ip_sabotage_in+32 But in the faulty case, this is what happens: a) The loop storing the value is the same: 3370447153000: system.cpu2 T0 : @loop : stq r17,0(r5) : MemWrite : D=0x0000000000000000 A=0xfffffc003cb22248 3370447153000: system.cpu2 T0 : @loop+4 : subq r3,1,r3 : IntAlu : D=0x0000000000000003 3370447153000: system.cpu2 T0 : @loop+8 : addq r5,8,r5 : IntAlu : D=0xfffffc003cb22250 3370447153000: system.cpu2 T0 : @loop+12 : bne r3,loop : IntAlu : b) The loop reading the value reads a wrong data: 3483490680000: system.cpu2 T0 : @ip_sabotage_in : ldq r1,136(r17) : MemRead : D=0x0000000000000090 A=0xfffffc003cb22248 3483490680000: system.cpu2 T0 : @ip_sabotage_in+4 : beq r1,0xfffffc0000680950 : IntAlu : 3483490705250: system.cpu2 T0 : @ip_sabotage_in+8 : ldl r1,24(r1) : MemRead : A=0xa8 It goes to line ip_sabotage_in+8 and lands up in a oops as the memory address cannot be accessed.. Could you suggest how I could debug this further? And could I trace the reason back to see what is the cause? I have loaded a kernel module while running the experiment, which spawn 3 threads which are run on cpu1, cpu2 and cpu3 respectively. Is the error somehow linked to the code? Thanks, Pritha On Fri, Jun 22, 2012 at 11:18 AM, Pritha Ghoshal <pritha9...@neo.tamu.edu>wrote: > Hi Ali, > > I think the problem is somewhere in the compiler.. I removed a printf > statement from a location and rebuilt the kernel, that particular error > seemed to go off and the simulation ran for a much longer time, before it > got killed with "unhandled unaligned exception". > > So I was trying to enable CONFIG_DEBUG_SLAB and run it again.. But there > seems to be some problem with this debug mode and the slab code.. This > comes only when it runs with more than one core.. The following bug stops > the simulation.. > > #if DEBUG > static void check_irq_off(void) > { > BUG_ON(!irqs_disabled()); > } > > Do I have to setup something more while enabling the config_debug_slab > mode? > > Thanks, > Pritha > > > On Wed, Jun 20, 2012 at 9:42 PM, Ali Saidi <sa...@umich.edu> wrote: > >> ** >> >> Hi Pritha, >> >> >> >> I seem to be missing something... so the ldq_u t2, 0(a0) is loading a >> bogus address. Did the a0 address get stored by the first stq? If so, what >> value was stored by the stq? (You can use the debug-flags to figure that >> out). Was it the right value? There are still a couple of possible issues >> here: 1) kernel bug 2) compiler bug 3) gem5 bug. You need to trace the >> source of the value back as far as possible using a combination of what >> you've done and the exec debug flag. If the value was stored and later read >> and isn't the same, something has likely gone wrong with gem5. >> Unfortunately, it's also possible there is a bug with the compiler or >> kernel. >> >> >> >> Ali >> >> >> >> On 20.06.2012 18:01, Pritha Ghoshal wrote: >> >> Hi Ali, >> >> I have a different panic now(not sure about the old one, that is also >> there). I had modified e1000_clean_rx_irq function to check for the >> skb_dev_name and match it with eth0 and process the packet only if it >> matched, just for a check.. The panic comes in the first line of code in >> strcmp: >> fffffc00004e8ff0: >> fffffc00004e8ff0: 00 00 70 2c ldq_u t2,0(a0) >> This is because a0($16) holds 000032b25000ada8 which is not a valid >> address. I tried to trace back when a0 was last loaded: >> stq $16,168($30) # adapter, adapter >> >> stq $18,176($30) # work_done, work_done >> >> stq $19,184($30) # work_to_do, work_to_do >> >> This is at the beginning of the function e1000_clean_rx_irq : >> static bool e1000_clean_rx_irq(struct e1000_adapter *adapter, >> struct e1000_rx_ring *rx_ring, >> int *work_done, int work_to_do) >> I am not sure how to fix this.. Is there a problem during compiling, a0 >> should have been loaded but it is not? I followed the instructions in this >> site to match the assembly code and c code : >> http://kerneltrap.org/node/3648 >> I added the assembly comments after each line to trace the flow of the >> code and made sure I went through all the parts of the code till before the >> strcmp call to check if a0 is loaded.. Do you have any suggestion about >> what I can do next? >> >> Thanks, >> Pritha >> >> On Tue, Jun 19, 2012 at 7:03 PM, Pritha Ghoshal >> <pritha9...@neo.tamu.edu>wrote: >> >>> I was able to use 1 core with the remote gdb.. With the 4 cores though, >>> even after connecting remote gdb-s to each of the cores, I get the same >>> output even after a kernel panic: >>> (gdb) c >>> Continuing. >>> Watchdog has expired. Target detached. >>> I am not able to get a backtrace on any of the connected gdb-s.. >>> Pritha >>> >>> On Tue, Jun 19, 2012 at 2:38 PM, Ali Saidi <sa...@umich.edu> wrote: >>> >>>> I think i missed that post, but you might need to connect 4 instances >>>> of gdb to the four cpus. This doesn't happen with 1, 2 or 3 cores? >>>> >>>> >>>> >>>> You can go to every cache and add code to the inbound port or dram port >>>> that has an explicit check on that address in the packet (cache block >>>> aligned). Every time it sees a read or write you should print out the fact >>>> that the write happened and at some point hopefully you'll find the bad >>>> piece of data. >>>> >>>> >>>> >>>> Ali >>>> >>>> >>>> >>>> On 19.06.2012 14:31, Pritha Ghoshal wrote: >>>> >>>> Hi Ali, >>>> >>>> I am having some troubles using the gdb on a 4 core machine (I had >>>> posted a previous mail to the group about that), I'll try it out once more >>>> and see.. >>>> >>>> How could I add the memory checks? >>>> >>>> Thanks, >>>> Pritha >>>> >>>> On Tue, Jun 19, 2012 at 2:02 PM, Ali Saidi <sa...@umich.edu> wrote: >>>> >>>>> >>>>> >>>>> On 19.06.2012 13:06, Pritha Ghoshal wrote: >>>>> >>>>> Hi, >>>>> I am getting a kernel panic which I am not able to debug. The pc >>>>> itself is getting polluted.. I have added the trace of the panic at the >>>>> end >>>>> of the email. >>>>> This is a snippet from the object dump of the kernel code. >>>>> fffffc00005d51e8: 00 00 69 a7 ldq t12,0(s0) >>>>> fffffc00005d51ec: 00 40 5b 6b jsr >>>>> ra,(t12),fffffc00005d51f0 >>>>> fffffc00005d51f0: 2a 00 ba 27 ldah gp,42(ra) >>>>> The panic is when ra = fffffc00005d51f0. Therefore the jsr should have >>>>> jumped to the address in t12 which is 0000000002969588. t12 gets loaded >>>>> from s0 in the previous step. I was unable to trace back the memory >>>>> address >>>>> content, is there a way to do it? The last function in the trace is given >>>>> in the following link: >>>>> >>>>> http://lxr.free-electrons.com/source/net/core/neighbour.c?v=2.6.28#L1187 >>>>> Could someone suggest how I go about debugging this kernel panic? >>>>> Thanks in advance.. >>>>> Thanks, >>>>> Pritha >>>>> >>>>> You'll need to either use the gdb support in gem5 or maybe put some >>>>> checks in the memory system for that specific address and print as it gets >>>>> changed. >>>>> Ali >>>>> >>>>> >>>>> _______________________________________________ >>>>> gem5-users mailing list >>>>> gem5-users@gem5.org >>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users >>>> >>>> >>>> >>>> >>> >> >> > >
_______________________________________________ gem5-users mailing list gem5-users@gem5.org http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users