Hello all, I could do some more analysis of this issue and I think I have identified what could be wrong here.
During pppoe session creation time frame, as we mentioned earlier, the below function will get invoked from pppoe_update_adj() adj_nbr_midchain_update_rewrite(adj_index_t adj_index,adj_midchain_fixup_t fixup,const void *fixup_data,...) Note the 3rd arg 'fixup_data', here we are passing the current session address as fixup_data When subsequent sessions are added this memory address will get altered as we are resizing the vector pool pem->sessions. Hence the above stored fixup_data (adj->sub_type.midchain.fixup_data) address is no longer valid, and that address could have been already freed. I think that is the reason why we see memory corruption. It would be great if some one with far better knowledge of VPP internals could take a look at this and confirm this. Thanks and Regards, Raj On Fri, Oct 11, 2019 at 12:17 PM Ni, Hongjun <hongjun...@intel.com> wrote: > > Hi Raj, > > I tried to reproduce your issue on VPP 20.01 as per your steps for some > times, but cannot reproduce it. > > From your description, please set a breakpoint in > vnet_pppoe_add_del_session(). > And to see what happened when you create your second pppoe session with > traffic in first pppoe session. > > Thanks, > Hongjun > > -----Original Message----- > From: vpp-dev@lists.fd.io [mailto:vpp-dev@lists.fd.io] On Behalf Of Raj > Sent: Monday, September 30, 2019 11:54 PM > To: vpp-dev <vpp-dev@lists.fd.io> > Subject: Re: [vpp-dev] VPP core dump with PPPoE > > Hello all, > > I did some more debugging to find out when and where exactly the > pppoe_session_t get corrupted. Added couple of log entries as shown below to > log pppoe session id when a session is created as well as when packets from > north traverses to south. I have tried this in VPP 19.08, 19.04 and 19.01 > with same results. > > vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id 20923 vnet > [21892]: pppoe_update_adj:195: New_Session pppoe01 session id 35666 > vnet [21892]: pppoe_fixup:169: New_Packet pppoe01 session id 35666 > vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id 58191 > > The sequence when corruption happens seems to be: > > 1. A new session is created > 2. A packet for the newly created session traverses from north to south 3. > Next session is created - and vpp crashes. > > Digging deeper, I added watch for all newly created sessions using the > following gdb script > > b /root/build-1901/src/vnet/ip/ip4_forward.c:2444 > commands 1 > watch -l ((pppoe_session_t*)adj0->sub_type.midchain.fixup_data).session_id > bt > continue > end > > gdb, running with this script, bails out with following message: > > Thread 1 "vpp_main" hit Hardware watchpoint 2: -location > ((pppoe_session_t*)adj0->sub_type.midchain.fixup_data).session_id > > Old value = 35666 > New value = 4883 > __memset_avx2_unaligned_erms () at > ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:203 > 203 ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: No > such file or directory. > (gdb) > > It is interesting to note that 4883 is 0x1313 > > Back trace shows the path it took to reach here: > > > (gdb) bt > #0 __memset_avx2_unaligned_erms () at > ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:203 > #1 0x00007ffff61a4179 in mspace_put (msp=0x7fffb4df7010, > p_arg=0x7fffb592d9c8) at /root/build-1901/src/vppinfra/dlmalloc.c:4294 > #2 0x00007ffff618ea39 in clib_mem_free (p=0x7fffb592d9c8) at > /root/build-1901/src/vppinfra/mem.h:215 > #3 0x00007ffff618edd8 in vec_resize_allocate_memory (v=0x7fffb592da00, > length_increment=1, data_bytes=312, header_bytes=56, data_align=64) at > /root/build-1901/src/vppinfra/vec.c:96 > #4 0x00007fffb0aa4a29 in _vec_resize_inline (v=0x7fffb592da00, > length_increment=1, data_bytes=256, header_bytes=48, data_align=64) at > /root/build-1901/src/vppinfra/vec.h:147 > #5 0x00007fffb0aa9ca4 in vnet_pppoe_add_del_session (a=0x7fffb6703950, > sw_if_indexp=0x7fffb67038e8) at > /root/build-1901/src/plugins/pppoe/pppoe.c:335 > #6 0x00007fffb0aaadec in pppoe_add_del_session_command_fn > (vm=0x7ffff68e3400 <vlib_global_main>, input=0x7fffb6703ee0, > cmd=0x7fffb65dd73c) at /root/build-1901/src/plugins/pppoe/pppoe.c:554 > #7 0x00007ffff6617db0 in vlib_cli_dispatch_sub_commands > (vm=0x7ffff68e3400 <vlib_global_main>, cm=0x7ffff68e3600 > <vlib_global_main+512>, input=0x7fffb6703ee0, parent_command_index=21) at > /root/build-1901/src/vlib/cli.c:644 > > This do not occur if traffic is not initiated, and there is no packet flow > through the system. It would be great if some one who understands this code > to confirm if my analysis is correct and give some pointers to figure out > > 1. Why, when a new session is created, the data of an old session is changed > to 0x1313 2. What debugging steps can I take next to figure out why this > happens. > > > > Thanks and Regards, > > > Raj > > On Sat, Sep 28, 2019 at 6:09 PM Raj via Lists.Fd.Io > <rajlistuser=gmail....@lists.fd.io> wrote: > > > > Hello all, > > > > I have done some more tests to pinpoint the exact condition of the > > crash. What I could figure out was that the crash happens when memory > > is being allocated for pppoe_session_t while packets are flowing > > through pppoe interface. > > > > Here is what I did to arrive at this conclusion: > > > > 1. Configure VPP without any default route (to ensure packets do not > > hit north interface from south) 2. Provision 100 PPPoE clients - No > > crash observed 3. Deprovision all 100 PPPoE clients 4. Configure > > default route 5. Provision 100 PPPoE clients again, and start a ping > > to an external IP from each client - No Crash observed 6. Provision 50 > > more PPPoE clients - VPP crashes. > > > > Based on this test, and from what I could understand from the code, my > > guess is that there is some memory corruption happening inside the > > pppoe_session_t when memory is being allocated for it when there is > > packets traversing through PPPoE interface. > > > > Thanks and Regards, > > > > Raj > > > > > > On Thu, Sep 26, 2019 at 7:15 PM Raj via Lists.Fd.Io > > <rajlistuser=gmail....@lists.fd.io> wrote: > > > > > > Hello all, > > > > > > I am observing a VPP crash when approximately 20 - 50 PPPoE clients > > > are connecting and traffic is flowing through them. This crash was > > > reproducible every time I tried. > > > > > > I did some debugging and here is what I could find out so far: > > > > > > If I understand correctly, when a incoming packet from north side is > > > being sent to PPPoE interface, pppoe_fixup() is called to update > > > pppoe0->length, and t->encap_if_index. Length and encap_if_index is > > > taken from adj0->sub_type.midchain.fixup_data > > > > > > My observation is that while clients are connecting and traffic is > > > flowing for connected clients, adj0->sub_type.midchain.fixup_data > > > appears to hold incorrect data, at some point in time, during the > > > test. What we have seen is the incorrect data > > > (adj0->sub_type.midchain.fixup_data) is observed for clients which > > > are already provisioned for some time and which had packets flowing > > > through them. > > > > > > I figured this out by using gdb and inspecting > > > adj0->sub_type.midchain.fixup_data, after typecasting it into > > > pppoe_session_t > > > > > > In the structure, I could see that session_id, client_ip and > > > encap_idx are incorrect. I did not check other values in the structure. > > > > > > I also added code to log this fields in pppoe_fixup() and logs too > > > shows incorrect data in the fields. > > > > > > Example logs taken just before crash: > > > > > > vnet[12988]: pppoe_fixup:243: 40:7b:1b: 0:12:38 -> 2:42: a: 1: 0: 2 > > > , type 8864 > > > vnet[12988]: pppoe_fixup:271: pppoe session id 4883, client_ip > > > 0x13131313 encap idx 0x13131313 > > > > > > First log prints out packet headers, to verify that data in packet > > > is as expected and is correct. Second log prints values in > > > pppoe_session data, and it can be seen that the values are obviously > > > incorrect. At this point the packet is sent out through the south > > > interface. Again after some time the TX index values become some > > > thing similar to > > > 1422457436 and VPP core dumps. > > > > > > We have tested the following scenarios: > > > > > > 1. Add PPPoE clients without sending out any traffic: There is no > > > crash observed. > > > 2. Add n number of PPPoE clients, load traffic [No adding or removal > > > or clients while traffic is on, see next scenario]: There is no > > > crash observed 3. Load traffic as soon as each client connects: VPP > > > crash observed. > > > > > > Another observation is that encap_if_index is available in two > > > places inside pppoe_fixup: > > > > > > 1. adj->rewrite_header.sw_if_index > > > 2. t->encap_if_index > > > > > > t->encap_if_index is used for updating TX, and this gets corrupted, > > > while adj->rewrite_header.sw_if_index has the correct index. > > > > > > I can check and get back if you need any additional information. Let > > > me know if a bug report is to be created for this. > > > > > > Environment: > > > > > > vpp# show version verbose > > > Version: v19.08.1-59~ga2aa83ca9-dirty > > > Compiled by: root > > > Compile host: build-02 > > > Compile date: Thu Sep 26 16:44:00 IST 2019 > > > Compile location: /root/build-1908 > > > Compiler: GCC 7.4.0 > > > Current PID: 7802 > > > > > > Operating system: Ubuntu 18.04 amd64 > > > > > > startup.conf and associated exec file is attached. > > > > > > There is a small patch to stock VPP to disable > > > ETHERNET_ERROR_L3_MAC_MISMATCH, which is attached. I have also > > > attached output of show show hardware and gdb bt output. I have the > > > core file and its matching VPP debs, and can be shared if needed. > > > > > > In the bt the incorrect value of index can be seen in bt #5: > > > > > > #5 0x00007fba88e9ce0b in vlib_increment_combined_counter > > > (n_bytes=<optimized out>, n_packets=1, index=538976288, > > > thread_index=0, cm=0x7fba481f46a0) at > > > /root/build-1908/src/vlib/counter.h:229 > > > > > > Thanks and Regards, > > > > > > Raj > > > -=-=-=-=-=-=-=-=-=-=-=- > > > Links: You receive all messages sent to this group. > > > > > > View/Reply Online (#14063): > > > https://lists.fd.io/g/vpp-dev/message/14063 > > > Mute This Topic: https://lists.fd.io/mt/34298895/157026 > > > Group Owner: vpp-dev+ow...@lists.fd.io > > > Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub > > > [rajlistu...@gmail.com] > > > -=-=-=-=-=-=-=-=-=-=-=- > > -=-=-=-=-=-=-=-=-=-=-=- > > Links: You receive all messages sent to this group. > > > > View/Reply Online (#14081): > > https://lists.fd.io/g/vpp-dev/message/14081 > > Mute This Topic: https://lists.fd.io/mt/34298895/157026 > > Group Owner: vpp-dev+ow...@lists.fd.io > > Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub > > [rajlistu...@gmail.com] > > -=-=-=-=-=-=-=-=-=-=-=-
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#14197): https://lists.fd.io/g/vpp-dev/message/14197 Mute This Topic: https://lists.fd.io/mt/34298895/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-