Hello all, I have done some more tests to pinpoint the exact condition of the crash. What I could figure out was that the crash happens when memory is being allocated for pppoe_session_t while packets are flowing through pppoe interface.
Here is what I did to arrive at this conclusion: 1. Configure VPP without any default route (to ensure packets do not hit north interface from south) 2. Provision 100 PPPoE clients - No crash observed 3. Deprovision all 100 PPPoE clients 4. Configure default route 5. Provision 100 PPPoE clients again, and start a ping to an external IP from each client - No Crash observed 6. Provision 50 more PPPoE clients - VPP crashes. Based on this test, and from what I could understand from the code, my guess is that there is some memory corruption happening inside the pppoe_session_t when memory is being allocated for it when there is packets traversing through PPPoE interface. Thanks and Regards, Raj On Thu, Sep 26, 2019 at 7:15 PM Raj via Lists.Fd.Io <rajlistuser=gmail....@lists.fd.io> wrote: > > Hello all, > > I am observing a VPP crash when approximately 20 - 50 PPPoE clients > are connecting and traffic is flowing through them. This crash was > reproducible every time I tried. > > I did some debugging and here is what I could find out so far: > > If I understand correctly, when a incoming packet from north side is > being sent to PPPoE interface, pppoe_fixup() is called to update > pppoe0->length, and t->encap_if_index. Length and encap_if_index is > taken from adj0->sub_type.midchain.fixup_data > > My observation is that while clients are connecting and traffic is > flowing for connected clients, adj0->sub_type.midchain.fixup_data > appears to hold incorrect data, at some point in time, during the > test. What we have seen is the incorrect data > (adj0->sub_type.midchain.fixup_data) is observed for clients which are > already provisioned for some time and which had packets flowing > through them. > > I figured this out by using gdb and inspecting > adj0->sub_type.midchain.fixup_data, after typecasting it into > pppoe_session_t > > In the structure, I could see that session_id, client_ip and encap_idx > are incorrect. I did not check other values in the structure. > > I also added code to log this fields in pppoe_fixup() and logs too > shows incorrect data in the fields. > > Example logs taken just before crash: > > vnet[12988]: pppoe_fixup:243: 40:7b:1b: 0:12:38 -> 2:42: a: 1: 0: 2 , type > 8864 > vnet[12988]: pppoe_fixup:271: pppoe session id 4883, client_ip > 0x13131313 encap idx 0x13131313 > > First log prints out packet headers, to verify that data in packet is > as expected and is correct. Second log prints values in pppoe_session > data, and it can be seen that the values are obviously incorrect. At > this point the packet is sent out through the south interface. Again > after some time the TX index values become some thing similar to > 1422457436 and VPP core dumps. > > We have tested the following scenarios: > > 1. Add PPPoE clients without sending out any traffic: There is no > crash observed. > 2. Add n number of PPPoE clients, load traffic [No adding or removal > or clients while traffic is on, see next scenario]: There is no crash > observed > 3. Load traffic as soon as each client connects: VPP crash observed. > > Another observation is that encap_if_index is available in two places > inside pppoe_fixup: > > 1. adj->rewrite_header.sw_if_index > 2. t->encap_if_index > > t->encap_if_index is used for updating TX, and this gets corrupted, > while adj->rewrite_header.sw_if_index has the correct index. > > I can check and get back if you need any additional information. Let > me know if a bug report is to be created for this. > > Environment: > > vpp# show version verbose > Version: v19.08.1-59~ga2aa83ca9-dirty > Compiled by: root > Compile host: build-02 > Compile date: Thu Sep 26 16:44:00 IST 2019 > Compile location: /root/build-1908 > Compiler: GCC 7.4.0 > Current PID: 7802 > > Operating system: Ubuntu 18.04 amd64 > > startup.conf and associated exec file is attached. > > There is a small patch to stock VPP to disable > ETHERNET_ERROR_L3_MAC_MISMATCH, which is attached. I have also > attached output of show show hardware and gdb bt output. I have the > core file and its matching VPP debs, and can be shared if needed. > > In the bt the incorrect value of index can be seen in bt #5: > > #5 0x00007fba88e9ce0b in vlib_increment_combined_counter > (n_bytes=<optimized out>, n_packets=1, index=538976288, > thread_index=0, cm=0x7fba481f46a0) at > /root/build-1908/src/vlib/counter.h:229 > > Thanks and Regards, > > Raj > -=-=-=-=-=-=-=-=-=-=-=- > Links: You receive all messages sent to this group. > > View/Reply Online (#14063): https://lists.fd.io/g/vpp-dev/message/14063 > Mute This Topic: https://lists.fd.io/mt/34298895/157026 > Group Owner: vpp-dev+ow...@lists.fd.io > Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [rajlistu...@gmail.com] > -=-=-=-=-=-=-=-=-=-=-=-
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#14081): https://lists.fd.io/g/vpp-dev/message/14081 Mute This Topic: https://lists.fd.io/mt/34298895/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-