Hello all,

I have done some more tests to pinpoint the exact condition of the
crash. What I could figure out was that the crash happens when memory
is being allocated for pppoe_session_t while packets are flowing
through pppoe interface.

Here is what I did to arrive at this conclusion:

1. Configure VPP without any default route (to ensure packets do not
hit north interface from south)
2. Provision 100 PPPoE clients - No crash observed
3. Deprovision all 100 PPPoE clients
4. Configure default route
5. Provision 100 PPPoE clients again, and start a ping to an external
IP from each client - No Crash observed
6. Provision 50 more PPPoE clients - VPP crashes.

Based on this test, and from what I could understand from the code, my
guess is  that there is some memory corruption happening inside the
pppoe_session_t when memory is being allocated for it when there is
packets traversing through PPPoE interface.

Thanks and Regards,

Raj


On Thu, Sep 26, 2019 at 7:15 PM Raj via Lists.Fd.Io
<rajlistuser=gmail....@lists.fd.io> wrote:
>
> Hello all,
>
> I am observing a VPP crash when approximately 20 - 50 PPPoE clients
> are connecting and traffic is flowing through them. This crash was
> reproducible every time I tried.
>
> I did some debugging and here is what I could find out so far:
>
> If I understand correctly, when a incoming packet from north side is
> being sent to PPPoE interface, pppoe_fixup() is called to update
> pppoe0->length, and t->encap_if_index. Length and encap_if_index is
> taken from adj0->sub_type.midchain.fixup_data
>
> My observation is that while clients are connecting and traffic is
> flowing for connected clients, adj0->sub_type.midchain.fixup_data
> appears to hold incorrect data, at some point in time, during the
> test. What we have seen is the incorrect data
> (adj0->sub_type.midchain.fixup_data) is observed for clients which are
> already provisioned for some time and which had packets flowing
> through them.
>
> I figured this out by using gdb and inspecting
> adj0->sub_type.midchain.fixup_data, after typecasting it into
> pppoe_session_t
>
> In the structure, I could see that session_id, client_ip and encap_idx
> are incorrect. I did not check other values in the structure.
>
> I also added code to log this fields in pppoe_fixup() and logs too
> shows incorrect data in the fields.
>
> Example logs taken just before crash:
>
> vnet[12988]: pppoe_fixup:243: 40:7b:1b: 0:12:38 ->  2:42: a: 1: 0: 2 , type 
> 8864
> vnet[12988]: pppoe_fixup:271: pppoe session id 4883, client_ip
> 0x13131313 encap idx 0x13131313
>
> First log prints out packet headers, to verify that data in packet is
> as expected and is correct. Second log prints values in pppoe_session
> data, and it can be seen that the values are obviously incorrect. At
> this point the packet is sent out through the south interface. Again
> after some time the TX index values become some thing similar to
> 1422457436 and VPP core dumps.
>
> We have tested the following scenarios:
>
> 1. Add PPPoE clients without sending out any traffic: There is no
> crash observed.
> 2. Add n number of PPPoE clients, load traffic [No adding or removal
> or clients while traffic is on, see next scenario]: There is no crash
> observed
> 3. Load traffic as soon as each client connects: VPP crash observed.
>
> Another observation is that  encap_if_index is available in two places
> inside pppoe_fixup:
>
> 1. adj->rewrite_header.sw_if_index
> 2. t->encap_if_index
>
> t->encap_if_index is used for updating TX, and this gets corrupted,
> while adj->rewrite_header.sw_if_index has the correct index.
>
> I can check and get back if you need any additional information. Let
> me know if a bug report is to be created for this.
>
> Environment:
>
> vpp# show version verbose
> Version:          v19.08.1-59~ga2aa83ca9-dirty
> Compiled by:          root
> Compile host:          build-02
> Compile date:          Thu Sep 26 16:44:00 IST 2019
> Compile location:      /root/build-1908
> Compiler:          GCC 7.4.0
> Current PID:          7802
>
> Operating system: Ubuntu 18.04 amd64
>
> startup.conf and associated exec file is attached.
>
> There is a small patch to stock VPP to disable
> ETHERNET_ERROR_L3_MAC_MISMATCH, which is attached. I have also
> attached output of show show hardware and gdb bt output. I have the
> core file and its matching VPP debs, and can be shared if needed.
>
> In the bt the incorrect value of index can be seen in bt #5:
>
> #5  0x00007fba88e9ce0b in vlib_increment_combined_counter
> (n_bytes=<optimized out>, n_packets=1, index=538976288,
> thread_index=0, cm=0x7fba481f46a0) at
> /root/build-1908/src/vlib/counter.h:229
>
> Thanks and Regards,
>
> Raj
> -=-=-=-=-=-=-=-=-=-=-=-
> Links: You receive all messages sent to this group.
>
> View/Reply Online (#14063): https://lists.fd.io/g/vpp-dev/message/14063
> Mute This Topic: https://lists.fd.io/mt/34298895/157026
> Group Owner: vpp-dev+ow...@lists.fd.io
> Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [rajlistu...@gmail.com]
> -=-=-=-=-=-=-=-=-=-=-=-
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#14081): https://lists.fd.io/g/vpp-dev/message/14081
Mute This Topic: https://lists.fd.io/mt/34298895/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to