Hello all,

I could do some more analysis of this issue and I think I have
identified what could be wrong here.

During pppoe session creation time frame, as we mentioned earlier, the
below function will get invoked from pppoe_update_adj()

adj_nbr_midchain_update_rewrite(adj_index_t
adj_index,adj_midchain_fixup_t fixup,const void *fixup_data,...)

Note the 3rd arg 'fixup_data', here we are passing the current session
address as fixup_data

When subsequent sessions are added this memory address will get
altered as we are resizing the vector pool pem->sessions. Hence the
above stored fixup_data (adj->sub_type.midchain.fixup_data) address is
no longer valid, and that address could have been already freed.

I think that is the reason why we see memory corruption. It would be
great if some one with far better knowledge of VPP internals could
take a look at this and confirm this.

Thanks and Regards,

Raj

On Fri, Oct 11, 2019 at 12:17 PM Ni, Hongjun <hongjun...@intel.com> wrote:
>
> Hi Raj,
>
> I tried to reproduce your issue on VPP 20.01 as per your steps for some 
> times, but cannot reproduce it.
>
> From your description, please set a breakpoint in 
> vnet_pppoe_add_del_session().
> And to see what happened when you create your second pppoe session with 
> traffic in first pppoe session.
>
> Thanks,
> Hongjun
>
> -----Original Message-----
> From: vpp-dev@lists.fd.io [mailto:vpp-dev@lists.fd.io] On Behalf Of Raj
> Sent: Monday, September 30, 2019 11:54 PM
> To: vpp-dev <vpp-dev@lists.fd.io>
> Subject: Re: [vpp-dev] VPP core dump with PPPoE
>
> Hello all,
>
> I did some more debugging to find out when and where exactly the 
> pppoe_session_t get corrupted. Added couple of log entries as shown below to 
> log pppoe session id when a session is created as well as when packets from 
> north traverses to south. I have tried this in VPP 19.08, 19.04 and 19.01 
> with same results.
>
> vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id 20923 vnet 
> [21892]: pppoe_update_adj:195: New_Session pppoe01 session id 35666
> vnet [21892]: pppoe_fixup:169:      New_Packet  pppoe01 session id 35666
> vnet [21892]: pppoe_update_adj:195: New_Session pppoe01 session id 58191
>
> The sequence when corruption happens seems to be:
>
> 1. A new session is created
> 2. A packet for the newly created session traverses from north to south 3. 
> Next session is created - and vpp crashes.
>
> Digging deeper, I added watch for all newly created sessions using the 
> following gdb script
>
> b /root/build-1901/src/vnet/ip/ip4_forward.c:2444
> commands 1
>  watch -l ((pppoe_session_t*)adj0->sub_type.midchain.fixup_data).session_id
>  bt
> continue
> end
>
> gdb, running with this script, bails out with following message:
>
> Thread 1 "vpp_main" hit Hardware watchpoint 2: -location 
> ((pppoe_session_t*)adj0->sub_type.midchain.fixup_data).session_id
>
> Old value = 35666
> New value = 4883
> __memset_avx2_unaligned_erms () at
> ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:203
> 203     ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: No
> such file or directory.
> (gdb)
>
> It is interesting to note that 4883 is 0x1313
>
> Back trace shows the path it took to reach here:
>
>
> (gdb) bt
> #0  __memset_avx2_unaligned_erms () at
> ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:203
> #1  0x00007ffff61a4179 in mspace_put (msp=0x7fffb4df7010,
> p_arg=0x7fffb592d9c8) at /root/build-1901/src/vppinfra/dlmalloc.c:4294
> #2  0x00007ffff618ea39 in clib_mem_free (p=0x7fffb592d9c8) at
> /root/build-1901/src/vppinfra/mem.h:215
> #3  0x00007ffff618edd8 in vec_resize_allocate_memory (v=0x7fffb592da00, 
> length_increment=1, data_bytes=312, header_bytes=56, data_align=64) at
> /root/build-1901/src/vppinfra/vec.c:96
> #4  0x00007fffb0aa4a29 in _vec_resize_inline (v=0x7fffb592da00, 
> length_increment=1, data_bytes=256, header_bytes=48, data_align=64) at
> /root/build-1901/src/vppinfra/vec.h:147
> #5  0x00007fffb0aa9ca4 in vnet_pppoe_add_del_session (a=0x7fffb6703950, 
> sw_if_indexp=0x7fffb67038e8) at
> /root/build-1901/src/plugins/pppoe/pppoe.c:335
> #6  0x00007fffb0aaadec in pppoe_add_del_session_command_fn
> (vm=0x7ffff68e3400 <vlib_global_main>, input=0x7fffb6703ee0,
> cmd=0x7fffb65dd73c) at /root/build-1901/src/plugins/pppoe/pppoe.c:554
> #7  0x00007ffff6617db0 in vlib_cli_dispatch_sub_commands
> (vm=0x7ffff68e3400 <vlib_global_main>, cm=0x7ffff68e3600 
> <vlib_global_main+512>, input=0x7fffb6703ee0, parent_command_index=21) at 
> /root/build-1901/src/vlib/cli.c:644
>
> This do not occur if traffic is not initiated, and there is no packet flow 
> through the system. It would be great if some one who understands this code 
> to confirm if my analysis is correct and give some pointers to figure out
>
> 1. Why, when a new session is created, the data of an old session is changed 
> to 0x1313 2. What debugging steps can I take next to figure out why this 
> happens.
>
>
>
> Thanks and Regards,
>
>
> Raj
>
> On Sat, Sep 28, 2019 at 6:09 PM Raj via Lists.Fd.Io 
> <rajlistuser=gmail....@lists.fd.io> wrote:
> >
> > Hello all,
> >
> > I have done some more tests to pinpoint the exact condition of the
> > crash. What I could figure out was that the crash happens when memory
> > is being allocated for pppoe_session_t while packets are flowing
> > through pppoe interface.
> >
> > Here is what I did to arrive at this conclusion:
> >
> > 1. Configure VPP without any default route (to ensure packets do not
> > hit north interface from south) 2. Provision 100 PPPoE clients - No
> > crash observed 3. Deprovision all 100 PPPoE clients 4. Configure
> > default route 5. Provision 100 PPPoE clients again, and start a ping
> > to an external IP from each client - No Crash observed 6. Provision 50
> > more PPPoE clients - VPP crashes.
> >
> > Based on this test, and from what I could understand from the code, my
> > guess is  that there is some memory corruption happening inside the
> > pppoe_session_t when memory is being allocated for it when there is
> > packets traversing through PPPoE interface.
> >
> > Thanks and Regards,
> >
> > Raj
> >
> >
> > On Thu, Sep 26, 2019 at 7:15 PM Raj via Lists.Fd.Io
> > <rajlistuser=gmail....@lists.fd.io> wrote:
> > >
> > > Hello all,
> > >
> > > I am observing a VPP crash when approximately 20 - 50 PPPoE clients
> > > are connecting and traffic is flowing through them. This crash was
> > > reproducible every time I tried.
> > >
> > > I did some debugging and here is what I could find out so far:
> > >
> > > If I understand correctly, when a incoming packet from north side is
> > > being sent to PPPoE interface, pppoe_fixup() is called to update
> > > pppoe0->length, and t->encap_if_index. Length and encap_if_index is
> > > taken from adj0->sub_type.midchain.fixup_data
> > >
> > > My observation is that while clients are connecting and traffic is
> > > flowing for connected clients, adj0->sub_type.midchain.fixup_data
> > > appears to hold incorrect data, at some point in time, during the
> > > test. What we have seen is the incorrect data
> > > (adj0->sub_type.midchain.fixup_data) is observed for clients which
> > > are already provisioned for some time and which had packets flowing
> > > through them.
> > >
> > > I figured this out by using gdb and inspecting
> > > adj0->sub_type.midchain.fixup_data, after typecasting it into
> > > pppoe_session_t
> > >
> > > In the structure, I could see that session_id, client_ip and
> > > encap_idx are incorrect. I did not check other values in the structure.
> > >
> > > I also added code to log this fields in pppoe_fixup() and logs too
> > > shows incorrect data in the fields.
> > >
> > > Example logs taken just before crash:
> > >
> > > vnet[12988]: pppoe_fixup:243: 40:7b:1b: 0:12:38 ->  2:42: a: 1: 0: 2
> > > , type 8864
> > > vnet[12988]: pppoe_fixup:271: pppoe session id 4883, client_ip
> > > 0x13131313 encap idx 0x13131313
> > >
> > > First log prints out packet headers, to verify that data in packet
> > > is as expected and is correct. Second log prints values in
> > > pppoe_session data, and it can be seen that the values are obviously
> > > incorrect. At this point the packet is sent out through the south
> > > interface. Again after some time the TX index values become some
> > > thing similar to
> > > 1422457436 and VPP core dumps.
> > >
> > > We have tested the following scenarios:
> > >
> > > 1. Add PPPoE clients without sending out any traffic: There is no
> > > crash observed.
> > > 2. Add n number of PPPoE clients, load traffic [No adding or removal
> > > or clients while traffic is on, see next scenario]: There is no
> > > crash observed 3. Load traffic as soon as each client connects: VPP
> > > crash observed.
> > >
> > > Another observation is that  encap_if_index is available in two
> > > places inside pppoe_fixup:
> > >
> > > 1. adj->rewrite_header.sw_if_index
> > > 2. t->encap_if_index
> > >
> > > t->encap_if_index is used for updating TX, and this gets corrupted,
> > > while adj->rewrite_header.sw_if_index has the correct index.
> > >
> > > I can check and get back if you need any additional information. Let
> > > me know if a bug report is to be created for this.
> > >
> > > Environment:
> > >
> > > vpp# show version verbose
> > > Version:          v19.08.1-59~ga2aa83ca9-dirty
> > > Compiled by:          root
> > > Compile host:          build-02
> > > Compile date:          Thu Sep 26 16:44:00 IST 2019
> > > Compile location:      /root/build-1908
> > > Compiler:          GCC 7.4.0
> > > Current PID:          7802
> > >
> > > Operating system: Ubuntu 18.04 amd64
> > >
> > > startup.conf and associated exec file is attached.
> > >
> > > There is a small patch to stock VPP to disable
> > > ETHERNET_ERROR_L3_MAC_MISMATCH, which is attached. I have also
> > > attached output of show show hardware and gdb bt output. I have the
> > > core file and its matching VPP debs, and can be shared if needed.
> > >
> > > In the bt the incorrect value of index can be seen in bt #5:
> > >
> > > #5  0x00007fba88e9ce0b in vlib_increment_combined_counter
> > > (n_bytes=<optimized out>, n_packets=1, index=538976288,
> > > thread_index=0, cm=0x7fba481f46a0) at
> > > /root/build-1908/src/vlib/counter.h:229
> > >
> > > Thanks and Regards,
> > >
> > > Raj
> > > -=-=-=-=-=-=-=-=-=-=-=-
> > > Links: You receive all messages sent to this group.
> > >
> > > View/Reply Online (#14063):
> > > https://lists.fd.io/g/vpp-dev/message/14063
> > > Mute This Topic: https://lists.fd.io/mt/34298895/157026
> > > Group Owner: vpp-dev+ow...@lists.fd.io
> > > Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub
> > > [rajlistu...@gmail.com]
> > > -=-=-=-=-=-=-=-=-=-=-=-
> > -=-=-=-=-=-=-=-=-=-=-=-
> > Links: You receive all messages sent to this group.
> >
> > View/Reply Online (#14081):
> > https://lists.fd.io/g/vpp-dev/message/14081
> > Mute This Topic: https://lists.fd.io/mt/34298895/157026
> > Group Owner: vpp-dev+ow...@lists.fd.io
> > Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub
> > [rajlistu...@gmail.com]
> > -=-=-=-=-=-=-=-=-=-=-=-
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#14197): https://lists.fd.io/g/vpp-dev/message/14197
Mute This Topic: https://lists.fd.io/mt/34298895/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to