Ouch. Attempting to run “make test-debug” resulted in a lot of unrelated 
sadness... I will do some more testing to ensure it’s not a PEBCAC...

--a

> On 28 May 2020, at 21:25, Andrew Yourtchenko via lists.fd.io 
> <ayourtch=gmail....@lists.fd.io> wrote:
> 
> Paul,
> 
> This is an excellent catch, thanks!! I will give it a go in test-debug...
> 
> --a
> 
>>> On 28 May 2020, at 16:15, Paul Vinciguerra <pvi...@vinciconsulting.com> 
>>> wrote:
>>> 
>> 
>> A few weeks back, I became aware of the following issue with the LISP tests:
>> 
>> /vpp/build-root/install-vpp_debug-native/vpp/bin/vpp[63989]: 
>> vnet_lisp_add_del_locator_set:2140: Can't delete a locator that supports a 
>> mapping!
>> /vpp/build-root/install-vpp_debug-native/vpp/bin/vpp[63989]: received signal 
>> SIGSEGV, PC 0xa0a0a00, faulting address 0xa0a0a00
>> 14:32:44,290 Child test runner process unresponsive and core-file exists in 
>> test temporary directory (last test running was `Test case for basic 
>> encapsulation' in `/tmp/vpp-unittest-TestLisp-wtuvdu4q')!
>> 
>> which seems to be triggered by the api trace commands in tearDown in 
>> framework.py.  I see it while running tests in a docker container.
>> 
>> 
>> 
>> On Thu, May 28, 2020 at 4:51 AM Andrew Yourtchenko <ayour...@gmail.com> 
>> wrote:
>>> Hi Elias,
>>> 
>>> Yeah it all does point to something like uninitialized data - I ran 
>>> yesterday the tests on two different machines for a while, apparently 
>>> without the issues...
>>> 
>>> The CI runtime environment is much more dynamic - it’s an ephemeral docker 
>>> container that is orchestrated by the nomad and is destroyed after the job 
>>> is run.
>>> 
>>> Could you push as a separate change the code that reliably gives you the 
>>> error in the LISP unit test in the CI, and let me know the change# ?
>>> 
>>> 
>>>  I will then test some tooling enhancement ideas I had for a while - to 
>>> check within the job whether the core exists, and if it does, to load it 
>>> into gdb and do some scripted processing of it and output the results... 
>>> (Iterate over the call stack and issue stuff like ‘info locals’, ‘info 
>>> regs’, etc).
>>> 
>>> I did some experiments with that approach earlier and it seemed like a 
>>> rather scalable technique for most of the issues, which should also save 
>>> disk space and the developer time ...
>>> 
>>> --a
>>> 
>>> > On 28 May 2020, at 10:33, Elias Rudberg <elias.rudb...@bahnhof.net> wrote:
>>> > 
>>> > Hi Andrew,
>>> > 
>>> > In my case it failed several times and appeared to be triggered by
>>> > seemingly harmless code changes, but it seemed like the problem was
>>> > reproducible for a given version of the code. What seemed to matter was
>>> > when I changed things related to local variables inside the
>>> > set_ipfix_exporter_command_fn() function. The test logs said "Core-file 
>>> > exists" which I suppose means that vpp crashed. The testing framework
>>> > repeats the test several times, saying "3 attempt(s) left", then "2
>>> > attempt(s) left" and so on, all those repeated attempts seemed to crash
>>> > in the same way.
>>> > 
>>> > It could be something with uninitialized variables, e.g. something that
>>> > is assumed to be zero but is never explicitly initialized so it can
>>> > work when it happens to be zero but depending on platform and compiler
>>> > details there could be some garbage there causing a problem. Then
>>> > unrelated code changes like adding variables somewhere making things
>>> > end up at slightly different memory ocations could make the error come
>>> > and go. This is just guessing of course.
>>> > 
>>> > Is it possible to get login access to the machine where the
>>> > gerrit/jenkins tests are run, to debug it there where the issue can be
>>> > reproduced?
>>> > 
>>> > / Elias
>>> > 
>>> > 
>>> >> On Wed, 2020-05-27 at 19:03 +0200, Andrew 👽 Yourtchenko wrote:
>>> >> Yep, so it looks like we have an issue...
>>> >> 
>>> >> https://gerrit.fd.io/r/c/vpp/+/27305 has the same failures, I am
>>> >> rerunning it now to see how intermittent it is - as well as testing
>>> >> the latest master locally....
>>> >> 
>>> >> --a
>>> >> 
>>> >>> On 27 May 2020, at 18:56, Elias Rudberg <elias.rudb...@bahnhof.net>
>>> >>> wrote:
>>> >>> 
>>> >>> Hi Andrew,
>>> >>> 
>>> >>> Yes, it was Basic LISP test. It looked like this in the
>>> >>> console.log.gz
>>> >>> for vpp-verify-master-ubuntu1804:
>>> >>> 
>>> >>> ===================================================================
>>> >>> ====
>>> >>> =======
>>> >>> TEST RESULTS:
>>> >>>    Scheduled tests: 1177
>>> >>>     Executed tests: 1176
>>> >>>       Passed tests: 1039
>>> >>>      Skipped tests: 137
>>> >>> Not Executed tests: 1
>>> >>>             Errors: 1
>>> >>> FAILURES AND ERRORS IN TESTS:
>>> >>> Testcase name: Basic LISP test 
>>> >>>     ERROR: Test case for basic encapsulation
>>> >>> [test_lisp.TestLisp.test_lisp_basic_encap]
>>> >>> TESTCASES WHERE NO TESTS WERE SUCCESSFULLY EXECUTED:
>>> >>> Basic LISP test 
>>> >>> ===================================================================
>>> >>> ====
>>> >>> =======
>>> >>> 
>>> >>> / Elias
>>> >>> 
>>> >>> 
>>> >>> 
>>> >>> On Wed, 2020-05-27 at 18:42 +0200, Andrew 👽 Yourtchenko wrote:
>>> >>>> Basic LISP test - was it the one that was failing for you ?
>>> >>>> That particular test intermittently failed a couple of times for
>>> >>>> me
>>> >>>> as well, on a doc-only change, so we have an unrelated issue.
>>> >>>> I am running it locally to see what is going on.
>>> >>>> --a
>>> 
> 
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#16564): https://lists.fd.io/g/vpp-dev/message/16564
Mute This Topic: https://lists.fd.io/mt/74491544/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to