I'm writing this up here in the hope it will stir a discussion and some
directions. Let's see if we can hammer this out here. I'll drop it in
Jira if not (or maybe "also")

To set the scene:

- many components of the code base have unit tests
- some of the build combinations run by jenkins run unit tests
-- there is a specific unit_tests target which runs the tests twice, one
in a secured build, once without (though for the below purposes, without
a pending patch, only one set of results will be considered)
-- some of the other targets (e.g. Windows) run some unit tests
- any place unit tests are run have the possibility of flipping that
entire build target to a fail if the unit tests don't pass.
- any failed Jenkins build target will cause it to vote -1 on a gerrit
change; if all pass it will vote +1
- an scons build script which has control over a unit test can add a
dependency so the test runs by calling a run_test function.
- an argument to run_test lets a run of valgrind's memory leak detector
take place on the unit test binary (this is done only if the platform
supports valgrind - basically: on Linux); if so, the test is not run
directly, but under control of valgrind.
- valgrind leaves a report file in a place Jenkins can find it
- Jenkins has a task which looks for those files, and if it finds any,
will use the results as part of its +1/-1 vote.  If there are no
memcheck files, that part does not contribute - "no files" is not a
failure. For the ones it finds, there is a configuration setting which
counts various types of failures, and if any exceed the limits for that
category, that's a -1 vote.

Over time, more and more unit tests have had the valgrind option turned
on. Recently Phil found a few places where though "turned on", the
output files did not get saved in a way that Jenkins found them, so
their results were not included in the counts.  When we fixed those in
master branch, one of the counts is now over the threshold, so for the
time being "master is broken".  It isn't really broken, but a patch
submitted to Gerrit against master will fail the unit-tests part of the
Jenkins build as noted due to too many "definitely lost" memory leaks,
even if the patch itself is fine and in no way contributes to the leaks.

How to resolve this and move forward? Note currently avoiding pushing
the missing memcheck files fixes to 1.3-rel to not disrupt release
preparation activities.

- turn some of the memchecks back off (e.g. revert fixes to filenames,
so they'll go back to not matching the Jenkins pattern and just be
missed). Negative: out of sight, out of mind
- tell Jenkins not to fail so easily on these leak issues, either by
upping the limits, or not making the presence of memcheck fails add up
to a -1 vote even if it's over a limit. Negative: at least you could
find the fails now if you looked for them, but it's still probably "out
of mind" - no failure means no urgency
- fix all the fails
- fix enough fails to get back to a passing state
Negative of last two: takes time and effort, and we're not blessed with
excess resources.
- teach valgrind to ignore more things that aren't really critical.
Negative: takes some work, still need to evaluate what's real and what
is not, may not be much different than previous option.

Some, maybe the majority, of the leaks are not that critical. As a major
example, the security unittests have a whole bunch of reported leak
failures as a result of invoking the googletest framework macro
ASSERT_TRUE.  As you might guess, this is an assertion: it the condition
is false, an error is reported and the function is bailed out on.  Thus,
any allocations the function may have made (and acl code in particular
seems prone to call other functions which allocate as a side effect)
don't have a chance to get freed; pointers holding those allocations go
out of scope and the memory is just lost.  This is a memory leak flagged
by valgrind, but in my opinion, it's not a crucial one. Unit tests are
short-lived, they will run and quit, and we're done. Sure, it's possible
to leak enough to start failing tests, but it's a different class of
problem than a similar situation would be in the core stack. Valgrind
can take an exception file to exclude certain fails, but even here
there's some effort involved: someone needs to identify the pattern to
exclude so it works for valgrind and doesn't also hide real leaks we
care about.

There's got to be people out there way more expert at these kinds of
issues than I am.  How do other projects deal with such things? How
should we proceed?

An example if anyone's curious what these look like:

1. a build that has failed, "47 errors" is not over the limit, but
inside that list are more than 20 "definitely lost" errors, which is
over that configured limit:
https://build.iotivity.org/ci/job/iotivity-verify-unit_tests/18404/

2. same build, showing the results of all the tests
https://build.iotivity.org/ci/job/iotivity-verify-unit_tests/18404/valgrindResult/

3. same build, drilling down to the security unittest which shows 17
fails in three categories:
https://build.iotivity.org/ci/job/iotivity-verify-unit_tests/18404/valgrindResult/pid=2754/



_______________________________________________
iotivity-dev mailing list
iotivity-dev@lists.iotivity.org
https://lists.iotivity.org/mailman/listinfo/iotivity-dev

Reply via email to