I'm writing this up here in the hope it will stir a discussion and some directions. Let's see if we can hammer this out here. I'll drop it in Jira if not (or maybe "also")
To set the scene: - many components of the code base have unit tests - some of the build combinations run by jenkins run unit tests -- there is a specific unit_tests target which runs the tests twice, one in a secured build, once without (though for the below purposes, without a pending patch, only one set of results will be considered) -- some of the other targets (e.g. Windows) run some unit tests - any place unit tests are run have the possibility of flipping that entire build target to a fail if the unit tests don't pass. - any failed Jenkins build target will cause it to vote -1 on a gerrit change; if all pass it will vote +1 - an scons build script which has control over a unit test can add a dependency so the test runs by calling a run_test function. - an argument to run_test lets a run of valgrind's memory leak detector take place on the unit test binary (this is done only if the platform supports valgrind - basically: on Linux); if so, the test is not run directly, but under control of valgrind. - valgrind leaves a report file in a place Jenkins can find it - Jenkins has a task which looks for those files, and if it finds any, will use the results as part of its +1/-1 vote. If there are no memcheck files, that part does not contribute - "no files" is not a failure. For the ones it finds, there is a configuration setting which counts various types of failures, and if any exceed the limits for that category, that's a -1 vote. Over time, more and more unit tests have had the valgrind option turned on. Recently Phil found a few places where though "turned on", the output files did not get saved in a way that Jenkins found them, so their results were not included in the counts. When we fixed those in master branch, one of the counts is now over the threshold, so for the time being "master is broken". It isn't really broken, but a patch submitted to Gerrit against master will fail the unit-tests part of the Jenkins build as noted due to too many "definitely lost" memory leaks, even if the patch itself is fine and in no way contributes to the leaks. How to resolve this and move forward? Note currently avoiding pushing the missing memcheck files fixes to 1.3-rel to not disrupt release preparation activities. - turn some of the memchecks back off (e.g. revert fixes to filenames, so they'll go back to not matching the Jenkins pattern and just be missed). Negative: out of sight, out of mind - tell Jenkins not to fail so easily on these leak issues, either by upping the limits, or not making the presence of memcheck fails add up to a -1 vote even if it's over a limit. Negative: at least you could find the fails now if you looked for them, but it's still probably "out of mind" - no failure means no urgency - fix all the fails - fix enough fails to get back to a passing state Negative of last two: takes time and effort, and we're not blessed with excess resources. - teach valgrind to ignore more things that aren't really critical. Negative: takes some work, still need to evaluate what's real and what is not, may not be much different than previous option. Some, maybe the majority, of the leaks are not that critical. As a major example, the security unittests have a whole bunch of reported leak failures as a result of invoking the googletest framework macro ASSERT_TRUE. As you might guess, this is an assertion: it the condition is false, an error is reported and the function is bailed out on. Thus, any allocations the function may have made (and acl code in particular seems prone to call other functions which allocate as a side effect) don't have a chance to get freed; pointers holding those allocations go out of scope and the memory is just lost. This is a memory leak flagged by valgrind, but in my opinion, it's not a crucial one. Unit tests are short-lived, they will run and quit, and we're done. Sure, it's possible to leak enough to start failing tests, but it's a different class of problem than a similar situation would be in the core stack. Valgrind can take an exception file to exclude certain fails, but even here there's some effort involved: someone needs to identify the pattern to exclude so it works for valgrind and doesn't also hide real leaks we care about. There's got to be people out there way more expert at these kinds of issues than I am. How do other projects deal with such things? How should we proceed? An example if anyone's curious what these look like: 1. a build that has failed, "47 errors" is not over the limit, but inside that list are more than 20 "definitely lost" errors, which is over that configured limit: https://build.iotivity.org/ci/job/iotivity-verify-unit_tests/18404/ 2. same build, showing the results of all the tests https://build.iotivity.org/ci/job/iotivity-verify-unit_tests/18404/valgrindResult/ 3. same build, drilling down to the security unittest which shows 17 fails in three categories: https://build.iotivity.org/ci/job/iotivity-verify-unit_tests/18404/valgrindResult/pid=2754/ _______________________________________________ iotivity-dev mailing list iotivity-dev@lists.iotivity.org https://lists.iotivity.org/mailman/listinfo/iotivity-dev