Package: cfengine2
Version: 2.2.10-5
Severity: normal
Hi,
So, first let me preface this with the understanding that cfengine2 is
very old code and not being maintained anymore, so I'll understand if
you don't accept the patches I'm proposing. However, since it isn't
changing much, cfengine2 is more more or less stable and works as
expected. Also, we have *a lot* cfinputs rules and it would be far more
costly to switch to another system right now than to simply fix this
bug. The fact that Debian still has a package for cfengine2 makes me
think that enough other folks are in this boat that I thought it might
be worth posting a bug on our experiences.
Ok, so the bug:
1) A long series of preceeding actionsequences.
In our case, just a long set of copy rules.
For instance:
control:
actionsequence = ( copy links )
copy:
somehost::
# stuff that takes at least 1 minute
Turns out that the main reason they're slow is due to poor bdb based
lock implementation that causes an fdatasync() for every
GetLock()/ReleaseLock(), which may be in the thousands.
There's also some time spent waiting for TCP buffers to fill before
exchanging small stat info for already up to date files.
2) A colliding lock name for certain actions.
In our case it was for several long symlink rules:
links:
somehost::
/usr/share/wordpress/wp-content/plugins/advanced-custom-fields
-> /home/vhosts-wp/plugins.dev/advanced-custom-fields
/usr/share/wordpress/wp-content/plugins/advanced-custom-fields-pro
-> /home/vhosts-wp/plugins.dev/advanced-custom-fields-pro
...
These all get the following lock name:
_usr_share_wordpress_wp_content_plugins_advanced_c__home_vhosts_wp_plugins_dev_advanced_custom_fields
Turns out that MakeLinks() in do.c, when attempting to compose a unique
lock name for the rule, truncates both the LHS and RHS and combines
them, before handing it off to GetLock() in locks.c.
Well, GetLock() does it's own quasi checksum based lock name
deduplication, so the initial lock name truncation in MakeLinks()
actually prevents GetLock() from doing the right thing.
Turns out that the copy action processes is also affected by this same
lock name truncation issue.
3) Since the copy action (in our example) takes more than 1 minute, the
check to GetLastLock() in GetLock() fails due to comparing the timestamp
of the first completed colliding link action with the start of the full
DoTree() cfagent run, rather than the first time any link action
actually took place.
To address these things (and document them further), I've made a github
repo here:
https://github.com/bpkroth/cfengine2/issues/1 (correctness)
https://github.com/bpkroth/cfengine2/issues/2 (performance optimizations)
Addressing the performance of the locking by batching up fdatasync()
calls to the cfengine_lock_db [1] helps with all of these issues (in our
situation) and reduces the runtime of cfagent by 50% (in our situation),
however it's ultimately just a workaround.
Adding the TCP_NODELAY flag also helps a bit with the stat delays, but
it's a relatively minor optimization compared with the fdatasyncs [2].
Changing the timestamp checking to be based on the action sequence also
mitigates the issue, though the changes were too invasive to be
complete [3].
Simply not truncating the lock name (at least as much) and filling the
available lock name buffer size to allow GetLock() to do its own
checksum uniquifying also solves this particular issue more directly
[4].
[1]
https://github.com/bpkroth/cfengine2/commit/5543c87f5fb102cecbe60b3d2d6b258c9d8f0baa
[2]
https://github.com/bpkroth/cfengine2/commit/a2807337e37ecad20f6cf65a4b3bdc0ac1b29651
[3]
https://github.com/bpkroth/cfengine2/commit/f346e8153bbe325924d751bd3ccdd830b88f2671
[4]
https://github.com/bpkroth/cfengine2/commit/f6658d41c7d0df76088adace1760aa9d9743403f
Anyways, here are two other branches that also turn these patches into
quilt patches for Debian packaging:
https://github.com/bpkroth/cfengine2/tree/debian/wheezy/cfengine2_2.2.10-5+patches
https://github.com/bpkroth/cfengine2/tree/debian/jessie/cfengine2_2.2.10-5+patches
(they're actually the same given that jessie and wheezy have identical
source packages for cfengine2)
We've been running with them widespread for about a week without any issues.
Let me know if you have any questions or comment.
Thanks for your time.
Cheers,
Brian
*** Reporter, please consider answering these questions, where appropriate ***
* What led up to the situation?
* What exactly did you do (or not do) that was effective (or
ineffective)?
* What was the outcome of this action?
* What outcome did you expect instead?
*** End of the template - remove these template lines ***
-- System Information:
Debian Release: 8.5
APT prefers stable
APT policy: (500, 'stable'), (120, 'testing'), (110, 'unstable')
Architecture: i386 (i686)
Kernel: Linux 3.16.0-4-686-pae (SMP w/1 CPU core)
Locale: LANG=en_US.utf8, LC_CTYPE=en_US.utf8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: sysvinit (via /sbin/init)
Versions of packages cfengine2 depends on:
ii debconf [debconf-2.0] 1.5.56
ii libc6 2.19-18+deb8u4
ii libdb5.3 5.3.28-9
ii libssl1.0.0 1.0.1t-1+deb8u2
cfengine2 recommends no packages.
cfengine2 suggests no packages.
-- Configuration Files:
/etc/default/cfengine2 changed [not included]
-- debconf information excluded