On Sun, May 20, 2018 at 09:58:10PM +0200, Michael Biebl <bi...@debian.org> wrote: > That seems very strange. The only case where I personally ran into > journal file corruption is when I had to power cycle the machine. > But you said that journald ran uninterrupted for 40 days. > Would it be possible that this is a hardware or file system issue?
I have an update to this, and can reproduce this: systemd is likely off the hook for the corruption itself. Clearly it shouldn't crash, but I can reproduce the corruption now, and it's a almost certainly a linux 4.14 bug. As for background: linux 4.4 was the last kernel which worked on our servers. At some point in 4.6, we started getting frequent OOM kills a few hours after booting, despite many gigabytes of memory "available" (e.g. used as cache) (you might remember me complaining about missing 4.4 compatibility for this reason - we couldn't switch to 4.9). The first kernel that kind of worked for us again was 4.14, but only with this hourly cronjob: echo 3 >/proc/sys/vm/drop_caches Without it, mysql still gets killed once per week or so. This doesn't work with debians 4.9 LTS kernel, which is why we use the 4.14 LTS kernel from the ubuntu mainline ppa. And the above command causes corruption of the systemd journal. I have reproduced this multiple times now, by deleting the journal and restarting the journald, following by waiting for a day, and then doing this: # journalctl --verify [everything fine at this point] # echo 3 >/proc/sys/vm/drop_caches # journalctl --verify [journcal now reporting corruption problems] We are in the lucky position to have "expected" md5 checksums for practically all files on the servers this happens on (and debian packages usually have md5sum files as well) and luckily, neither the fs itself now any other file seems corrupted, including some write-heavy mysql databases and over 53TB of data we verified. Only one other program also suffers from corruption: rtorrent, which doesn't run on many servers :) which is why I found out about it only by accident. There, the same patterns happens: downloading a torrent is fine, downloading a torrent while dropping the caches frequently causes file corruption. I also have cmp -l output from a corrupted file vs. a correct file, and it seems the corruption manifests itself as (non-aligned to anything obvious, such as 512 or 4k borders) streaks of zero bytes instead of the real data that should be there. I will pursue this with the linux upstream. It's possible that systemd (like rtorrent is known to) does something to increase the chance of corruption, as it luckily only seems to affect those tow programs, but it's unlikely to be a bug in systemd itself (other than it probably shouldn't crash), as drop_caches is supposed to be safe. Greetings, -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schm...@schmorp.de -=====/_/_//_/\_,_/ /_/\_\