Rolling back to kernel 5.4 has resolved the issue.
On Tue, Sep 7, 2021 at 3:51 PM Frank Schilder <fr...@dtu.dk> wrote: > > Hi Nathan, > > > Is this the bug you are referring to? https://tracker.ceph.com/issues/37713 > > yes, its one of them. I believe there were more such reports. > > > The main prod filesystems are home > > directories for hundreds of interactive users using clustered > > machines, ... > > That's exactly what we have, the main file system is home for an HPC cluster. > And yes, you can control write patterns. Just tell your users that what they > want to do needs to be done differently or they have to live with garbled > files. Worked for us like a charm. Point is, most HPC applications use > something like MPI and these frameworks have functions for exactly this > purpose. The parallel libraries in python also support this in an easy way. > Our users had to change/add one function call and make the main process do > the write. Nobody had a problem with that. If your situation is similar, it > should work as well. > > If you need a selling point, its for performance as well. The impact is > measurable even, as I wrote, for a healthy client. There are actually more > instances where multi-node performance heavily depends on which node one > performs a certain operation on. This is not restricted to ceph (although a > lot more pronounced compared with other distributed file systems), so a > useful lesson to learn in any case I would say. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Nathan Fish <lordci...@gmail.com> > Sent: 07 September 2021 21:36:17 > To: Frank Schilder > Cc: ceph-users > Subject: Re: [ceph-users] Data loss on appends, prod outage > > Thank you for your reply. The main prod filesystems are home > directories for hundreds of interactive users using clustered > machines, so we cannot really control the write patterns. In the past, > metadata performance has indeed been the bottleneck, but it was still > quite fast enough. > > Is this the bug you are referring to? https://tracker.ceph.com/issues/37713 > > On Tue, Sep 7, 2021 at 3:29 PM Frank Schilder <fr...@dtu.dk> wrote: > > > > Hi Nathan, > > > > could be a regression. The write append bug was a known issue for older > > kernel clients. I can try to find the link. We have one of the affected > > kernel versions and asked our users to use a single node for all writes to > > a file. > > > > In general, for distributed/parallel file systems, this is expected > > behaviour. It is explicitly the duty of the developer to coordinate write > > processes to prevent simultaneous writes to the same file area. In > > principle, ceph fs should do this for you using client CAPs, however, I > > personally think at a way too high performance cost. > > > > Depending on what you write to, try to use a collector process on a single > > node that collects all input from the remote nodes and writes a single > > consistent stream. Basically what rsyslogd does (maybe you can abuse > > rsyslogd?). This will also greatly outperform any such application running > > on a non-broken ceph-fs client, because the coordination of meta data > > updates and write locks between clients is unreasonably expensive. > > > > Best regards, > > ================= > > Frank Schilder > > AIT Risø Campus > > Bygning 109, rum S14 > > > > ________________________________________ > > From: Nathan Fish <lordci...@gmail.com> > > Sent: 07 September 2021 21:17:05 > > To: ceph-users > > Subject: [ceph-users] Data loss on appends, prod outage > > > > As of this morning, when two CephFS clients append to the same file in > > quick succession, one append sometimes overwrites the other. This > > happens on some clients but not others; we're still trying to track > > down the pattern, if any. We've failed all production filesystems to > > prevent further data loss. We added 3 new OSD servers last week, they > > finished backfilling a few days ago. Servers are Ubuntu 18.04, clients > > mostly 18.04 and 20.04, with HWE kernels (5.4 and 5.11 respectively). > > Ceph was upgraded from nautilus to octopus months ago. There were no > > relevant errors or even warnings in "ceph health" before we stopped > > the filesystems: > > > > HEALTH_ERR mons are allowing insecure global_id reclaim; 20 OSD(s) > > experiencing BlueFS spillover; 6 filesystems are degraded; 6 > > filesystems are offline > > > > ceph versions > > { > > "mon": { > > "ceph version 15.2.14 > > (cd3bb7e87a2f62c1b862ff3fd8b1eec13391a5be) octopus (stable)": 3 > > }, > > "mgr": { > > "ceph version 15.2.14 > > (cd3bb7e87a2f62c1b862ff3fd8b1eec13391a5be) octopus (stable)": 3 > > }, > > "osd": { > > "ceph version 15.2.14 > > (cd3bb7e87a2f62c1b862ff3fd8b1eec13391a5be) octopus (stable)": 200 > > }, > > "mds": { > > "ceph version 15.2.14 > > (cd3bb7e87a2f62c1b862ff3fd8b1eec13391a5be) octopus (stable)": 48 > > }, > > "rgw": { > > "ceph version 15.2.13 > > (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable)": 1 > > }, > > "overall": { > > "ceph version 15.2.13 > > (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable)": 1, > > "ceph version 15.2.14 > > (cd3bb7e87a2f62c1b862ff3fd8b1eec13391a5be) octopus (stable)": 254 > > } > > } > > > > I looked for bugs on the tracker but didn't see anything that seemed > > like our issue. Any advice would be appreciated. > > _______________________________________________ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io