[ceph-users] Re: Data loss on appends, prod outage

Nathan Fish Wed, 08 Sep 2021 10:33:56 -0700

Rolling back to kernel 5.4 has resolved the issue.


On Tue, Sep 7, 2021 at 3:51 PM Frank Schilder <fr...@dtu.dk> wrote:
>
> Hi Nathan,
>
> > Is this the bug you are referring to? https://tracker.ceph.com/issues/37713
>
> yes, its one of them. I believe there were more such reports.
>
> > The main prod filesystems are home
> > directories for hundreds of interactive users using clustered
> > machines, ...
>
> That's exactly what we have, the main file system is home for an HPC cluster. 
> And yes, you can control write patterns. Just tell your users that what they 
> want to do needs to be done differently or they have to live with garbled 
> files. Worked for us like a charm. Point is, most HPC applications use 
> something like MPI and these frameworks have functions for exactly this 
> purpose. The parallel libraries in python also support this in an easy way. 
> Our users had to change/add one function call and make the main process do 
> the write. Nobody had a problem with that. If your situation is similar, it 
> should work as well.
>
> If you need a selling point, its for performance as well. The impact is 
> measurable even, as I wrote, for a healthy client. There are actually more 
> instances where multi-node performance heavily depends on which node one 
> performs a certain operation on. This is not restricted to ceph (although a 
> lot more pronounced compared with other distributed file systems), so a 
> useful lesson to learn in any case I would say.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Nathan Fish <lordci...@gmail.com>
> Sent: 07 September 2021 21:36:17
> To: Frank Schilder
> Cc: ceph-users
> Subject: Re: [ceph-users] Data loss on appends, prod outage
>
> Thank you for your reply. The main prod filesystems are home
> directories for hundreds of interactive users using clustered
> machines, so we cannot really control the write patterns. In the past,
> metadata performance has indeed been the bottleneck, but it was still
> quite fast enough.
>
> Is this the bug you are referring to? https://tracker.ceph.com/issues/37713
>
> On Tue, Sep 7, 2021 at 3:29 PM Frank Schilder <fr...@dtu.dk> wrote:
> >
> > Hi Nathan,
> >
> > could be a regression. The write append bug was a known issue for older 
> > kernel clients. I can try to find the link. We have one of the affected 
> > kernel versions and asked our users to use a single node for all writes to 
> > a file.
> >
> > In general, for distributed/parallel file systems, this is expected 
> > behaviour. It is explicitly the duty of the developer to coordinate write 
> > processes to prevent simultaneous writes to the same file area. In 
> > principle, ceph fs should do this for you using client CAPs, however, I 
> > personally think at a way too high performance cost.
> >
> > Depending on what you write to, try to use a collector process on a single 
> > node that collects all input from the remote nodes and writes a single 
> > consistent stream. Basically what rsyslogd does (maybe you can abuse 
> > rsyslogd?). This will also greatly outperform any such application running 
> > on a non-broken ceph-fs client, because the coordination of meta data 
> > updates and write locks between clients is unreasonably expensive.
> >
> > Best regards,
> > =================
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > ________________________________________
> > From: Nathan Fish <lordci...@gmail.com>
> > Sent: 07 September 2021 21:17:05
> > To: ceph-users
> > Subject: [ceph-users] Data loss on appends, prod outage
> >
> > As of this morning, when two CephFS clients append to the same file in
> > quick succession, one append sometimes overwrites the other. This
> > happens on some clients but not others; we're still trying to track
> > down the pattern, if any.  We've failed all production filesystems to
> > prevent further data loss. We added 3 new OSD servers last week, they
> > finished backfilling a few days ago. Servers are Ubuntu 18.04, clients
> > mostly 18.04 and 20.04, with HWE kernels (5.4 and 5.11 respectively).
> > Ceph was upgraded from nautilus to octopus months ago. There were no
> > relevant errors or even warnings in "ceph health" before we stopped
> > the filesystems:
> >
> > HEALTH_ERR mons are allowing insecure global_id reclaim; 20 OSD(s)
> > experiencing BlueFS spillover; 6 filesystems are degraded; 6
> > filesystems are offline
> >
> > ceph versions
> > {
> >     "mon": {
> >         "ceph version 15.2.14
> > (cd3bb7e87a2f62c1b862ff3fd8b1eec13391a5be) octopus (stable)": 3
> >     },
> >     "mgr": {
> >         "ceph version 15.2.14
> > (cd3bb7e87a2f62c1b862ff3fd8b1eec13391a5be) octopus (stable)": 3
> >     },
> >     "osd": {
> >         "ceph version 15.2.14
> > (cd3bb7e87a2f62c1b862ff3fd8b1eec13391a5be) octopus (stable)": 200
> >     },
> >     "mds": {
> >         "ceph version 15.2.14
> > (cd3bb7e87a2f62c1b862ff3fd8b1eec13391a5be) octopus (stable)": 48
> >     },
> >     "rgw": {
> >         "ceph version 15.2.13
> > (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable)": 1
> >     },
> >     "overall": {
> >         "ceph version 15.2.13
> > (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable)": 1,
> >         "ceph version 15.2.14
> > (cd3bb7e87a2f62c1b862ff3fd8b1eec13391a5be) octopus (stable)": 254
> >     }
> > }
> >
> > I looked for bugs on the tracker but didn't see anything that seemed
> > like our issue. Any advice would be appreciated.
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Data loss on appends, prod outage

Reply via email to