Hi Nathan,

> Is this the bug you are referring to? https://tracker.ceph.com/issues/37713

yes, its one of them. I believe there were more such reports.

> The main prod filesystems are home
> directories for hundreds of interactive users using clustered
> machines, ...

That's exactly what we have, the main file system is home for an HPC cluster. 
And yes, you can control write patterns. Just tell your users that what they 
want to do needs to be done differently or they have to live with garbled 
files. Worked for us like a charm. Point is, most HPC applications use 
something like MPI and these frameworks have functions for exactly this 
purpose. The parallel libraries in python also support this in an easy way. Our 
users had to change/add one function call and make the main process do the 
write. Nobody had a problem with that. If your situation is similar, it should 
work as well.

If you need a selling point, its for performance as well. The impact is 
measurable even, as I wrote, for a healthy client. There are actually more 
instances where multi-node performance heavily depends on which node one 
performs a certain operation on. This is not restricted to ceph (although a lot 
more pronounced compared with other distributed file systems), so a useful 
lesson to learn in any case I would say.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Nathan Fish <lordci...@gmail.com>
Sent: 07 September 2021 21:36:17
To: Frank Schilder
Cc: ceph-users
Subject: Re: [ceph-users] Data loss on appends, prod outage

Thank you for your reply. The main prod filesystems are home
directories for hundreds of interactive users using clustered
machines, so we cannot really control the write patterns. In the past,
metadata performance has indeed been the bottleneck, but it was still
quite fast enough.

Is this the bug you are referring to? https://tracker.ceph.com/issues/37713

On Tue, Sep 7, 2021 at 3:29 PM Frank Schilder <fr...@dtu.dk> wrote:
>
> Hi Nathan,
>
> could be a regression. The write append bug was a known issue for older 
> kernel clients. I can try to find the link. We have one of the affected 
> kernel versions and asked our users to use a single node for all writes to a 
> file.
>
> In general, for distributed/parallel file systems, this is expected 
> behaviour. It is explicitly the duty of the developer to coordinate write 
> processes to prevent simultaneous writes to the same file area. In principle, 
> ceph fs should do this for you using client CAPs, however, I personally think 
> at a way too high performance cost.
>
> Depending on what you write to, try to use a collector process on a single 
> node that collects all input from the remote nodes and writes a single 
> consistent stream. Basically what rsyslogd does (maybe you can abuse 
> rsyslogd?). This will also greatly outperform any such application running on 
> a non-broken ceph-fs client, because the coordination of meta data updates 
> and write locks between clients is unreasonably expensive.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Nathan Fish <lordci...@gmail.com>
> Sent: 07 September 2021 21:17:05
> To: ceph-users
> Subject: [ceph-users] Data loss on appends, prod outage
>
> As of this morning, when two CephFS clients append to the same file in
> quick succession, one append sometimes overwrites the other. This
> happens on some clients but not others; we're still trying to track
> down the pattern, if any.  We've failed all production filesystems to
> prevent further data loss. We added 3 new OSD servers last week, they
> finished backfilling a few days ago. Servers are Ubuntu 18.04, clients
> mostly 18.04 and 20.04, with HWE kernels (5.4 and 5.11 respectively).
> Ceph was upgraded from nautilus to octopus months ago. There were no
> relevant errors or even warnings in "ceph health" before we stopped
> the filesystems:
>
> HEALTH_ERR mons are allowing insecure global_id reclaim; 20 OSD(s)
> experiencing BlueFS spillover; 6 filesystems are degraded; 6
> filesystems are offline
>
> ceph versions
> {
>     "mon": {
>         "ceph version 15.2.14
> (cd3bb7e87a2f62c1b862ff3fd8b1eec13391a5be) octopus (stable)": 3
>     },
>     "mgr": {
>         "ceph version 15.2.14
> (cd3bb7e87a2f62c1b862ff3fd8b1eec13391a5be) octopus (stable)": 3
>     },
>     "osd": {
>         "ceph version 15.2.14
> (cd3bb7e87a2f62c1b862ff3fd8b1eec13391a5be) octopus (stable)": 200
>     },
>     "mds": {
>         "ceph version 15.2.14
> (cd3bb7e87a2f62c1b862ff3fd8b1eec13391a5be) octopus (stable)": 48
>     },
>     "rgw": {
>         "ceph version 15.2.13
> (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable)": 1
>     },
>     "overall": {
>         "ceph version 15.2.13
> (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable)": 1,
>         "ceph version 15.2.14
> (cd3bb7e87a2f62c1b862ff3fd8b1eec13391a5be) octopus (stable)": 254
>     }
> }
>
> I looked for bugs on the tracker but didn't see anything that seemed
> like our issue. Any advice would be appreciated.
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to