[ceph-users] Re: osd is immidietly down and uses CPU full.
Den mån 3 feb. 2020 kl 08:25 skrev Wido den Hollander : > > The crash happens, when the osd wants to read from pipe when processing > > heartbeat. To me it sounds like a networking issue. > > It could also be that this OSD is so busy internally with other stuff > that it doesn't respond to heartbeats and then commits suicide. > > Combined with the comment that VMs can't read their data it could very > well be that the OSD is super busy. > > Maybe try a compact of the LevelDB database. > I think I am with Wido on this one, if you get one or a few PGs so full of metadata or weird stuff that it takes longer than suicide_timeout to handle it, then it will be like this. At start it tries to complete whatever operation was in queue (like scrubs, recovery, something) and it just gets stuck doing that instead of answering heartbeats or finishing operations requested by other OSDs or clients, and gets ejected from the cluster. If it is anything like what we see on our jewel cluster, you can move these PGs around with impact to clients but you can't "fix" them without doing deep changes like moving from leveldb to rocksdb (if filestore), splitting PGs and sharding buckets if it is RGW metadata that causes these huge indexes to end up on a single OSD. You sort of need to figure out what the root cause is and aim to fix that part. -- May the most significant bit of your life be positive. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ceph fs dir-layouts and sub-directory mounts
Dear Konstantin and Patrick, thanks! I started migrating a 2-pool layout ceph fs (rep meta, EC default data) to a 3-pool layout (rep meta, rep default data, EC data set at "/") and use sub-directory mounts for data migration. So far, everything as it should. Maybe some background info for everyone who is reading this. The reason for migrating is the modified best practices for cephfs, compare these two: https://docs.ceph.com/docs/mimic/cephfs/createfs/#creating-pools https://docs.ceph.com/docs/master/cephfs/createfs/#creating-pools The 3-pool layout was never mentioned in the RH ceph-course I took, nor by any of the ceph consultants we hired before deploying ceph. However, it seems really important to know about it. For a meta data + data pool layout, since some meta-data is written to the default data pool, an EC default data pool seems a bad idea most of the time. I see a lot of size-0 objects that only store rados meta data: POOLS: NAME ID USED%USED MAX AVAIL OBJECTS con-fs2-meta112 256 MiB 0.02 1.1 TiB 410910 con-fs2-meta213 0 B 0 355 TiB 5217644 con-fs2-data 14 50 TiB 5.53 852 TiB 17943209 con-fs2-meta is the default data pool. This is probably the worst workload for an EC pool. On our file system I have regularly seen "one MDS reports slow meta-data IOs" and was always wondering where this comes from. I have the meta-data pool on SSDs and this warning simply didn't make any sense. Now I know. Having a small replicated default pool resolves not only this issue, it also speeds up file create/delete and hard-link operations dramatically. I guess, anything that modifies an inode. I never tested these operations in my benchmarks, but they are important. Compiling and installing packages etc., anything with heavy create/modify/delete workload will profit as well as cluster health. Fortunately, I had an opportunity to migrate the ceph fs. For anyone who starts new, I would recommend to have the 3-pool layout right from the beginning. Never use an EC pool as the default data pool. I would even make this statement a bit stronger in the ceph documentation: If erasure-coded pools are planned for the file system, it is usually better to use a replicated pool for the default data pool ... to, for example, If erasure-coded pools are planned for the file system, it is strongly recommended to use a replicated pool for the default data pool ... Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Patrick Donnelly Sent: 02 February 2020 12:41 To: Frank Schilder Cc: ceph-users Subject: Re: [ceph-users] ceph fs dir-layouts and sub-directory mounts On Wed, Jan 29, 2020 at 3:04 AM Frank Schilder wrote: > > I would like to (in this order) > > - set the data pool for the root "/" of a ceph-fs to a custom value, say "P" > (not the initial data pool used in fs new) > - create a sub-directory of "/", for example "/a" > - mount the sub-directory "/a" with a client key with access restricted to > "/a" > > The client will not be able to see the dir layout attribute set at "/", its > not mounted. The client gets the file layout information when the file is created (i.e. the RPC response from the MDS) . It doesn't have _any_ access to "/". It can't even stat "/". > Will the data of this client still go to the pool "P", that is, does "/a" > inherit the dir layout transparently to the client when following the steps > above? Yes. -- Patrick Donnelly, Ph.D. He / Him / His Senior Software Engineer Red Hat Sunnyvale, CA GPG: 19F28A586F808C2402351B93C3301A3E258DD79D ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ceph fs dir-layouts and sub-directory mounts
errata: con-fs2-meta2 is the default data pool. = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: 03 February 2020 10:08 To: Patrick Donnelly; Konstantin Shalygin Cc: ceph-users Subject: Re: [ceph-users] ceph fs dir-layouts and sub-directory mounts Dear Konstantin and Patrick, thanks! I started migrating a 2-pool layout ceph fs (rep meta, EC default data) to a 3-pool layout (rep meta, rep default data, EC data set at "/") and use sub-directory mounts for data migration. So far, everything as it should. Maybe some background info for everyone who is reading this. The reason for migrating is the modified best practices for cephfs, compare these two: https://docs.ceph.com/docs/mimic/cephfs/createfs/#creating-pools https://docs.ceph.com/docs/master/cephfs/createfs/#creating-pools The 3-pool layout was never mentioned in the RH ceph-course I took, nor by any of the ceph consultants we hired before deploying ceph. However, it seems really important to know about it. For a meta data + data pool layout, since some meta-data is written to the default data pool, an EC default data pool seems a bad idea most of the time. I see a lot of size-0 objects that only store rados meta data: POOLS: NAME ID USED%USED MAX AVAIL OBJECTS con-fs2-meta112 256 MiB 0.02 1.1 TiB 410910 con-fs2-meta213 0 B 0 355 TiB 5217644 con-fs2-data 14 50 TiB 5.53 852 TiB 17943209 con-fs2-meta2 is the default data pool. This is probably the worst workload for an EC pool. On our file system I have regularly seen "one MDS reports slow meta-data IOs" and was always wondering where this comes from. I have the meta-data pool on SSDs and this warning simply didn't make any sense. Now I know. Having a small replicated default pool resolves not only this issue, it also speeds up file create/delete and hard-link operations dramatically. I guess, anything that modifies an inode. I never tested these operations in my benchmarks, but they are important. Compiling and installing packages etc., anything with heavy create/modify/delete workload will profit as well as cluster health. Fortunately, I had an opportunity to migrate the ceph fs. For anyone who starts new, I would recommend to have the 3-pool layout right from the beginning. Never use an EC pool as the default data pool. I would even make this statement a bit stronger in the ceph documentation: If erasure-coded pools are planned for the file system, it is usually better to use a replicated pool for the default data pool ... to, for example, If erasure-coded pools are planned for the file system, it is strongly recommended to use a replicated pool for the default data pool ... Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: kernel client osdc ops stuck and mds slow reqs
On Fri, Jan 31, 2020 at 6:32 PM Ilya Dryomov wrote: > > On Fri, Jan 31, 2020 at 4:57 PM Dan van der Ster wrote: > > > > Hi Ilya, > > > > On Fri, Jan 31, 2020 at 11:33 AM Ilya Dryomov wrote: > > > > > > On Fri, Jan 31, 2020 at 11:06 AM Dan van der Ster > > > wrote: > > > > > > > > Hi all, > > > > > > > > We are quite regularly (a couple times per week) seeing: > > > > > > > > HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs > > > > report slow requests > > > > MDS_CLIENT_LATE_RELEASE 1 clients failing to respond to capability > > > > release > > > > mdshpc-be143(mds.0): Client hpc-be028.cern.ch: failing to respond > > > > to capability release client_id: 52919162 > > > > MDS_SLOW_REQUEST 1 MDSs report slow requests > > > > mdshpc-be143(mds.0): 1 slow requests are blocked > 30 secs > > > > > > > > Which is being caused by osdc ops stuck in a kernel client, e.g.: > > > > > > > > 10:57:18 root hpc-be028 /root > > > > → cat > > > > /sys/kernel/debug/ceph/4da6fd06-b069-49af-901f-c9513baabdbd.client52919162/osdc > > > > REQUESTS 9 homeless 0 > > > > 46559317osd2433.ee6ffcdb3.cdb[243,501,92]/243 > > > > [243,501,92]/243e678697 > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a01.0057 > > > > 0x4000141read > > > > 46559322osd2433.ee6ffcdb3.cdb[243,501,92]/243 > > > > [243,501,92]/243e678697 > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a01.0057 > > > > 0x4000141read > > > > 46559323osd2433.969cc5733.573[243,330,226]/243 > > > > [243,330,226]/243e678697 > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.0056 > > > > 0x4000141read > > > > 46559341osd2433.969cc5733.573[243,330,226]/243 > > > > [243,330,226]/243e678697 > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.0056 > > > > 0x4000141read > > > > 46559342osd2433.969cc5733.573[243,330,226]/243 > > > > [243,330,226]/243e678697 > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.0056 > > > > 0x4000141read > > > > 46559345osd2433.969cc5733.573[243,330,226]/243 > > > > [243,330,226]/243e678697 > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.0056 > > > > 0x4000141read > > > > 46559621osd2433.6313e8ef3.8ef[243,330,521]/243 > > > > [243,330,521]/243e678697 > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a45.007a > > > > 0x4000141read > > > > 46559629osd2433.b280c8523.852[243,113,539]/243 > > > > [243,113,539]/243e678697 > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a3a.007f > > > > 0x4000141read > > > > 46559928osd2433.1ee7bab43.ab4[243,332,94]/243 > > > > [243,332,94]/243e678697 > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f099ff.073f > > > > 0x4000241write > > > > LINGER REQUESTS > > > > BACKOFFS > > > > > > > > > > > > We can unblock those requests by doing `ceph osd down osd.243` (or > > > > restarting osd.243). > > > > > > > > This is ceph v14.2.6 and the client kernel is el7 > > > > 3.10.0-957.27.2.el7.x86_64. > > > > > > > > Are there a better way to debug this? > > > > > > Hi Dan, > > > > > > I assume that these ops don't show up as slow requests on the OSD side? > > > How long did you see it stuck for before intervening? > > > > That's correct -- the osd had no active ops (ceph daemon ops). > > > > The late release slow req was stuck for 4129s before we intervened. > > > > > Do you happen to have "debug ms = 1" logs from osd243? > > > > Nope, but I can try to get it afterwards next time. (Though you need > > it at the moment the ops get stuck, not only from the moment we notice > > the stuck ops, right?) > > Yes, starting before the moment the ops get stuck and ending after you > kick the OSD. > > > > > > Do you have PG autoscaler enabled? Any PG splits and/or merges at the > > > time? > > > > Not on the cephfs_(meta)data pools (though on the 30th I increased > > those pool sizes from 2 to 3). And also on the 30th I did some PG > > merging on an unrelated test pool. > > And anyway we have seen this type of lockup in the past, without those > > pool changes (also with mimic MDS until we upgraded to nautilus). > > The MDS is out of question here. This issue is between the kernel > client and the OSD. > > > > > Looking back further in the client's kernel log we see a page alloc > > failure on the 30th: > > > > Jan 30 16:16:35 hpc-be028.cern.ch kernel: kworker/1:36: page > > allocation failure: order:5, mode:0x104050 > > Jan 30 16:16:35 hpc-be028.cern.ch kernel: CPU: 1 PID: 78445 Comm: > > kworker/1:36 Kdump: loaded Tainted: P > > Jan 30 16:16:35 hpc-be028.cern.ch kernel: Workqueue: ceph-msgr > > ceph_con_workfn [libceph] > > Can you share the stack trace? That's a 128k allocati
[ceph-users] Re: data loss on full file system?
On Sun, Feb 2, 2020 at 9:35 PM Håkan T Johansson wrote: > > > Changing cp (or whatever standard tool is used) to call fsync() before > each close() is not an option for a user. Also, doing that would lead to > terrible performance generally. Just tested - a recursive copy of a 70k > files linux source tree went from 15 s to 6 minutes on a local filesystem > I have at hand. Don't do it for every file: cp foo bar; sync > > Best regards, > Håkan > > > > > > > > > Paul > > > > -- > > Paul Emmerich > > > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > > > croit GmbH > > Freseniusstr. 31h > > 81247 München > > www.croit.io > > Tel: +49 89 1896585 90 > > > > On Mon, Jan 27, 2020 at 9:11 PM Håkan T Johansson > > wrote: > >> > >> > >> Hi, > >> > >> for test purposes, I have set up two 100 GB OSDs, one > >> taking a data pool and the other metadata pool for cephfs. > >> > >> Am running 14.2.6-1-gffd69200ad-1 with packages from > >> https://mirror.croit.io/debian-nautilus > >> > >> Am then running a program that creates a lot of 1 MiB files by calling > >>fopen() > >>fwrite() > >>fclose() > >> for each of them. Error codes are checked. > >> > >> This works successfully for ~100 GB of data, and then strangely also > >> succeeds > >> for many more 100 GB of data... ?? > >> > >> All written files have size 1 MiB with 'ls', and thus should contain the > >> data > >> written. However, on inspection, the files written after the first ~100 > >> GiB, > >> are full of just 0s. (hexdump -C) > >> > >> > >> To further test this, I use the standard tool 'cp' to copy a few > >> random-content > >> files into the full cephfs filessystem. cp reports no complaints, and > >> after > >> the copy operations, content is seen with hexdump -C. However, after > >> forcing > >> the data out of cache on the client by reading other earlier created files, > >> hexdump -C show all-0 content for the files copied with 'cp'. Data that > >> was > >> there is suddenly gone...? > >> > >> > >> I am new to ceph. Is there an option I have missed to avoid this > >> behaviour? > >> (I could not find one in > >> https://docs.ceph.com/docs/master/man/8/mount.ceph/ ) > >> > >> Is this behaviour related to > >> https://docs.ceph.com/docs/mimic/cephfs/full/ > >> ? > >> > >> (That page states 'sometime after a write call has already returned 0'. > >> But if > >> write returns 0, then no data has been written, so the user program would > >> not > >> assume any kind of success.) > >> > >> Best regards, > >> > >> Håkan > >> ___ > >> ceph-users mailing list -- ceph-users@ceph.io > >> To unsubscribe send an email to ceph-users-le...@ceph.io > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ceph fs dir-layouts and sub-directory mounts
On Mon, Feb 3, 2020 at 1:09 AM Frank Schilder wrote: > Fortunately, I had an opportunity to migrate the ceph fs. For anyone who > starts new, I would recommend to have the 3-pool layout right from the > beginning. Never use an EC pool as the default data pool. I would even make > this statement a bit stronger in the ceph documentation: > > If erasure-coded pools are planned for the file system, it is usually > better to use a replicated pool for the default data pool ... > > to, for example, > > If erasure-coded pools are planned for the file system, it is strongly > recommended to use a replicated pool for the default data pool ... We're going even further by having the monitors warn you if you try to do this: https://tracker.ceph.com/issues/42450 Backports to Nautilus and Mimic are already in flight. -- Patrick Donnelly, Ph.D. He / Him / His Senior Software Engineer Red Hat Sunnyvale, CA GPG: 19F28A586F808C2402351B93C3301A3E258DD79D ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ceph fs dir-layouts and sub-directory mounts
Thumbs up for that! Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Patrick Donnelly Sent: 03 February 2020 11:18 To: Frank Schilder Cc: Konstantin Shalygin; ceph-users Subject: Re: [ceph-users] ceph fs dir-layouts and sub-directory mounts On Mon, Feb 3, 2020 at 1:09 AM Frank Schilder wrote: > Fortunately, I had an opportunity to migrate the ceph fs. For anyone who > starts new, I would recommend to have the 3-pool layout right from the > beginning. Never use an EC pool as the default data pool. I would even make > this statement a bit stronger in the ceph documentation: > > If erasure-coded pools are planned for the file system, it is usually > better to use a replicated pool for the default data pool ... > > to, for example, > > If erasure-coded pools are planned for the file system, it is strongly > recommended to use a replicated pool for the default data pool ... We're going even further by having the monitors warn you if you try to do this: https://tracker.ceph.com/issues/42450 Backports to Nautilus and Mimic are already in flight. -- Patrick Donnelly, Ph.D. He / Him / His Senior Software Engineer Red Hat Sunnyvale, CA GPG: 19F28A586F808C2402351B93C3301A3E258DD79D ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: kernel client osdc ops stuck and mds slow reqs
On Mon, Feb 3, 2020 at 10:38 AM Dan van der Ster wrote: > > On Fri, Jan 31, 2020 at 6:32 PM Ilya Dryomov wrote: > > > > On Fri, Jan 31, 2020 at 4:57 PM Dan van der Ster > > wrote: > > > > > > Hi Ilya, > > > > > > On Fri, Jan 31, 2020 at 11:33 AM Ilya Dryomov wrote: > > > > > > > > On Fri, Jan 31, 2020 at 11:06 AM Dan van der Ster > > > > wrote: > > > > > > > > > > Hi all, > > > > > > > > > > We are quite regularly (a couple times per week) seeing: > > > > > > > > > > HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs > > > > > report slow requests > > > > > MDS_CLIENT_LATE_RELEASE 1 clients failing to respond to capability > > > > > release > > > > > mdshpc-be143(mds.0): Client hpc-be028.cern.ch: failing to respond > > > > > to capability release client_id: 52919162 > > > > > MDS_SLOW_REQUEST 1 MDSs report slow requests > > > > > mdshpc-be143(mds.0): 1 slow requests are blocked > 30 secs > > > > > > > > > > Which is being caused by osdc ops stuck in a kernel client, e.g.: > > > > > > > > > > 10:57:18 root hpc-be028 /root > > > > > → cat > > > > > /sys/kernel/debug/ceph/4da6fd06-b069-49af-901f-c9513baabdbd.client52919162/osdc > > > > > REQUESTS 9 homeless 0 > > > > > 46559317osd2433.ee6ffcdb3.cdb[243,501,92]/243 > > > > > [243,501,92]/243e678697 > > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a01.0057 > > > > > 0x4000141read > > > > > 46559322osd2433.ee6ffcdb3.cdb[243,501,92]/243 > > > > > [243,501,92]/243e678697 > > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a01.0057 > > > > > 0x4000141read > > > > > 46559323osd2433.969cc5733.573[243,330,226]/243 > > > > > [243,330,226]/243e678697 > > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.0056 > > > > > 0x4000141read > > > > > 46559341osd2433.969cc5733.573[243,330,226]/243 > > > > > [243,330,226]/243e678697 > > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.0056 > > > > > 0x4000141read > > > > > 46559342osd2433.969cc5733.573[243,330,226]/243 > > > > > [243,330,226]/243e678697 > > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.0056 > > > > > 0x4000141read > > > > > 46559345osd2433.969cc5733.573[243,330,226]/243 > > > > > [243,330,226]/243e678697 > > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.0056 > > > > > 0x4000141read > > > > > 46559621osd2433.6313e8ef3.8ef[243,330,521]/243 > > > > > [243,330,521]/243e678697 > > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a45.007a > > > > > 0x4000141read > > > > > 46559629osd2433.b280c8523.852[243,113,539]/243 > > > > > [243,113,539]/243e678697 > > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a3a.007f > > > > > 0x4000141read > > > > > 46559928osd2433.1ee7bab43.ab4[243,332,94]/243 > > > > > [243,332,94]/243e678697 > > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f099ff.073f > > > > > 0x4000241write > > > > > LINGER REQUESTS > > > > > BACKOFFS > > > > > > > > > > > > > > > We can unblock those requests by doing `ceph osd down osd.243` (or > > > > > restarting osd.243). > > > > > > > > > > This is ceph v14.2.6 and the client kernel is el7 > > > > > 3.10.0-957.27.2.el7.x86_64. > > > > > > > > > > Are there a better way to debug this? > > > > > > > > Hi Dan, > > > > > > > > I assume that these ops don't show up as slow requests on the OSD side? > > > > How long did you see it stuck for before intervening? > > > > > > That's correct -- the osd had no active ops (ceph daemon ops). > > > > > > The late release slow req was stuck for 4129s before we intervened. > > > > > > > Do you happen to have "debug ms = 1" logs from osd243? > > > > > > Nope, but I can try to get it afterwards next time. (Though you need > > > it at the moment the ops get stuck, not only from the moment we notice > > > the stuck ops, right?) > > > > Yes, starting before the moment the ops get stuck and ending after you > > kick the OSD. > > > > > > > > > Do you have PG autoscaler enabled? Any PG splits and/or merges at the > > > > time? > > > > > > Not on the cephfs_(meta)data pools (though on the 30th I increased > > > those pool sizes from 2 to 3). And also on the 30th I did some PG > > > merging on an unrelated test pool. > > > And anyway we have seen this type of lockup in the past, without those > > > pool changes (also with mimic MDS until we upgraded to nautilus). > > > > The MDS is out of question here. This issue is between the kernel > > client and the OSD. > > > > > > > > Looking back further in the client's kernel log we see a page alloc > > > failure on the 30th: > > > > > > Jan 30 16:16:35 hpc-be028.cern.ch kernel: kworker/1:36: page > > > allocation failu
[ceph-users] Re: kernel client osdc ops stuck and mds slow reqs
On Mon, Feb 3, 2020 at 11:50 AM Ilya Dryomov wrote: > > On Mon, Feb 3, 2020 at 10:38 AM Dan van der Ster wrote: > > > > On Fri, Jan 31, 2020 at 6:32 PM Ilya Dryomov wrote: > > > > > > On Fri, Jan 31, 2020 at 4:57 PM Dan van der Ster > > > wrote: > > > > > > > > Hi Ilya, > > > > > > > > On Fri, Jan 31, 2020 at 11:33 AM Ilya Dryomov > > > > wrote: > > > > > > > > > > On Fri, Jan 31, 2020 at 11:06 AM Dan van der Ster > > > > > wrote: > > > > > > > > > > > > Hi all, > > > > > > > > > > > > We are quite regularly (a couple times per week) seeing: > > > > > > > > > > > > HEALTH_WARN 1 clients failing to respond to capability release; 1 > > > > > > MDSs > > > > > > report slow requests > > > > > > MDS_CLIENT_LATE_RELEASE 1 clients failing to respond to capability > > > > > > release > > > > > > mdshpc-be143(mds.0): Client hpc-be028.cern.ch: failing to > > > > > > respond > > > > > > to capability release client_id: 52919162 > > > > > > MDS_SLOW_REQUEST 1 MDSs report slow requests > > > > > > mdshpc-be143(mds.0): 1 slow requests are blocked > 30 secs > > > > > > > > > > > > Which is being caused by osdc ops stuck in a kernel client, e.g.: > > > > > > > > > > > > 10:57:18 root hpc-be028 /root > > > > > > → cat > > > > > > /sys/kernel/debug/ceph/4da6fd06-b069-49af-901f-c9513baabdbd.client52919162/osdc > > > > > > REQUESTS 9 homeless 0 > > > > > > 46559317osd2433.ee6ffcdb3.cdb[243,501,92]/243 > > > > > > [243,501,92]/243e678697 > > > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a01.0057 > > > > > > 0x4000141read > > > > > > 46559322osd2433.ee6ffcdb3.cdb[243,501,92]/243 > > > > > > [243,501,92]/243e678697 > > > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a01.0057 > > > > > > 0x4000141read > > > > > > 46559323osd2433.969cc5733.573[243,330,226]/243 > > > > > > [243,330,226]/243e678697 > > > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.0056 > > > > > > 0x4000141read > > > > > > 46559341osd2433.969cc5733.573[243,330,226]/243 > > > > > > [243,330,226]/243e678697 > > > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.0056 > > > > > > 0x4000141read > > > > > > 46559342osd2433.969cc5733.573[243,330,226]/243 > > > > > > [243,330,226]/243e678697 > > > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.0056 > > > > > > 0x4000141read > > > > > > 46559345osd2433.969cc5733.573[243,330,226]/243 > > > > > > [243,330,226]/243e678697 > > > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.0056 > > > > > > 0x4000141read > > > > > > 46559621osd2433.6313e8ef3.8ef[243,330,521]/243 > > > > > > [243,330,521]/243e678697 > > > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a45.007a > > > > > > 0x4000141read > > > > > > 46559629osd2433.b280c8523.852[243,113,539]/243 > > > > > > [243,113,539]/243e678697 > > > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a3a.007f > > > > > > 0x4000141read > > > > > > 46559928osd2433.1ee7bab43.ab4[243,332,94]/243 > > > > > > [243,332,94]/243e678697 > > > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f099ff.073f > > > > > > 0x4000241write > > > > > > LINGER REQUESTS > > > > > > BACKOFFS > > > > > > > > > > > > > > > > > > We can unblock those requests by doing `ceph osd down osd.243` (or > > > > > > restarting osd.243). > > > > > > > > > > > > This is ceph v14.2.6 and the client kernel is el7 > > > > > > 3.10.0-957.27.2.el7.x86_64. > > > > > > > > > > > > Are there a better way to debug this? > > > > > > > > > > Hi Dan, > > > > > > > > > > I assume that these ops don't show up as slow requests on the OSD > > > > > side? > > > > > How long did you see it stuck for before intervening? > > > > > > > > That's correct -- the osd had no active ops (ceph daemon ops). > > > > > > > > The late release slow req was stuck for 4129s before we intervened. > > > > > > > > > Do you happen to have "debug ms = 1" logs from osd243? > > > > > > > > Nope, but I can try to get it afterwards next time. (Though you need > > > > it at the moment the ops get stuck, not only from the moment we notice > > > > the stuck ops, right?) > > > > > > Yes, starting before the moment the ops get stuck and ending after you > > > kick the OSD. > > > > > > > > > > > > Do you have PG autoscaler enabled? Any PG splits and/or merges at > > > > > the time? > > > > > > > > Not on the cephfs_(meta)data pools (though on the 30th I increased > > > > those pool sizes from 2 to 3). And also on the 30th I did some PG > > > > merging on an unrelated test pool. > > > > And anyway we have seen this type of lockup in the past, without those > > > > pool changes (also with mimic MDS until we upgraded to
[ceph-users] cpu and memory for OSD server
We have 18 Sata disks (each 2TB) on a physical server, each disk with an OSD deployed. I am not sure how much CPU and memory resources should be prepared for this server. Does each OSD require a physical CPU? and how to calculate memory usage? Thanks. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] v12.2.13 Luminous released
We're happy to announce 13th bug fix release of the Luminous v12.2.x long term stable release series. We recommend that all users upgrade to this release. Many thanks to all the contributors, in particular Yuri & Nathan, in getting this release out of the door. This shall be the last release of the Luminous series. For a detailed release notes, please check out the official blog entry at https://ceph.io/releases/v12-2-13-luminous-released/ Notable Changes --- * Ceph now packages python bindings for python3.6 instead of python3.4, because EPEL7 recently switched from python3.4 to python3.6 as the native python3. see the announcement[1] for more details on the background of this change. * We now have telemetry support via a ceph-mgr module. The telemetry module is absolutely on an opt-in basis, and is meant to collect generic cluster information and push it to a central endpoint. By default, we're pushing it to a project endpoint at https://telemetry.ceph.com/report, but this is customizable using by setting the 'url' config option with:: ceph telemetry config-set url '' You will have to opt-in on sharing your information with:: ceph telemetry on You can view exactly what information will be reported first with:: ceph telemetry show Should you opt-in, your information will be licensed under the Community Data License Agreement - Sharing - Version 1.0, which you can read at https://cdla.io/sharing-1-0/ The telemetry module reports information about CephFS file systems, including: - how many MDS daemons (in total and per file system) - which features are (or have been) enabled - how many data pools - approximate file system age (year + month of creation) - how much metadata is being cached per file system As well as: - whether IPv4 or IPv6 addresses are used for the monitors - whether RADOS cache tiering is enabled (and which mode) - whether pools are replicated or erasure coded, and which erasure code profile plugin and parameters are in use - how many RGW daemons, zones, and zonegroups are present; which RGW frontends are in use - aggregate stats about the CRUSH map, like which algorithms are used, how big buckets are, how many rules are defined, and what tunables are in use * A health warning is now generated if the average osd heartbeat ping time exceeds a configurable threshold for any of the intervals computed. The OSD computes 1 minute, 5 minute and 15 minute intervals with average, minimum and maximum values. New configuration option `mon_warn_on_slow_ping_ratio` specifies a percentage of `osd_heartbeat_grace` to determine the threshold. A value of zero disables the warning. New configuration option `mon_warn_on_slow_ping_time` specified in milliseconds over-rides the computed value, causes a warning when OSD heartbeat pings take longer than the specified amount. New admin command `ceph daemon mgr.# dump_osd_network [threshold]` command will list all connections with a ping time longer than the specified threshold or value determined by the config options, for the average for any of the 3 intervals. New admin command `ceph daemon osd.# dump_osd_network [threshold]` will do the same but only including heartbeats initiated by the specified OSD. * The configuration value `osd_calc_pg_upmaps_max_stddev` used for upmap balancing has been removed. Instead use the mgr balancer config `upmap_max_deviation` which now is an integer number of PGs of deviation from the target PGs per OSD. This can be set with a command like `ceph config set mgr mgr/balancer/upmap_max_deviation 2`. The default `upmap_max_deviation` is 1. There are situations where crush rules would not allow a pool to ever have completely balanced PGs. For example, if crush requires 1 replica on each of 3 racks, but there are fewer OSDs in 1 of the racks. In those cases, the configuration value can be increased. Getting Ceph * Git at git://github.com/ceph/ceph.git * Tarball at http://download.ceph.com/tarballs/ceph-12.2.13.tar.gz * For packages, see http://docs.ceph.com/docs/master/install/get-packages/ * Release git sha1: 584a20eb0237c657dc0567da126be145106aa47e [1]: https://lists.fedoraproject.org/archives/list/epel-annou...@lists.fedoraproject.org/message/EGUMKAIMPK2UD5VSHXM53BH2MBDGDWMO/ -- Abhishek Lekshmanan SUSE Software Solutions Germany GmbH GF: Felix Imendörffer HRB 21284 (AG Nürnberg) ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] v14.2.7 Nautilus released
This is the seventh update to the Ceph Nautilus release series. This is a hotfix release primarily fixing a couple of security issues. We recommend that all users upgrade to this release. Notable Changes --- * CVE-2020-1699: Fixed a path traversal flaw in Ceph dashboard that could allow for potential information disclosure (Ernesto Puerta) * CVE-2020-1700: Fixed a flaw in RGW beast frontend that could lead to denial of service from an unauthenticated client (Or Friedmann) Blog Link: https://ceph.io/releases/v14-2-7-nautilus-released/ Getting Ceph * Git at git://github.com/ceph/ceph.git * Tarball at http://download.ceph.com/tarballs/ceph-14.2.7.tar.gz * For packages, see http://docs.ceph.com/docs/master/install/get-packages/ * Release git sha1: 3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8 -- Abhishek Lekshmanan SUSE Software Solutions Germany GmbH ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Problem with OSD - stuck in CPU loop after rbd snapshot mount
Hi all, I have small cluster and yesterday I tried to mount older RBD snapshot torecover data. (I have approx. 230 daily snapshots of one RBD image on my small ceph). After I did mount and ls operation, cluster was stuck and I notice that 2of my OSD's eaten CPU and raise in memory usage (more than 4x more memory than usual). Cluster was unusable after that and it didn't help stopping rbd mount operation and restarting OSD's etc. Strange is, that it affects only 2 of my OSD's - one is on NVME and has cache PG's mapped, other one is rotational HDD. IOPS are normal, only CPU and memory usage are unusual. I noticed hundred thousands of slow requests (on idle cluster) so I startto debug OSD operations. You can find OSD 8 log below. OSD 9 contains nvme cached PG's and there is nothing unusual in standard log. In the middle of log I raised debug level to 10. It look's to me, that it is trying to do one operation with one snapshotted RBD object in PG 2.b (rbd_data.9e87fa74b0dc51.0001d1a9:7e3) in a loop. Is there any way how to stop that? What is causing the problem? Can I somehow flush OSD's "journals" or "queues" to stop current operation and make cluster usable again? Is there any way how I can access old snapshot data? Problem started on Luminous 12.2.12 and upgrade to 12.2.13 didn't help. Thank you Jan Pekar 2020-02-03 15:09:22.470277 7f51f6af7700 0 log_channel(cluster) log [WRN] : 30966 slow requests, 5 included below; oldest blocked for > 37.505482 secs 2020-02-03 15:09:22.470336 7f51f6af7700 0 log_channel(cluster) log [WRN] : slow request 37.503404 seconds old, received at 2020-02-03 15:08:44.956422: osd_op(osd.9.59923:2670991 2.b 2.162d8f8b (undecoded) ondisk+read+rwordered+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected e59939) currently queued_for_pg 2020-02-03 15:09:22.470360 7f51f6af7700 0 log_channel(cluster) log [WRN] : slow request 37.500695 seconds old, received at 2020-02-03 15:08:44.959132: osd_op(osd.9.59923:2671023 2.b 2.162d8f8b (undecoded) ondisk+read+rwordered+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected e59939) currently queued_for_pg 2020-02-03 15:09:22.470366 7f51f6af7700 0 log_channel(cluster) log [WRN] : slow request 37.498013 seconds old, received at 2020-02-03 15:08:44.961814: osd_op(osd.9.59923:2671055 2.b 2.162d8f8b (undecoded) ondisk+read+rwordered+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected e59939) currently queued_for_pg 2020-02-03 15:09:22.470371 7f51f6af7700 0 log_channel(cluster) log [WRN] : slow request 37.494836 seconds old, received at 2020-02-03 15:08:44.964990: osd_op(osd.9.59923:2671087 2.b 2.162d8f8b (undecoded) ondisk+read+rwordered+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected e59939) currently queued_for_pg 2020-02-03 15:09:22.470376 7f51f6af7700 0 log_channel(cluster) log [WRN] : slow request 37.491950 seconds old, received at 2020-02-03 15:08:44.967877: osd_op(osd.9.59923:2671119 2.b 2.162d8f8b (undecoded) ondisk+read+rwordered+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected e59939) currently queued_for_pg ... 2020-02-03 15:09:35.537259 7f51ddac5700 0 osd.8 59939 do_command r=0 2020-02-03 15:09:35.588753 7f51ddac5700 0 osd.8 59939 do_command r=0 Set debug_osd to 10/10 2020-02-03 15:09:35.588814 7f51ddac5700 0 log_channel(cluster) log [INF] : Set debug_osd to 10/10 2020-02-03 15:09:35.597667 7f51f0945700 2 osd.8 59939 ms_handle_reset con 0x5647736ac800 session 0x564785519e00 2020-02-03 15:09:35.599056 7f51e0acb700 10 osd.8 pg_epoch: 59939 pg[2.b( v 59908'4218227 (58731'4216723,59908'4218227] local-lis/les=59932/59934 n=4693 ec=1/1 lis/c 59932/59932 les/c/f 59934/59934/38237 59932/59932/59932) [8,0] r=0 lpr=59932 crt=59908'4218227 lcod 0'0 mlcod 0'0 active+clean] dropping ondisk_read_lock 2020-02-03 15:09:35.599209 7f51e0acb700 10 osd.8 59939 dequeue_op 0x5647873d5800 finish 2020-02-03 15:09:35.599250 7f51e0acb700 10 osd.8 59939 dequeue_op 0x5647873d59c0 prio 63 cost 29 latency 22.224501 osd_op(osd.9.59923:2779946 2.b 2.162d8f8b (undecoded) ondisk+read+rwordered+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected e59939) v8 pg pg[2.b( v 59908'4218227 (58731'4216723,59908'4218227] local-lis/les=59932/59934 n=4693 ec=1/1 lis/c 59932/59932 les/c/f 59934/59934/38237 59932/59932/59932) [8,0] r=0 lpr=59932 crt=59908'4218227 lcod 0'0 mlcod 0'0 active+clean] 2020-02-03 15:09:35.599293 7f51e0acb700 10 osd.8 pg_epoch: 59939 pg[2.b( v 59908'4218227 (58731'4216723,59908'4218227] local-lis/les=59932/59934 n=4693 ec=1/1 lis/c 59932/59932 les/c/f 59934/59934/38237 59932/59932/59932) [8,0] r=0 lpr=59932 crt=59908'4218227 lcod 0'0 mlcod 0'0 active+clean] _handle_message: 0x5647873d59c0 2020-02-03 15:09:35.599347 7f51e0acb700 10 osd.8 pg_epoch: 59939 pg[2.b( v 59908'4218227 (58731'4216723,59908'4218227] local-lis/les=59932/59934 n=4693 ec=1/1 lis/c 59932/59932 les/c/f 59934/59934/38237 59932/59932/59932) [8,0] r=0 lpr=59932 cr
[ceph-users] recovery_unfound
Dear All, Due to a mistake in my "rolling restart" script, one of our ceph clusters now has a number of unfound objects: There is an 8+2 erasure encoded data pool, 3x replicated metadata pool, all data is stored as cephfs. root@ceph7 ceph-archive]# ceph health HEALTH_ERR 24/420880027 objects unfound (0.000%); Possible data damage: 14 pgs recovery_unfound; Degraded data redundancy: 64/4204261148 objects degraded (0.000%), 14 pgs degraded "ceph health detail" gives me a handle on which pgs are affected. e.g: pg 5.f2f has 2 unfound objects pg 5.5c9 has 2 unfound objects pg 5.4c1 has 1 unfound objects and so on... plus more entries of this type: pg 5.6d is active+recovery_unfound+degraded, acting [295,104,57,442,240,338,219,33,150,382], 1 unfound pg 5.3fa is active+recovery_unfound+degraded, acting [343,147,21,131,315,63,214,365,264,437], 2 unfound pg 5.41d is active+recovery_unfound+degraded, acting [20,104,190,377,52,141,418,358,240,289], 1 unfound Digging deeper into one of the bad pg, we see the oid for the two unfound objects: root@ceph7 ceph-archive]# ceph pg 5.f2f list_unfound { "num_missing": 4, "num_unfound": 2, "objects": [ { "oid": { "oid": "1000ba25e49.0207", "key": "", "snapid": -2, "hash": 854007599, "max": 0, "pool": 5, "namespace": "" }, "need": "22541'3088478", "have": "0'0", "flags": "none", "locations": [ "189(8)", "263(9)" ] }, { "oid": { "oid": "1000bb25a5b.0091", "key": "", "snapid": -2, "hash": 3637976879, "max": 0, "pool": 5, "namespace": "" }, "need": "22541'3088476", "have": "0'0", "flags": "none", "locations": [ "189(8)", "263(9)" ] } ], "more": false } While it would be nice to recover the data, this cluster is only used for storing backups. As all OSD are up and running, presumably the data blocks are permanently lost? If it's hard / impossible to recover the data, presumably we should now consider using "ceph pg 5.f2f mark_unfound_lost delete" on each affected pg? Finally, can we use the oid to identify the affected files? best regards, Jake -- Jake Grimmett MRC Laboratory of Molecular Biology Francis Crick Avenue, Cambridge CB2 0QH, UK. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] ceph positions
Hi all, I really hope this isn't seen as spam. I am looking to find a position where I can focus on Linux storage/Ceph. If anyone is currently looking please let me know. Linkedin profile frankritchie. Thanks, Frank ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] cephf_metadata: Large omap object found
Hello, I have this message on my new ceph cluster in Nautilus. I have a cephfs with a copy of ~100TB in progress. > /var/log/ceph/artemis.log:2020-02-03 16:22:49.970437 osd.66 (osd.66) 1137 : > cluster [WRN] Large omap object found. Object: > 8:579bf162:::mds3_openfiles.0:head PG: 8.468fd9ea (8.2a) Key count: 206548 > Size (bytes): 6691941 > /var/log/ceph/artemis-osd.66.log:2020-02-03 16:22:49.966 7fe77af62700 0 > log_channel(cluster) log [WRN] : Large omap object found. Object: > 8:579bf162:::mds3_openfiles.0:head PG: 8.468fd9ea (8.2a) Key count: 206548 > Size (bytes): 6691941 I found this thread about a similar issue in the archives of the list https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/JUFYDCQ2AHFA23NFJQY743ELJHG2N5DI/ But I'm not sure what I can do in my situation, can I increase osd_deep_scrub_large_omap_object_key_threshold or it's a bad idea? Thanks for your help. Here some useful (I guess) information: > Filesystem Size Used Avail Use% Mounted on > 10.90.37.4,10.90.37.6,10.90.37.8:/ 329T 32T 297T 10% /artemis > artemis@icitsrv5:~$ ceph -s > cluster: > id: 815ea021-7839-4a63-9dc1-14f8c5feecc6 > health: HEALTH_WARN > 1 large omap objects > > services: > mon: 3 daemons, quorum iccluster003,iccluster005,iccluster007 (age 2w) > mgr: iccluster021(active, since 7h), standbys: iccluster009, iccluster023 > mds: cephfs:5 5 up:active > osd: 120 osds: 120 up (since 5d), 120 in (since 5d) > rgw: 8 daemons active (iccluster003.rgw0, iccluster005.rgw0, > iccluster007.rgw0, iccluster013.rgw0, iccluster015.rgw0, iccluster019.rgw0, > iccluster021.rgw0, iccluster023.rgw0) > > data: > pools: 10 pools, 2161 pgs > objects: 72.02M objects, 125 TiB > usage: 188 TiB used, 475 TiB / 662 TiB avail > pgs: 2157 active+clean > 4active+clean+scrubbing+deep > > io: > client: 31 KiB/s rd, 803 KiB/s wr, 31 op/s rd, 184 op/s wr > artemis@icitsrv5:~$ ceph health detail > HEALTH_WARN 1 large omap objects > LARGE_OMAP_OBJECTS 1 large omap objects > 1 large objects found in pool 'cephfs_metadata' > Search the cluster log for 'Large omap object found' for more details. > artemis@icitsrv5:~$ ceph fs status > cephfs - 3 clients > == > +--++--+---+---+---+ > | Rank | State | MDS |Activity | dns | inos | > +--++--+---+---+---+ > | 0 | active | iccluster015 | Reqs:0 /s | 251k | 251k | > | 1 | active | iccluster001 | Reqs:3 /s | 20.2k | 19.1k | > | 2 | active | iccluster017 | Reqs:1 /s | 116k | 112k | > | 3 | active | iccluster019 | Reqs:0 /s | 263k | 263k | > | 4 | active | iccluster013 | Reqs: 123 /s | 16.3k | 16.3k | > +--++--+---+---+---+ > +-+--+---+---+ > | Pool | type | used | avail | > +-+--+---+---+ > | cephfs_metadata | metadata | 13.9G | 135T | > | cephfs_data | data | 51.3T | 296T | > +-+--+---+---+ > +-+ > | Standby MDS | > +-+ > +-+ > MDS version: ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) > nautilus (stable) > root@iccluster019:~# ceph --cluster artemis daemon osd.13 config show | grep > large_omap > "osd_deep_scrub_large_omap_object_key_threshold": "20", > "osd_deep_scrub_large_omap_object_value_sum_threshold": "1073741824", > artemis@icitsrv5:~$ rados -p cephfs_metadata listxattr mds3_openfiles.0 > artemis@icitsrv5:~$ rados -p cephfs_metadata getomapheader mds3_openfiles.0 > header (42 bytes) : > 13 00 00 00 63 65 70 68 20 66 73 20 76 6f 6c 75 |ceph fs volu| > 0010 6d 65 20 76 30 31 31 01 01 0d 00 00 00 14 63 00 |me v011...c.| > 0020 00 00 00 00 00 01 00 00 00 00|..| > 002a Best regards, -- Yoann Moulin EPFL IC-IT ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cephf_metadata: Large omap object found
The warning threshold recently changed, I'd just increase it in this particular case. It just means you have lots of open files. I think there's some work going on to split the openfiles object into multiple, so that problem will be fixed. Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Mon, Feb 3, 2020 at 5:39 PM Yoann Moulin wrote: > > Hello, > > I have this message on my new ceph cluster in Nautilus. I have a cephfs with > a copy of ~100TB in progress. > > > /var/log/ceph/artemis.log:2020-02-03 16:22:49.970437 osd.66 (osd.66) 1137 : > > cluster [WRN] Large omap object found. Object: > > 8:579bf162:::mds3_openfiles.0:head PG: 8.468fd9ea (8.2a) Key count: 206548 > > Size (bytes): 6691941 > > > /var/log/ceph/artemis-osd.66.log:2020-02-03 16:22:49.966 7fe77af62700 0 > > log_channel(cluster) log [WRN] : Large omap object found. Object: > > 8:579bf162:::mds3_openfiles.0:head PG: 8.468fd9ea (8.2a) Key count: 206548 > > Size (bytes): 6691941 > > I found this thread about a similar issue in the archives of the list > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/JUFYDCQ2AHFA23NFJQY743ELJHG2N5DI/ > > But I'm not sure what I can do in my situation, can I increase > osd_deep_scrub_large_omap_object_key_threshold or it's a bad idea? > > Thanks for your help. > > Here some useful (I guess) information: > > > Filesystem Size Used Avail Use% Mounted on > > 10.90.37.4,10.90.37.6,10.90.37.8:/ 329T 32T 297T 10% /artemis > > > artemis@icitsrv5:~$ ceph -s > > cluster: > > id: 815ea021-7839-4a63-9dc1-14f8c5feecc6 > > health: HEALTH_WARN > > 1 large omap objects > > > > services: > > mon: 3 daemons, quorum iccluster003,iccluster005,iccluster007 (age 2w) > > mgr: iccluster021(active, since 7h), standbys: iccluster009, > > iccluster023 > > mds: cephfs:5 5 up:active > > osd: 120 osds: 120 up (since 5d), 120 in (since 5d) > > rgw: 8 daemons active (iccluster003.rgw0, iccluster005.rgw0, > > iccluster007.rgw0, iccluster013.rgw0, iccluster015.rgw0, iccluster019.rgw0, > > iccluster021.rgw0, iccluster023.rgw0) > > > > data: > > pools: 10 pools, 2161 pgs > > objects: 72.02M objects, 125 TiB > > usage: 188 TiB used, 475 TiB / 662 TiB avail > > pgs: 2157 active+clean > > 4active+clean+scrubbing+deep > > > > io: > > client: 31 KiB/s rd, 803 KiB/s wr, 31 op/s rd, 184 op/s wr > > > artemis@icitsrv5:~$ ceph health detail > > HEALTH_WARN 1 large omap objects > > LARGE_OMAP_OBJECTS 1 large omap objects > > 1 large objects found in pool 'cephfs_metadata' > > Search the cluster log for 'Large omap object found' for more details. > > > > artemis@icitsrv5:~$ ceph fs status > > cephfs - 3 clients > > == > > +--++--+---+---+---+ > > | Rank | State | MDS |Activity | dns | inos | > > +--++--+---+---+---+ > > | 0 | active | iccluster015 | Reqs:0 /s | 251k | 251k | > > | 1 | active | iccluster001 | Reqs:3 /s | 20.2k | 19.1k | > > | 2 | active | iccluster017 | Reqs:1 /s | 116k | 112k | > > | 3 | active | iccluster019 | Reqs:0 /s | 263k | 263k | > > | 4 | active | iccluster013 | Reqs: 123 /s | 16.3k | 16.3k | > > +--++--+---+---+---+ > > +-+--+---+---+ > > | Pool | type | used | avail | > > +-+--+---+---+ > > | cephfs_metadata | metadata | 13.9G | 135T | > > | cephfs_data | data | 51.3T | 296T | > > +-+--+---+---+ > > +-+ > > | Standby MDS | > > +-+ > > +-+ > > MDS version: ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) > > nautilus (stable) > > root@iccluster019:~# ceph --cluster artemis daemon osd.13 config show | > > grep large_omap > > "osd_deep_scrub_large_omap_object_key_threshold": "20", > > "osd_deep_scrub_large_omap_object_value_sum_threshold": "1073741824", > > > artemis@icitsrv5:~$ rados -p cephfs_metadata listxattr mds3_openfiles.0 > > artemis@icitsrv5:~$ rados -p cephfs_metadata getomapheader mds3_openfiles.0 > > header (42 bytes) : > > 13 00 00 00 63 65 70 68 20 66 73 20 76 6f 6c 75 |ceph fs > > volu| > > 0010 6d 65 20 76 30 31 31 01 01 0d 00 00 00 14 63 00 |me > > v011...c.| > > 0020 00 00 00 00 00 01 00 00 00 00|..| > > 002a > > Best regards, > > -- > Yoann Moulin > EPFL IC-IT > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list
[ceph-users] Re: recovery_unfound
This might be related to recent problems with OSDs not being queried for unfound objects properly in some cases (which I think was fixed in master?) Anyways: run ceph pg query on the affected PGs, check for "might have unfound" and try restarting the OSDs mentioned there. Probably also sufficient to just run "ceph osd down" on the primaries on the affected PGs to get them to re-check. Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Mon, Feb 3, 2020 at 4:27 PM Jake Grimmett wrote: > > Dear All, > > Due to a mistake in my "rolling restart" script, one of our ceph > clusters now has a number of unfound objects: > > There is an 8+2 erasure encoded data pool, 3x replicated metadata pool, > all data is stored as cephfs. > > root@ceph7 ceph-archive]# ceph health > HEALTH_ERR 24/420880027 objects unfound (0.000%); Possible data damage: > 14 pgs recovery_unfound; Degraded data redundancy: 64/4204261148 objects > degraded (0.000%), 14 pgs degraded > > "ceph health detail" gives me a handle on which pgs are affected. > e.g: > pg 5.f2f has 2 unfound objects > pg 5.5c9 has 2 unfound objects > pg 5.4c1 has 1 unfound objects > and so on... > > plus more entries of this type: > pg 5.6d is active+recovery_unfound+degraded, acting > [295,104,57,442,240,338,219,33,150,382], 1 unfound > pg 5.3fa is active+recovery_unfound+degraded, acting > [343,147,21,131,315,63,214,365,264,437], 2 unfound > pg 5.41d is active+recovery_unfound+degraded, acting > [20,104,190,377,52,141,418,358,240,289], 1 unfound > > Digging deeper into one of the bad pg, we see the oid for the two > unfound objects: > > root@ceph7 ceph-archive]# ceph pg 5.f2f list_unfound > { > "num_missing": 4, > "num_unfound": 2, > "objects": [ > { > "oid": { > "oid": "1000ba25e49.0207", > "key": "", > "snapid": -2, > "hash": 854007599, > "max": 0, > "pool": 5, > "namespace": "" > }, > "need": "22541'3088478", > "have": "0'0", > "flags": "none", > "locations": [ > "189(8)", > "263(9)" > ] > }, > { > "oid": { > "oid": "1000bb25a5b.0091", > "key": "", > "snapid": -2, > "hash": 3637976879, > "max": 0, > "pool": 5, > "namespace": "" > }, > "need": "22541'3088476", > "have": "0'0", > "flags": "none", > "locations": [ > "189(8)", > "263(9)" > ] > } > ], > "more": false > } > > > While it would be nice to recover the data, this cluster is only used > for storing backups. > > As all OSD are up and running, presumably the data blocks are > permanently lost? > > If it's hard / impossible to recover the data, presumably we should now > consider using "ceph pg 5.f2f mark_unfound_lost delete" on each > affected pg? > > Finally, can we use the oid to identify the affected files? > > best regards, > > Jake > > -- > Jake Grimmett > MRC Laboratory of Molecular Biology > Francis Crick Avenue, > Cambridge CB2 0QH, UK. > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: High CPU usage by ceph-mgr in 14.2.6
Does anyone have access to libibverbs-debuginfo-22.1-3.el7.x86_64 and librdmacm-debuginfo-22.1-3.el7.x86_64? I cannot find them in any repo list out there and the gdbpmp.py requires them. Thanks, Joe ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ubuntu 18.04.4 Ceph 12.2.12
So now that 12.2.13 has been released, now I will have a mixed environment if I use Ubuntu 18.04 repo 12.2.12 I also found there is a docker container https://hub.docker.com/r/ceph/daemon I could potentially just use the container to run the version I need. Wondering if anyone has done this in production? Managing the ubuntu repos for ceph has not been easy to say the least :( Found this ticket but looks dead https://tracker.ceph.com/issues/24326 ‐‐‐ Original Message ‐‐‐ On Friday, January 24, 2020 1:12 PM, Anthony D'Atri wrote: > I applied those packages for the same reason on a staging cluster and so far > so good. > >> On Jan 24, 2020, at 9:15 AM, Atherion wrote: > >> >> Hi Ceph Community. >> We currently have a luminous cluster running and some machines still on >> Ubuntu 14.04 >> We are looking to upgrade these machines to 18.04 but the only upgrade path >> for luminous with the ceph repo is through 16.04. >> It is doable to get to Mimic but then we have to upgrade all those machines >> to 16.04 but then we have to upgrade again to 18.04 when we get to Mimic, it >> is becoming a huge time sink. >> >> I did notice in the Ubuntu repos they have added 12.2.12 in 18.04.4 release. >> Is this a reliable build we can use? >> https://ubuntu.pkgs.org/18.04/ubuntu-proposed-main-amd64/ceph_12.2.12-0ubuntu0.18.04.4_amd64.deb.html >> If so then we can go straight to 18.04.4 and not waste so much time. >> >> Best >> >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Understanding Bluestore performance characteristics
Hi, We have a production cluster of 27 OSD's across 5 servers (all SSD's running bluestore), and have started to notice a possible performance issue. In order to isolate the problem, we built a single server with a single OSD, and ran a few FIO tests. The results are puzzling, not that we were expecting good performance on a single OSD. In short, with a sequential write test, we are seeing huge numbers of reads hitting the actual SSD Key FIO parameters are: [global] pool=benchmarks rbdname=disk-1 direct=1 numjobs=2 iodepth=1 blocksize=4k group_reporting=1 [writer] readwrite=write iostat results are: Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util nvme0n1 0.00 105.00 4896.00 294.00 312080.00 1696.00 120.92 17.253.353.550.02 0.02 12.60 There are nearly ~5000 reads/second (~300 MB/sec), compared with only ~300 writes (~1.5MB/sec), when we are doing a sequential write test? The system is otherwise idle, with no other workload. Running the same fio test with only 1 thread (numjobs=1) still shows a high number of reads (110). Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util nvme0n1 0.00 1281.00 110.00 1463.00 440.00 12624.0016.61 0.030.020.050.02 0.02 3.40 Can anyone kindly offer any comments on why we are seeing this behaviour? I can understand if there's the occasional read here and there if RocksDB/WAL entries need to be read from disk during the sequential write test, but this seems significantly high and unusual. FIO results (numjobs=2) writer: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=1 ... fio-3.7 Starting 2 processes Jobs: 1 (f=1): [W(1),_(1)][52.4%][r=0KiB/s,w=208KiB/s][r=0,w=52 IOPS][eta 01m:00s] writer: (groupid=0, jobs=2): err= 0: pid=19553: Mon Feb 3 22:46:16 2020 write: IOPS=34, BW=137KiB/s (140kB/s)(8228KiB/60038msec) slat (nsec): min=5402, max=77083, avg=27305.33, stdev=7786.83 clat (msec): min=2, max=210, avg=58.32, stdev=70.54 lat (msec): min=2, max=210, avg=58.35, stdev=70.54 clat percentiles (msec): | 1.00th=[3], 5.00th=[3], 10.00th=[3], 20.00th=[3], | 30.00th=[3], 40.00th=[3], 50.00th=[ 54], 60.00th=[ 62], | 70.00th=[ 65], 80.00th=[ 174], 90.00th=[ 188], 95.00th=[ 194], | 99.00th=[ 201], 99.50th=[ 203], 99.90th=[ 209], 99.95th=[ 209], | 99.99th=[ 211] bw ( KiB/s): min= 24, max= 144, per=49.69%, avg=68.08, stdev=38.22, samples=239 iops: min=6, max= 36, avg=16.97, stdev= 9.55, samples=239 lat (msec) : 4=49.83%, 10=0.10%, 100=29.90%, 250=20.18% cpu : usr=0.08%, sys=0.08%, ctx=2100, majf=0, minf=118 IO depths: 1=105.3%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,2057,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: bw=137KiB/s (140kB/s), 137KiB/s-137KiB/s (140kB/s-140kB/s), io=8228KiB (8425kB), run=60038-60038msec ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ceph positions
Hello Frank, we are always looking for Ceph/Linux consultants. -- Martin Verges Managing director Hint: Secure one of the last slots in the upcoming 4-day Ceph Intensive Training at https://croit.io/training/4-days-ceph-in-depth-training. Mobile: +49 174 9335695 E-Mail: martin.ver...@croit.io Chat: https://t.me/MartinVerges croit GmbH, Freseniusstr. 31h, 81247 Munich CEO: Martin Verges - VAT-ID: DE310638492 Com. register: Amtsgericht Munich HRB 231263 Web: https://croit.io YouTube: https://goo.gl/PGE1Bx Am Mo., 3. Feb. 2020 um 17:26 Uhr schrieb Frank R : > Hi all, > > I really hope this isn't seen as spam. I am looking to find a position > where I can focus on Linux storage/Ceph. If anyone is currently > looking please let me know. Linkedin profile frankritchie. > > Thanks, > Frank > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io