[ceph-users] Re: osd is immidietly down and uses CPU full.

2020-02-03 Thread Janne Johansson
Den mån 3 feb. 2020 kl 08:25 skrev Wido den Hollander :

> > The crash happens, when the osd wants to read from pipe when processing
> > heartbeat. To me it sounds like a networking issue.
>
> It could also be that this OSD is so busy internally with other stuff
> that it doesn't respond to heartbeats and then commits suicide.
>
> Combined with the comment that VMs can't read their data it could very
> well be that the OSD is super busy.
>
> Maybe try a compact of the LevelDB database.
>

I think I am with Wido on this one, if you get one or a few PGs so full of
metadata or weird stuff that it takes longer than suicide_timeout to handle
it, then it will be like this.
At start it tries to complete whatever operation was in queue (like scrubs,
recovery, something) and it just gets stuck doing that instead of answering
heartbeats or finishing
operations requested by other OSDs or clients, and gets ejected from the
cluster. If it is anything like what we see on our jewel cluster, you can
move these PGs around
with impact to clients but you can't "fix" them without doing deep changes
like moving from leveldb to rocksdb (if filestore), splitting PGs and
sharding buckets if it is RGW
metadata that causes these huge indexes to end up on a single OSD.

You sort of need to figure out what the root cause is and aim to fix that
part.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph fs dir-layouts and sub-directory mounts

2020-02-03 Thread Frank Schilder
Dear Konstantin and Patrick,

thanks!

I started migrating a 2-pool layout ceph fs (rep meta, EC default data) to a 
3-pool layout (rep meta, rep default data, EC data set at "/") and use 
sub-directory mounts for data migration. So far, everything as it should.

Maybe some background info for everyone who is reading this. The reason for 
migrating is the modified best practices for cephfs, compare these two:

https://docs.ceph.com/docs/mimic/cephfs/createfs/#creating-pools
https://docs.ceph.com/docs/master/cephfs/createfs/#creating-pools

The 3-pool layout was never mentioned in the RH ceph-course I took, nor by any 
of the ceph consultants we hired before deploying ceph. However, it seems 
really important to know about it.

For a meta data + data pool layout, since some meta-data is written to the 
default data pool, an EC default data pool seems a bad idea most of the time. I 
see a lot of size-0 objects that only store rados meta data:

POOLS:
NAME ID USED%USED MAX AVAIL OBJECTS
con-fs2-meta112 256 MiB  0.02   1.1 TiB   410910
con-fs2-meta213 0 B 0   355 TiB  5217644
con-fs2-data 14  50 TiB  5.53   852 TiB 17943209

con-fs2-meta is the default data pool. This is probably the worst workload for 
an EC pool.

On our file system I have regularly seen "one MDS reports slow meta-data IOs" 
and was always wondering where this comes from. I have the meta-data pool on 
SSDs and this warning simply didn't make any sense. Now I know.

Having a small replicated default pool resolves not only this issue, it also 
speeds up file create/delete and hard-link operations dramatically. I guess, 
anything that modifies an inode. I never tested these operations in my 
benchmarks, but they are important. Compiling and installing packages etc., 
anything with heavy create/modify/delete workload will profit as well as 
cluster health.

Fortunately, I had an opportunity to migrate the ceph fs. For anyone who starts 
new, I would recommend to have the 3-pool layout right from the beginning. 
Never use an EC pool as the default data pool. I would even make this statement 
a bit stronger in the ceph documentation:

If erasure-coded pools are planned for the file system, it is usually 
better to use a replicated pool for the default data pool ...

to, for example,

If erasure-coded pools are planned for the file system, it is strongly 
recommended to use a replicated pool for the default data pool ...

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Patrick Donnelly 
Sent: 02 February 2020 12:41
To: Frank Schilder
Cc: ceph-users
Subject: Re: [ceph-users] ceph fs dir-layouts and sub-directory mounts

On Wed, Jan 29, 2020 at 3:04 AM Frank Schilder  wrote:
>
> I would like to (in this order)
>
> - set the data pool for the root "/" of a ceph-fs to a custom value, say "P" 
> (not the initial data pool used in fs new)
> - create a sub-directory of "/", for example "/a"
> - mount the sub-directory "/a" with a client key with access restricted to 
> "/a"
>
> The client will not be able to see the dir layout attribute set at "/", its 
> not mounted.

The client gets the file layout information when the file is created
(i.e. the RPC response from the MDS) . It doesn't have _any_ access to
"/". It can't even stat "/".

> Will the data of this client still go to the pool "P", that is, does "/a" 
> inherit the dir layout transparently to the client when following the steps 
> above?

Yes.

--
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph fs dir-layouts and sub-directory mounts

2020-02-03 Thread Frank Schilder
errata: con-fs2-meta2 is the default data pool.

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder
Sent: 03 February 2020 10:08
To: Patrick Donnelly; Konstantin Shalygin
Cc: ceph-users
Subject: Re: [ceph-users] ceph fs dir-layouts and sub-directory mounts

Dear Konstantin and Patrick,

thanks!

I started migrating a 2-pool layout ceph fs (rep meta, EC default data) to a 
3-pool layout (rep meta, rep default data, EC data set at "/") and use 
sub-directory mounts for data migration. So far, everything as it should.

Maybe some background info for everyone who is reading this. The reason for 
migrating is the modified best practices for cephfs, compare these two:

https://docs.ceph.com/docs/mimic/cephfs/createfs/#creating-pools
https://docs.ceph.com/docs/master/cephfs/createfs/#creating-pools

The 3-pool layout was never mentioned in the RH ceph-course I took, nor by any 
of the ceph consultants we hired before deploying ceph. However, it seems 
really important to know about it.

For a meta data + data pool layout, since some meta-data is written to the 
default data pool, an EC default data pool seems a bad idea most of the time. I 
see a lot of size-0 objects that only store rados meta data:

POOLS:
NAME ID USED%USED MAX AVAIL OBJECTS
con-fs2-meta112 256 MiB  0.02   1.1 TiB   410910
con-fs2-meta213 0 B 0   355 TiB  5217644
con-fs2-data 14  50 TiB  5.53   852 TiB 17943209

con-fs2-meta2 is the default data pool. This is probably the worst workload for 
an EC pool.

On our file system I have regularly seen "one MDS reports slow meta-data IOs" 
and was always wondering where this comes from. I have the meta-data pool on 
SSDs and this warning simply didn't make any sense. Now I know.

Having a small replicated default pool resolves not only this issue, it also 
speeds up file create/delete and hard-link operations dramatically. I guess, 
anything that modifies an inode. I never tested these operations in my 
benchmarks, but they are important. Compiling and installing packages etc., 
anything with heavy create/modify/delete workload will profit as well as 
cluster health.

Fortunately, I had an opportunity to migrate the ceph fs. For anyone who starts 
new, I would recommend to have the 3-pool layout right from the beginning. 
Never use an EC pool as the default data pool. I would even make this statement 
a bit stronger in the ceph documentation:

If erasure-coded pools are planned for the file system, it is usually 
better to use a replicated pool for the default data pool ...

to, for example,

If erasure-coded pools are planned for the file system, it is strongly 
recommended to use a replicated pool for the default data pool ...

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: kernel client osdc ops stuck and mds slow reqs

2020-02-03 Thread Dan van der Ster
On Fri, Jan 31, 2020 at 6:32 PM Ilya Dryomov  wrote:
>
> On Fri, Jan 31, 2020 at 4:57 PM Dan van der Ster  wrote:
> >
> > Hi Ilya,
> >
> > On Fri, Jan 31, 2020 at 11:33 AM Ilya Dryomov  wrote:
> > >
> > > On Fri, Jan 31, 2020 at 11:06 AM Dan van der Ster  
> > > wrote:
> > > >
> > > > Hi all,
> > > >
> > > > We are quite regularly (a couple times per week) seeing:
> > > >
> > > > HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs
> > > > report slow requests
> > > > MDS_CLIENT_LATE_RELEASE 1 clients failing to respond to capability 
> > > > release
> > > > mdshpc-be143(mds.0): Client hpc-be028.cern.ch: failing to respond
> > > > to capability release client_id: 52919162
> > > > MDS_SLOW_REQUEST 1 MDSs report slow requests
> > > > mdshpc-be143(mds.0): 1 slow requests are blocked > 30 secs
> > > >
> > > > Which is being caused by osdc ops stuck in a kernel client, e.g.:
> > > >
> > > > 10:57:18 root hpc-be028 /root
> > > > → cat 
> > > > /sys/kernel/debug/ceph/4da6fd06-b069-49af-901f-c9513baabdbd.client52919162/osdc
> > > > REQUESTS 9 homeless 0
> > > > 46559317osd2433.ee6ffcdb3.cdb[243,501,92]/243
> > > > [243,501,92]/243e678697
> > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a01.0057
> > > >  0x4000141read
> > > > 46559322osd2433.ee6ffcdb3.cdb[243,501,92]/243
> > > > [243,501,92]/243e678697
> > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a01.0057
> > > >  0x4000141read
> > > > 46559323osd2433.969cc5733.573[243,330,226]/243
> > > > [243,330,226]/243e678697
> > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.0056
> > > >  0x4000141read
> > > > 46559341osd2433.969cc5733.573[243,330,226]/243
> > > > [243,330,226]/243e678697
> > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.0056
> > > >  0x4000141read
> > > > 46559342osd2433.969cc5733.573[243,330,226]/243
> > > > [243,330,226]/243e678697
> > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.0056
> > > >  0x4000141read
> > > > 46559345osd2433.969cc5733.573[243,330,226]/243
> > > > [243,330,226]/243e678697
> > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.0056
> > > >  0x4000141read
> > > > 46559621osd2433.6313e8ef3.8ef[243,330,521]/243
> > > > [243,330,521]/243e678697
> > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a45.007a
> > > >  0x4000141read
> > > > 46559629osd2433.b280c8523.852[243,113,539]/243
> > > > [243,113,539]/243e678697
> > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a3a.007f
> > > >  0x4000141read
> > > > 46559928osd2433.1ee7bab43.ab4[243,332,94]/243
> > > > [243,332,94]/243e678697
> > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f099ff.073f
> > > >  0x4000241write
> > > > LINGER REQUESTS
> > > > BACKOFFS
> > > >
> > > >
> > > > We can unblock those requests by doing `ceph osd down osd.243` (or
> > > > restarting osd.243).
> > > >
> > > > This is ceph v14.2.6 and the client kernel is el7 
> > > > 3.10.0-957.27.2.el7.x86_64.
> > > >
> > > > Are there a better way to debug this?
> > >
> > > Hi Dan,
> > >
> > > I assume that these ops don't show up as slow requests on the OSD side?
> > > How long did you see it stuck for before intervening?
> >
> > That's correct -- the osd had no active ops (ceph daemon ops).
> >
> > The late release slow req was stuck for 4129s before we intervened.
> >
> > > Do you happen to have "debug ms = 1" logs from osd243?
> >
> > Nope, but I can try to get it afterwards next time. (Though you need
> > it at the moment the ops get stuck, not only from the moment we notice
> > the stuck ops, right?)
>
> Yes, starting before the moment the ops get stuck and ending after you
> kick the OSD.
>
> >
> > > Do you have PG autoscaler enabled?  Any PG splits and/or merges at the 
> > > time?
> >
> > Not on the cephfs_(meta)data pools (though on the 30th I increased
> > those pool sizes from 2 to 3). And also on the 30th I did some PG
> > merging on an unrelated test pool.
> > And anyway we have seen this type of lockup in the past, without those
> > pool changes (also with mimic MDS until we upgraded to nautilus).
>
> The MDS is out of question here.  This issue is between the kernel
> client and the OSD.
>
> >
> > Looking back further in the client's kernel log we see a page alloc
> > failure on the 30th:
> >
> > Jan 30 16:16:35 hpc-be028.cern.ch kernel: kworker/1:36: page
> > allocation failure: order:5, mode:0x104050
> > Jan 30 16:16:35 hpc-be028.cern.ch kernel: CPU: 1 PID: 78445 Comm:
> > kworker/1:36 Kdump: loaded Tainted: P
> > Jan 30 16:16:35 hpc-be028.cern.ch kernel: Workqueue: ceph-msgr
> > ceph_con_workfn [libceph]
>
> Can you share the stack trace?  That's a 128k allocati

[ceph-users] Re: data loss on full file system?

2020-02-03 Thread Paul Emmerich
On Sun, Feb 2, 2020 at 9:35 PM Håkan T Johansson  wrote:
>

>
> Changing cp (or whatever standard tool is used) to call fsync() before
> each close() is not an option for a user.  Also, doing that would lead to
> terrible performance generally.  Just tested - a recursive copy of a 70k
> files linux source tree went from 15 s to 6 minutes on a local filesystem
> I have at hand.

Don't do it for every file:  cp foo bar; sync

>
> Best regards,
> Håkan
>
>
>
> >
> >
> > Paul
> >
> > --
> > Paul Emmerich
> >
> > Looking for help with your Ceph cluster? Contact us at https://croit.io
> >
> > croit GmbH
> > Freseniusstr. 31h
> > 81247 München
> > www.croit.io
> > Tel: +49 89 1896585 90
> >
> > On Mon, Jan 27, 2020 at 9:11 PM Håkan T Johansson  
> > wrote:
> >>
> >>
> >> Hi,
> >>
> >> for test purposes, I have set up two 100 GB OSDs, one
> >> taking a data pool and the other metadata pool for cephfs.
> >>
> >> Am running 14.2.6-1-gffd69200ad-1 with packages from
> >> https://mirror.croit.io/debian-nautilus
> >>
> >> Am then running a program that creates a lot of 1 MiB files by calling
> >>fopen()
> >>fwrite()
> >>fclose()
> >> for each of them.  Error codes are checked.
> >>
> >> This works successfully for ~100 GB of data, and then strangely also 
> >> succeeds
> >> for many more 100 GB of data...  ??
> >>
> >> All written files have size 1 MiB with 'ls', and thus should contain the 
> >> data
> >> written.  However, on inspection, the files written after the first ~100 
> >> GiB,
> >> are full of just 0s.  (hexdump -C)
> >>
> >>
> >> To further test this, I use the standard tool 'cp' to copy a few 
> >> random-content
> >> files into the full cephfs filessystem.  cp reports no complaints, and 
> >> after
> >> the copy operations, content is seen with hexdump -C.  However, after 
> >> forcing
> >> the data out of cache on the client by reading other earlier created files,
> >> hexdump -C show all-0 content for the files copied with 'cp'.  Data that 
> >> was
> >> there is suddenly gone...?
> >>
> >>
> >> I am new to ceph.  Is there an option I have missed to avoid this 
> >> behaviour?
> >> (I could not find one in
> >> https://docs.ceph.com/docs/master/man/8/mount.ceph/ )
> >>
> >> Is this behaviour related to
> >> https://docs.ceph.com/docs/mimic/cephfs/full/
> >> ?
> >>
> >> (That page states 'sometime after a write call has already returned 0'. 
> >> But if
> >> write returns 0, then no data has been written, so the user program would 
> >> not
> >> assume any kind of success.)
> >>
> >> Best regards,
> >>
> >> Håkan
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph fs dir-layouts and sub-directory mounts

2020-02-03 Thread Patrick Donnelly
On Mon, Feb 3, 2020 at 1:09 AM Frank Schilder  wrote:
> Fortunately, I had an opportunity to migrate the ceph fs. For anyone who 
> starts new, I would recommend to have the 3-pool layout right from the 
> beginning. Never use an EC pool as the default data pool. I would even make 
> this statement a bit stronger in the ceph documentation:
>
> If erasure-coded pools are planned for the file system, it is usually 
> better to use a replicated pool for the default data pool ...
>
> to, for example,
>
> If erasure-coded pools are planned for the file system, it is strongly 
> recommended to use a replicated pool for the default data pool ...

We're going even further by having the monitors warn you if you try to
do this: https://tracker.ceph.com/issues/42450

Backports to Nautilus and Mimic are already in flight.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph fs dir-layouts and sub-directory mounts

2020-02-03 Thread Frank Schilder
Thumbs up for that!

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Patrick Donnelly 
Sent: 03 February 2020 11:18
To: Frank Schilder
Cc: Konstantin Shalygin; ceph-users
Subject: Re: [ceph-users] ceph fs dir-layouts and sub-directory mounts

On Mon, Feb 3, 2020 at 1:09 AM Frank Schilder  wrote:
> Fortunately, I had an opportunity to migrate the ceph fs. For anyone who 
> starts new, I would recommend to have the 3-pool layout right from the 
> beginning. Never use an EC pool as the default data pool. I would even make 
> this statement a bit stronger in the ceph documentation:
>
> If erasure-coded pools are planned for the file system, it is usually 
> better to use a replicated pool for the default data pool ...
>
> to, for example,
>
> If erasure-coded pools are planned for the file system, it is strongly 
> recommended to use a replicated pool for the default data pool ...

We're going even further by having the monitors warn you if you try to
do this: https://tracker.ceph.com/issues/42450

Backports to Nautilus and Mimic are already in flight.

--
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: kernel client osdc ops stuck and mds slow reqs

2020-02-03 Thread Ilya Dryomov
On Mon, Feb 3, 2020 at 10:38 AM Dan van der Ster  wrote:
>
> On Fri, Jan 31, 2020 at 6:32 PM Ilya Dryomov  wrote:
> >
> > On Fri, Jan 31, 2020 at 4:57 PM Dan van der Ster  
> > wrote:
> > >
> > > Hi Ilya,
> > >
> > > On Fri, Jan 31, 2020 at 11:33 AM Ilya Dryomov  wrote:
> > > >
> > > > On Fri, Jan 31, 2020 at 11:06 AM Dan van der Ster  
> > > > wrote:
> > > > >
> > > > > Hi all,
> > > > >
> > > > > We are quite regularly (a couple times per week) seeing:
> > > > >
> > > > > HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs
> > > > > report slow requests
> > > > > MDS_CLIENT_LATE_RELEASE 1 clients failing to respond to capability 
> > > > > release
> > > > > mdshpc-be143(mds.0): Client hpc-be028.cern.ch: failing to respond
> > > > > to capability release client_id: 52919162
> > > > > MDS_SLOW_REQUEST 1 MDSs report slow requests
> > > > > mdshpc-be143(mds.0): 1 slow requests are blocked > 30 secs
> > > > >
> > > > > Which is being caused by osdc ops stuck in a kernel client, e.g.:
> > > > >
> > > > > 10:57:18 root hpc-be028 /root
> > > > > → cat 
> > > > > /sys/kernel/debug/ceph/4da6fd06-b069-49af-901f-c9513baabdbd.client52919162/osdc
> > > > > REQUESTS 9 homeless 0
> > > > > 46559317osd2433.ee6ffcdb3.cdb[243,501,92]/243
> > > > > [243,501,92]/243e678697
> > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a01.0057
> > > > >  0x4000141read
> > > > > 46559322osd2433.ee6ffcdb3.cdb[243,501,92]/243
> > > > > [243,501,92]/243e678697
> > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a01.0057
> > > > >  0x4000141read
> > > > > 46559323osd2433.969cc5733.573[243,330,226]/243
> > > > > [243,330,226]/243e678697
> > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.0056
> > > > >  0x4000141read
> > > > > 46559341osd2433.969cc5733.573[243,330,226]/243
> > > > > [243,330,226]/243e678697
> > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.0056
> > > > >  0x4000141read
> > > > > 46559342osd2433.969cc5733.573[243,330,226]/243
> > > > > [243,330,226]/243e678697
> > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.0056
> > > > >  0x4000141read
> > > > > 46559345osd2433.969cc5733.573[243,330,226]/243
> > > > > [243,330,226]/243e678697
> > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.0056
> > > > >  0x4000141read
> > > > > 46559621osd2433.6313e8ef3.8ef[243,330,521]/243
> > > > > [243,330,521]/243e678697
> > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a45.007a
> > > > >  0x4000141read
> > > > > 46559629osd2433.b280c8523.852[243,113,539]/243
> > > > > [243,113,539]/243e678697
> > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a3a.007f
> > > > >  0x4000141read
> > > > > 46559928osd2433.1ee7bab43.ab4[243,332,94]/243
> > > > > [243,332,94]/243e678697
> > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f099ff.073f
> > > > >  0x4000241write
> > > > > LINGER REQUESTS
> > > > > BACKOFFS
> > > > >
> > > > >
> > > > > We can unblock those requests by doing `ceph osd down osd.243` (or
> > > > > restarting osd.243).
> > > > >
> > > > > This is ceph v14.2.6 and the client kernel is el7 
> > > > > 3.10.0-957.27.2.el7.x86_64.
> > > > >
> > > > > Are there a better way to debug this?
> > > >
> > > > Hi Dan,
> > > >
> > > > I assume that these ops don't show up as slow requests on the OSD side?
> > > > How long did you see it stuck for before intervening?
> > >
> > > That's correct -- the osd had no active ops (ceph daemon ops).
> > >
> > > The late release slow req was stuck for 4129s before we intervened.
> > >
> > > > Do you happen to have "debug ms = 1" logs from osd243?
> > >
> > > Nope, but I can try to get it afterwards next time. (Though you need
> > > it at the moment the ops get stuck, not only from the moment we notice
> > > the stuck ops, right?)
> >
> > Yes, starting before the moment the ops get stuck and ending after you
> > kick the OSD.
> >
> > >
> > > > Do you have PG autoscaler enabled?  Any PG splits and/or merges at the 
> > > > time?
> > >
> > > Not on the cephfs_(meta)data pools (though on the 30th I increased
> > > those pool sizes from 2 to 3). And also on the 30th I did some PG
> > > merging on an unrelated test pool.
> > > And anyway we have seen this type of lockup in the past, without those
> > > pool changes (also with mimic MDS until we upgraded to nautilus).
> >
> > The MDS is out of question here.  This issue is between the kernel
> > client and the OSD.
> >
> > >
> > > Looking back further in the client's kernel log we see a page alloc
> > > failure on the 30th:
> > >
> > > Jan 30 16:16:35 hpc-be028.cern.ch kernel: kworker/1:36: page
> > > allocation failu

[ceph-users] Re: kernel client osdc ops stuck and mds slow reqs

2020-02-03 Thread Dan van der Ster
On Mon, Feb 3, 2020 at 11:50 AM Ilya Dryomov  wrote:
>
> On Mon, Feb 3, 2020 at 10:38 AM Dan van der Ster  wrote:
> >
> > On Fri, Jan 31, 2020 at 6:32 PM Ilya Dryomov  wrote:
> > >
> > > On Fri, Jan 31, 2020 at 4:57 PM Dan van der Ster  
> > > wrote:
> > > >
> > > > Hi Ilya,
> > > >
> > > > On Fri, Jan 31, 2020 at 11:33 AM Ilya Dryomov  
> > > > wrote:
> > > > >
> > > > > On Fri, Jan 31, 2020 at 11:06 AM Dan van der Ster 
> > > > >  wrote:
> > > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > We are quite regularly (a couple times per week) seeing:
> > > > > >
> > > > > > HEALTH_WARN 1 clients failing to respond to capability release; 1 
> > > > > > MDSs
> > > > > > report slow requests
> > > > > > MDS_CLIENT_LATE_RELEASE 1 clients failing to respond to capability 
> > > > > > release
> > > > > > mdshpc-be143(mds.0): Client hpc-be028.cern.ch: failing to 
> > > > > > respond
> > > > > > to capability release client_id: 52919162
> > > > > > MDS_SLOW_REQUEST 1 MDSs report slow requests
> > > > > > mdshpc-be143(mds.0): 1 slow requests are blocked > 30 secs
> > > > > >
> > > > > > Which is being caused by osdc ops stuck in a kernel client, e.g.:
> > > > > >
> > > > > > 10:57:18 root hpc-be028 /root
> > > > > > → cat 
> > > > > > /sys/kernel/debug/ceph/4da6fd06-b069-49af-901f-c9513baabdbd.client52919162/osdc
> > > > > > REQUESTS 9 homeless 0
> > > > > > 46559317osd2433.ee6ffcdb3.cdb[243,501,92]/243
> > > > > > [243,501,92]/243e678697
> > > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a01.0057
> > > > > >  0x4000141read
> > > > > > 46559322osd2433.ee6ffcdb3.cdb[243,501,92]/243
> > > > > > [243,501,92]/243e678697
> > > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a01.0057
> > > > > >  0x4000141read
> > > > > > 46559323osd2433.969cc5733.573[243,330,226]/243
> > > > > > [243,330,226]/243e678697
> > > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.0056
> > > > > >  0x4000141read
> > > > > > 46559341osd2433.969cc5733.573[243,330,226]/243
> > > > > > [243,330,226]/243e678697
> > > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.0056
> > > > > >  0x4000141read
> > > > > > 46559342osd2433.969cc5733.573[243,330,226]/243
> > > > > > [243,330,226]/243e678697
> > > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.0056
> > > > > >  0x4000141read
> > > > > > 46559345osd2433.969cc5733.573[243,330,226]/243
> > > > > > [243,330,226]/243e678697
> > > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.0056
> > > > > >  0x4000141read
> > > > > > 46559621osd2433.6313e8ef3.8ef[243,330,521]/243
> > > > > > [243,330,521]/243e678697
> > > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a45.007a
> > > > > >  0x4000141read
> > > > > > 46559629osd2433.b280c8523.852[243,113,539]/243
> > > > > > [243,113,539]/243e678697
> > > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a3a.007f
> > > > > >  0x4000141read
> > > > > > 46559928osd2433.1ee7bab43.ab4[243,332,94]/243
> > > > > > [243,332,94]/243e678697
> > > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f099ff.073f
> > > > > >  0x4000241write
> > > > > > LINGER REQUESTS
> > > > > > BACKOFFS
> > > > > >
> > > > > >
> > > > > > We can unblock those requests by doing `ceph osd down osd.243` (or
> > > > > > restarting osd.243).
> > > > > >
> > > > > > This is ceph v14.2.6 and the client kernel is el7 
> > > > > > 3.10.0-957.27.2.el7.x86_64.
> > > > > >
> > > > > > Are there a better way to debug this?
> > > > >
> > > > > Hi Dan,
> > > > >
> > > > > I assume that these ops don't show up as slow requests on the OSD 
> > > > > side?
> > > > > How long did you see it stuck for before intervening?
> > > >
> > > > That's correct -- the osd had no active ops (ceph daemon ops).
> > > >
> > > > The late release slow req was stuck for 4129s before we intervened.
> > > >
> > > > > Do you happen to have "debug ms = 1" logs from osd243?
> > > >
> > > > Nope, but I can try to get it afterwards next time. (Though you need
> > > > it at the moment the ops get stuck, not only from the moment we notice
> > > > the stuck ops, right?)
> > >
> > > Yes, starting before the moment the ops get stuck and ending after you
> > > kick the OSD.
> > >
> > > >
> > > > > Do you have PG autoscaler enabled?  Any PG splits and/or merges at 
> > > > > the time?
> > > >
> > > > Not on the cephfs_(meta)data pools (though on the 30th I increased
> > > > those pool sizes from 2 to 3). And also on the 30th I did some PG
> > > > merging on an unrelated test pool.
> > > > And anyway we have seen this type of lockup in the past, without those
> > > > pool changes (also with mimic MDS until we upgraded to

[ceph-users] cpu and memory for OSD server

2020-02-03 Thread Wyatt Chun
We have 18 Sata disks (each 2TB) on a physical server, each disk with an
OSD deployed.
I am not sure how much CPU and memory resources should be prepared for this
server.
Does each OSD require a physical CPU? and how to calculate memory usage?

Thanks.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] v12.2.13 Luminous released

2020-02-03 Thread Abhishek Lekshmanan

We're happy to announce 13th bug fix release of the Luminous v12.2.x
long term stable release series. We recommend that all users upgrade to
this release. Many thanks to all the contributors, in particular Yuri &
Nathan, in getting this release out of the door. This shall be the last
release of the Luminous series.

For a detailed release notes, please check out the official blog entry
at https://ceph.io/releases/v12-2-13-luminous-released/

Notable Changes
---

* Ceph now packages python bindings for python3.6 instead of python3.4,
  because EPEL7 recently switched from python3.4 to python3.6 as the
  native python3. see the announcement[1] for more details on the
  background of this change.

* We now have telemetry support via a ceph-mgr module. The telemetry module is
  absolutely on an opt-in basis, and is meant to collect generic cluster
  information and push it to a central endpoint. By default, we're pushing it
  to a project endpoint at https://telemetry.ceph.com/report, but this is
  customizable using by setting the 'url' config option with::

ceph telemetry config-set url ''

  You will have to opt-in on sharing your information with::

ceph telemetry on

  You can view exactly what information will be reported first with::

ceph telemetry show

  Should you opt-in, your information will be licensed under the
  Community Data License Agreement - Sharing - Version 1.0, which you can
  read at https://cdla.io/sharing-1-0/

  The telemetry module reports information about CephFS file systems,
  including:

- how many MDS daemons (in total and per file system)
- which features are (or have been) enabled
- how many data pools
- approximate file system age (year + month of creation)
- how much metadata is being cached per file system

  As well as:

- whether IPv4 or IPv6 addresses are used for the monitors
- whether RADOS cache tiering is enabled (and which mode)
- whether pools are replicated or erasure coded, and
  which erasure code profile plugin and parameters are in use
- how many RGW daemons, zones, and zonegroups are present; which RGW 
frontends are in use
- aggregate stats about the CRUSH map, like which algorithms are used, how
  big buckets are, how many rules are defined, and what tunables are in use

* A health warning is now generated if the average osd heartbeat ping
  time exceeds a configurable threshold for any of the intervals
  computed. The OSD computes 1 minute, 5 minute and 15 minute intervals
  with average, minimum and maximum values. New configuration option
  `mon_warn_on_slow_ping_ratio` specifies a percentage of
  `osd_heartbeat_grace` to determine the threshold. A value of zero
  disables the warning. New configuration option
  `mon_warn_on_slow_ping_time` specified in milliseconds over-rides the
  computed value, causes a warning when OSD heartbeat pings take longer
  than the specified amount. New admin command `ceph daemon mgr.#
  dump_osd_network [threshold]` command will list all connections with a
  ping time longer than the specified threshold or value determined by
  the config options, for the average for any of the 3 intervals. New
  admin command `ceph daemon osd.# dump_osd_network [threshold]` will do
  the same but only including heartbeats initiated by the specified OSD.

* The configuration value `osd_calc_pg_upmaps_max_stddev` used for upmap
  balancing has been removed. Instead use the mgr balancer config
  `upmap_max_deviation` which now is an integer number of PGs of
  deviation from the target PGs per OSD. This can be set with a command
  like `ceph config set mgr mgr/balancer/upmap_max_deviation 2`. The
  default `upmap_max_deviation` is 1. There are situations where crush
  rules would not allow a pool to ever have completely balanced PGs. For
  example, if crush requires 1 replica on each of 3 racks, but there are
  fewer OSDs in 1 of the racks. In those cases, the configuration value
  can be increased.


Getting Ceph


* Git at git://github.com/ceph/ceph.git
* Tarball at http://download.ceph.com/tarballs/ceph-12.2.13.tar.gz
* For packages, see http://docs.ceph.com/docs/master/install/get-packages/
* Release git sha1: 584a20eb0237c657dc0567da126be145106aa47e

[1]: 
https://lists.fedoraproject.org/archives/list/epel-annou...@lists.fedoraproject.org/message/EGUMKAIMPK2UD5VSHXM53BH2MBDGDWMO/

--
Abhishek Lekshmanan
SUSE Software Solutions Germany GmbH
GF: Felix Imendörffer HRB 21284 (AG Nürnberg)
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] v14.2.7 Nautilus released

2020-02-03 Thread Abhishek Lekshmanan
This is the seventh update to the Ceph Nautilus release series. This is
a hotfix release primarily fixing a couple of security issues. We
recommend that all users upgrade to this release.

Notable Changes
---

* CVE-2020-1699: Fixed a path traversal flaw in Ceph dashboard that
  could allow for potential information disclosure (Ernesto Puerta)
* CVE-2020-1700: Fixed a flaw in RGW beast frontend that could lead to
  denial of service from an unauthenticated client (Or Friedmann)

Blog Link: https://ceph.io/releases/v14-2-7-nautilus-released/

Getting Ceph


* Git at git://github.com/ceph/ceph.git
* Tarball at http://download.ceph.com/tarballs/ceph-14.2.7.tar.gz
* For packages, see http://docs.ceph.com/docs/master/install/get-packages/
* Release git sha1: 3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8

-- 
Abhishek Lekshmanan
SUSE Software Solutions Germany GmbH
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Problem with OSD - stuck in CPU loop after rbd snapshot mount

2020-02-03 Thread Jan Pekař - Imatic

Hi all,

I have small cluster and yesterday I tried to mount older RBD snapshot torecover data. (I have approx. 230 daily snapshots of one RBD image 
on my small ceph).


After I did mount and ls operation, cluster was stuck and I notice that 2of my OSD's eaten CPU and raise in memory usage (more than 4x more 
memory than usual). Cluster was unusable after that and it didn't help stopping rbd mount operation and restarting OSD's etc.


Strange is, that it affects only 2 of my OSD's - one is on NVME and has cache PG's mapped, other one is rotational HDD. IOPS are normal, 
only CPU and memory usage are unusual.


I noticed hundred thousands of slow requests (on idle cluster) so I startto debug OSD operations. You can find OSD 8 log below. OSD 9 
contains nvme cached PG's and there is nothing unusual in standard log. In the middle of log I raised debug level to 10.


It look's to me, that it is trying to do one operation with one snapshotted RBD object in PG 2.b 
(rbd_data.9e87fa74b0dc51.0001d1a9:7e3) in a loop.


Is there any way how to stop that? What is causing the problem? Can I somehow flush OSD's "journals" or "queues" to stop current operation 
and make cluster usable again? Is there any way how I can access old snapshot data?


Problem started on Luminous 12.2.12 and upgrade to 12.2.13 didn't help.

Thank you
Jan Pekar

2020-02-03 15:09:22.470277 7f51f6af7700  0 log_channel(cluster) log [WRN] : 30966 slow requests, 5 included below; oldest blocked for > 
37.505482 secs
2020-02-03 15:09:22.470336 7f51f6af7700  0 log_channel(cluster) log [WRN] : slow request 37.503404 seconds old, received at 2020-02-03 
15:08:44.956422: osd_op(osd.9.59923:2670991 2.b 2.162d8f8b (undecoded) 
ondisk+read+rwordered+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected e59939) currently queued_for_pg
2020-02-03 15:09:22.470360 7f51f6af7700  0 log_channel(cluster) log [WRN] : slow request 37.500695 seconds old, received at 2020-02-03 
15:08:44.959132: osd_op(osd.9.59923:2671023 2.b 2.162d8f8b (undecoded) 
ondisk+read+rwordered+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected e59939) currently queued_for_pg
2020-02-03 15:09:22.470366 7f51f6af7700  0 log_channel(cluster) log [WRN] : slow request 37.498013 seconds old, received at 2020-02-03 
15:08:44.961814: osd_op(osd.9.59923:2671055 2.b 2.162d8f8b (undecoded) 
ondisk+read+rwordered+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected e59939) currently queued_for_pg
2020-02-03 15:09:22.470371 7f51f6af7700  0 log_channel(cluster) log [WRN] : slow request 37.494836 seconds old, received at 2020-02-03 
15:08:44.964990: osd_op(osd.9.59923:2671087 2.b 2.162d8f8b (undecoded) 
ondisk+read+rwordered+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected e59939) currently queued_for_pg
2020-02-03 15:09:22.470376 7f51f6af7700  0 log_channel(cluster) log [WRN] : slow request 37.491950 seconds old, received at 2020-02-03 
15:08:44.967877: osd_op(osd.9.59923:2671119 2.b 2.162d8f8b (undecoded) 
ondisk+read+rwordered+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected e59939) currently queued_for_pg

...
2020-02-03 15:09:35.537259 7f51ddac5700  0 osd.8 59939 do_command r=0
2020-02-03 15:09:35.588753 7f51ddac5700  0 osd.8 59939 do_command r=0 Set 
debug_osd to 10/10
2020-02-03 15:09:35.588814 7f51ddac5700  0 log_channel(cluster) log [INF] : Set 
debug_osd to 10/10
2020-02-03 15:09:35.597667 7f51f0945700  2 osd.8 59939 ms_handle_reset con 
0x5647736ac800 session 0x564785519e00
2020-02-03 15:09:35.599056 7f51e0acb700 10 osd.8 pg_epoch: 59939 pg[2.b( v 59908'4218227 (58731'4216723,59908'4218227] 
local-lis/les=59932/59934 n=4693 ec=1/1 lis/c 59932/59932 les/c/f 59934/59934/38237 59932/59932/59932) [8,0] r=0 lpr=59932 crt=59908'4218227 
lcod 0'0 mlcod 0'0 active+clean]  dropping ondisk_read_lock

2020-02-03 15:09:35.599209 7f51e0acb700 10 osd.8 59939 dequeue_op 
0x5647873d5800 finish
2020-02-03 15:09:35.599250 7f51e0acb700 10 osd.8 59939 dequeue_op 0x5647873d59c0 prio 63 cost 29 latency 22.224501 
osd_op(osd.9.59923:2779946 2.b 2.162d8f8b (undecoded) ondisk+read+rwordered+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected 
e59939) v8 pg pg[2.b( v 59908'4218227 (58731'4216723,59908'4218227] local-lis/les=59932/59934 n=4693 ec=1/1 lis/c 59932/59932 les/c/f 
59934/59934/38237 59932/59932/59932) [8,0] r=0 lpr=59932 crt=59908'4218227 lcod 0'0 mlcod 0'0 active+clean]
2020-02-03 15:09:35.599293 7f51e0acb700 10 osd.8 pg_epoch: 59939 pg[2.b( v 59908'4218227 (58731'4216723,59908'4218227] 
local-lis/les=59932/59934 n=4693 ec=1/1 lis/c 59932/59932 les/c/f 59934/59934/38237 59932/59932/59932) [8,0] r=0 lpr=59932 crt=59908'4218227 
lcod 0'0 mlcod 0'0 active+clean] _handle_message: 0x5647873d59c0
2020-02-03 15:09:35.599347 7f51e0acb700 10 osd.8 pg_epoch: 59939 pg[2.b( v 59908'4218227 (58731'4216723,59908'4218227] 
local-lis/les=59932/59934 n=4693 ec=1/1 lis/c 59932/59932 les/c/f 59934/59934/38237 59932/59932/59932) [8,0] r=0 lpr=59932 cr

[ceph-users] recovery_unfound

2020-02-03 Thread Jake Grimmett
Dear All,

Due to a mistake in my "rolling restart" script, one of our ceph
clusters now has a number of unfound objects:

There is an 8+2 erasure encoded data pool, 3x replicated metadata pool,
all data is stored as cephfs.

root@ceph7 ceph-archive]# ceph health
HEALTH_ERR 24/420880027 objects unfound (0.000%); Possible data damage:
14 pgs recovery_unfound; Degraded data redundancy: 64/4204261148 objects
degraded (0.000%), 14 pgs degraded

"ceph health detail" gives me a handle on which pgs are affected.
e.g:
pg 5.f2f has 2 unfound objects
pg 5.5c9 has 2 unfound objects
pg 5.4c1 has 1 unfound objects
and so on...

plus more entries of this type:
  pg 5.6d is active+recovery_unfound+degraded, acting
[295,104,57,442,240,338,219,33,150,382], 1 unfound
pg 5.3fa is active+recovery_unfound+degraded, acting
[343,147,21,131,315,63,214,365,264,437], 2 unfound
pg 5.41d is active+recovery_unfound+degraded, acting
[20,104,190,377,52,141,418,358,240,289], 1 unfound

Digging deeper into one of the bad pg, we see the oid for the two
unfound objects:

root@ceph7 ceph-archive]# ceph pg 5.f2f list_unfound
{
"num_missing": 4,
"num_unfound": 2,
"objects": [
{
"oid": {
"oid": "1000ba25e49.0207",
"key": "",
"snapid": -2,
"hash": 854007599,
"max": 0,
"pool": 5,
"namespace": ""
},
"need": "22541'3088478",
"have": "0'0",
"flags": "none",
"locations": [
"189(8)",
"263(9)"
]
},
{
"oid": {
"oid": "1000bb25a5b.0091",
"key": "",
"snapid": -2,
"hash": 3637976879,
"max": 0,
"pool": 5,
"namespace": ""
},
"need": "22541'3088476",
"have": "0'0",
"flags": "none",
"locations": [
"189(8)",
"263(9)"
]
}
],
"more": false
}


While it would be nice to recover the data, this cluster is only used
for storing backups.

As all OSD are up and running, presumably the data blocks are
permanently lost?

If it's hard / impossible to recover the data, presumably we should now
consider using "ceph pg 5.f2f  mark_unfound_lost delete" on each
affected pg?

Finally, can we use the oid to identify the affected files?

best regards,

Jake

-- 
Jake Grimmett
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] ceph positions

2020-02-03 Thread Frank R
Hi all,

I really hope this isn't seen as spam. I am looking to find a position
where I can focus on Linux storage/Ceph. If anyone is currently
looking please let me know. Linkedin profile frankritchie.

Thanks,
Frank
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] cephf_metadata: Large omap object found

2020-02-03 Thread Yoann Moulin
Hello,

I have this message on my new ceph cluster in Nautilus. I have a cephfs with a 
copy of ~100TB in progress.

> /var/log/ceph/artemis.log:2020-02-03 16:22:49.970437 osd.66 (osd.66) 1137 : 
> cluster [WRN] Large omap object found. Object: 
> 8:579bf162:::mds3_openfiles.0:head PG: 8.468fd9ea (8.2a) Key count: 206548 
> Size (bytes): 6691941

> /var/log/ceph/artemis-osd.66.log:2020-02-03 16:22:49.966 7fe77af62700  0 
> log_channel(cluster) log [WRN] : Large omap object found. Object: 
> 8:579bf162:::mds3_openfiles.0:head PG: 8.468fd9ea (8.2a) Key count: 206548 
> Size (bytes): 6691941

I found this thread about a similar issue in the archives of the list
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/JUFYDCQ2AHFA23NFJQY743ELJHG2N5DI/

But I'm not sure what I can do in my situation, can I increase 
osd_deep_scrub_large_omap_object_key_threshold or it's a bad idea?

Thanks for your help.

Here some useful (I guess) information:

> Filesystem  Size  Used Avail Use% Mounted on
> 10.90.37.4,10.90.37.6,10.90.37.8:/  329T   32T  297T  10% /artemis

> artemis@icitsrv5:~$ ceph -s
>   cluster:
> id: 815ea021-7839-4a63-9dc1-14f8c5feecc6
> health: HEALTH_WARN
> 1 large omap objects
>  
>   services:
> mon: 3 daemons, quorum iccluster003,iccluster005,iccluster007 (age 2w)
> mgr: iccluster021(active, since 7h), standbys: iccluster009, iccluster023
> mds: cephfs:5 5 up:active
> osd: 120 osds: 120 up (since 5d), 120 in (since 5d)
> rgw: 8 daemons active (iccluster003.rgw0, iccluster005.rgw0, 
> iccluster007.rgw0, iccluster013.rgw0, iccluster015.rgw0, iccluster019.rgw0, 
> iccluster021.rgw0, iccluster023.rgw0)
>  
>   data:
> pools:   10 pools, 2161 pgs
> objects: 72.02M objects, 125 TiB
> usage:   188 TiB used, 475 TiB / 662 TiB avail
> pgs: 2157 active+clean
>  4active+clean+scrubbing+deep
>  
>   io:
> client:   31 KiB/s rd, 803 KiB/s wr, 31 op/s rd, 184 op/s wr

> artemis@icitsrv5:~$ ceph health detail
> HEALTH_WARN 1 large omap objects
> LARGE_OMAP_OBJECTS 1 large omap objects
> 1 large objects found in pool 'cephfs_metadata'
> Search the cluster log for 'Large omap object found' for more details.


> artemis@icitsrv5:~$ ceph fs status
> cephfs - 3 clients
> ==
> +--++--+---+---+---+
> | Rank | State  | MDS  |Activity   |  dns  |  inos |
> +--++--+---+---+---+
> |  0   | active | iccluster015 | Reqs:0 /s |  251k |  251k |
> |  1   | active | iccluster001 | Reqs:3 /s | 20.2k | 19.1k |
> |  2   | active | iccluster017 | Reqs:1 /s |  116k |  112k |
> |  3   | active | iccluster019 | Reqs:0 /s |  263k |  263k |
> |  4   | active | iccluster013 | Reqs:  123 /s | 16.3k | 16.3k |
> +--++--+---+---+---+
> +-+--+---+---+
> |   Pool  |   type   |  used | avail |
> +-+--+---+---+
> | cephfs_metadata | metadata | 13.9G |  135T |
> |   cephfs_data   |   data   | 51.3T |  296T |
> +-+--+---+---+
> +-+
> | Standby MDS |
> +-+
> +-+
> MDS version: ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) 
> nautilus (stable)
> root@iccluster019:~# ceph --cluster artemis daemon osd.13 config show | grep 
> large_omap
> "osd_deep_scrub_large_omap_object_key_threshold": "20",
> "osd_deep_scrub_large_omap_object_value_sum_threshold": "1073741824",

> artemis@icitsrv5:~$ rados -p cephfs_metadata listxattr mds3_openfiles.0
> artemis@icitsrv5:~$ rados -p cephfs_metadata getomapheader mds3_openfiles.0
> header (42 bytes) :
>   13 00 00 00 63 65 70 68  20 66 73 20 76 6f 6c 75  |ceph fs volu|
> 0010  6d 65 20 76 30 31 31 01  01 0d 00 00 00 14 63 00  |me v011...c.|
> 0020  00 00 00 00 00 01 00 00  00 00|..|
> 002a

Best regards,

-- 
Yoann Moulin
EPFL IC-IT
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephf_metadata: Large omap object found

2020-02-03 Thread Paul Emmerich
The warning threshold recently changed, I'd just increase it in this
particular case. It just means you have lots of open files.

I think there's some work going on to split the openfiles object into
multiple, so that problem will be fixed.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Mon, Feb 3, 2020 at 5:39 PM Yoann Moulin  wrote:
>
> Hello,
>
> I have this message on my new ceph cluster in Nautilus. I have a cephfs with 
> a copy of ~100TB in progress.
>
> > /var/log/ceph/artemis.log:2020-02-03 16:22:49.970437 osd.66 (osd.66) 1137 : 
> > cluster [WRN] Large omap object found. Object: 
> > 8:579bf162:::mds3_openfiles.0:head PG: 8.468fd9ea (8.2a) Key count: 206548 
> > Size (bytes): 6691941
>
> > /var/log/ceph/artemis-osd.66.log:2020-02-03 16:22:49.966 7fe77af62700  0 
> > log_channel(cluster) log [WRN] : Large omap object found. Object: 
> > 8:579bf162:::mds3_openfiles.0:head PG: 8.468fd9ea (8.2a) Key count: 206548 
> > Size (bytes): 6691941
>
> I found this thread about a similar issue in the archives of the list
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/JUFYDCQ2AHFA23NFJQY743ELJHG2N5DI/
>
> But I'm not sure what I can do in my situation, can I increase 
> osd_deep_scrub_large_omap_object_key_threshold or it's a bad idea?
>
> Thanks for your help.
>
> Here some useful (I guess) information:
>
> > Filesystem  Size  Used Avail Use% Mounted on
> > 10.90.37.4,10.90.37.6,10.90.37.8:/  329T   32T  297T  10% /artemis
>
> > artemis@icitsrv5:~$ ceph -s
> >   cluster:
> > id: 815ea021-7839-4a63-9dc1-14f8c5feecc6
> > health: HEALTH_WARN
> > 1 large omap objects
> >
> >   services:
> > mon: 3 daemons, quorum iccluster003,iccluster005,iccluster007 (age 2w)
> > mgr: iccluster021(active, since 7h), standbys: iccluster009, 
> > iccluster023
> > mds: cephfs:5 5 up:active
> > osd: 120 osds: 120 up (since 5d), 120 in (since 5d)
> > rgw: 8 daemons active (iccluster003.rgw0, iccluster005.rgw0, 
> > iccluster007.rgw0, iccluster013.rgw0, iccluster015.rgw0, iccluster019.rgw0, 
> > iccluster021.rgw0, iccluster023.rgw0)
> >
> >   data:
> > pools:   10 pools, 2161 pgs
> > objects: 72.02M objects, 125 TiB
> > usage:   188 TiB used, 475 TiB / 662 TiB avail
> > pgs: 2157 active+clean
> >  4active+clean+scrubbing+deep
> >
> >   io:
> > client:   31 KiB/s rd, 803 KiB/s wr, 31 op/s rd, 184 op/s wr
>
> > artemis@icitsrv5:~$ ceph health detail
> > HEALTH_WARN 1 large omap objects
> > LARGE_OMAP_OBJECTS 1 large omap objects
> > 1 large objects found in pool 'cephfs_metadata'
> > Search the cluster log for 'Large omap object found' for more details.
>
>
> > artemis@icitsrv5:~$ ceph fs status
> > cephfs - 3 clients
> > ==
> > +--++--+---+---+---+
> > | Rank | State  | MDS  |Activity   |  dns  |  inos |
> > +--++--+---+---+---+
> > |  0   | active | iccluster015 | Reqs:0 /s |  251k |  251k |
> > |  1   | active | iccluster001 | Reqs:3 /s | 20.2k | 19.1k |
> > |  2   | active | iccluster017 | Reqs:1 /s |  116k |  112k |
> > |  3   | active | iccluster019 | Reqs:0 /s |  263k |  263k |
> > |  4   | active | iccluster013 | Reqs:  123 /s | 16.3k | 16.3k |
> > +--++--+---+---+---+
> > +-+--+---+---+
> > |   Pool  |   type   |  used | avail |
> > +-+--+---+---+
> > | cephfs_metadata | metadata | 13.9G |  135T |
> > |   cephfs_data   |   data   | 51.3T |  296T |
> > +-+--+---+---+
> > +-+
> > | Standby MDS |
> > +-+
> > +-+
> > MDS version: ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) 
> > nautilus (stable)
> > root@iccluster019:~# ceph --cluster artemis daemon osd.13 config show | 
> > grep large_omap
> > "osd_deep_scrub_large_omap_object_key_threshold": "20",
> > "osd_deep_scrub_large_omap_object_value_sum_threshold": "1073741824",
>
> > artemis@icitsrv5:~$ rados -p cephfs_metadata listxattr mds3_openfiles.0
> > artemis@icitsrv5:~$ rados -p cephfs_metadata getomapheader mds3_openfiles.0
> > header (42 bytes) :
> >   13 00 00 00 63 65 70 68  20 66 73 20 76 6f 6c 75  |ceph fs 
> > volu|
> > 0010  6d 65 20 76 30 31 31 01  01 0d 00 00 00 14 63 00  |me 
> > v011...c.|
> > 0020  00 00 00 00 00 01 00 00  00 00|..|
> > 002a
>
> Best regards,
>
> --
> Yoann Moulin
> EPFL IC-IT
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list

[ceph-users] Re: recovery_unfound

2020-02-03 Thread Paul Emmerich
This might be related to recent problems with OSDs not being queried
for unfound objects properly in some cases (which I think was fixed in
master?)

Anyways: run ceph pg  query on the affected PGs, check for "might
have unfound" and try restarting the OSDs mentioned there. Probably
also sufficient to just run "ceph osd down" on the primaries on the
affected PGs to get them to re-check.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Mon, Feb 3, 2020 at 4:27 PM Jake Grimmett  wrote:
>
> Dear All,
>
> Due to a mistake in my "rolling restart" script, one of our ceph
> clusters now has a number of unfound objects:
>
> There is an 8+2 erasure encoded data pool, 3x replicated metadata pool,
> all data is stored as cephfs.
>
> root@ceph7 ceph-archive]# ceph health
> HEALTH_ERR 24/420880027 objects unfound (0.000%); Possible data damage:
> 14 pgs recovery_unfound; Degraded data redundancy: 64/4204261148 objects
> degraded (0.000%), 14 pgs degraded
>
> "ceph health detail" gives me a handle on which pgs are affected.
> e.g:
> pg 5.f2f has 2 unfound objects
> pg 5.5c9 has 2 unfound objects
> pg 5.4c1 has 1 unfound objects
> and so on...
>
> plus more entries of this type:
>   pg 5.6d is active+recovery_unfound+degraded, acting
> [295,104,57,442,240,338,219,33,150,382], 1 unfound
> pg 5.3fa is active+recovery_unfound+degraded, acting
> [343,147,21,131,315,63,214,365,264,437], 2 unfound
> pg 5.41d is active+recovery_unfound+degraded, acting
> [20,104,190,377,52,141,418,358,240,289], 1 unfound
>
> Digging deeper into one of the bad pg, we see the oid for the two
> unfound objects:
>
> root@ceph7 ceph-archive]# ceph pg 5.f2f list_unfound
> {
> "num_missing": 4,
> "num_unfound": 2,
> "objects": [
> {
> "oid": {
> "oid": "1000ba25e49.0207",
> "key": "",
> "snapid": -2,
> "hash": 854007599,
> "max": 0,
> "pool": 5,
> "namespace": ""
> },
> "need": "22541'3088478",
> "have": "0'0",
> "flags": "none",
> "locations": [
> "189(8)",
> "263(9)"
> ]
> },
> {
> "oid": {
> "oid": "1000bb25a5b.0091",
> "key": "",
> "snapid": -2,
> "hash": 3637976879,
> "max": 0,
> "pool": 5,
> "namespace": ""
> },
> "need": "22541'3088476",
> "have": "0'0",
> "flags": "none",
> "locations": [
> "189(8)",
> "263(9)"
> ]
> }
> ],
> "more": false
> }
>
>
> While it would be nice to recover the data, this cluster is only used
> for storing backups.
>
> As all OSD are up and running, presumably the data blocks are
> permanently lost?
>
> If it's hard / impossible to recover the data, presumably we should now
> consider using "ceph pg 5.f2f  mark_unfound_lost delete" on each
> affected pg?
>
> Finally, can we use the oid to identify the affected files?
>
> best regards,
>
> Jake
>
> --
> Jake Grimmett
> MRC Laboratory of Molecular Biology
> Francis Crick Avenue,
> Cambridge CB2 0QH, UK.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: High CPU usage by ceph-mgr in 14.2.6

2020-02-03 Thread jbardgett
Does anyone have access to libibverbs-debuginfo-22.1-3.el7.x86_64 and 
librdmacm-debuginfo-22.1-3.el7.x86_64?  I cannot find them in any repo list out 
there and the gdbpmp.py requires them.

Thanks,
Joe
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ubuntu 18.04.4 Ceph 12.2.12

2020-02-03 Thread Atherion
So now that 12.2.13 has been released, now I will have a mixed environment if I 
use Ubuntu 18.04 repo 12.2.12

I also found there is a docker container https://hub.docker.com/r/ceph/daemon I 
could potentially just use the container to run the version I need. Wondering 
if anyone has done this in production?

Managing the ubuntu repos for ceph has not been easy to say the least :(
Found this ticket but looks dead https://tracker.ceph.com/issues/24326

‐‐‐ Original Message ‐‐‐
On Friday, January 24, 2020 1:12 PM, Anthony D'Atri  wrote:

> I applied those packages for the same reason on a staging cluster and so far 
> so good.
>
>> On Jan 24, 2020, at 9:15 AM, Atherion  wrote:
>
>> 
>> Hi Ceph Community.
>> We currently have a luminous cluster running and some machines still on 
>> Ubuntu 14.04
>> We are looking to upgrade these machines to 18.04 but the only upgrade path 
>> for luminous with the ceph repo is through 16.04.
>> It is doable to get to Mimic but then we have to upgrade all those machines 
>> to 16.04 but then we have to upgrade again to 18.04 when we get to Mimic, it 
>> is becoming a huge time sink.
>>
>> I did notice in the Ubuntu repos they have added 12.2.12 in 18.04.4 release. 
>> Is this a reliable build we can use?
>> https://ubuntu.pkgs.org/18.04/ubuntu-proposed-main-amd64/ceph_12.2.12-0ubuntu0.18.04.4_amd64.deb.html
>> If so then we can go straight to 18.04.4 and not waste so much time.
>>
>> Best
>>
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Understanding Bluestore performance characteristics

2020-02-03 Thread Bradley Kite
Hi,

We have a production cluster of 27 OSD's across 5 servers (all SSD's
running bluestore), and have started to notice a possible performance issue.

In order to isolate the problem, we built a single server with a single
OSD, and ran a few FIO tests. The results are puzzling, not that we were
expecting good performance on a single OSD.

In short, with a sequential write test, we are seeing huge numbers of reads
hitting the actual SSD

Key FIO parameters are:

[global]
pool=benchmarks
rbdname=disk-1
direct=1
numjobs=2
iodepth=1
blocksize=4k
group_reporting=1
[writer]
readwrite=write

iostat results are:
Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
nvme0n1   0.00   105.00 4896.00  294.00 312080.00  1696.00   120.92
   17.253.353.550.02   0.02  12.60

There are nearly ~5000 reads/second (~300 MB/sec), compared with only ~300
writes (~1.5MB/sec), when we are doing a sequential write test? The system
is otherwise idle, with no other workload.

Running the same fio test with only 1 thread (numjobs=1) still shows a high
number of reads (110).

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
nvme0n1   0.00  1281.00  110.00 1463.00   440.00 12624.0016.61
0.030.020.050.02   0.02   3.40

Can anyone kindly offer any comments on why we are seeing this behaviour?

I can understand if there's the occasional read here and there if
RocksDB/WAL entries need to be read from disk during the sequential write
test, but this seems significantly high and unusual.

FIO results (numjobs=2)
writer: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
4096B-4096B, ioengine=rbd, iodepth=1
...
fio-3.7
Starting 2 processes
Jobs: 1 (f=1): [W(1),_(1)][52.4%][r=0KiB/s,w=208KiB/s][r=0,w=52 IOPS][eta
01m:00s]
writer: (groupid=0, jobs=2): err= 0: pid=19553: Mon Feb  3 22:46:16 2020
  write: IOPS=34, BW=137KiB/s (140kB/s)(8228KiB/60038msec)
slat (nsec): min=5402, max=77083, avg=27305.33, stdev=7786.83
clat (msec): min=2, max=210, avg=58.32, stdev=70.54
 lat (msec): min=2, max=210, avg=58.35, stdev=70.54
clat percentiles (msec):
 |  1.00th=[3],  5.00th=[3], 10.00th=[3], 20.00th=[3],
 | 30.00th=[3], 40.00th=[3], 50.00th=[   54], 60.00th=[   62],
 | 70.00th=[   65], 80.00th=[  174], 90.00th=[  188], 95.00th=[  194],
 | 99.00th=[  201], 99.50th=[  203], 99.90th=[  209], 99.95th=[  209],
 | 99.99th=[  211]
   bw (  KiB/s): min=   24, max=  144, per=49.69%, avg=68.08, stdev=38.22,
samples=239
   iops: min=6, max=   36, avg=16.97, stdev= 9.55, samples=239
  lat (msec)   : 4=49.83%, 10=0.10%, 100=29.90%, 250=20.18%
  cpu  : usr=0.08%, sys=0.08%, ctx=2100, majf=0, minf=118
  IO depths: 1=105.3%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
 issued rwts: total=0,2057,0,0 short=0,0,0,0 dropped=0,0,0,0
 latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=137KiB/s (140kB/s), 137KiB/s-137KiB/s (140kB/s-140kB/s),
io=8228KiB (8425kB), run=60038-60038msec
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph positions

2020-02-03 Thread Martin Verges
Hello Frank,

we are always looking for Ceph/Linux consultants.

--
Martin Verges
Managing director

Hint: Secure one of the last slots in the upcoming 4-day Ceph Intensive
Training at https://croit.io/training/4-days-ceph-in-depth-training.

Mobile: +49 174 9335695
E-Mail: martin.ver...@croit.io
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx


Am Mo., 3. Feb. 2020 um 17:26 Uhr schrieb Frank R :

> Hi all,
>
> I really hope this isn't seen as spam. I am looking to find a position
> where I can focus on Linux storage/Ceph. If anyone is currently
> looking please let me know. Linkedin profile frankritchie.
>
> Thanks,
> Frank
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io