Re: [ceph-users] slow requests and high i/o / read rate on bluestore osds after upgrade 12.2.8 -> 12.2.10

2019-01-15 Thread Stefan Priebe - Profihost AG
Hello list,

i also tested current upstream/luminous branch and it happens as well. A
clean install works fine. It only happens on upgraded bluestore osds.

Greets,
Stefan

Am 14.01.19 um 20:35 schrieb Stefan Priebe - Profihost AG:
> while trying to upgrade a cluster from 12.2.8 to 12.2.10 i'm experience
> issues with bluestore osds - so i canceled the upgrade and all bluestore
> osds are stopped now.
> 
> After starting a bluestore osd i'm seeing a lot of slow requests caused
> by very high read rates.
> 
> 
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda  45,00   187,00  767,00   39,00 482040,00  8660,00
> 1217,6258,16   74,60   73,85   89,23   1,24 100,00
> 
> it reads permanently with 500MB/s from the disk and can't service client
> requests. Overall client read rate is at 10.9MiB/s rd
> 
> I can't reproduce this with 12.2.8. Is this a known bug / regression?
> 
> Greets,
> Stefan
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-osd processes restart during Luminous -> Mimic upgrade on CentOS 7

2019-01-15 Thread Wido den Hollander
Hi,

I'm in the middle of upgrading a 12.2.8 cluster to 13.2.4 and I've
noticed that during the Yum/RPM upgrade the OSDs are being restarted.

Jan 15 11:24:25 x yum[2348259]: Updated: 2:ceph-base-13.2.4-0.el7.x86_64
Jan 15 11:24:47 x systemd[1]: Stopped target ceph target allowing to
start/stop all ceph*@.service instances at once.
Jan 15 11:24:47 x systemd[1]: Stopped target ceph target allowing to
start/stop all ceph-osd@.service instances at once.
Jan 15 11:24:47 x systemd[1]: Stopping Ceph object storage daemon
osd.267...


Jan 15 11:24:54 x systemd[1]: Started Ceph object storage daemon
osd.143.
Jan 15 11:24:54 x systemd[1]: Started Ceph object storage daemon
osd.1156.
Jan 15 11:24:54 x systemd[1]: Reached target ceph target allowing to
start/stop all ceph-osd@.service instances at once.
Jan 15 11:24:54 x systemd[1]: Reached target ceph target allowing to
start/stop all ceph*@.service instances at once.
Jan 15 11:24:54 x yum[2348259]: Updated:
2:ceph-selinux-13.2.4-0.el7.x86_64
Jan 15 11:24:59 x yum[2348259]: Updated: 2:ceph-osd-13.2.4-0.el7.x86_64

In /etc/sysconfig/ceph there is CEPH_AUTO_RESTART_ON_UPGRADE=no

So this makes me wonder, what causes the OSDs to be restarted after the
package upgrade as we are not allowing this restart.

Checking cloud.spec.in in both the Luminous and Mimic branch I can't
find a good reason why this is happening because it checks for
'CEPH_AUTO_RESTART_ON_UPGRADE' which isn't set to 'yes'.

In addition, ceph.spec.in never restarts 'ceph.target' which is being
restarted.

Could it be that the SELinux upgrade initiates the restart of these daemons?

CentOS Linux release 7.6.1810 (Core)
Luminous 12.2.8
Mimic 13.2.4

Has anybody seen this before?

Wido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-osd processes restart during Luminous -> Mimic upgrade on CentOS 7

2019-01-15 Thread Dan van der Ster
Hi Wido,

`rpm -q --scripts ceph-selinux` will tell you why.

It was the same from 12.2.8 to 12.2.10: http://tracker.ceph.com/issues/21672

And the problem is worse than you described, because the daemons are
even restarted before all the package files have been updated.

Our procedure on these upgrades is systemctl stop ceph.target; yum
update; systemctl start ceph.target (or ceph-volume lvm activate
--all).

Cheers, Dan

On Tue, Jan 15, 2019 at 11:33 AM Wido den Hollander  wrote:
>
> Hi,
>
> I'm in the middle of upgrading a 12.2.8 cluster to 13.2.4 and I've
> noticed that during the Yum/RPM upgrade the OSDs are being restarted.
>
> Jan 15 11:24:25 x yum[2348259]: Updated: 2:ceph-base-13.2.4-0.el7.x86_64
> Jan 15 11:24:47 x systemd[1]: Stopped target ceph target allowing to
> start/stop all ceph*@.service instances at once.
> Jan 15 11:24:47 x systemd[1]: Stopped target ceph target allowing to
> start/stop all ceph-osd@.service instances at once.
> Jan 15 11:24:47 x systemd[1]: Stopping Ceph object storage daemon
> osd.267...
> 
> 
> Jan 15 11:24:54 x systemd[1]: Started Ceph object storage daemon
> osd.143.
> Jan 15 11:24:54 x systemd[1]: Started Ceph object storage daemon
> osd.1156.
> Jan 15 11:24:54 x systemd[1]: Reached target ceph target allowing to
> start/stop all ceph-osd@.service instances at once.
> Jan 15 11:24:54 x systemd[1]: Reached target ceph target allowing to
> start/stop all ceph*@.service instances at once.
> Jan 15 11:24:54 x yum[2348259]: Updated:
> 2:ceph-selinux-13.2.4-0.el7.x86_64
> Jan 15 11:24:59 x yum[2348259]: Updated: 2:ceph-osd-13.2.4-0.el7.x86_64
>
> In /etc/sysconfig/ceph there is CEPH_AUTO_RESTART_ON_UPGRADE=no
>
> So this makes me wonder, what causes the OSDs to be restarted after the
> package upgrade as we are not allowing this restart.
>
> Checking cloud.spec.in in both the Luminous and Mimic branch I can't
> find a good reason why this is happening because it checks for
> 'CEPH_AUTO_RESTART_ON_UPGRADE' which isn't set to 'yes'.
>
> In addition, ceph.spec.in never restarts 'ceph.target' which is being
> restarted.
>
> Could it be that the SELinux upgrade initiates the restart of these daemons?
>
> CentOS Linux release 7.6.1810 (Core)
> Luminous 12.2.8
> Mimic 13.2.4
>
> Has anybody seen this before?
>
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-osd processes restart during Luminous -> Mimic upgrade on CentOS 7

2019-01-15 Thread Wido den Hollander



On 1/15/19 11:39 AM, Dan van der Ster wrote:
> Hi Wido,
> 
> `rpm -q --scripts ceph-selinux` will tell you why.
> 
> It was the same from 12.2.8 to 12.2.10: http://tracker.ceph.com/issues/21672
> 

Thanks for pointing it out!

> And the problem is worse than you described, because the daemons are
> even restarted before all the package files have been updated.
> 
> Our procedure on these upgrades is systemctl stop ceph.target; yum
> update; systemctl start ceph.target (or ceph-volume lvm activate
> --all).
> 

Yes, I was thinking about it as well. I'm just not used to having
daemons suddenly restart.

SELinux is set to Permissive mode anyway, so why restart the packages
while we are running in Permissive mode?

I'll update the ticket with some feedback about this as this is not what
I (and it seems other users) expect.

Wido

> Cheers, Dan
> 
> On Tue, Jan 15, 2019 at 11:33 AM Wido den Hollander  wrote:
>>
>> Hi,
>>
>> I'm in the middle of upgrading a 12.2.8 cluster to 13.2.4 and I've
>> noticed that during the Yum/RPM upgrade the OSDs are being restarted.
>>
>> Jan 15 11:24:25 x yum[2348259]: Updated: 2:ceph-base-13.2.4-0.el7.x86_64
>> Jan 15 11:24:47 x systemd[1]: Stopped target ceph target allowing to
>> start/stop all ceph*@.service instances at once.
>> Jan 15 11:24:47 x systemd[1]: Stopped target ceph target allowing to
>> start/stop all ceph-osd@.service instances at once.
>> Jan 15 11:24:47 x systemd[1]: Stopping Ceph object storage daemon
>> osd.267...
>> 
>> 
>> Jan 15 11:24:54 x systemd[1]: Started Ceph object storage daemon
>> osd.143.
>> Jan 15 11:24:54 x systemd[1]: Started Ceph object storage daemon
>> osd.1156.
>> Jan 15 11:24:54 x systemd[1]: Reached target ceph target allowing to
>> start/stop all ceph-osd@.service instances at once.
>> Jan 15 11:24:54 x systemd[1]: Reached target ceph target allowing to
>> start/stop all ceph*@.service instances at once.
>> Jan 15 11:24:54 x yum[2348259]: Updated:
>> 2:ceph-selinux-13.2.4-0.el7.x86_64
>> Jan 15 11:24:59 x yum[2348259]: Updated: 2:ceph-osd-13.2.4-0.el7.x86_64
>>
>> In /etc/sysconfig/ceph there is CEPH_AUTO_RESTART_ON_UPGRADE=no
>>
>> So this makes me wonder, what causes the OSDs to be restarted after the
>> package upgrade as we are not allowing this restart.
>>
>> Checking cloud.spec.in in both the Luminous and Mimic branch I can't
>> find a good reason why this is happening because it checks for
>> 'CEPH_AUTO_RESTART_ON_UPGRADE' which isn't set to 'yes'.
>>
>> In addition, ceph.spec.in never restarts 'ceph.target' which is being
>> restarted.
>>
>> Could it be that the SELinux upgrade initiates the restart of these daemons?
>>
>> CentOS Linux release 7.6.1810 (Core)
>> Luminous 12.2.8
>> Mimic 13.2.4
>>
>> Has anybody seen this before?
>>
>> Wido
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] samsung sm863 vs cephfs rep.1 pool performance

2019-01-15 Thread Marc Roos

Is this result to be expected from cephfs, when comparing to a native
ssd speed test. 

 
4k r ran.

4k w ran.

4k r seq.

4k w seq.

1024k r ran.

1024k w ran.

1024k r seq.

1024k w seq.

  size

 

lat

iops

kB/s

lat

iops

kB/s

lat

iops

MB/s

lat

iops

MB/s

lat

iops

MB/s

lat

iops

MB/s

lat

iops

MB/s

lat

iops

MB/s

Cephfs

ssd rep. 3

  2.78

1781

7297

1.42

700

2871

0.29

3314

13.6

0.04

889

3.64

4.3

231

243

0.08

132

139

4.23

235

247

6.99

142

150

Cephfs

ssd rep. 1

  0.54

1809

7412

0.8

1238

5071

0.29

3325

13.6

0.56

1761

7.21

4.27

233

245

4.34

229

241

4.21

236

248

4.34

229

241

Samsung

MZK7KM480

480GB

   0.09

10.2k

41600

0.05

17.9k

73200

0.05

18k

77.6

0.05

18.3k

75.1

2.06

482

506

2.16

460

483

1.98

502

527

2.13

466

489


(4 nodes, CentOS7, luminous)



  






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests and high i/o / read rate on bluestore osds after upgrade 12.2.8 -> 12.2.10

2019-01-15 Thread Marc Roos
 
I upgraded this weekend from 12.2.8 to 12.2.10 without such issues 
(osd's are idle)




-Original Message-
From: Stefan Priebe - Profihost AG [mailto:s.pri...@profihost.ag] 
Sent: 15 January 2019 10:26
To: ceph-users@lists.ceph.com
Cc: n.fahldi...@profihost.ag
Subject: Re: [ceph-users] slow requests and high i/o / read rate on 
bluestore osds after upgrade 12.2.8 -> 12.2.10

Hello list,

i also tested current upstream/luminous branch and it happens as well. A
clean install works fine. It only happens on upgraded bluestore osds.

Greets,
Stefan

Am 14.01.19 um 20:35 schrieb Stefan Priebe - Profihost AG:
> while trying to upgrade a cluster from 12.2.8 to 12.2.10 i'm 
experience
> issues with bluestore osds - so i canceled the upgrade and all 
bluestore
> osds are stopped now.
> 
> After starting a bluestore osd i'm seeing a lot of slow requests 
caused
> by very high read rates.
> 
> 
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda  45,00   187,00  767,00   39,00 482040,00  8660,00
> 1217,6258,16   74,60   73,85   89,23   1,24 100,00
> 
> it reads permanently with 500MB/s from the disk and can't service 
client
> requests. Overall client read rate is at 10.9MiB/s rd
> 
> I can't reproduce this with 12.2.8. Is this a known bug / regression?
> 
> Greets,
> Stefan
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Best practice creating pools / rbd images

2019-01-15 Thread Thomas
Hi,
 
my use case for Ceph is serving a central backup storage.
This means I will backup multiple databases in Ceph storage cluster.
 
This is my question:
What is the best practice for creating pools & images?
Should I create multiple pools, means one pool per database?
Or should I create a single pool "backup" and multiple rbd images, means
one rbd image per database?
Or should I create a single pool "backup" and single rbd image "db"?
 
This is the security demand that should be considered:
DB-owner A can only modify the files that belong to A; other files
(owned by B, C or D) are accessible for A.
 
 
THX
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds0: Metadata damage detected

2019-01-15 Thread Yan, Zheng
On Tue, Jan 15, 2019 at 3:51 PM Sergei Shvarts  wrote:
>
> Hello ceph users!
>
> A couple of days ago I've got a ceph health error - mds0: Metadata damage 
> detected.
> Overall ceph cluster is fine: all pgs are clean, all osds are up and in, no 
> big problems.
> Looks like there is not much information regarding this class of issues, so 
> I'm writing this message and hope somebody can help me.
>
> here is the damage itself
> ceph tell mds.0 damage ls
> 2019-01-15 07:47:04.651317 7f48c9813700  0 client.312845186 ms_handle_reset 
> on 192.168.0.5:6801/1186631878
> 2019-01-15 07:47:04.656991 7f48ca014700  0 client.312845189 ms_handle_reset 
> on 192.168.0.5:6801/1186631878
> [{"damage_type":"dir_frag","id":3472877204,"ino":1100954978087,"frag":"*","path":"\/public\/video\/3h\/3hG6X7\/screen-msmall"}]
>

Looks like object 1005607c727. in cephfs metadata pool is
corrupted. please run following commands and send mds.0 log to us

ceph tell mds.0 injectargs '--debug_mds 10'
ceph tell mds.0 damage rm 3472877204
ls /public/video/3h/3hG6X7/screen-msmall
ceph tell mds.0 injectargs '--debug_mds 0'

Regards
Yan, Zheng

> Best regards,
> Sergei Shvarts
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs kernel client instability

2019-01-15 Thread Andras Pataki
An update on our cephfs kernel client troubles.  After doing some 
heavier testing with a newer kernel 4.19.13, it seems like it also gets 
into a bad state when it can't connect to monitors (all back end 
processes are on 12.2.8):


Jan 15 08:49:00 mon5 kernel: libceph: mon1 10.128.150.11:6789 session 
established

Jan 15 08:49:01 mon5 kernel: libceph: mon1 10.128.150.11:6789 io error
Jan 15 08:49:01 mon5 kernel: libceph: mon1 10.128.150.11:6789 session 
lost, hunting for new mon
Jan 15 08:49:01 mon5 kernel: libceph: mon0 10.128.150.10:6789 session 
established

Jan 15 08:49:01 mon5 kernel: libceph: mon0 10.128.150.10:6789 io error
Jan 15 08:49:01 mon5 kernel: libceph: mon0 10.128.150.10:6789 session 
lost, hunting for new mon
Jan 15 08:49:04 mon5 kernel: libceph: mon1 10.128.150.11:6789 session 
established

Jan 15 08:49:04 mon5 kernel: libceph: mon1 10.128.150.11:6789 io error
Jan 15 08:49:04 mon5 kernel: libceph: mon1 10.128.150.11:6789 session 
lost, hunting for new mon
Jan 15 08:49:06 mon5 kernel: libceph: mon2 10.128.150.12:6789 session 
established

Jan 15 08:49:07 mon5 kernel: libceph: mon2 10.128.150.12:6789 io error
Jan 15 08:49:07 mon5 kernel: libceph: mon2 10.128.150.12:6789 session 
lost, hunting for new mon

... repeating forever ...

# uname -r
4.19.13

and on the mon node (10.128.150.10) at log level 20, I see that it is 
building/encoding a lot of maps (10.128.36.18 is the client in question):
2019-01-15 08:49:01.355017 7fffee40c700 10 mon.cephmon00@0(leader) e40 
_ms_dispatch new session 0x62dc6c00 MonSession(client.36800361 
10.128.36.18:0/2911716500 is open , features 0x27018fb86aa42ada (jewel)) 
features 0x27018fb86aa42ada
2019-01-15 08:49:01.355021 7fffee40c700 20 mon.cephmon00@0(leader) e40  
caps
2019-01-15 08:49:01.355026 7fffee40c700 10 mon.cephmon00@0(leader).auth 
v58457 preprocess_query auth(proto 0 34 bytes epoch 0) from 
client.36800361 10.128.36.18:0/2911716500


2019-01-15 08:49:01.355817 7fffee40c700 10 mon.cephmon00@0(leader).osd 
e1254390 check_osdmap_sub 0x65373340 next 1254102 (onetime)
2019-01-15 08:49:01.355819 7fffee40c700  5 mon.cephmon00@0(leader).osd 
e1254390 send_incremental [1254102..1254390] to client.36800361 
10.128.36.18:0/2911716500
2019-01-15 08:49:01.355821 7fffee40c700 10 mon.cephmon00@0(leader).osd 
e1254390 build_incremental [1254102..1254141] with features 27018fb86aa42ada
2019-01-15 08:49:01.364859 7fffee40c700 20 mon.cephmon00@0(leader).osd 
e1254390 reencode_incremental_map 1254141 with features 504412504116439552
2019-01-15 08:49:01.372131 7fffee40c700 20 mon.cephmon00@0(leader).osd 
e1254390 build_incremental    inc 1254141 1237271 bytes
2019-01-15 08:49:01.372180 7fffee40c700 20 mon.cephmon00@0(leader).osd 
e1254390 reencode_incremental_map 1254140 with features 504412504116439552
2019-01-15 08:49:01.372187 7fffee40c700 20 mon.cephmon00@0(leader).osd 
e1254390 build_incremental    inc 1254140 260 bytes
2019-01-15 08:49:01.380981 7fffee40c700 20 mon.cephmon00@0(leader).osd 
e1254390 reencode_incremental_map 1254139 with features 504412504116439552
2019-01-15 08:49:01.387983 7fffee40c700 20 mon.cephmon00@0(leader).osd 
e1254390 build_incremental    inc 1254139 1237351 bytes
2019-01-15 08:49:01.388043 7fffee40c700 20 mon.cephmon00@0(leader).osd 
e1254390 reencode_incremental_map 1254138 with features 504412504116439552
2019-01-15 08:49:01.388049 7fffee40c700 20 mon.cephmon00@0(leader).osd 
e1254390 build_incremental    inc 1254138 232 bytes
2019-01-15 08:49:01.396781 7fffee40c700 20 mon.cephmon00@0(leader).osd 
e1254390 reencode_incremental_map 1254137 with features 504412504116439552

 ... a lot more of similar messages
2019-01-15 08:49:04.210936 7fffee40c700 20 mon.cephmon00@0(leader).osd 
e1254390 reencode_incremental_map 1254382 with features 504412504116439552
2019-01-15 08:49:04.211032 7fffee40c700 20 mon.cephmon00@0(leader).osd 
e1254390 build_incremental    inc 1254382 232 bytes
2019-01-15 08:49:04.211066 7fffee40c700 10 mon.cephmon00@0(leader) e40 
ms_handle_reset 0x6450f800 10.128.36.18:0/2911716500
2019-01-15 08:49:04.211070 7fffee40c700 10 mon.cephmon00@0(leader) e40 
reset/close on session client.36800361 10.128.36.18:0/2911716500
2019-01-15 08:49:04.211073 7fffee40c700 10 mon.cephmon00@0(leader) e40 
remove_session 0x62dc6c00 client.36800361 10.128.36.18:0/2911716500 
features 0x27018fb86aa42ada


looks like the client disconnects (either for waiting too long, or for 
some protocol error?).  Any hints on why so many maps need to be 
reencoded (to jewel), or how to improve this behavior would be much 
appreciated.  We would really be interested in using the kernel client 
instead of fuse, but this seems to be a stumbling block.


Thanks,

Andras


On 1/3/19 6:49 AM, Andras Pataki wrote:
I wonder if anyone could offer any insight on the issue below, 
regarding the CentOS 7.6 kernel cephfs client connecting to a Luminous 
cluster.  I have since tried a much newer 4.19.13 kernel, which did 
not show the same issue (but 

Re: [ceph-users] rocksdb mon stores growing until restart

2019-01-15 Thread Dan van der Ster
On Wed, Sep 19, 2018 at 7:01 PM Bryan Stillwell  wrote:
>
> > On 08/30/2018 11:00 AM, Joao Eduardo Luis wrote:
> > > On 08/30/2018 09:28 AM, Dan van der Ster wrote:
> > > Hi,
> > > Is anyone else seeing rocksdb mon stores slowly growing to >15GB,
> > > eventually triggering the 'mon is using a lot of disk space' warning?
> > > Since upgrading to luminous, we've seen this happen at least twice.
> > > Each time, we restart all the mons and then stores slowly trim down to
> > > <500MB. We have 'mon compact on start = true', but it's not the
> > > compaction that's shrinking the rockdb's -- the space used seems to
> > > decrease over a few minutes only after *all* mons have been restarted.
> > > This reminds me of a hammer-era issue where references to trimmed maps
> > > were leaking -- I can't find that bug at the moment, though.
> >
> > Next time this happens, mind listing the store contents and check if you
> > are holding way too many osdmaps? You shouldn't be holding more osdmaps
> > than the default IF the cluster is healthy and all the pgs are clean.
> >
> > I've chased a bug pertaining this last year, even got a patch, but then
> > was unable to reproduce it. Didn't pursue merging the patch any longer
> > (I think I may still have an open PR for it though), simply because it
> > was no longer clear if it was needed.
>
> I just had this happen to me while using ceph-gentle-split on a 12.2.5
> cluster with 1,370 OSDs.  Unfortunately, I restarted the mon nodes which
> fixed the problem before finding this thread.  I'm only halfway done
> with the split, so I'll see if the problem resurfaces again.
>

I think I've understood the what's causing this -- it's related to the
issue we've seen where osdmaps are not being trimmed on osds.
It seems that once the oldest_map and newest_map are within 500, they
are no longer trimmed ever until the mon's are restarted.

I updated this tracker: http://tracker.ceph.com/issues/37875

-- dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests and high i/o / read rate on bluestore osds after upgrade 12.2.8 -> 12.2.10

2019-01-15 Thread Stefan Priebe - Profihost AG


Am 15.01.19 um 12:45 schrieb Marc Roos:
>  
> I upgraded this weekend from 12.2.8 to 12.2.10 without such issues 
> (osd's are idle)


it turns out this was a kernel bug. Updating to a newer kernel - has
solved this issue.

Greets,
Stefan


> -Original Message-
> From: Stefan Priebe - Profihost AG [mailto:s.pri...@profihost.ag] 
> Sent: 15 January 2019 10:26
> To: ceph-users@lists.ceph.com
> Cc: n.fahldi...@profihost.ag
> Subject: Re: [ceph-users] slow requests and high i/o / read rate on 
> bluestore osds after upgrade 12.2.8 -> 12.2.10
> 
> Hello list,
> 
> i also tested current upstream/luminous branch and it happens as well. A
> clean install works fine. It only happens on upgraded bluestore osds.
> 
> Greets,
> Stefan
> 
> Am 14.01.19 um 20:35 schrieb Stefan Priebe - Profihost AG:
>> while trying to upgrade a cluster from 12.2.8 to 12.2.10 i'm 
> experience
>> issues with bluestore osds - so i canceled the upgrade and all 
> bluestore
>> osds are stopped now.
>>
>> After starting a bluestore osd i'm seeing a lot of slow requests 
> caused
>> by very high read rates.
>>
>>
>> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>> sda  45,00   187,00  767,00   39,00 482040,00  8660,00
>> 1217,6258,16   74,60   73,85   89,23   1,24 100,00
>>
>> it reads permanently with 500MB/s from the disk and can't service 
> client
>> requests. Overall client read rate is at 10.9MiB/s rd
>>
>> I can't reproduce this with 12.2.8. Is this a known bug / regression?
>>
>> Greets,
>> Stefan
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests and high i/o / read rate on bluestore osds after upgrade 12.2.8 -> 12.2.10

2019-01-15 Thread Mark Nelson


On 1/15/19 9:02 AM, Stefan Priebe - Profihost AG wrote:

Am 15.01.19 um 12:45 schrieb Marc Roos:
  
I upgraded this weekend from 12.2.8 to 12.2.10 without such issues

(osd's are idle)


it turns out this was a kernel bug. Updating to a newer kernel - has
solved this issue.

Greets,
Stefan



Hi Stefan, can you tell me what kernel you were on and what hardware was 
involved?  I want to make sure that it's recorded for the community in 
case others run into the same issue.



Thanks,

Mark






-Original Message-
From: Stefan Priebe - Profihost AG [mailto:s.pri...@profihost.ag]
Sent: 15 January 2019 10:26
To: ceph-users@lists.ceph.com
Cc: n.fahldi...@profihost.ag
Subject: Re: [ceph-users] slow requests and high i/o / read rate on
bluestore osds after upgrade 12.2.8 -> 12.2.10

Hello list,

i also tested current upstream/luminous branch and it happens as well. A
clean install works fine. It only happens on upgraded bluestore osds.

Greets,
Stefan

Am 14.01.19 um 20:35 schrieb Stefan Priebe - Profihost AG:

while trying to upgrade a cluster from 12.2.8 to 12.2.10 i'm

experience

issues with bluestore osds - so i canceled the upgrade and all

bluestore

osds are stopped now.

After starting a bluestore osd i'm seeing a lot of slow requests

caused

by very high read rates.


Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda  45,00   187,00  767,00   39,00 482040,00  8660,00
1217,6258,16   74,60   73,85   89,23   1,24 100,00

it reads permanently with 500MB/s from the disk and can't service

client

requests. Overall client read rate is at 10.9MiB/s rd

I can't reproduce this with 12.2.8. Is this a known bug / regression?

Greets,
Stefan


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Recommendations for sharing a file system to a heterogeneous client network?

2019-01-15 Thread Ketil Froyn
Hi,

I'm pretty new to Ceph - pardon the newbie question. I've done a bit
of reading and searching, but I haven't seen an answer to this yet.

Is anyone using ceph to power a filesystem shared among a network of
Linux, Windows and Mac clients? How have you set it up? Is there a
mature Windows driver for CephFS? If not, are you using Samba/CIFS on
top of CephFS, or Samba/CIFS on top of a large RBD volume? Or
something else entirely? I'm looking for something scalable that
supports AD integration and ACL access control.

Are there any recommendations for this? Any pointers would be appreciated

Regards, Ketil
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recommendations for sharing a file system to a heterogeneous client network?

2019-01-15 Thread Robert Sander
Hi Ketil,

use Samba/CIFS with multiple gateway machines clustered with CTDB.
CephFS can be mounted with Posix ACL support.

Slides from my last Ceph day talk are available here:
https://www.slideshare.net/Inktank_Ceph/ceph-day-berlin-unlimited-fileserver-with-samba-ctdb-and-cephfs

Regards
-- 
Robert Sander
Heinlein Support GmbH
Schwedter Str. 8/9b, 10119 Berlin

https://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Amtsgericht Berlin-Charlottenburg - HRB 93818 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPH_FSAL Nfs-ganesha

2019-01-15 Thread Patrick Donnelly
On Mon, Jan 14, 2019 at 7:11 AM Daniel Gryniewicz  wrote:
>
> Hi.  Welcome to the community.
>
> On 01/14/2019 07:56 AM, David C wrote:
> > Hi All
> >
> > I've been playing around with the nfs-ganesha 2.7 exporting a cephfs
> > filesystem, it seems to be working pretty well so far. A few questions:
> >
> > 1) The docs say " For each NFS-Ganesha export, FSAL_CEPH uses a
> > libcephfs client,..." [1]. For arguments sake, if I have ten top level
> > dirs in my Cephfs namespace, is there any value in creating a separate
> > export for each directory? Will that potentially give me better
> > performance than a single export of the entire namespace?
>
> I don't believe there are any advantages from the Ceph side.  From the
> Ganesha side, you configure permissions, client ACLs, squashing, and so
> on on a per-export basis, so you'll need different exports if you need
> different settings for each top level directory.  If they can all use
> the same settings, one export is probably better.

There may be performance impact (good or bad) with having separate
exports for CephFS. Each export instantiates a separate instance of
the CephFS client which has its own bookkeeping and set of
capabilities issued by the MDS. Also, each client instance has a
separate big lock (potentially a big deal for performance). If the
data for each export is disjoint (no hard links or shared inodes) and
the NFS server is expected to have a lot of load, breaking out the
exports can have a positive impact on performance. If there are hard
links, then the clients associated with the exports will potentially
fight over capabilities which will add to request latency.)

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recommendations for sharing a file system to a heterogeneous client network?

2019-01-15 Thread Ketil Froyn
Robert,

Thanks, this is really interesting. Do you also have any details on how a
solution like this performs? I've been reading a thread about samba/cephfs
performance, and the stats aren't great - especially when creating/deleting
many files - but being a rookie, I'm not 100% clear on the hardware
differences being benchmarked in the mentioned test.

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026841.html

Regards, Ketil

On Tue, Jan 15, 2019, 16:38 Robert Sander  Hi Ketil,
>
> use Samba/CIFS with multiple gateway machines clustered with CTDB.
> CephFS can be mounted with Posix ACL support.
>
> Slides from my last Ceph day talk are available here:
>
> https://www.slideshare.net/Inktank_Ceph/ceph-day-berlin-unlimited-fileserver-with-samba-ctdb-and-cephfs
>
> Regards
> --
> Robert Sander
> Heinlein Support GmbH
> Schwedter Str. 8/9b, 10119 Berlin
>
> https://www.heinlein-support.de
>
> Tel: 030 / 405051-43
> Fax: 030 / 405051-19
>
> Amtsgericht Berlin-Charlottenburg - HRB 93818 B
> Geschäftsführer: Peer Heinlein - Sitz: Berlin
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore device’s device selector for Samsung NVMe

2019-01-15 Thread Vitaliy Filippov

Try lspci -vs and look for

`Capabilities: [148] Device Serial Number 00-02-c9-03-00-4f-68-7e`

in the output

--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recommendations for sharing a file system to a heterogeneous client network?

2019-01-15 Thread Maged Mokhtar


Hi Ketil,

I have not tested the creation/deletion but the read/write performance 
was much better then the link you posted. Using CTDB setup based on 
Robert's presentation, we were getting 800 MB/s write performance for 
queue depth =1 and  2.2 GB/s  queue depth= 32  from a single CTDB/Samba 
gateway. For the QD=32 test we used 2 Windows clients to the same 
gateway (to avoid limitation from the Windows side). Tests were done 
using Microsoft diskspd tool at 4M blocks with cache off.  Gateway had 
2x40 G nics : one for Windows network the other for CephFS client, each 
was doing 20 Gbps (50% utilization) cpu was 24 cores running at 85% 
utilization taken by the smbd process. We used Ubuntu 16.04 CTDB/Samba 
with a SUSE SLE15 kernel for kernel client. Ceph was Luminous 12.2.7.


Maged


On 15/01/2019 22:04, Ketil Froyn wrote:

Robert,

Thanks, this is really interesting. Do you also have any details on 
how a solution like this performs? I've been reading a thread about 
samba/cephfs performance, and the stats aren't great - especially when 
creating/deleting many files - but being a rookie, I'm not 100% clear 
on the hardware differences being benchmarked in the mentioned test.


http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026841.html

Regards, Ketil

On Tue, Jan 15, 2019, 16:38 Robert Sander 
mailto:r.san...@heinlein-support.de> wrote:


Hi Ketil,

use Samba/CIFS with multiple gateway machines clustered with CTDB.
CephFS can be mounted with Posix ACL support.

Slides from my last Ceph day talk are available here:

https://www.slideshare.net/Inktank_Ceph/ceph-day-berlin-unlimited-fileserver-with-samba-ctdb-and-cephfs

Regards
-- 
Robert Sander

Heinlein Support GmbH
Schwedter Str. 8/9b, 10119 Berlin

https://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Amtsgericht Berlin-Charlottenburg - HRB 93818 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin

___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
Maged Mokhtar
CEO PetaSAN
4 Emad El Deen Kamel
Cairo 11371, Egypt
www.petasan.org
+201006979931
skype: maged.mokhtar

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs kernel client instability

2019-01-15 Thread Kjetil Joergensen
Hi,

you could try reducing "osd map message max", some code paths that end up
as -EIO (kernel: libceph: mon1 *** io error) is exceeding
include/linux/ceph/libceph.h:CEPH_MSG_MAX_{FRONT,MIDDLE,DATA}_LEN.

This "worked for us" - YMMV.

-KJ

On Tue, Jan 15, 2019 at 6:14 AM Andras Pataki 
wrote:

> An update on our cephfs kernel client troubles.  After doing some heavier
> testing with a newer kernel 4.19.13, it seems like it also gets into a bad
> state when it can't connect to monitors (all back end processes are on
> 12.2.8):
>
> Jan 15 08:49:00 mon5 kernel: libceph: mon1 10.128.150.11:6789 session
> established
> Jan 15 08:49:01 mon5 kernel: libceph: mon1 10.128.150.11:6789 io error
> Jan 15 08:49:01 mon5 kernel: libceph: mon1 10.128.150.11:6789 session
> lost, hunting for new mon
> Jan 15 08:49:01 mon5 kernel: libceph: mon0 10.128.150.10:6789 session
> established
> Jan 15 08:49:01 mon5 kernel: libceph: mon0 10.128.150.10:6789 io error
> Jan 15 08:49:01 mon5 kernel: libceph: mon0 10.128.150.10:6789 session
> lost, hunting for new mon
> Jan 15 08:49:04 mon5 kernel: libceph: mon1 10.128.150.11:6789 session
> established
> Jan 15 08:49:04 mon5 kernel: libceph: mon1 10.128.150.11:6789 io error
> Jan 15 08:49:04 mon5 kernel: libceph: mon1 10.128.150.11:6789 session
> lost, hunting for new mon
> Jan 15 08:49:06 mon5 kernel: libceph: mon2 10.128.150.12:6789 session
> established
> Jan 15 08:49:07 mon5 kernel: libceph: mon2 10.128.150.12:6789 io error
> Jan 15 08:49:07 mon5 kernel: libceph: mon2 10.128.150.12:6789 session
> lost, hunting for new mon
> ... repeating forever ...
>
> # uname -r
> 4.19.13
>
> and on the mon node (10.128.150.10) at log level 20, I see that it is
> building/encoding a lot of maps (10.128.36.18 is the client in question):
> 2019-01-15 08:49:01.355017 7fffee40c700 10 mon.cephmon00@0(leader) e40
> _ms_dispatch new session 0x62dc6c00 MonSession(client.36800361
> 10.128.36.18:0/2911716500 is open , features 0x27018fb86aa42ada (jewel))
> features 0x27018fb86aa42ada
> 2019-01-15 08:49:01.355021 7fffee40c700 20 mon.cephmon00@0(leader) e40
> caps
> 2019-01-15 08:49:01.355026 7fffee40c700 10 mon.cephmon00@0(leader).auth
> v58457 preprocess_query auth(proto 0 34 bytes epoch 0) from client.36800361
> 10.128.36.18:0/2911716500
> 
> 2019-01-15 08:49:01.355817 7fffee40c700 10 mon.cephmon00@0(leader).osd
> e1254390 check_osdmap_sub 0x65373340 next 1254102 (onetime)
> 2019-01-15 08:49:01.355819 7fffee40c700  5 mon.cephmon00@0(leader).osd
> e1254390 send_incremental [1254102..1254390] to client.36800361
> 10.128.36.18:0/2911716500
> 2019-01-15 08:49:01.355821 7fffee40c700 10 mon.cephmon00@0(leader).osd
> e1254390 build_incremental [1254102..1254141] with features 27018fb86aa42ada
> 2019-01-15 08:49:01.364859 7fffee40c700 20 mon.cephmon00@0(leader).osd
> e1254390 reencode_incremental_map 1254141 with features 504412504116439552
> 2019-01-15 08:49:01.372131 7fffee40c700 20 mon.cephmon00@0(leader).osd
> e1254390 build_incrementalinc 1254141 1237271 bytes
> 2019-01-15 08:49:01.372180 7fffee40c700 20 mon.cephmon00@0(leader).osd
> e1254390 reencode_incremental_map 1254140 with features 504412504116439552
> 2019-01-15 08:49:01.372187 7fffee40c700 20 mon.cephmon00@0(leader).osd
> e1254390 build_incrementalinc 1254140 260 bytes
> 2019-01-15 08:49:01.380981 7fffee40c700 20 mon.cephmon00@0(leader).osd
> e1254390 reencode_incremental_map 1254139 with features 504412504116439552
> 2019-01-15 08:49:01.387983 7fffee40c700 20 mon.cephmon00@0(leader).osd
> e1254390 build_incrementalinc 1254139 1237351 bytes
> 2019-01-15 08:49:01.388043 7fffee40c700 20 mon.cephmon00@0(leader).osd
> e1254390 reencode_incremental_map 1254138 with features 504412504116439552
> 2019-01-15 08:49:01.388049 7fffee40c700 20 mon.cephmon00@0(leader).osd
> e1254390 build_incrementalinc 1254138 232 bytes
> 2019-01-15 08:49:01.396781 7fffee40c700 20 mon.cephmon00@0(leader).osd
> e1254390 reencode_incremental_map 1254137 with features 504412504116439552
>  ... a lot more of similar messages
> 2019-01-15 08:49:04.210936 7fffee40c700 20 mon.cephmon00@0(leader).osd
> e1254390 reencode_incremental_map 1254382 with features 504412504116439552
> 2019-01-15 08:49:04.211032 7fffee40c700 20 mon.cephmon00@0(leader).osd
> e1254390 build_incrementalinc 1254382 232 bytes
> 2019-01-15 08:49:04.211066 7fffee40c700 10 mon.cephmon00@0(leader) e40
> ms_handle_reset 0x6450f800 10.128.36.18:0/2911716500
> 2019-01-15 08:49:04.211070 7fffee40c700 10 mon.cephmon00@0(leader) e40
> reset/close on session client.36800361 10.128.36.18:0/2911716500
> 2019-01-15 08:49:04.211073 7fffee40c700 10 mon.cephmon00@0(leader) e40
> remove_session 0x62dc6c00 client.36800361 10.128.36.18:0/2911716500
> features 0x27018fb86aa42ada
>
> looks like the client disconnects (either for waiting too long, or for
> some protocol error?).  Any hints on why so many maps need to be reencoded
> (to jewel), or how to improve this behavior would be much appreciated.  We

[ceph-users] Why does "df" on a cephfs not report same free space as "rados df" ?

2019-01-15 Thread David Young
Hi folks,

My ceph cluster is used exclusively for cephfs, as follows:

---
root@node1:~# grep ceph /etc/fstab
node2:6789:/ /ceph ceph 
auto,_netdev,name=admin,secretfile=/root/ceph.admin.secret
root@node1:~#
---

"rados df" shows me the following:

---
root@node1:~# rados df
POOL_NAME  USED  OBJECTS CLONESCOPIES MISSING_ON_PRIMARY UNFOUND 
DEGRADEDRD_OPS  RDWR_OPS  WR
cephfs_metadata 197 MiB49066  0 98132  0   0
0   9934744  55 GiB  57244243 232 GiB
media   196 TiB 51768595  0 258842975  0   1   
203534 477915206 509 TiB 165167618 292 TiB

total_objects51817661
total_used   266 TiB
total_avail  135 TiB
total_space  400 TiB
root@node1:~#
---

But "df" on the mounted cephfs volume shows me:

---
root@node1:~# df -h /ceph
Filesystem  Size  Used Avail Use% Mounted on
10.20.30.22:6789:/  207T  196T   11T  95% /ceph
root@node1:~#
---

And ceph -s shows me:

---
  data:
pools:   2 pools, 1028 pgs
objects: 51.82 M objects, 196 TiB
usage:   266 TiB used, 135 TiB / 400 TiB avail
---

"media" is an EC pool with size of 5 (4+1), so I can expect 1TB of data to 
consume 1.25TB raw space.

My question is, why does "df" show me I have 11TB free, when "rados df" shows 
me I have 135TB (raw) available?

Thanks!
D___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Call For Papers coordination pad

2019-01-15 Thread Kai Wagner
Hi all,

just a friendly reminder to use this pad for CfP coordination .

Right now it seems like I'm the only one who submitted something to
Cephalocon and I can't believe that ;-)

https://pad.ceph.com/p/cfp-coordination

Thanks,

Kai

On 5/31/18 1:17 AM, Gregory Farnum wrote:
> Short version: https://pad.ceph.com/p/cfp-coordination is a space for
> you to share talks you've submitted to conferences, if you want to let
> other Ceph community members know what to look for and avoid
> duplicating topics.
>
> Longer version: I and a teammate almost duplicated a talk topic (for
> the upcoming https://mountpoint.io — check it out!) and realized there
> was no established way for us to coordinate this. Other people have
> pointed out similar problems in the past. So, by the power vested in
> me by the power of doing things and having Sage say "that's a good
> idea", I created https://pad.ceph.com/p/cfp-coordination. Use that
> space to coordinate. I've provided a template for conferences around
> talk ideas and actual submissions, but please feel free to jot down
> other notes around those, add new conferences you know about (even if
> you aren't submitting a talk yourself), and generally use that
> etherpad as a community resource.
>
> I'll try to keep it up-to-date as conferences age out, but obviously
> it's only helpful if people actually put stuff there. So go forth and
> write, dear community! :)
> -Greg
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com