Re: [ceph-users] limited disk slots - should I ran OS on SD card ?

2018-08-15 Thread Vladimir Prokofev
I'm running small CEPH cluster of 9 OSD nodes, with systems hosted on USB
sticks exactly for the same reason - not enough disk slots. Works fine for
almost 2 years now.

2018-08-15 1:13 GMT+03:00 Paul Emmerich :

> I've seen the OS running on SATA DOMs and cheap USB sticks.
> It works well for some time, and then it just falls apart.
>
> Paul
>
> 2018-08-14 9:12 GMT+02:00 Burkhard Linke
> :
> > Hi,
> >
> >
> > AFAIk SD cards (and SATA DOMs) do not have any kind of wear-leveling
> > support. Even if the crappy write endurance of these storage systems
> would
> > be enough to operate a server for several years on average, you will
> always
> > have some hot spots with higher than usual write activity. This is the
> case
> > for filesystem journals (xfs, ext4, almost all modern filesystems). Been
> > there, done that, had two storage systems failing due to SD wear
> >
> >
> > The only sane setup for SD cards amd DOMs are flash aware filesystems
> like
> > f2fs. Unfortunately most linux distributions do not support these in
> their
> > standard installers.
> >
> >
> > Short answer: no, do not use SD cards.
> >
> >
> > Regards,
> >
> > Burkhard
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] limited disk slots - should I ran OS on SD card ?

2018-08-15 Thread Wido den Hollander



On 08/14/2018 09:12 AM, Burkhard Linke wrote:
> Hi,
> 
> 
> AFAIk SD cards (and SATA DOMs) do not have any kind of wear-leveling
> support. Even if the crappy write endurance of these storage systems
> would be enough to operate a server for several years on average, you
> will always have some hot spots with higher than usual write activity.
> This is the case for filesystem journals (xfs, ext4, almost all modern
> filesystems). Been there, done that, had two storage systems failing due
> to SD wear
> 

I've been running OS on the SuperMicro 64 and 128GB SATA-DOMs for a
while now and work fine.

I disable Ceph's OSD logging though for performance reasons, but it also
saves writes.

They work just fine.

Wido

> 
> The only sane setup for SD cards amd DOMs are flash aware filesystems
> like f2fs. Unfortunately most linux distributions do not support these
> in their standard installers.
> 
> 
> Short answer: no, do not use SD cards.
> 
> 
> Regards,
> 
> Burkhard
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] limited disk slots - should I ran OS on SD card ?

2018-08-15 Thread Janne Johansson
Den ons 15 aug. 2018 kl 10:04 skrev Wido den Hollander :

> > This is the case for filesystem journals (xfs, ext4, almost all modern
> > filesystems). Been there, done that, had two storage systems failing due
> > to SD wear
> >
>
> I've been running OS on the SuperMicro 64 and 128GB SATA-DOMs for a
> while now and work fine.
>
> I disable Ceph's OSD logging though for performance reasons, but it also
> saves writes.
>
> They work just fine.
>

We had OS on DOMs and ETOOMANY of them failed for us to be comfortable with
them, so we moved away from that.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Enable daemonperf - no stats selected by filters

2018-08-15 Thread Marc Roos
 

This is working again, after I upgraded centos7. Is it not the idea of 
12.2.x releases that are 'somewhat' compatible with the running os?
Add some rpm dependancy or even better make sure it just works with the 
'older' rpms

[@c01 ~]# ceph daemonperf mds.a

---mds --mds_cache--- ---mds_log 
-mds_mem- --mds_server-- mds_ -objecter-- purg
req  rlat fwd  inos caps exi  imi |stry recy recd|subm evts segs|ino  dn 
 |hcr  hcs  hsr |sess|actv rd   wr   rdwr|purg|
  000  4.6M 2.5k   00 | 76k   00 |  0  104k 129 |2.0M 
4.6M|  000 | 11 |  0000 |  0
  000  4.6M 2.5k   00 | 76k   00 |  0  104k 129 |2.0M 
4.6M|  000 | 11 |  0000 |  0
  000  4.6M 2.5k   00 | 76k   00 |  0  104k 129 |2.0M 
4.6M|  000 | 11 |  0000 |  0



-Original Message-
From: Marc Roos 
Sent: dinsdag 31 juli 2018 9:24
To: jspray
Cc: ceph-users
Subject: Re: [ceph-users] Enable daemonperf - no stats selected by 
filters

 
Luminous 12.2.7

[@c01 ~]# rpm -qa | grep ceph-
ceph-mon-12.2.7-0.el7.x86_64
ceph-selinux-12.2.7-0.el7.x86_64
ceph-osd-12.2.7-0.el7.x86_64
ceph-mgr-12.2.7-0.el7.x86_64
ceph-12.2.7-0.el7.x86_64
ceph-common-12.2.7-0.el7.x86_64
ceph-mds-12.2.7-0.el7.x86_64
ceph-radosgw-12.2.7-0.el7.x86_64
ceph-base-12.2.7-0.el7.x86_64

-Original Message-
From: John Spray [mailto:jsp...@redhat.com]
Sent: dinsdag 31 juli 2018 0:35
To: Marc Roos
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Enable daemonperf - no stats selected by 
filters

On Mon, Jul 30, 2018 at 10:27 PM Marc Roos 
wrote:
>
>
> Do you need to enable the option daemonperf?

This looks strange, it's supposed to have sensible defaults -- what 
version are you on?

John

> [@c01 ~]# ceph daemonperf mds.a
> Traceback (most recent call last):
>   File "/usr/bin/ceph", line 1122, in 
> retval = main()
>   File "/usr/bin/ceph", line 822, in main
> done, ret = maybe_daemon_command(parsed_args, childargs)
>   File "/usr/bin/ceph", line 686, in maybe_daemon_command
> return True, daemonperf(childargs, sockpath)
>   File "/usr/bin/ceph", line 776, in daemonperf
> watcher.run(interval, count)
>   File "/usr/lib/python2.7/site-packages/ceph_daemon.py", line 362, in

> run
> self._load_schema()
>   File "/usr/lib/python2.7/site-packages/ceph_daemon.py", line 350, in

> _load_schema
> raise RuntimeError("no stats selected by filters")
> RuntimeError: no stats selected by filters
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help needed for debugging slow_requests

2018-08-15 Thread Konstantin Shalygin

Now here's the thing:

Some weeks ago Proxmox upgraded from kernel 4.13 to 4.15. Since then I'm 
getting slow requests that
cause blocked IO inside the VMs that are running on the cluster (but not 
necessarily on the host
with the OSD causing the slow request).

If I boot back into 4.13 then Ceph runs smoothly again.



This is PTI, I think.  Try to add "noibrs noibpb nopti nospectre_v2" to 
kernel cmdline and reboot.




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Clock skew

2018-08-15 Thread Dominque Roux
Hi all,

We recently facing clock skews from time to time.
This means that sometimes everything is fine but hours later the warning
appears again.

NTPD is running and configured with the same pool.

Did someone else already had the same issue and could probably help us
to fix this?

Thanks a lot!

Dominique
-- 

Your Swiss, Open Source and IPv6 Virtual Machine. Now on
www.datacenterlight.ch



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Segmentation fault in Ceph-mon

2018-08-15 Thread Arif A.
Dear all,

I am facing problem in deploying ceph-mon (Segmentation fault). I am
deploying Ceph on single-board Raspberry pi 3, Hyperiot debian 8.0 jessie
OS. I downloaded ceph packages from the following repository :
deb http://mirrordirector.raspbian.org/raspbian/ testing main contrib
non-free rpi


ii  ceph 10.2.5-7.2+rpi1 armhf
  distributed storage and file system
ii  ceph-base10.2.5-7.2+rpi1 armhf
  common ceph daemon libraries and management tools
ii  ceph-common  10.2.5-7.2+rpi1 armhf
  common utilities to mount and interact with a ceph storage cluster
ii  ceph-deploy  2.0.0+dfsg1-1   all
  Ceph cluster deployment and configuration over ssh
ii  ceph-mds 10.2.5-7.2+rpi1 armhf
  metadata server for the ceph distributed file system
ii  ceph-mon 10.2.5-7.2+rpi1 armhf
  monitor server for the ceph storage system
ii  ceph-osd 10.2.5-7.2+rpi1 armhf
  OSD server for the ceph storage system
ii  libcephfs1   10.2.5-7.2+rpi1 armhf
  Ceph distributed file system client library
ii  python-cephfs10.2.5-7.2+rpi1 armhf
  Python libraries for the Ceph libcephfs library


The packages are installed successfully. The problem is when I deploy
ceph-mon service in the node raising segmentation fault. I think because of
this admin socket was not created successfully.
I don't know if somebody has faced the similar problem. Could you please
help me  how to fix this bug?

The log file of the mon and ceph-deploy is given bellow:

mon.log
Aug 14 16:11:22 rpi3-4 systemd[1]: Started Ceph cluster key creator task.
Aug 14 16:11:22 rpi3-4 systemd[1]: Started Ceph cluster monitor daemon.
Aug 14 16:11:23 rpi3-4 ceph-mon[31611]: *** Caught signal (Segmentation
fault) **
Aug 14 16:11:23 rpi3-4 ceph-mon[31611]:  in thread 7561cb80
thread_name:admin_socket
Aug 14 16:11:23 rpi3-4 ceph-create-keys[31609]: admin_socket: exception
getting command descriptions: exception: [Errno 104] Connection reset by
peer
Aug 14 16:11:23 rpi3-4 systemd[1]: ceph-mon@rpi3-4.service: Main process
exited, code=killed, status=11/SEGV
Aug 14 16:11:23 rpi3-4 systemd[1]: ceph-mon@rpi3-4.service: Failed with
result 'signal'.
Aug 14 16:11:23 rpi3-4 ceph-create-keys[31609]:
INFO:ceph-create-keys:ceph-mon admin socket not ready yet.
Aug 14 16:11:23 rpi3-4 systemd[1]: ceph-mon@rpi3-4.service: Service
RestartSec=100ms expired, scheduling restart.
Aug 14 16:11:23 rpi3-4 systemd[1]: ceph-mon@rpi3-4.service: Scheduled
restart job, restart counter is at 1.
Aug 14 16:11:23 rpi3-4 systemd[1]: Stopped Ceph cluster monitor daemon.
Aug 14 16:11:23 rpi3-4 systemd[1]: Started Ceph cluster monitor daemon.
Aug 14 16:11:25 rpi3-4 ceph-mon[31624]: *** Caught signal (Segmentation
fault) **
Aug 14 16:11:25 rpi3-4 ceph-mon[31624]:  in thread 75620b80
thread_name:admin_socket
Aug 14 16:11:25 rpi3-4 ceph-create-keys[31609]: admin_socket: exception
getting command descriptions: exception: [Errno 104] Connection reset by
peer
Aug 14 16:11:25 rpi3-4 systemd[1]: ceph-mon@rpi3-4.service: Main process
exited, code=killed, status=11/SEGV
Aug 14 16:11:25 rpi3-4 systemd[1]: ceph-mon@rpi3-4.service: Failed with
result 'signal'.
Aug 14 16:11:25 rpi3-4 ceph-create-keys[31609]:
INFO:ceph-create-keys:ceph-mon admin socket not ready yet.
Aug 14 16:11:25 rpi3-4 systemd[1]: ceph-mon@rpi3-4.service: Service
RestartSec=100ms expired, scheduling restart.
Aug 14 16:11:25 rpi3-4 systemd[1]: ceph-mon@rpi3-4.service: Scheduled
restart job, restart counter is at 2.
Aug 14 16:11:25 rpi3-4 systemd[1]: Stopped Ceph cluster monitor daemon.
Aug 14 16:11:25 rpi3-4 systemd[1]: Started Ceph cluster monitor daemon.
Aug 14 16:11:25 rpi3-4 ceph-mon[31640]: *** Caught signal (Segmentation
fault) **
Aug 14 16:11:25 rpi3-4 ceph-mon[31640]:  in thread 75673b80
thread_name:admin_socket
Aug 14 16:11:25 rpi3-4 ceph-mon[31640]:  ceph version 10.2.5
(c461ee19ecbc0c5c330aca20f7392c9a00730367)
Aug 14 16:11:25 rpi3-4 ceph-mon[31640]:  1: (()+0x4b1348) [0x5502d348]
Aug 14 16:11:25 rpi3-4 ceph-mon[31640]:  2: (__default_sa_restorer()+0)
[0x768d7bc0]
Aug 14 16:11:25 rpi3-4 ceph-mon[31640]:  3: (AdminSocket::do_accept()+0x28)
[0x55149154]
Aug 14 16:11:25 rpi3-4 ceph-mon[31640]:  4: (AdminSocket::entry()+0x22c)
[0x5514b458]
Aug 14 16:11:26 rpi3-4 systemd[1]: ceph-mon@rpi3-4.service: Main process
exited, code=killed, status=11/SEGV
Aug 14 16:11:26 rpi3-4 systemd[1]: ceph-mon@rpi3-4.service: Failed with
result 'signal'.
Aug 14 16:11:26 rpi3-4 systemd[1]: ceph-mon@rpi3-4.service: Service
RestartSec=100ms expired, scheduling restart.
Aug 14 16:11:26 rpi3-4 systemd[1]: ceph-mon@rpi3-4.service: Scheduled
restart job, restart counter is at 3.
Aug 14 16:11:26 rpi3-

Re: [ceph-users] Ceph logging into graylog

2018-08-15 Thread Roman Steinhart
Hi,

thanks for your reply.
May I ask which type of input do you use in graylog?
"GELF UDP" or another one?
And which version of graylog/ceph do you use?

Thanks,
Roman

On Aug 9 2018, at 7:47 pm, Rudenko Aleksandr  wrote:
>
> Hi,
>
> All our settings for this:
>
> mon cluster log to graylog = true
> mon cluster log to graylog host = {graylog-server-hostname}
>
>
>
>
> > On 9 Aug 2018, at 19:33, Roman Steinhart  > (mailto:ro...@aternos.org)> wrote:
> > Hi all,
> > I'm trying to set up ceph logging into graylog.
> > For that I've set the following options in ceph.conf:
> > log_to_graylog = true
> > err_to_graylog = true
> > log_to_graylog_host = graylog.service.consul
> > log_to_graylog_port = 12201
> > mon_cluster_log_to_graylog = true
> > mon_cluster_log_to_graylog_host = graylog.service.consul
> > mon_cluster_log_to_graylog_port = 12201
> > clog_to_graylog = true
> > clog_to_graylog_host = graylog.service.consul
> > clog_to_graylog_port = 12201
> >
> > According to the graylog server.log file it looks like ceph accepted these 
> > config options and sends log messages to graylog, however graylog is not 
> > able to process these message cause of this error: 
> > https://paste.steinh.art/jezerobevu.apache
> > It says: "has empty mandatory "host" field."
> > How can I advice ceph to fill this host field?
> > Or is it because of a version incompatibility between ceph and graylog?
> >
> > We're using ceph 12.2.7 and graylog 2.4.6+ceaa7e4
> > Maybe someone of you was already able to get graylog working and is able to 
> > help me with this problem?
> > Kind regards,
> > Roman
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com (mailto:ceph-users@lists.ceph.com)
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
>
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph upgrade Jewel to Luminous

2018-08-15 Thread Jaime Ibar

Hi Tom,

thanks for the info.

That's what I thought but I asked just in case as breaking the entire 
cluster would be


very bad news.

Thanks again.

Jaime


On 14/08/18 20:18, Thomas White wrote:


Hi Jaime,

Upgrading directly should not be a problem. It is usually recommended 
to go to the latest minor release before upgrading major versions, but 
my own migration from 10.2.10 to 12.2.5 went seamlessly and I can’t 
see of any technical limitation which would hinder or prevent this 
process.


Kind Regards,

Tom

*From:*ceph-users  *On Behalf Of 
*Jaime Ibar

*Sent:* 14 August 2018 10:00
*To:* ceph-users@lists.ceph.com
*Subject:* [ceph-users] Ceph upgrade Jewel to Luminous

Hi all,

we're running Ceph Jewel 10.2.10 in our cluster and we plan to upgrade 
to latest Luminous


release(12.2.7). Jewel 10.2.11 was released one month ago and ours 
plans were upgrade to


this release first and then upgrade to Luminous, but as someone 
reported osd's crashes after


upgrading to Jewel 10.2.11, we wonder if would be possible to skip 
this Jewel release and


upgrade directly to Luminous 12.2.7.

Thanks

Jaime


Jaime Ibar
High Performance & Research Computing, IS Services
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie  | 
ja...@tchpc.tcd.ie 

Tel: +353-1-896-3725



--

Jaime Ibar
High Performance & Research Computing, IS Services
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie
Tel: +353-1-896-3725

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] limited disk slots - should I ran OS on SD card ?

2018-08-15 Thread Steven Vacaroaia
Thank you all

Since all concerns were about reliability I am assuming  performance impact
of having OS running on SD card is minimal / negligible

In other words, an OSD server is not writing/reading from Linux OS
partitions too much ( especially with logs at minimum )
so its performance is not dependent on what type of disk  OS resides  on

If I am wrong, please let me know :-))

Thanks


On Wed, 15 Aug 2018 at 04:13, Janne Johansson  wrote:

>
> Den ons 15 aug. 2018 kl 10:04 skrev Wido den Hollander :
>
>> > This is the case for filesystem journals (xfs, ext4, almost all modern
>> > filesystems). Been there, done that, had two storage systems failing due
>> > to SD wear
>> >
>>
>> I've been running OS on the SuperMicro 64 and 128GB SATA-DOMs for a
>> while now and work fine.
>>
>> I disable Ceph's OSD logging though for performance reasons, but it also
>> saves writes.
>>
>> They work just fine.
>>
>
> We had OS on DOMs and ETOOMANY of them failed for us to be comfortable with
> them, so we moved away from that.
>
> --
> May the most significant bit of your life be positive.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] upgraded centos7 (not collectd nor ceph) now json failed error

2018-08-15 Thread Marc Roos



I upgraded centos7, not ceph nor collectd. Ceph was already 12.2.7 and 
collectd was already 5.8.0-2 (and collectd-ceph-5.8.0-2)

Now I have this error:

Aug 14 22:43:34 c01 collectd[285425]: ceph plugin: ds 
FinisherPurgeQueue.queueLen was not properly initialized.
Aug 14 22:43:34 c01 collectd[285425]: ceph plugin: JSON handler failed 
with status -1.
Aug 14 22:43:34 c01 collectd[285425]: ceph plugin: ds 
FinisherPurgeQueue.queueLen was not properly initialized.
Aug 14 22:43:34 c01 collectd[285425]: ceph plugin: JSON handler failed 
with status -1.
Aug 14 22:43:34 c01 collectd[285425]: ceph plugin: 
cconn_handle_event(name=mds.a,i=9,st=4): error 1
Aug 14 22:43:34 c01 collectd[285425]: ceph plugin: 
cconn_handle_event(name=mds.a,i=9,st=4): error 1
Aug 14 22:43:44 c01 collectd[285425]: ceph plugin: ds 
FinisherPurgeQueue.queueLen was not properly initialized.
Aug 14 22:43:44 c01 collectd[285425]: ceph plugin: JSON handler failed 
with status -1.
Aug 14 22:43:44 c01 collectd[285425]: ceph plugin: 
cconn_handle_event(name=mds.a,i=9,st=4): error 1
Aug 14 22:43:44 c01 collectd[285425]: ceph plugin: ds 
FinisherPurgeQueue.queueLen was not properly initialized.
Aug 14 22:43:44 c01 collectd[285425]: ceph plugin: JSON handler failed 
with status -1.
Aug 14 22:43:44 c01 collectd[285425]: ceph plugin: 
cconn_handle_event(name=mds.a,i=9,st=4): error 1
Aug 14 22:43:54 c01 collectd[285425]: ceph plugin: ds 
FinisherPurgeQueue.queueLen was not properly initialized.
Aug 14 22:43:54 c01 collectd[285425]: ceph plugin: ds 
FinisherPurgeQueue.queueLen was not properly initialized.
Aug 14 22:43:54 c01 collectd[285425]: ceph plugin: JSON handler failed 
with status -1.
Aug 14 22:43:54 c01 collectd[285425]: ceph plugin: 
cconn_handle_event(name=mds.a,i=10,st=4): error 1
Aug 14 22:43:54 c01 collectd[285425]: ceph plugin: JSON handler failed 
with status -1.
Aug 14 22:43:54 c01 collectd[285425]: ceph plugin: 
cconn_handle_event(name=mds.a,i=10,st=4): error 1
Aug 14 22:44:04 c01 collectd[285425]: ceph plugin: ds 
FinisherPurgeQueue.queueLen was not properly initialized.
Aug 14 22:44:04 c01 collectd[285425]: ceph plugin: JSON handler failed 
with status -1.
Aug 14 22:44:04 c01 collectd[285425]: ceph plugin: ds 
FinisherPurgeQueue.queueLen was not properly initial

Anyone having the same? What dependency can cause this? It is annoying 
and filling up log files.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] limited disk slots - should I ran OS on SD card ?

2018-08-15 Thread Götz Reinicke
Hi,

> Am 15.08.2018 um 15:11 schrieb Steven Vacaroaia :
> 
> Thank you all 
> 
> Since all concerns were about reliability I am assuming  performance impact 
> of having OS running on SD card is minimal / negligible 

some time ago we had a some Cisco Blades booting VMware esxi from SD cards and 
hat no issue for month …till after an update the blade was rebooted and the SD 
failed …and then an other on an other server … From my POV at that time the 
„server" SDs where not close as reliable as SSDs or rotating disks. My 
experiences from some years ago.

> 
> In other words, an OSD server is not writing/reading from Linux OS partitions 
> too much ( especially with logs at minimum )
> so its performance is not dependent on what type of disk  OS resides  on 

Regarding performance: What kind of SDs are supported? You can get some "SDXCTM 
| UHS-II | U3 | Class 10 | V90“ which can handle up to 260 MBytes/sec; like 
„Angelbird Matchpack EVA1“ ok they are Panasonic 4K Camera certified (and we 
use them currently to record 4K video)

https://www.angelbird.com/prod/match-pack-for-panasonic-eva1-1836/

My2cents . Götz




smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs fuse versus kernel performance

2018-08-15 Thread Chad William Seys

Hi all,
  Anyone know of benchmarks of cephfs through fuse versus kernel?

Thanks!
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Clock skew

2018-08-15 Thread Brent Kennedy
For clock skew, I setup NTPD on one of the monitors with a public time server 
to pull from.  Then I setup NTPD on all the servers with them pulling time only 
from the local monitor server.  Restart the time service on each server until 
they get relatively close.  If you have a time server setup already in place, 
that would work as well.  Make sure to eliminate the backup time server entry 
as well.

If this is already in place, then what is usually necessary is a restart of the 
monitor service on the monitor complaining of clock skew.  If any are 
virtualized, make sure the time is not syncing from the host server to the VM, 
this could be causing the skew as well.

-Brent

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Dominque Roux
Sent: Wednesday, August 15, 2018 5:38 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Clock skew

Hi all,

We recently facing clock skews from time to time.
This means that sometimes everything is fine but hours later the warning 
appears again.

NTPD is running and configured with the same pool.

Did someone else already had the same issue and could probably help us to fix 
this?

Thanks a lot!

Dominique
-- 

Your Swiss, Open Source and IPv6 Virtual Machine. Now on www.datacenterlight.ch


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

2018-08-15 Thread Jonathan Woytek
Hi list people. I was asking a few of these questions in IRC, too, but
figured maybe a wider audience could see something that I'm missing.

I'm running a four-node cluster with cephfs and the kernel-mode driver as
the primary access method. Each node has 72 * 10TB OSDs, for a total of 288
OSDs. Each system has about 256GB of memory. These systems are dedicated to
the ceph service and run no other workloads. The cluster is configured with
every machine participating as MON, MGR, and MDS. Data is stored in replica
2 mode. The data are files of varying lengths, but most <5MB. Files are
named with their SHA256 hash, and are divided into subdirectories based on
the first few octets (example: files/a1/a13f/a13f25). The current set
of files occupies about 100TB (200TB accounting for replication).

Early this week, we started seeing some network issues that were causing
OSDs to become unavailable for short periods of time. It was long enough to
get logged by syslog, but not long enough to trigger a persistent warning
or error state in ceph status. Conditions continued to degrade until we
encountered two of the four nodes falling off of the network, and OSDs
tried to start migrating en masse. After the network stabilized a short
while later, the OSDs were all shown as online and OK, and ceph seemed to
recover cleanly, and stopped trying to migrate data. In the process of
trying to get the network stable, though, the two nodes that had fallen off
the network had to be rebooted.

When all four nodes were back online and talking to each other, I noticed
that MDS was in "up: rejoin", and after a period of time, it would eat all
of the available memory and swap on whatever system was primary. It would
eventually either get killed-off by the system due to memory usage, or it
got so slow that the monitors would drop it and pick another MDS as
primary. This cycle would repeat.

I added more swap to one system (160GB of swap total), and brought down the
MDS service on the other three nodes, forcing the rejoin operations to
occur on the node with added swap. I also turned up debugging to see what
it was actually doing. This was then allowed to run for about 14 hours
overnight. When I arrived this morning, the system was still up, but
severly lagged. Nearly all swap had been used, and the system had
difficulty responding to commands. Out of options, I killed the process,
and then watched as it tried to shut down cleanly. I was hoping to preserve
as much of the work that it did as possible. I restarted it, and it seemed
to do more in replay, and then reentered the rejoin, which is still running
and giving no hints of finishing anytime soon.

The rejoin traffic I'm seeing in the MDS log looks like this:

2018-08-15 11:39:21.726 7f9c7229f700 10 mds.0.cache.ino(0x10108aa)
verify_diri_backtrace
2018-08-15 11:39:21.726 7f9c7229f700 10 mds.0.cache.dir(0x10108aa)
_fetched header 274 bytes 2323 keys for [dir 0x10108aa
/files-by-sha256/1c/1cc4/ [2,head] auth v=0 cv=0/0 ap=1+0+0
state=1073741888|fetching f() n() hs=0+0,ss=0+0 | waiter=1 authpin=1
0x561738166a00]
2018-08-15 11:39:21.726 7f9c7229f700 10 mds.0.cache.dir(0x10108aa)
_fetched version 59738838
2018-08-15 11:39:21.726 7f9c7229f700 10  mds.0.cache.snaprealm(0x1 seq 1
0x56171c6a3400) have_past_parents_open [1,head]
2018-08-15 11:39:21.727 7f9c7229f700 10  mds.0.cache.snaprealm(0x1 seq 1
0x56171c6a3400) have_past_parents_open [1,head]
2018-08-15 11:39:21.898 7f9c78534700 10 mds.beacon.ta-g17 handle_mds_beacon
up:rejoin seq 377 rtt 1.400594
2018-08-15 11:39:24.564 7f9c752a5700 10 mds.beacon.ta-g17 _send up:rejoin
seq 378
2018-08-15 11:39:25.503 7f9c78534700 10 mds.beacon.ta-g17 handle_mds_beacon
up:rejoin seq 378 rtt 0.907796
2018-08-15 11:39:26.565 7f9c7229f700 10 mds.0.cache.dir(0x10108aa)
auth_unpin by 0x561738166a00 on [dir 0x10108aa
/files-by-sha256/1c/1cc4/ [2,head] auth v=59738838 cv=59738838/59738838
state=1073741825|complete f(v0 m2018-08-14 07:52:06.764154 2323=2323+0)
n(v0 rc2018-08-14 07:52:06.764154 b3161079403 2323=2323+0) hs=2323+0,ss=0+0
| child=1 waiter=1 authpin=0 0x561738166a00] count now 0 + 0
2018-08-15 11:39:26.706 7f9c73aa2700  7 mds.0.13676 mds has 1 queued
contexts
2018-08-15 11:39:26.706 7f9c73aa2700 10 mds.0.13676 0x5617cd27a790
2018-08-15 11:39:26.706 7f9c73aa2700 10 mds.0.13676  finish 0x5617cd27a790
2018-08-15 11:39:26.723 7f9c7229f700 10 MDSIOContextBase::complete:
21C_IO_Dir_OMAP_Fetched
2018-08-15 11:39:26.723 7f9c7229f700 10 mds.0.cache.ino(0x10020f7)
verify_diri_backtrace
2018-08-15 11:39:26.738 7f9c7229f700 10 mds.0.cache.dir(0x10020f7)
_fetched header 274 bytes 1899 keys for [dir 0x10020f7
/files-by-sha256/a7/a723/ [2,head] auth v=0 cv=0/0 ap=1+0+0
state=1073741888|fetching f() n() hs=0+0,ss=0+0 | waiter=1 authpin=1
0x5617351cbc00]
2018-08-15 11:39:26.792 7f9c7229f700 10 mds.0.cache.dir(0x10020f7)
_fetched version 59752211
2018-08-15 11:39:26.792 7f9c7229f700 10  mds.0.cache.snaprealm(0x1 seq 1
0x5617

Re: [ceph-users] BlueStore wal vs. db size

2018-08-15 Thread Robert Stanford
 Thank you Wido.  I don't want to make any assumptions so let me verify,
that's 10GB of DB per 1TB storage on that OSD alone, right?  So if I have 4
OSDs sharing the same SSD journal, each 1TB, there are 4 10 GB DB
partitions for each?

On Wed, Aug 15, 2018 at 1:59 AM, Wido den Hollander  wrote:

>
>
> On 08/15/2018 04:17 AM, Robert Stanford wrote:
> > I am keeping the wal and db for a ceph cluster on an SSD.  I am using
> > the masif_bluestore_block_db_size / masif_bluestore_block_wal_size
> > parameters in ceph.conf to specify how big they should be.  Should these
> > values be the same, or should one be much larger than the other?
> >
>
> This has been answered multiple times on this mailinglist in the last
> months, a bit of searching would have helped.
>
> Nevertheless, 1GB for the WAL is sufficient and then allocate about 10GB
> of DB per TB of storage. That should be enough in most use cases.
>
> Now, if you can spare more DB space, do so!
>
> Wido
>
> >  R
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] BlueStore wal vs. db size

2018-08-15 Thread Wido den Hollander


On 08/15/2018 05:57 PM, Robert Stanford wrote:
> 
>  Thank you Wido.  I don't want to make any assumptions so let me verify,
> that's 10GB of DB per 1TB storage on that OSD alone, right?  So if I
> have 4 OSDs sharing the same SSD journal, each 1TB, there are 4 10 GB DB
> partitions for each?
> 

Yes, that is correct.

Each OSD needs 10GB/1TB of storage of DB. So size your SSD according to
your storage needs.

However, it depends on the workload if you need to offload WAL+DB to a
SSD. What is the workload?

Wido

> On Wed, Aug 15, 2018 at 1:59 AM, Wido den Hollander  > wrote:
> 
> 
> 
> On 08/15/2018 04:17 AM, Robert Stanford wrote:
> > I am keeping the wal and db for a ceph cluster on an SSD.  I am using
> > the masif_bluestore_block_db_size / masif_bluestore_block_wal_size
> > parameters in ceph.conf to specify how big they should be.  Should these
> > values be the same, or should one be much larger than the other?
> > 
> 
> This has been answered multiple times on this mailinglist in the last
> months, a bit of searching would have helped.
> 
> Nevertheless, 1GB for the WAL is sufficient and then allocate about 10GB
> of DB per TB of storage. That should be enough in most use cases.
> 
> Now, if you can spare more DB space, do so!
> 
> Wido
> 
> >  R
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> >
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] BlueStore wal vs. db size

2018-08-15 Thread Robert Stanford
 The workload is relatively high read/write of objects through radosgw.
Gbps+ in both directions.  The OSDs are spinning disks, the journals (up
until now filestore) are on SSDs.  Four OSDs / journal disk.

On Wed, Aug 15, 2018 at 10:58 AM, Wido den Hollander  wrote:

>
>
> On 08/15/2018 05:57 PM, Robert Stanford wrote:
> >
> >  Thank you Wido.  I don't want to make any assumptions so let me verify,
> > that's 10GB of DB per 1TB storage on that OSD alone, right?  So if I
> > have 4 OSDs sharing the same SSD journal, each 1TB, there are 4 10 GB DB
> > partitions for each?
> >
>
> Yes, that is correct.
>
> Each OSD needs 10GB/1TB of storage of DB. So size your SSD according to
> your storage needs.
>
> However, it depends on the workload if you need to offload WAL+DB to a
> SSD. What is the workload?
>
> Wido
>
> > On Wed, Aug 15, 2018 at 1:59 AM, Wido den Hollander  > > wrote:
> >
> >
> >
> > On 08/15/2018 04:17 AM, Robert Stanford wrote:
> > > I am keeping the wal and db for a ceph cluster on an SSD.  I am
> using
> > > the masif_bluestore_block_db_size / masif_bluestore_block_wal_size
> > > parameters in ceph.conf to specify how big they should be.  Should
> these
> > > values be the same, or should one be much larger than the other?
> > >
> >
> > This has been answered multiple times on this mailinglist in the last
> > months, a bit of searching would have helped.
> >
> > Nevertheless, 1GB for the WAL is sufficient and then allocate about
> 10GB
> > of DB per TB of storage. That should be enough in most use cases.
> >
> > Now, if you can spare more DB space, do so!
> >
> > Wido
> >
> > >  R
> > >
> > >
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com 
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> > >
> >
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] BlueStore wal vs. db size

2018-08-15 Thread Wido den Hollander


On 08/15/2018 06:15 PM, Robert Stanford wrote:
> 
>  The workload is relatively high read/write of objects through radosgw. 
> Gbps+ in both directions.  The OSDs are spinning disks, the journals (up
> until now filestore) are on SSDs.  Four OSDs / journal disk.
> 

RGW isn't always a heavy enough workload for this. It depends on your
choice. I've deployed many RGW-only workloads without WAL+DB and it
works fine.

RBD is a perfect use-case which needs very low (<10ms) write latency and
that's not always the case with RGW.

Just having the WAL on a SSD device can also help.

Keep in mind that the 'journal' doesn't apply anymore with BlueStore.
That was a FileStore thing.

Wido

> On Wed, Aug 15, 2018 at 10:58 AM, Wido den Hollander  > wrote:
> 
> 
> 
> On 08/15/2018 05:57 PM, Robert Stanford wrote:
> > 
> >  Thank you Wido.  I don't want to make any assumptions so let me verify,
> > that's 10GB of DB per 1TB storage on that OSD alone, right?  So if I
> > have 4 OSDs sharing the same SSD journal, each 1TB, there are 4 10 GB DB
> > partitions for each?
> > 
> 
> Yes, that is correct.
> 
> Each OSD needs 10GB/1TB of storage of DB. So size your SSD according to
> your storage needs.
> 
> However, it depends on the workload if you need to offload WAL+DB to a
> SSD. What is the workload?
> 
> Wido
> 
> > On Wed, Aug 15, 2018 at 1:59 AM, Wido den Hollander  
> > >> wrote:
> > 
> > 
> > 
> >     On 08/15/2018 04:17 AM, Robert Stanford wrote:
> >     > I am keeping the wal and db for a ceph cluster on an SSD.  I am 
> using
> >     > the masif_bluestore_block_db_size / masif_bluestore_block_wal_size
> >     > parameters in ceph.conf to specify how big they should be.  
> Should these
> >     > values be the same, or should one be much larger than the other?
> >     > 
> > 
> >     This has been answered multiple times on this mailinglist in the 
> last
> >     months, a bit of searching would have helped.
> > 
> >     Nevertheless, 1GB for the WAL is sufficient and then allocate about 
> 10GB
> >     of DB per TB of storage. That should be enough in most use cases.
> > 
> >     Now, if you can spare more DB space, do so!
> > 
> >     Wido
> > 
> >     >  R
> >     >
> >     >
> >     > ___
> >     > ceph-users mailing list
> >     > ceph-users@lists.ceph.com 
> >
> >     > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> >      >
> >     >
> >
> >
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] FreeBSD rc.d script: sta.rt not found

2018-08-15 Thread Norman Gray



Greetings.

I'm having difficulty starting up the ceph monitor on FreeBSD.  The
rc.d/ceph script appears to be doing something ... odd.

I'm following the instructions on
.
I've configured a monitor called mon.pochhammer

When I try to start the service with

# service ceph start

I get an error

/usr/local/bin/init-ceph: sta.rt not found
(/usr/local/etc/ceph/ceph.conf defines mon.pochhammer, /var/lib/ceph
defines )

This appears to be because, in ceph_common.sh's get_name_list(), $orig
is 'start' and allconf ends up as ' mon.pochhammer'.  In that function,
the value of $orig is then worked through word-by-word, whereupon
'start' is split into 'sta' and 'rt', which fails to match a test a few
lines later.

Calling 'service ceph start' results in /usr/local/bin/ceph-init being
called with arguments 'start start', and calling 'service ceph start
start mon.pochhammer' (as the above instructions recommend) results in
'run_rc_command start start start mon.pochhammer'.  Is the ceph-init
script perhaps missing a 'shift' at some point before the sourcing of
ceph_common.sh?

Incidentally, that's a rather unexpected call to the rc.d script -- I
would have expected just 'service ceph start' as above.  The latter call
does seem to extract the correct mon.pochhammer monitor name from the
correct config file, even if the presence of the word 'start' does then
confuse it.

This is FreeBSD 11.2, and ceph-conf version 12.2.7, built from the
FreeBSD ports tree.

Best wishes,

Norman


--
Norman Gray  :  http://www.astro.gla.ac.uk/users/norman/it/
SUPA School of Physics and Astronomy, University of Glasgow, UK
Charity number SC004401

[University of Glasgow: The Times Scottish University of the Year 2018]
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Clock skew

2018-08-15 Thread Sean Crosby
Hi Dominique,

The clock skew warning shows up when your NTP daemon is not synced.

You can see the sync in the output of ntpq -p

This is a synced NTP

# ntpq -p
 remote   refid  st t when poll reach   delay   offset
jitter
==
 ntp.unimelb.edu 210.9.192.50 2 u   24   64   170.496   -6.421
 0.181
*ntp2.unimelb.ed 202.6.131.1182 u   26   64   170.613  -11.998
 0.250
 ntp41.frosteri. .INIT.  16 u-   6400.0000.000
 0.000
 dns01.ntl02.pri .INIT.  16 u-   6400.0000.000
 0.000
 cosima.470n.act .INIT.  16 u-   6400.0000.000
 0.000
 x.ns.gin.ntt.ne .INIT.  16 u-   6400.0000.000
 0.000

The *'s show that there is a sync with a NTP server. When you start or
restart ntp, it takes a while for a sync to occur

Here's immediately after restarting the ntp daemon

# ntpq -p
 remote   refid  st t when poll reach   delay   offset
jitter
==
 ntp.unimelb.edu 210.9.192.50 2 u-   6410.496   -6.421
 0.000
 ntp2.unimelb.ed 202.6.131.1182 u-   6410.474  -11.678
 0.000
 ntp41.frosteri. .INIT.  16 u-   6400.0000.000
 0.000
 dns01.ntl02.pri .INIT.  16 u-   6400.0000.000
 0.000
 cosima.470n.act .INIT.  16 u-   6400.0000.000
 0.000
 x.ns.gin.ntt.ne .INIT.  16 u-   6400.0000.000
 0.000

Make sure that nothing is regularly restarting ntpd. For us, we had puppet
and dhcp regularly fight over the contents of ntp.conf, and it caused a
restart of ntpd.

Sean


On Wed, 15 Aug 2018 at 19:37, Dominque Roux 
wrote:

> Hi all,
>
> We recently facing clock skews from time to time.
> This means that sometimes everything is fine but hours later the warning
> appears again.
>
> NTPD is running and configured with the same pool.
>
> Did someone else already had the same issue and could probably help us
> to fix this?
>
> Thanks a lot!
>
> Dominique
> --
>
> Your Swiss, Open Source and IPv6 Virtual Machine. Now on
> www.datacenterlight.ch
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] FreeBSD rc.d script: sta.rt not found

2018-08-15 Thread Willem Jan Withagen

On 15/08/2018 19:46, Norman Gray wrote:


Greetings.

I'm having difficulty starting up the ceph monitor on FreeBSD.  The
rc.d/ceph script appears to be doing something ... odd.

I'm following the instructions on
.
I've configured a monitor called mon.pochhammer

When I try to start the service with

     # service ceph start

I get an error

     /usr/local/bin/init-ceph: sta.rt not found
(/usr/local/etc/ceph/ceph.conf defines mon.pochhammer, /var/lib/ceph
defines )

This appears to be because, in ceph_common.sh's get_name_list(), $orig
is 'start' and allconf ends up as ' mon.pochhammer'.  In that function,
the value of $orig is then worked through word-by-word, whereupon
'start' is split into 'sta' and 'rt', which fails to match a test a few
lines later.

Calling 'service ceph start' results in /usr/local/bin/ceph-init being
called with arguments 'start start', and calling 'service ceph start
start mon.pochhammer' (as the above instructions recommend) results in
'run_rc_command start start start mon.pochhammer'.  Is the ceph-init
script perhaps missing a 'shift' at some point before the sourcing of
ceph_common.sh?

Incidentally, that's a rather unexpected call to the rc.d script -- I
would have expected just 'service ceph start' as above.  The latter call
does seem to extract the correct mon.pochhammer monitor name from the
correct config file, even if the presence of the word 'start' does then
confuse it.

This is FreeBSD 11.2, and ceph-conf version 12.2.7, built from the
FreeBSD ports tree.


This is an error in the /usr/local/etc/rc.d/ceph file.

The last line should look like:
run_rc_command "$1"

The double set of commands is confusing init-ceph.

Init-ceph or rc.d/ceph should be rewritten, but I just have not yet 
gotten to that. Also because in near future ceph-disk goes a way and the 
config starts looking different/less important. And I have not yet 
decided how to fit the parts together.


--WjW


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

2018-08-15 Thread Yan, Zheng
On Wed, Aug 15, 2018 at 11:44 PM Jonathan Woytek  wrote:
>
> Hi list people. I was asking a few of these questions in IRC, too, but 
> figured maybe a wider audience could see something that I'm missing.
>
> I'm running a four-node cluster with cephfs and the kernel-mode driver as the 
> primary access method. Each node has 72 * 10TB OSDs, for a total of 288 OSDs. 
> Each system has about 256GB of memory. These systems are dedicated to the 
> ceph service and run no other workloads. The cluster is configured with every 
> machine participating as MON, MGR, and MDS. Data is stored in replica 2 mode. 
> The data are files of varying lengths, but most <5MB. Files are named with 
> their SHA256 hash, and are divided into subdirectories based on the first few 
> octets (example: files/a1/a13f/a13f25). The current set of files occupies 
> about 100TB (200TB accounting for replication).
>
> Early this week, we started seeing some network issues that were causing OSDs 
> to become unavailable for short periods of time. It was long enough to get 
> logged by syslog, but not long enough to trigger a persistent warning or 
> error state in ceph status. Conditions continued to degrade until we 
> encountered two of the four nodes falling off of the network, and OSDs tried 
> to start migrating en masse. After the network stabilized a short while 
> later, the OSDs were all shown as online and OK, and ceph seemed to recover 
> cleanly, and stopped trying to migrate data. In the process of trying to get 
> the network stable, though, the two nodes that had fallen off the network had 
> to be rebooted.
>
> When all four nodes were back online and talking to each other, I noticed 
> that MDS was in "up: rejoin", and after a period of time, it would eat all of 
> the available memory and swap on whatever system was primary. It would 
> eventually either get killed-off by the system due to memory usage, or it got 
> so slow that the monitors would drop it and pick another MDS as primary. This 
> cycle would repeat.
>
> I added more swap to one system (160GB of swap total), and brought down the 
> MDS service on the other three nodes, forcing the rejoin operations to occur 
> on the node with added swap. I also turned up debugging to see what it was 
> actually doing. This was then allowed to run for about 14 hours overnight. 
> When I arrived this morning, the system was still up, but severly lagged. 
> Nearly all swap had been used, and the system had difficulty responding to 
> commands. Out of options, I killed the process, and then watched as it tried 
> to shut down cleanly. I was hoping to preserve as much of the work that it 
> did as possible. I restarted it, and it seemed to do more in replay, and then 
> reentered the rejoin, which is still running and giving no hints of finishing 
> anytime soon.
>
> The rejoin traffic I'm seeing in the MDS log looks like this:
>
> 2018-08-15 11:39:21.726 7f9c7229f700 10 mds.0.cache.ino(0x10108aa) 
> verify_diri_backtrace
> 2018-08-15 11:39:21.726 7f9c7229f700 10 mds.0.cache.dir(0x10108aa) 
> _fetched header 274 bytes 2323 keys for [dir 0x10108aa 
> /files-by-sha256/1c/1cc4/ [2,head] auth v=0 cv=0/0 ap=1+0+0 
> state=1073741888|fetching f() n() hs=0+0,ss=0+0 | waiter=1 authpin=1 
> 0x561738166a00]
> 2018-08-15 11:39:21.726 7f9c7229f700 10 mds.0.cache.dir(0x10108aa) 
> _fetched version 59738838
> 2018-08-15 11:39:21.726 7f9c7229f700 10  mds.0.cache.snaprealm(0x1 seq 1 
> 0x56171c6a3400) have_past_parents_open [1,head]
> 2018-08-15 11:39:21.727 7f9c7229f700 10  mds.0.cache.snaprealm(0x1 seq 1 
> 0x56171c6a3400) have_past_parents_open [1,head]
> 2018-08-15 11:39:21.898 7f9c78534700 10 mds.beacon.ta-g17 handle_mds_beacon 
> up:rejoin seq 377 rtt 1.400594
> 2018-08-15 11:39:24.564 7f9c752a5700 10 mds.beacon.ta-g17 _send up:rejoin seq 
> 378
> 2018-08-15 11:39:25.503 7f9c78534700 10 mds.beacon.ta-g17 handle_mds_beacon 
> up:rejoin seq 378 rtt 0.907796
> 2018-08-15 11:39:26.565 7f9c7229f700 10 mds.0.cache.dir(0x10108aa) 
> auth_unpin by 0x561738166a00 on [dir 0x10108aa /files-by-sha256/1c/1cc4/ 
> [2,head] auth v=59738838 cv=59738838/59738838 state=1073741825|complete f(v0 
> m2018-08-14 07:52:06.764154 2323=2323+0) n(v0 rc2018-08-14 07:52:06.764154 
> b3161079403 2323=2323+0) hs=2323+0,ss=0+0 | child=1 waiter=1 authpin=0 
> 0x561738166a00] count now 0 + 0
> 2018-08-15 11:39:26.706 7f9c73aa2700  7 mds.0.13676 mds has 1 queued contexts
> 2018-08-15 11:39:26.706 7f9c73aa2700 10 mds.0.13676 0x5617cd27a790
> 2018-08-15 11:39:26.706 7f9c73aa2700 10 mds.0.13676  finish 0x5617cd27a790
> 2018-08-15 11:39:26.723 7f9c7229f700 10 MDSIOContextBase::complete: 
> 21C_IO_Dir_OMAP_Fetched
> 2018-08-15 11:39:26.723 7f9c7229f700 10 mds.0.cache.ino(0x10020f7) 
> verify_diri_backtrace
> 2018-08-15 11:39:26.738 7f9c7229f700 10 mds.0.cache.dir(0x10020f7) 
> _fetched header 274 bytes 1899 keys for [dir 0x10020f7 
> /files-by-sha256/a7/a723/ [2,head] auth v=0 cv=0/0 ap=1

Re: [ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

2018-08-15 Thread Jonathan Woytek
On Wed, Aug 15, 2018 at 9:40 PM, Yan, Zheng  wrote:
> How many client reconnected when mds restarts?  The issue is likely
> because reconnected clients held two many inodes, mds was opening
> these inodes in rejoin state.  Try  starting mds with option
> mds_wipe_sessions = true. The option makes mds ignore old clients
> during recovery.  You need to unset the option and  remount clients
> after mds become 'active'


Thank you for the suggestion! I set that in the global section of
ceph.conf on the node where I am starting ceph-mds. After setting it
and starting ceph-mds, I'm not seeing markedly different behavior.
After flying through replay and then flying through a bunch of the
messages posted earlier, it begins to eat up memory again and slows
down, still outputting the log messages as in the original post.
Looking in the ceph-mds...log, I'm not seeing any reference to 'wipe',
so I'm not sure if it is being honored. Am I putting that in the right
place?

jonathan
-- 
Jonathan Woytek
http://www.dryrose.com
KB3HOZ
PGP:  462C 5F50 144D 6B09 3B65  FCE8 C1DC DEC4 E8B6 AABC
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

2018-08-15 Thread Jonathan Woytek
Actually, I missed it--I do see the wipe start, wipe done in the log.
However, it is still doing verify_diri_backtrace, as described
previously.

jonathan

On Wed, Aug 15, 2018 at 10:42 PM, Jonathan Woytek  wrote:
> On Wed, Aug 15, 2018 at 9:40 PM, Yan, Zheng  wrote:
>> How many client reconnected when mds restarts?  The issue is likely
>> because reconnected clients held two many inodes, mds was opening
>> these inodes in rejoin state.  Try  starting mds with option
>> mds_wipe_sessions = true. The option makes mds ignore old clients
>> during recovery.  You need to unset the option and  remount clients
>> after mds become 'active'
>
>
> Thank you for the suggestion! I set that in the global section of
> ceph.conf on the node where I am starting ceph-mds. After setting it
> and starting ceph-mds, I'm not seeing markedly different behavior.
> After flying through replay and then flying through a bunch of the
> messages posted earlier, it begins to eat up memory again and slows
> down, still outputting the log messages as in the original post.
> Looking in the ceph-mds...log, I'm not seeing any reference to 'wipe',
> so I'm not sure if it is being honored. Am I putting that in the right
> place?
>
> jonathan
> --
> Jonathan Woytek
> http://www.dryrose.com
> KB3HOZ
> PGP:  462C 5F50 144D 6B09 3B65  FCE8 C1DC DEC4 E8B6 AABC



-- 
Jonathan Woytek
http://www.dryrose.com
KB3HOZ
PGP:  462C 5F50 144D 6B09 3B65  FCE8 C1DC DEC4 E8B6 AABC
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

2018-08-15 Thread Yan, Zheng
On Thu, Aug 16, 2018 at 10:50 AM Jonathan Woytek  wrote:
>
> Actually, I missed it--I do see the wipe start, wipe done in the log.
> However, it is still doing verify_diri_backtrace, as described
> previously.
>

which version of mds do you use?

> jonathan
>
> On Wed, Aug 15, 2018 at 10:42 PM, Jonathan Woytek  wrote:
> > On Wed, Aug 15, 2018 at 9:40 PM, Yan, Zheng  wrote:
> >> How many client reconnected when mds restarts?  The issue is likely
> >> because reconnected clients held two many inodes, mds was opening
> >> these inodes in rejoin state.  Try  starting mds with option
> >> mds_wipe_sessions = true. The option makes mds ignore old clients
> >> during recovery.  You need to unset the option and  remount clients
> >> after mds become 'active'
> >
> >
> > Thank you for the suggestion! I set that in the global section of
> > ceph.conf on the node where I am starting ceph-mds. After setting it
> > and starting ceph-mds, I'm not seeing markedly different behavior.
> > After flying through replay and then flying through a bunch of the
> > messages posted earlier, it begins to eat up memory again and slows
> > down, still outputting the log messages as in the original post.
> > Looking in the ceph-mds...log, I'm not seeing any reference to 'wipe',
> > so I'm not sure if it is being honored. Am I putting that in the right
> > place?
> >
> > jonathan
> > --
> > Jonathan Woytek
> > http://www.dryrose.com
> > KB3HOZ
> > PGP:  462C 5F50 144D 6B09 3B65  FCE8 C1DC DEC4 E8B6 AABC
>
>
>
> --
> Jonathan Woytek
> http://www.dryrose.com
> KB3HOZ
> PGP:  462C 5F50 144D 6B09 3B65  FCE8 C1DC DEC4 E8B6 AABC
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

2018-08-15 Thread Jonathan Woytek
ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic (stable)


On Wed, Aug 15, 2018 at 10:51 PM, Yan, Zheng  wrote:
> On Thu, Aug 16, 2018 at 10:50 AM Jonathan Woytek  wrote:
>>
>> Actually, I missed it--I do see the wipe start, wipe done in the log.
>> However, it is still doing verify_diri_backtrace, as described
>> previously.
>>
>
> which version of mds do you use?
>
>> jonathan
>>
>> On Wed, Aug 15, 2018 at 10:42 PM, Jonathan Woytek  wrote:
>> > On Wed, Aug 15, 2018 at 9:40 PM, Yan, Zheng  wrote:
>> >> How many client reconnected when mds restarts?  The issue is likely
>> >> because reconnected clients held two many inodes, mds was opening
>> >> these inodes in rejoin state.  Try  starting mds with option
>> >> mds_wipe_sessions = true. The option makes mds ignore old clients
>> >> during recovery.  You need to unset the option and  remount clients
>> >> after mds become 'active'
>> >
>> >
>> > Thank you for the suggestion! I set that in the global section of
>> > ceph.conf on the node where I am starting ceph-mds. After setting it
>> > and starting ceph-mds, I'm not seeing markedly different behavior.
>> > After flying through replay and then flying through a bunch of the
>> > messages posted earlier, it begins to eat up memory again and slows
>> > down, still outputting the log messages as in the original post.
>> > Looking in the ceph-mds...log, I'm not seeing any reference to 'wipe',
>> > so I'm not sure if it is being honored. Am I putting that in the right
>> > place?
>> >
>> > jonathan
>> > --
>> > Jonathan Woytek
>> > http://www.dryrose.com
>> > KB3HOZ
>> > PGP:  462C 5F50 144D 6B09 3B65  FCE8 C1DC DEC4 E8B6 AABC
>>
>>
>>
>> --
>> Jonathan Woytek
>> http://www.dryrose.com
>> KB3HOZ
>> PGP:  462C 5F50 144D 6B09 3B65  FCE8 C1DC DEC4 E8B6 AABC



-- 
Jonathan Woytek
http://www.dryrose.com
KB3HOZ
PGP:  462C 5F50 144D 6B09 3B65  FCE8 C1DC DEC4 E8B6 AABC
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

2018-08-15 Thread Yan, Zheng
On Thu, Aug 16, 2018 at 10:55 AM Jonathan Woytek  wrote:
>
> ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic (stable)
>
>

Try deleting mds0_openfiles.0 (mds1_openfiles.0 and so on if you have
multiple active mds)  from metadata pool of your filesystem. Records
in these files are open files hints. It's safe to delete them.

> On Wed, Aug 15, 2018 at 10:51 PM, Yan, Zheng  wrote:
> > On Thu, Aug 16, 2018 at 10:50 AM Jonathan Woytek  wrote:
> >>
> >> Actually, I missed it--I do see the wipe start, wipe done in the log.
> >> However, it is still doing verify_diri_backtrace, as described
> >> previously.
> >>
> >
> > which version of mds do you use?
> >
> >> jonathan
> >>
> >> On Wed, Aug 15, 2018 at 10:42 PM, Jonathan Woytek  
> >> wrote:
> >> > On Wed, Aug 15, 2018 at 9:40 PM, Yan, Zheng  wrote:
> >> >> How many client reconnected when mds restarts?  The issue is likely
> >> >> because reconnected clients held two many inodes, mds was opening
> >> >> these inodes in rejoin state.  Try  starting mds with option
> >> >> mds_wipe_sessions = true. The option makes mds ignore old clients
> >> >> during recovery.  You need to unset the option and  remount clients
> >> >> after mds become 'active'
> >> >
> >> >
> >> > Thank you for the suggestion! I set that in the global section of
> >> > ceph.conf on the node where I am starting ceph-mds. After setting it
> >> > and starting ceph-mds, I'm not seeing markedly different behavior.
> >> > After flying through replay and then flying through a bunch of the
> >> > messages posted earlier, it begins to eat up memory again and slows
> >> > down, still outputting the log messages as in the original post.
> >> > Looking in the ceph-mds...log, I'm not seeing any reference to 'wipe',
> >> > so I'm not sure if it is being honored. Am I putting that in the right
> >> > place?
> >> >
> >> > jonathan
> >> > --
> >> > Jonathan Woytek
> >> > http://www.dryrose.com
> >> > KB3HOZ
> >> > PGP:  462C 5F50 144D 6B09 3B65  FCE8 C1DC DEC4 E8B6 AABC
> >>
> >>
> >>
> >> --
> >> Jonathan Woytek
> >> http://www.dryrose.com
> >> KB3HOZ
> >> PGP:  462C 5F50 144D 6B09 3B65  FCE8 C1DC DEC4 E8B6 AABC
>
>
>
> --
> Jonathan Woytek
> http://www.dryrose.com
> KB3HOZ
> PGP:  462C 5F50 144D 6B09 3B65  FCE8 C1DC DEC4 E8B6 AABC
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

2018-08-15 Thread Jonathan Woytek
On Wed, Aug 15, 2018 at 11:02 PM Yan, Zheng  wrote:

> On Thu, Aug 16, 2018 at 10:55 AM Jonathan Woytek 
> wrote:
> >
> > ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic
> (stable)
> >
> >
>
> Try deleting mds0_openfiles.0 (mds1_openfiles.0 and so on if you have
> multiple active mds)  from metadata pool of your filesystem. Records
> in these files are open files hints. It's safe to delete them.


I will try that in the morning. I had to bail for the night here (UTC-4).
Thank you!

Jonathan

> --
Sent from my Commodore64
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD journal feature

2018-08-15 Thread Glen Baars
Is there any workaround that you can think of to correctly enable journaling on 
locked images?
Kind regards,
Glen Baars

From: ceph-users  On Behalf Of Glen Baars
Sent: Tuesday, 14 August 2018 9:36 PM
To: dilla...@redhat.com
Cc: ceph-users 
Subject: Re: [ceph-users] RBD journal feature

Hello Jason,

Thanks for your help. Here is the output you asked for also.

https://pastebin.com/dKH6mpwk
Kind regards,
Glen Baars

From: Jason Dillaman mailto:jdill...@redhat.com>>
Sent: Tuesday, 14 August 2018 9:33 PM
To: Glen Baars mailto:g...@onsitecomputers.com.au>>
Cc: ceph-users mailto:ceph-users@lists.ceph.com>>
Subject: Re: [ceph-users] RBD journal feature

On Tue, Aug 14, 2018 at 9:31 AM Glen Baars 
mailto:g...@onsitecomputers.com.au>> wrote:
Hello Jason,

I have now narrowed it down.

If the image has an exclusive lock – the journal doesn’t go on the correct pool.

OK, that makes sense. If you have an active client on the image holding the 
lock, the request to enable journaling is sent over to that client but it's 
missing all the journal options. I'll open a tracker ticket to fix the issue.

Thanks.

Kind regards,
Glen Baars

From: Jason Dillaman mailto:jdill...@redhat.com>>
Sent: Tuesday, 14 August 2018 9:29 PM
To: Glen Baars mailto:g...@onsitecomputers.com.au>>
Cc: ceph-users mailto:ceph-users@lists.ceph.com>>
Subject: Re: [ceph-users] RBD journal feature


On Tue, Aug 14, 2018 at 9:19 AM Glen Baars 
mailto:g...@onsitecomputers.com.au>> wrote:
Hello Jason,

I have tried with and without ‘rbd journal pool = rbd’ in the ceph.conf. it 
doesn’t seem to make a difference.

It should be SSDPOOL, but regardless, I am at a loss as to why it's not working 
for you. You can try appending "--debug-rbd=20" to the end of the "rbd feature 
enable" command and provide the generated logs in a pastebin link.

Also, here is the output:

rbd image-meta list RBD-HDD/2ef34a96-27e0-4ae7-9888-fd33c38f657a
There are 0 metadata on this image.
Kind regards,
Glen Baars

From: Jason Dillaman mailto:jdill...@redhat.com>>
Sent: Tuesday, 14 August 2018 9:00 PM
To: Glen Baars mailto:g...@onsitecomputers.com.au>>
Cc: dillaman mailto:dilla...@redhat.com>>; ceph-users 
mailto:ceph-users@lists.ceph.com>>
Subject: Re: [ceph-users] RBD journal feature

I tried w/ a rbd CLI from 12.2.7 and I still don't have an issue enabling 
journaling on a different pool:

$ rbd info rbd/foo
rbd image 'foo':
   size 1024 MB in 256 objects
   order 22 (4096 kB objects)
   block_name_prefix: rbd_data.101e6b8b4567
   format: 2
   features: layering, exclusive-lock, object-map, fast-diff, 
deep-flatten
   flags:
   create_timestamp: Tue Aug 14 08:51:19 2018
$ rbd feature enable rbd/foo journaling --journal-pool rbd_ssd
$ rbd journal info --pool rbd --image foo
rbd journal '101e6b8b4567':
   header_oid: journal.101e6b8b4567
   object_oid_prefix: journal_data.1.101e6b8b4567.
   order: 24 (16384 kB objects)
   splay_width: 4
   object_pool: rbd_ssd

Can you please run "rbd image-meta list " to see if you are 
overwriting any configuration settings? Do you have any client configuration 
overrides in your "/etc/ceph/ceph.conf"?

On Tue, Aug 14, 2018 at 8:25 AM Glen Baars 
mailto:g...@onsitecomputers.com.au>> wrote:
Hello Jason,

I will also complete testing of a few combinations tomorrow to try and isolate 
the issue now that we can get it to work with a new image.

The cluster started out at 12.2.3 bluestore so there shouldn’t be any old 
issues from previous versions.
Kind regards,
Glen Baars

From: Jason Dillaman mailto:jdill...@redhat.com>>
Sent: Tuesday, 14 August 2018 7:43 PM
To: Glen Baars mailto:g...@onsitecomputers.com.au>>
Cc: dillaman mailto:dilla...@redhat.com>>; ceph-users 
mailto:ceph-users@lists.ceph.com>>
Subject: Re: [ceph-users] RBD journal feature

On Tue, Aug 14, 2018 at 4:08 AM Glen Baars 
mailto:g...@onsitecomputers.com.au>> wrote:
Hello Jason,

I can confirm that your tests work on our cluster with a newly created image.

We still can’t get the current images to use a different object pool. Do you 
think that maybe another feature is incompatible with this feature? Below is a 
log of the issue.

I wouldn't think so. I used master branch for my testing but I'll try 12.2.7 
just in case it's an issue that's only in the luminous release.

:~# rbd info RBD_HDD/2ef34a96-27e0-4ae7-9888-fd33c38f657a
rbd image '2ef34a96-27e0-4ae7-9888-fd33c38f657a':
size 51200 MB in 12800 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.37c8974b0dc51
format: 2
features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
flags:
create_timestamp: Sat May  5 11:39:07 2018

:~# rbd journal info --pool RBD_HDD --image 2ef34a96-27e0-4ae7-9888-fd33c38f657a
rbd: journaling is not enabled for image 2ef34a96-27e0-4ae7-9888-fd33c38f657a

:~# rbd feature enable