Re: [ceph-users] Enable daemonperf - no stats selected by filters

2018-07-31 Thread Marc Roos
 
Luminous 12.2.7

[@c01 ~]# rpm -qa | grep ceph-
ceph-mon-12.2.7-0.el7.x86_64
ceph-selinux-12.2.7-0.el7.x86_64
ceph-osd-12.2.7-0.el7.x86_64
ceph-mgr-12.2.7-0.el7.x86_64
ceph-12.2.7-0.el7.x86_64
ceph-common-12.2.7-0.el7.x86_64
ceph-mds-12.2.7-0.el7.x86_64
ceph-radosgw-12.2.7-0.el7.x86_64
ceph-base-12.2.7-0.el7.x86_64

-Original Message-
From: John Spray [mailto:jsp...@redhat.com] 
Sent: dinsdag 31 juli 2018 0:35
To: Marc Roos
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Enable daemonperf - no stats selected by 
filters

On Mon, Jul 30, 2018 at 10:27 PM Marc Roos  
wrote:
>
>
> Do you need to enable the option daemonperf?

This looks strange, it's supposed to have sensible defaults -- what 
version are you on?

John

> [@c01 ~]# ceph daemonperf mds.a
> Traceback (most recent call last):
>   File "/usr/bin/ceph", line 1122, in 
> retval = main()
>   File "/usr/bin/ceph", line 822, in main
> done, ret = maybe_daemon_command(parsed_args, childargs)
>   File "/usr/bin/ceph", line 686, in maybe_daemon_command
> return True, daemonperf(childargs, sockpath)
>   File "/usr/bin/ceph", line 776, in daemonperf
> watcher.run(interval, count)
>   File "/usr/lib/python2.7/site-packages/ceph_daemon.py", line 362, in 

> run
> self._load_schema()
>   File "/usr/lib/python2.7/site-packages/ceph_daemon.py", line 350, in 

> _load_schema
> raise RuntimeError("no stats selected by filters")
> RuntimeError: no stats selected by filters
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Self shutdown of 1 whole system: Oops, it did it again (not yet anymore)

2018-07-31 Thread Nicolas Huillard
Hi all,

The latest hint I received (thanks!) was to replace a failing hardware.
Before that, I updated the BIOS, which included a CPU microcode fix for
melddown/spectre and probably other thngs. Last time I had checked, the
vendor didn't have that fix yet.
Since this update, not CATERR happened... This Intel microcode + vendor
BIOS may have mitigated the problem, and postpones hardware
replacement...

Le mardi 24 juillet 2018 à 12:18 +0200, Nicolas Huillard a écrit :
> Hi all,
> 
> The same server did it again with the same CATERR exactly 3 days
> after
> rebooting (+/- 30 seconds).
> If it were'nt for the exact +3 days, I would think it's a random
> event.
> But exactly 3 days after reboot does not seem random.
> 
> Nothing I added got me more information (mcelog, pstore, BMC video
> record, etc.)...
> 
> Thanks is advance for any hint ;-)
> 
> Le samedi 21 juillet 2018 à 10:31 +0200, Nicolas Huillard a écrit :
> > Hi all,
> > 
> > One of my server silently shutdown last night, with no explanation
> > whatsoever in any logs. According to the existing logs, the
> > shutdown
> > (without reboot) happened between 03:58:20.061452 (last timestamp
> > from
> > /var/log/ceph/ceph-mgr.oxygene.log) and 03:59:01.515308 (new MON
> > election called, for which oxygene didn't answer).
> > 
> > Is there any way in which Ceph could silently shutdown a server?
> > Can SMART self-test influence scrubbing or compaction?
> > 
> > The only thing I have is that smartd stated a long self-test on
> > both
> > OSD spinning drives on that host:
> > Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sda [SAT],
> > starting
> > scheduled Long Self-Test.
> > Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sdb [SAT],
> > starting
> > scheduled Long Self-Test.
> > Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sdc [SAT],
> > starting
> > scheduled Long Self-Test.
> > Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sda [SAT], self-
> > test in progress, 90% remaining
> > Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sdb [SAT], self-
> > test in progress, 90% remaining
> > Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sdc [SAT],
> > previous
> > self-test completed without error
> > 
> > ...and smartctl now says that the self-tests didn't finish (on both
> > drives) :
> > # 1  Extended offlineInterrupted (host
> > reset)  00% 10636 -
> > 
> > MON logs on oxygene talks about rockdb compaction a few minutes
> > before
> > the shutdown, and a deep-scrub finished earlier:
> > /var/log/ceph/ceph-osd.6.log
> > 2018-07-21 03:32:54.086021 7fd15d82c700  0 log_channel(cluster) log
> > [DBG] : 6.1d deep-scrub starts
> > 2018-07-21 03:34:31.185549 7fd15d82c700  0 log_channel(cluster) log
> > [DBG] : 6.1d deep-scrub ok
> > 2018-07-21 03:43:36.720707 7fd178082700  0 --
> > 172.22.0.16:6801/478362
> > > > 172.21.0.16:6800/1459922146 conn(0x556f0642b800 :6801
> > 
> > s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
> > l=1).handle_connect_msg: challenging authorizer
> > 
> > /var/log/ceph/ceph-mgr.oxygene.log
> > 2018-07-21 03:58:16.060137 7fbcd300  1 mgr send_beacon standby
> > 2018-07-21 03:58:18.060733 7fbcd300  1 mgr send_beacon standby
> > 2018-07-21 03:58:20.061452 7fbcd300  1 mgr send_beacon standby
> > 
> > /var/log/ceph/ceph-mon.oxygene.log
> > 2018-07-21 03:52:27.702314 7f25b5406700  4 rocksdb: (Original Log
> > Time 2018/07/21-03:52:27.702302) [/build/ceph-
> > 12.2.7/src/rocksdb/db/db_impl_compaction_flush.cc:1392] [default]
> > Manual compaction from level-0 to level-1 from 'mgrstat .. '
> > 2018-07-21 03:52:27.702321 7f25b5406700  4 rocksdb: [/build/ceph-
> > 12.2.7/src/rocksdb/db/compaction_job.cc:1403] [default] [JOB 1746]
> > Compacting 1@0 + 1@1 files to L1, score -1.00
> > 2018-07-21 03:52:27.702329 7f25b5406700  4 rocksdb: [/build/ceph-
> > 12.2.7/src/rocksdb/db/compaction_job.cc:1407] [default] Compaction
> > start summary: Base version 1745 Base level 0, inputs:
> > [149507(602KB)], [149505(13MB)]
> > 2018-07-21 03:52:27.702348 7f25b5406700  4 rocksdb: EVENT_LOG_v1
> > {"time_micros": 1532137947702334, "job": 1746, "event":
> > "compaction_started", "files_L0": [149507], "files_L1": [149505],
> > "score": -1, "input_data_size": 14916379}
> > 2018-07-21 03:52:27.785532 7f25b5406700  4 rocksdb: [/build/ceph-
> > 12.2.7/src/rocksdb/db/compaction_job.cc:1116] [default] [JOB 1746]
> > Generated table #149508: 4904 keys, 14808953 bytes
> > 2018-07-21 03:52:27.785587 7f25b5406700  4 rocksdb: EVENT_LOG_v1
> > {"time_micros": 1532137947785565, "cf_name": "default", "job":
> > 1746,
> > "event": "table_file_creation", "file_number": 149508, "file_size":
> > 14808953, "table_properties": {"data
> > 2018-07-21 03:52:27.785627 7f25b5406700  4 rocksdb: [/build/ceph-
> > 12.2.7/src/rocksdb/db/compaction_job.cc:1173] [default] [JOB 1746]
> > Compacted 1@0 + 1@1 files to L1 => 14808953 bytes
> > 2018-07-21 03:52:27.785656 7f25b5406700  3 rocksdb: [/build/ceph-
> > 12.2.7/src/rocksdb/db/version_set.

[ceph-users] Mimi Telegraf plugin on Luminous

2018-07-31 Thread Denny Fuchs

hi,

I try to get the Telegraf plugin from Mimic on Luminous running (Debian 
Stretch). I copied the files from the Git into 
/usr/lib/ceph/mgr/telegraf; enabled the plugin and get:



2018-07-31 09:25:46.501858 7f496cfc9700 -1 log_channel(cluster) log 
[ERR] : Unhandled exception from module 'telegraf' while running on 
mgr.fc-r02-ceph-osd-02: (2, 'No such file or directory')

2018-07-31 09:25:46.501872 7f496cfc9700 -1 telegraf.serve:
2018-07-31 09:25:46.501873 7f496cfc9700 -1 Traceback (most recent call 
last):

  File "/usr/lib/ceph/mgr/telegraf/module.py", line 310, in serve
self.send_to_telegraf()
  File "/usr/lib/ceph/mgr/telegraf/module.py", line 255, in 
send_to_telegraf

with sock as s:
  File "/usr/lib/ceph/mgr/telegraf/basesocket.py", line 41, in __enter__
self.connect()
  File "/usr/lib/ceph/mgr/telegraf/basesocket.py", line 29, in connect
return self.sock.connect(self.address)
  File "/usr/lib/python2.7/socket.py", line 228, in meth
return getattr(self._sock,name)(*args)
error: (2, 'No such file or directory')


I don't know, which file the plugin wants, or need ...

any suggestions ?

cu denny
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Whole cluster flapping

2018-07-31 Thread CUZA Frédéric
Hi Everyone,

I just upgrade our cluster to Luminous 12.2.7 and I delete a quite large pool 
that we had (120 TB).
Our cluster is made of 14 Nodes with each composed of 12 OSDs (1 HDD -> 1 OSD), 
we have SDD for journal.

After I deleted the large pool my cluster started to flapping on all OSDs.
Osds are marked down and then marked up as follow :

2018-07-31 10:42:51.504319 mon.ceph_monitor01 [INF] osd.97 
172.29.228.72:6800/95783 boot
2018-07-31 10:42:55.330993 mon.ceph_monitor01 [WRN] Health check update: 
5798/5845200 objects misplaced (0.099%) (OBJECT_MISPLACED)
2018-07-31 10:42:55.331065 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 221365/5845200 objects degraded (3.787%), 98 pgs 
degraded, 317 pgs undersized (PG_DEGRADED)
2018-07-31 10:42:55.331093 mon.ceph_monitor01 [WRN] Health check update: 81 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:42:55.548385 mon.ceph_monitor01 [WRN] Health check update: 
Reduced data availability: 13 pgs inactive, 4 pgs peering (PG_AVAILABILITY)
2018-07-31 10:42:55.610556 mon.ceph_monitor01 [INF] osd.96 
172.29.228.72:6803/95830 boot
2018-07-31 10:43:00.331787 mon.ceph_monitor01 [WRN] Health check update: 5 osds 
down (OSD_DOWN)
2018-07-31 10:43:00.331930 mon.ceph_monitor01 [WRN] Health check update: 
5782/5845401 objects misplaced (0.099%) (OBJECT_MISPLACED)
2018-07-31 10:43:00.331950 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 167757/5845401 objects degraded (2.870%), 77 pgs 
degraded, 223 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:00.331966 mon.ceph_monitor01 [WRN] Health check update: 76 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:01.729891 mon.ceph_monitor01 [WRN] Health check update: 
Reduced data availability: 7 pgs inactive, 6 pgs peering (PG_AVAILABILITY)
2018-07-31 10:43:01.753867 mon.ceph_monitor01 [INF] osd.4 
172.29.228.246:6812/3144542 boot
2018-07-31 10:43:05.332624 mon.ceph_monitor01 [WRN] Health check update: 4 osds 
down (OSD_DOWN)
2018-07-31 10:43:05.332691 mon.ceph_monitor01 [WRN] Health check update: 
5767/5845569 objects misplaced (0.099%) (OBJECT_MISPLACED)
2018-07-31 10:43:05.332718 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 130565/5845569 objects degraded (2.234%), 67 pgs 
degraded, 220 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:05.332736 mon.ceph_monitor01 [WRN] Health check update: 83 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:07.004993 mon.ceph_monitor01 [WRN] Health check update: 
Reduced data availability: 5 pgs inactive, 5 pgs peering (PG_AVAILABILITY)
2018-07-31 10:43:10.333548 mon.ceph_monitor01 [WRN] Health check update: 
5752/5845758 objects misplaced (0.098%) (OBJECT_MISPLACED)
2018-07-31 10:43:10.333593 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 107805/5845758 objects degraded (1.844%), 59 pgs 
degraded, 197 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:10.333608 mon.ceph_monitor01 [WRN] Health check update: 95 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:15.334451 mon.ceph_monitor01 [WRN] Health check update: 
5738/5845923 objects misplaced (0.098%) (OBJECT_MISPLACED)
2018-07-31 10:43:15.334494 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 107807/5845923 objects degraded (1.844%), 59 pgs 
degraded, 197 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:15.334510 mon.ceph_monitor01 [WRN] Health check update: 98 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:15.334865 mon.ceph_monitor01 [INF] osd.18 failed 
(root=default,room=,host=) (8 reporters from different host after 
54.650576 >= grace 54.300663)
2018-07-31 10:43:15.336552 mon.ceph_monitor01 [WRN] Health check update: 5 osds 
down (OSD_DOWN)
2018-07-31 10:43:17.357747 mon.ceph_monitor01 [WRN] Health check update: 
Reduced data availability: 6 pgs inactive, 6 pgs peering (PG_AVAILABILITY)
2018-07-31 10:43:20.339495 mon.ceph_monitor01 [WRN] Health check update: 
5724/5846073 objects misplaced (0.098%) (OBJECT_MISPLACED)
2018-07-31 10:43:20.339543 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 122901/5846073 objects degraded (2.102%), 65 pgs 
degraded, 201 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:20.339559 mon.ceph_monitor01 [WRN] Health check update: 78 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:22.481251 mon.ceph_monitor01 [WRN] Health check update: 4 osds 
down (OSD_DOWN)
2018-07-31 10:43:22.498621 mon.ceph_monitor01 [INF] osd.18 
172.29.228.5:6812/14996 boot
2018-07-31 10:43:25.340099 mon.ceph_monitor01 [WRN] Health check update: 
5712/5846235 objects misplaced (0.098%) (OBJECT_MISPLACED)
2018-07-31 10:43:25.340147 mon.ceph_monitor01 [WRN] Health check update: 
Reduced data availability: 6 pgs inactive, 3 pgs peering (PG_AVAILABILITY)
2018-07-31 10:43:25.340163 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 138553/5846235 objects degraded

Re: [ceph-users] Mimi Telegraf plugin on Luminous

2018-07-31 Thread Wido den Hollander


On 07/31/2018 09:38 AM, Denny Fuchs wrote:
> hi,
> 
> I try to get the Telegraf plugin from Mimic on Luminous running (Debian
> Stretch). I copied the files from the Git into
> /usr/lib/ceph/mgr/telegraf; enabled the plugin and get:
> 
> 
> 2018-07-31 09:25:46.501858 7f496cfc9700 -1 log_channel(cluster) log
> [ERR] : Unhandled exception from module 'telegraf' while running on
> mgr.fc-r02-ceph-osd-02: (2, 'No such file or directory')
> 2018-07-31 09:25:46.501872 7f496cfc9700 -1 telegraf.serve:
> 2018-07-31 09:25:46.501873 7f496cfc9700 -1 Traceback (most recent call
> last):
>   File "/usr/lib/ceph/mgr/telegraf/module.py", line 310, in serve
>     self.send_to_telegraf()
>   File "/usr/lib/ceph/mgr/telegraf/module.py", line 255, in
> send_to_telegraf
>     with sock as s:
>   File "/usr/lib/ceph/mgr/telegraf/basesocket.py", line 41, in __enter__
>     self.connect()
>   File "/usr/lib/ceph/mgr/telegraf/basesocket.py", line 29, in connect
>     return self.sock.connect(self.address)
>   File "/usr/lib/python2.7/socket.py", line 228, in meth
>     return getattr(self._sock,name)(*args)
> error: (2, 'No such file or directory')
> 
> 
> I don't know, which file the plugin wants, or need ...
> 

You will need to set the address. I recommend that you have Telegraf
listen on UDP localhost and then set the address:

$ ceph telegraf config-set address 

Wido

> any suggestions ?
> 
> cu denny
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimi Telegraf plugin on Luminous

2018-07-31 Thread Denny Fuchs

hi,

found the issue:

problem was a wrong syntax from Telegraf, which does not create the 
socket and because there was no socket, I've got the "no such file ..." 
I tried later udp/tcp ... but I didn't read it correctly, that Telegraf 
creates the necessary input :-D


Now it works :-)

cu denny
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mgr cephx caps to run `ceph fs status`?

2018-07-31 Thread John Spray
On Tue, Jul 31, 2018 at 3:36 AM Linh Vu  wrote:
>
> Hi all,
>
>
> I want a non-admin client to be able to run `ceph fs status`, either via the 
> ceph CLI or a python script. Adding `mgr "allow *"` to this client's cephx 
> caps works, but I'd like to be more specific if possible. I can't find the 
> complete list of mgr cephx caps anywhere, so if you could point me in the 
> right direction, that'd be great!

Both mgr and mon caps have an "allow command" syntax that lets you
restrict users to specific named commands (and even specific
arguments).  Internally, the mgr and the mon use the same code to
intepret capabilities.

I just went looking for the documentation for those mon caps and it
appears not to exist!

Anyway, in your case it's something like this:

mgr "allow command \"fs status\""

I don't think I've ever tested this on a mgr daemon, so let us know
how you get on.

John



>
> Cheers,
>
> Linh
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Intermittent client reconnect delay following node fail

2018-07-31 Thread John Spray
On Tue, Jul 31, 2018 at 12:33 AM William Lawton
 wrote:
>
> Hi.
>
>
>
> We have recently setup our first ceph cluster (4 nodes) but our node failure 
> tests have revealed an intermittent problem. When we take down a node (i.e. 
> by powering it off) most of the time all clients reconnect to the cluster 
> within milliseconds, but occasionally it can take them 30 seconds or more. 
> All clients are Centos7 instances and have the ceph cluster mount point 
> configured in /etc/fstab as follows:

The first thing I'd do is make sure you've got recent client code --
there are backports in RHEL but I'm unclear on how much of that (if
any) makes it into centos.  You may find it simpler to just install a
recent 4.x kernel from ELRepo.  Even if you don't want to use that in
production, it would be useful to try and isolate any CephFS client
issues you're encountering.

John

>
>
>
> 10.18.49.35:6789,10.18.49.204:6789,10.18.49.101:6789,10.18.49.183:6789:/ 
> /mnt/ceph ceph name=admin,secretfile=/etc/ceph_key,noatime,_netdev0   
> 2
>
>
>
> On rare occasions, using the ls command, we can see that a failover has left 
> a client’s /mnt/ceph directory with the following state: “???  ? ?
> ?   ?? ceph”. When this occurs, we think that the client has 
> failed to connect within 45 seconds (the mds_reconnect_timeout period) so the 
> client has been evicted. We can reproduce this circumstance by reducing the 
> mds reconnect timeout down to 1 second.
>
>
>
> We’d like to know why our clients sometimes struggle to reconnect after a 
> cluster node failure and how to prevent this i.e. how can we ensure that all 
> clients consistently reconnect to the cluster quickly following a node 
> failure.
>
>
>
> We are using the default configuration options.
>
>
>
> Ceph Status:
>
>
>
>   cluster:
>
> id: ea2d9095-3deb-4482-bf6c-23229c594da4
>
> health: HEALTH_OK
>
>
>
>   services:
>
> mon: 4 daemons, quorum dub-ceph-01,dub-ceph-03,dub-ceph-04,dub-ceph-02
>
> mgr: dub-ceph-02(active), standbys: dub-ceph-04.ott.local, dub-ceph-01, 
> dub-ceph-03
>
> mds: cephfs-1/1/1 up  {0=dub-ceph-03=up:active}, 3 up:standby
>
> osd: 4 osds: 4 up, 4 in
>
>
>
>   data:
>
> pools:   2 pools, 200 pgs
>
> objects: 2.36 k objects, 8.9 GiB
>
> usage:   31 GiB used, 1.9 TiB / 2.0 TiB avail
>
> pgs: 200 active+clean
>
>
>
> Thanks
>
> William Lawton
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Write operation to cephFS mount hangs

2018-07-31 Thread Bödefeld Sabine
Hello,

 

we have a Ceph Cluster 10.2.10 on VMs with Ubuntu 16.04 using Xen as the
hypervisor. We use CephFS and the clients use ceph-fuse to access the files.

Some of the ceph-fuse clients hang on write operations to the cephFS. On
copying a file to the cephFS, the file is created but it's empty and the
write operation hangs forever. Ceph-fuse version is 10.2.9.

In the logfile of the mds there are no error messages. Also, ceph health
returns HEALTH_OK.

ceph daemon mds.eco61 session ls reports no problems (if I interpret
correctly):

   {

"id": 64396,

"num_leases": 2,

"num_caps": 32,

"state": "open",

"replay_requests": 0,

"completed_requests": 1,

"reconnecting": false,

"inst": "client.64396 192.168.1.179:0\/980852091",

"client_metadata": {

"ceph_sha1": "2ee413f77150c0f375ff6f10edd6c8f9c7d060d0",

"ceph_version": "ceph version 10.2.9
(2ee413f77150c0f375ff6f10edd6c8f9c7d060d0)",

"entity_id": "admin",

"hostname": "eco79",

"mount_point": "\/mnt\/cephfs",

"root": "\/"

}

},

 

Does anyone have an idea where the problem lies? Any help would be greatly
appreciated.

Thanks very much,

Kind regards

Sabine



smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Write operation to cephFS mount hangs

2018-07-31 Thread Eugen Block

Hi,


Some of the ceph-fuse clients hang on write operations to the cephFS.


Do all the clients use the same credentials for authentication? Have  
you tried to mount the filesystem with the same credentials as your  
VMs do and then tried to create files? Has it worked before or is this  
a new approach?

Anything in the syslog/dmesg of the VMs?

Regards,
Eugen


Zitat von Bödefeld Sabine :


Hello,



we have a Ceph Cluster 10.2.10 on VMs with Ubuntu 16.04 using Xen as the
hypervisor. We use CephFS and the clients use ceph-fuse to access the files.

Some of the ceph-fuse clients hang on write operations to the cephFS. On
copying a file to the cephFS, the file is created but it's empty and the
write operation hangs forever. Ceph-fuse version is 10.2.9.

In the logfile of the mds there are no error messages. Also, ceph health
returns HEALTH_OK.

ceph daemon mds.eco61 session ls reports no problems (if I interpret
correctly):

   {

"id": 64396,

"num_leases": 2,

"num_caps": 32,

"state": "open",

"replay_requests": 0,

"completed_requests": 1,

"reconnecting": false,

"inst": "client.64396 192.168.1.179:0\/980852091",

"client_metadata": {

"ceph_sha1": "2ee413f77150c0f375ff6f10edd6c8f9c7d060d0",

"ceph_version": "ceph version 10.2.9
(2ee413f77150c0f375ff6f10edd6c8f9c7d060d0)",

"entity_id": "admin",

"hostname": "eco79",

"mount_point": "\/mnt\/cephfs",

"root": "\/"

}

},



Does anyone have an idea where the problem lies? Any help would be greatly
appreciated.

Thanks very much,

Kind regards

Sabine




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Whole cluster flapping

2018-07-31 Thread Webert de Souza Lima
The pool deletion might have triggered a lot of IO operations on the disks
and the process might be too busy to respond to hearbeats, so the mons mark
them as down due to no response.
Check also the OSD logs to see if they are actually crashing and
restarting, and disk IO usage (i.e. iostat).

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Tue, Jul 31, 2018 at 7:23 AM CUZA Frédéric  wrote:

> Hi Everyone,
>
>
>
> I just upgrade our cluster to Luminous 12.2.7 and I delete a quite large
> pool that we had (120 TB).
>
> Our cluster is made of 14 Nodes with each composed of 12 OSDs (1 HDD -> 1
> OSD), we have SDD for journal.
>
>
>
> After I deleted the large pool my cluster started to flapping on all OSDs.
>
> Osds are marked down and then marked up as follow :
>
>
>
> 2018-07-31 10:42:51.504319 mon.ceph_monitor01 [INF] osd.97
> 172.29.228.72:6800/95783 boot
>
> 2018-07-31 10:42:55.330993 mon.ceph_monitor01 [WRN] Health check update:
> 5798/5845200 objects misplaced (0.099%) (OBJECT_MISPLACED)
>
> 2018-07-31 10:42:55.331065 mon.ceph_monitor01 [WRN] Health check update:
> Degraded data redundancy: 221365/5845200 objects degraded (3.787%), 98 pgs
> degraded, 317 pgs undersized (PG_DEGRADED)
>
> 2018-07-31 10:42:55.331093 mon.ceph_monitor01 [WRN] Health check update:
> 81 slow requests are blocked > 32 sec (REQUEST_SLOW)
>
> 2018-07-31 10:42:55.548385 mon.ceph_monitor01 [WRN] Health check update:
> Reduced data availability: 13 pgs inactive, 4 pgs peering (PG_AVAILABILITY)
>
> 2018-07-31 10:42:55.610556 mon.ceph_monitor01 [INF] osd.96
> 172.29.228.72:6803/95830 boot
>
> 2018-07-31 10:43:00.331787 mon.ceph_monitor01 [WRN] Health check update: 5
> osds down (OSD_DOWN)
>
> 2018-07-31 10:43:00.331930 mon.ceph_monitor01 [WRN] Health check update:
> 5782/5845401 objects misplaced (0.099%) (OBJECT_MISPLACED)
>
> 2018-07-31 10:43:00.331950 mon.ceph_monitor01 [WRN] Health check update:
> Degraded data redundancy: 167757/5845401 objects degraded (2.870%), 77 pgs
> degraded, 223 pgs undersized (PG_DEGRADED)
>
> 2018-07-31 10:43:00.331966 mon.ceph_monitor01 [WRN] Health check update:
> 76 slow requests are blocked > 32 sec (REQUEST_SLOW)
>
> 2018-07-31 10:43:01.729891 mon.ceph_monitor01 [WRN] Health check update:
> Reduced data availability: 7 pgs inactive, 6 pgs peering (PG_AVAILABILITY)
>
> 2018-07-31 10:43:01.753867 mon.ceph_monitor01 [INF] osd.4
> 172.29.228.246:6812/3144542 boot
>
> 2018-07-31 10:43:05.332624 mon.ceph_monitor01 [WRN] Health check update: 4
> osds down (OSD_DOWN)
>
> 2018-07-31 10:43:05.332691 mon.ceph_monitor01 [WRN] Health check update:
> 5767/5845569 objects misplaced (0.099%) (OBJECT_MISPLACED)
>
> 2018-07-31 10:43:05.332718 mon.ceph_monitor01 [WRN] Health check update:
> Degraded data redundancy: 130565/5845569 objects degraded (2.234%), 67 pgs
> degraded, 220 pgs undersized (PG_DEGRADED)
>
> 2018-07-31 10:43:05.332736 mon.ceph_monitor01 [WRN] Health check update:
> 83 slow requests are blocked > 32 sec (REQUEST_SLOW)
>
> 2018-07-31 10:43:07.004993 mon.ceph_monitor01 [WRN] Health check update:
> Reduced data availability: 5 pgs inactive, 5 pgs peering (PG_AVAILABILITY)
>
> 2018-07-31 10:43:10.333548 mon.ceph_monitor01 [WRN] Health check update:
> 5752/5845758 objects misplaced (0.098%) (OBJECT_MISPLACED)
>
> 2018-07-31 10:43:10.333593 mon.ceph_monitor01 [WRN] Health check update:
> Degraded data redundancy: 107805/5845758 objects degraded (1.844%), 59 pgs
> degraded, 197 pgs undersized (PG_DEGRADED)
>
> 2018-07-31 10:43:10.333608 mon.ceph_monitor01 [WRN] Health check update:
> 95 slow requests are blocked > 32 sec (REQUEST_SLOW)
>
> 2018-07-31 10:43:15.334451 mon.ceph_monitor01 [WRN] Health check update:
> 5738/5845923 objects misplaced (0.098%) (OBJECT_MISPLACED)
>
> 2018-07-31 10:43:15.334494 mon.ceph_monitor01 [WRN] Health check update:
> Degraded data redundancy: 107807/5845923 objects degraded (1.844%), 59 pgs
> degraded, 197 pgs undersized (PG_DEGRADED)
>
> 2018-07-31 10:43:15.334510 mon.ceph_monitor01 [WRN] Health check update:
> 98 slow requests are blocked > 32 sec (REQUEST_SLOW)
>
> 2018-07-31 10:43:15.334865 mon.ceph_monitor01 [INF] osd.18 failed
> (root=default,room=,host=) (8 reporters from different host after
> 54.650576 >= grace 54.300663)
>
> 2018-07-31 10:43:15.336552 mon.ceph_monitor01 [WRN] Health check update: 5
> osds down (OSD_DOWN)
>
> 2018-07-31 10:43:17.357747 mon.ceph_monitor01 [WRN] Health check update:
> Reduced data availability: 6 pgs inactive, 6 pgs peering (PG_AVAILABILITY)
>
> 2018-07-31 10:43:20.339495 mon.ceph_monitor01 [WRN] Health check update:
> 5724/5846073 objects misplaced (0.098%) (OBJECT_MISPLACED)
>
> 2018-07-31 10:43:20.339543 mon.ceph_monitor01 [WRN] Health check update:
> Degraded data redundancy: 122901/5846073 objects degraded (2.102%), 65 pgs
> degraded, 201 pgs undersized (PG_DEGRADED)
>
> 2018-07-31 10:43:20.339559 mon.ceph_monitor01 [WRN] Health check update:
> 78 slow reques

[ceph-users] CephFS Snapshots in Mimic

2018-07-31 Thread Kenneth Waegeman

Hi all,

I updated an existing Luminous cluster to Mimic 13.2.1. All daemons were 
updated, so I did ceph osd require-osd-release mimic, so everything 
seems up to date.


I want to try the snapshots in Mimic, since this should be stable, so i ran:

[root@osd2801 alleee]# ceph fs set cephfs allow_new_snaps true
enabled new snapshots

Now, when I try to create a snapshot, it is not working:

[root@osd2801 ~]# mkdir /mnt/bla/alleee/aaas
[root@osd2801 ~]# mkdir /mnt/bla/alleee/aaas/.snap
mkdir: cannot create directory ‘/mnt/bla/alleee/aaas/.snap’: File exists

I tried this using ceph-fuse and the kernel client, but always get the 
same response.


Should I enable something else to get snapshots working ?


Thank you!


Kenneth

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Write operation to cephFS mount hangs

2018-07-31 Thread Bödefeld Sabine
Hello Eugen

 

Yes all of the clients use the same credentials for authentication. I’ve
mounted the cephFS on about 10 VMs and it works only on about 4 of them.

We have used this setup before but on Ubuntu 14.04 with ceph 0.94.1,
ceph-deploy 1.5.35 and ceph-fuse 0.80.11. 

 

In dmesg there is only:

[0.099826] fuse init (API version 7.23)

[0.099895] Key type big_key registered

[0.099926] Allocating IMA MOK and blacklist keyrings.

 

In the syslog there are no errors either, there is just the same message
(fuse init) and the message about starting the client (etc/init.d/fuse ).

 

Kind regards

Sabine

 



smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Snapshots in Mimic

2018-07-31 Thread David Disseldorp
Hi Kenneth,

On Tue, 31 Jul 2018 16:44:36 +0200, Kenneth Waegeman wrote:

> Hi all,
> 
> I updated an existing Luminous cluster to Mimic 13.2.1. All daemons were 
> updated, so I did ceph osd require-osd-release mimic, so everything 
> seems up to date.
> 
> I want to try the snapshots in Mimic, since this should be stable, so i ran:
> 
> [root@osd2801 alleee]# ceph fs set cephfs allow_new_snaps true
> enabled new snapshots
> 
> Now, when I try to create a snapshot, it is not working:
> 
> [root@osd2801 ~]# mkdir /mnt/bla/alleee/aaas
> [root@osd2801 ~]# mkdir /mnt/bla/alleee/aaas/.snap
> mkdir: cannot create directory ‘/mnt/bla/alleee/aaas/.snap’: File exists
> 
> I tried this using ceph-fuse and the kernel client, but always get the 
> same response.
> 
> Should I enable something else to get snapshots working ?

It looks as though you're just missing the snapshot name in the mkdir
request:
rapido1:/# cd /mnt/cephfs
rapido1:/mnt/cephfs# echo "before snap" >> data
rapido1:/mnt/cephfs# mkdir -p .snap/mysnapshot
rapido1:/mnt/cephfs# echo "after snap" >> data
rapido1:/mnt/cephfs# cat .snap/mysnapshot/data
before snap
rapido1:/mnt/cephfs# cat data
before snap
after snap

Cheers, David
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Snapshots in Mimic

2018-07-31 Thread John Spray
On Tue, Jul 31, 2018 at 3:45 PM Kenneth Waegeman
 wrote:
>
> Hi all,
>
> I updated an existing Luminous cluster to Mimic 13.2.1. All daemons were
> updated, so I did ceph osd require-osd-release mimic, so everything
> seems up to date.
>
> I want to try the snapshots in Mimic, since this should be stable, so i ran:
>
> [root@osd2801 alleee]# ceph fs set cephfs allow_new_snaps true
> enabled new snapshots
>
> Now, when I try to create a snapshot, it is not working:
>
> [root@osd2801 ~]# mkdir /mnt/bla/alleee/aaas
> [root@osd2801 ~]# mkdir /mnt/bla/alleee/aaas/.snap
> mkdir: cannot create directory ‘/mnt/bla/alleee/aaas/.snap’: File exists
>
> I tried this using ceph-fuse and the kernel client, but always get the
> same response.

The .snap directory always exists.  To create a snapshot, you create
subdirectory of .snap with a name of your choice.

John

> Should I enable something else to get snapshots working ?
>
>
> Thank you!
>
>
> Kenneth
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD mirroring replicated and erasure coded pools

2018-07-31 Thread Ilja Slepnev
Hi,

is it possible to establish RBD mirroring between replicated and erasure
coded pools?
I'm trying to setup replication as described on
http://docs.ceph.com/docs/master/rbd/rbd-mirroring/ without success.
Ceph 12.2.5 Luminous

root@local:~# rbd --cluster local mirror pool enable rbd-2 pool
2018-07-31 17:35:57.350506 7fa0833af0c0 -1 librbd::api::Mirror: mode_set:
failed to allocate mirroring uuid: (95) Operation not supported

No problem with replicated pool:
root@local:~# rbd --cluster local mirror pool enable rbd pool
root@local:~#

Pool configuration:
root@local:~# ceph --cluster local osd pool ls detail
pool 13 'rbd' replicated size 3 min_size 2 crush_rule 0 object_hash
rjenkins pg_num 256 pgp_num 256 last_change 2219 flags hashpspool
stripe_width 0 compression_mode none application rbd
pool 15 'rbd-2' erasure size 6 min_size 5 crush_rule 4 object_hash rjenkins
pg_num 64 pgp_num 64 last_change 2220 flags hashpspool,ec_overwrites
stripe_width 16384 application rbd

BR,
--
Ilja Slepnev
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Snapshots in Mimic

2018-07-31 Thread Kenneth Waegeman

Thanks David and John,

That sounds logical now. When I did read "To make a snapshot on 
directory “/1/2/3/”, the client invokes “mkdir” on “/1/2/3/.snap” 
directory (http://docs.ceph.com/docs/master/dev/cephfs-snapshots/)" it 
didn't come to mind I should create subdirectory immediately.


Thanks, it works now!

K


On 31/07/18 17:06, John Spray wrote:

On Tue, Jul 31, 2018 at 3:45 PM Kenneth Waegeman
 wrote:

Hi all,

I updated an existing Luminous cluster to Mimic 13.2.1. All daemons were
updated, so I did ceph osd require-osd-release mimic, so everything
seems up to date.

I want to try the snapshots in Mimic, since this should be stable, so i ran:

[root@osd2801 alleee]# ceph fs set cephfs allow_new_snaps true
enabled new snapshots

Now, when I try to create a snapshot, it is not working:

[root@osd2801 ~]# mkdir /mnt/bla/alleee/aaas
[root@osd2801 ~]# mkdir /mnt/bla/alleee/aaas/.snap
mkdir: cannot create directory ‘/mnt/bla/alleee/aaas/.snap’: File exists

I tried this using ceph-fuse and the kernel client, but always get the
same response.

The .snap directory always exists.  To create a snapshot, you create
subdirectory of .snap with a name of your choice.

John


Should I enable something else to get snapshots working ?


Thank you!


Kenneth

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Snapshots in Mimic

2018-07-31 Thread Ken Dreyer
On Tue, Jul 31, 2018 at 9:23 AM, Kenneth Waegeman
 wrote:
> Thanks David and John,
>
> That sounds logical now. When I did read "To make a snapshot on directory
> “/1/2/3/”, the client invokes “mkdir” on “/1/2/3/.snap” directory
> (http://docs.ceph.com/docs/master/dev/cephfs-snapshots/)" it didn't come to
> mind I should create subdirectory immediately.

That does sound unclear to me as well. Here's a proposed docs change:
https://github.com/ceph/ceph/pull/23353

- Ken
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OMAP warning ( again )

2018-07-31 Thread Brent Kennedy
Upgraded from 12.2.5 to 12.2.6, got a "1 large omap objects" warning
message, then upgraded to 12.2.7 and the message went away.  I just added
four OSDs to balance out the cluster ( we had some servers with fewer drives
in them; jbod config ) and now the "1 large omap objects" warning message is
back.  I did some googlefoo to try to figure out what it means and then how
to correct it, but the how to correct it is a bit vague.

 

We use rados gateways for all storage, so everything is in the .rgw.buckets
pool, which I gather from research is why we are getting the warning message
( there are millions of objects in there ).

 

Is there an if/then process to clearing this error message?

 

Regards,

-Brent

 

Existing Clusters:

Test: Luminous 12.2.7 with 3 osd servers, 1 mon/man, 1 gateway ( all virtual
)

US Production: Firefly with 4 osd servers, 3 mons, 3 gateways behind haproxy
LB

UK Production: Luminous 12.2.7 with 8 osd servers, 3 mons/man, 3 gateways
behind haproxy LB

 

 

 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Hiring: Ceph community manager

2018-07-31 Thread Rich Bowen

Hi, folks,

The Open Source and Standards (OSAS) group at Red Hat is hiring a Ceph 
Community Manager. If you're interested, check out the job listing here:


https://us-redhat.icims.com/jobs/64407/ceph-community-manager/job

If you'd like to talk to someone about what's involved in being a 
community manager, (in addition to reading the job posting itself) you 
can contact me at any time at any of the below contact methods.


Red Hat is an awesome place to work, and OSAS is the best part of the 
entire company (although I could be a little biased).


--
Rich Bowen - rbo...@redhat.com
@CentOSProject // @rbowen
859 351 9166
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Whole cluster flapping

2018-07-31 Thread Brent Kennedy
I have had this happen during large data movements.  Stopped happening after
I went to 10Gb though(from 1Gb).  What I had done is injected a setting (
and adjusted the configs ) to give more time before an OSD was marked down.

 

osd heartbeat grace = 200

mon osd down out interval = 900

 

For injecting runtime values/settings( under runtime changes ):

http://docs.ceph.com/docs/luminous/rados/configuration/ceph-conf/ 

 

Probably should check the logs before doing anything to ensure the OSDs or
host is not failing.  

 

-Brent

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
CUZA Frédéric
Sent: Tuesday, July 31, 2018 5:06 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Whole cluster flapping

 

Hi Everyone,

 

I just upgrade our cluster to Luminous 12.2.7 and I delete a quite large
pool that we had (120 TB).

Our cluster is made of 14 Nodes with each composed of 12 OSDs (1 HDD -> 1
OSD), we have SDD for journal.

 

After I deleted the large pool my cluster started to flapping on all OSDs.

Osds are marked down and then marked up as follow :

 

2018-07-31 10:42:51.504319 mon.ceph_monitor01 [INF] osd.97
172.29.228.72:6800/95783 boot

2018-07-31 10:42:55.330993 mon.ceph_monitor01 [WRN] Health check update:
5798/5845200 objects misplaced (0.099%) (OBJECT_MISPLACED)

2018-07-31 10:42:55.331065 mon.ceph_monitor01 [WRN] Health check update:
Degraded data redundancy: 221365/5845200 objects degraded (3.787%), 98 pgs
degraded, 317 pgs undersized (PG_DEGRADED)

2018-07-31 10:42:55.331093 mon.ceph_monitor01 [WRN] Health check update: 81
slow requests are blocked > 32 sec (REQUEST_SLOW)

2018-07-31 10:42:55.548385 mon.ceph_monitor01 [WRN] Health check update:
Reduced data availability: 13 pgs inactive, 4 pgs peering (PG_AVAILABILITY)

2018-07-31 10:42:55.610556 mon.ceph_monitor01 [INF] osd.96
172.29.228.72:6803/95830 boot

2018-07-31 10:43:00.331787 mon.ceph_monitor01 [WRN] Health check update: 5
osds down (OSD_DOWN)

2018-07-31 10:43:00.331930 mon.ceph_monitor01 [WRN] Health check update:
5782/5845401 objects misplaced (0.099%) (OBJECT_MISPLACED)

2018-07-31 10:43:00.331950 mon.ceph_monitor01 [WRN] Health check update:
Degraded data redundancy: 167757/5845401 objects degraded (2.870%), 77 pgs
degraded, 223 pgs undersized (PG_DEGRADED)

2018-07-31 10:43:00.331966 mon.ceph_monitor01 [WRN] Health check update: 76
slow requests are blocked > 32 sec (REQUEST_SLOW)

2018-07-31 10:43:01.729891 mon.ceph_monitor01 [WRN] Health check update:
Reduced data availability: 7 pgs inactive, 6 pgs peering (PG_AVAILABILITY)

2018-07-31 10:43:01.753867 mon.ceph_monitor01 [INF] osd.4
172.29.228.246:6812/3144542 boot

2018-07-31 10:43:05.332624 mon.ceph_monitor01 [WRN] Health check update: 4
osds down (OSD_DOWN)

2018-07-31 10:43:05.332691 mon.ceph_monitor01 [WRN] Health check update:
5767/5845569 objects misplaced (0.099%) (OBJECT_MISPLACED)

2018-07-31 10:43:05.332718 mon.ceph_monitor01 [WRN] Health check update:
Degraded data redundancy: 130565/5845569 objects degraded (2.234%), 67 pgs
degraded, 220 pgs undersized (PG_DEGRADED)

2018-07-31 10:43:05.332736 mon.ceph_monitor01 [WRN] Health check update: 83
slow requests are blocked > 32 sec (REQUEST_SLOW)

2018-07-31 10:43:07.004993 mon.ceph_monitor01 [WRN] Health check update:
Reduced data availability: 5 pgs inactive, 5 pgs peering (PG_AVAILABILITY)

2018-07-31 10:43:10.333548 mon.ceph_monitor01 [WRN] Health check update:
5752/5845758 objects misplaced (0.098%) (OBJECT_MISPLACED)

2018-07-31 10:43:10.333593 mon.ceph_monitor01 [WRN] Health check update:
Degraded data redundancy: 107805/5845758 objects degraded (1.844%), 59 pgs
degraded, 197 pgs undersized (PG_DEGRADED)

2018-07-31 10:43:10.333608 mon.ceph_monitor01 [WRN] Health check update: 95
slow requests are blocked > 32 sec (REQUEST_SLOW)

2018-07-31 10:43:15.334451 mon.ceph_monitor01 [WRN] Health check update:
5738/5845923 objects misplaced (0.098%) (OBJECT_MISPLACED)

2018-07-31 10:43:15.334494 mon.ceph_monitor01 [WRN] Health check update:
Degraded data redundancy: 107807/5845923 objects degraded (1.844%), 59 pgs
degraded, 197 pgs undersized (PG_DEGRADED)

2018-07-31 10:43:15.334510 mon.ceph_monitor01 [WRN] Health check update: 98
slow requests are blocked > 32 sec (REQUEST_SLOW)

2018-07-31 10:43:15.334865 mon.ceph_monitor01 [INF] osd.18 failed
(root=default,room=,host=) (8 reporters from different host after
54.650576 >= grace 54.300663)

2018-07-31 10:43:15.336552 mon.ceph_monitor01 [WRN] Health check update: 5
osds down (OSD_DOWN)

2018-07-31 10:43:17.357747 mon.ceph_monitor01 [WRN] Health check update:
Reduced data availability: 6 pgs inactive, 6 pgs peering (PG_AVAILABILITY)

2018-07-31 10:43:20.339495 mon.ceph_monitor01 [WRN] Health check update:
5724/5846073 objects misplaced (0.098%) (OBJECT_MISPLACED)

2018-07-31 10:43:20.339543 mon.ceph_monitor01 [WRN] Health check update:
Degraded data redundancy: 122901/5846073 objects degraded (2.102%), 65 pgs
degraded, 201 pgs unde

Re: [ceph-users] Force cephfs delayed deletion

2018-07-31 Thread Kamble, Nitin A
Hi John,

I am running ceph Luminous 12.2.1 release on the storage nodes with v4.4.114 
kernel on the cephfs clients.

3 client nodes are running 3 instances of a test program.
The test program is doing this repeatedly in a loop:

  *   sequentially write a 256GB file on cephfs
  *   delete the file

‘ceph df’ shows that after delete the space is not getting freed from cephfs 
and and cephfs space utilization (number of objects, space used and  % 
utilization) keeps growing up continuously.

I double checked, and no process is holding an open handle to the closed files.

When the test program is stopped, the writing workload stops and then the 
cephfs space utilization starts going down as expected.

Looks like the cephfs write load is not giving enough opportunity to actually 
perform the delete file operations from clients. It is a consistent behavior, 
and easy to reproduce.

I tried playing with these advanced MDS config parameters:

  *   mds_max_purge_files
  *   mds_max_purge_ops
  *   mds_max_purge_ops_per_pg
  *   mds_purge_queue_busy_flush_period

But it is not helping with the workload.

Is this a known issue? And is there a workaround to give more priority to the 
objects purging operations?

Thanks in advance,
Nitin

From: ceph-users  on behalf of Alexander 
Ryabov 
Date: Thursday, July 19, 2018 at 8:09 AM
To: John Spray 
Cc: "ceph-users@lists.ceph.com" 
Subject: Re: [ceph-users] Force cephfs delayed deletion


>Also, since I see this is a log directory, check that you don't have some 
>processes that are holding their log files open even after they're unlinked.

Thank you very much - that was the case.

lsof /mnt/logs | grep deleted



After dealing with these, space was reclaimed in about 2-3min.






From: John Spray 
Sent: Thursday, July 19, 2018 17:24
To: Alexander Ryabov
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Force cephfs delayed deletion

On Thu, Jul 19, 2018 at 1:58 PM Alexander Ryabov 
mailto:arya...@intermedia.net>> wrote:

Hello,

I see that free space is not released after files are removed on CephFS.

I'm using Luminous with replica=3 without any snapshots etc and with default 
settings.



From client side:
$ du -sh /mnt/logs/
4.1G /mnt/logs/
$ df -h /mnt/logs/
Filesystem   Size  Used Avail Use% Mounted on
h1,h2:/logs  125G   87G   39G  70% /mnt/logs

These stats are after couple of large files were removed in /mnt/logs dir, but 
that only dropped Useв space a little.

Check what version of the client you're using -- some older clients had bugs 
that would hold references to deleted files and prevent them from being purged. 
 If you find that the space starts getting freed when you unmount the client, 
this is likely to be because of a client bug.

Also, since I see this is a log directory, check that you don't have some 
processes that are holding their log files open even after they're unlinked.

John



Doing 'sync' command also changes nothing.

From server side:
# ceph  df
GLOBAL:
SIZE AVAIL  RAW USED %RAW USED
124G 39226M   88723M 69.34
POOLS:
NAMEID USED   %USED MAX AVAIL OBJECTS
cephfs_data 1  28804M 76.80 8703M7256
cephfs_metadata 2236M  2.65 8703M 101


Why there are such a large difference between 'du' and 'USED'?

I've found that it could be due to 'delayed delete' 
http://docs.ceph.com/docs/luminous/dev/delayed-delete/

And previously it seems could be tuned by adjusting the "mds max purge files" 
and "mds max purge ops"

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-October/013679.html

But there is no more of such options in 
http://docs.ceph.com/docs/luminous/cephfs/mds-config-ref/



So the question is - how to purge deleted data and reclaim free space?

Thank you.


​__
ceph

Re: [ceph-users] Mgr cephx caps to run `ceph fs status`?

2018-07-31 Thread Linh Vu
Thanks John, that works! Also works with multiple commands, e.g I granted my 
user access to both `ceph fs status` and `ceph status`:


mgr 'allow command "fs status", allow command "status"'


From: John Spray 
Sent: Tuesday, 31 July 2018 8:12:00 PM
To: Linh Vu
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Mgr cephx caps to run `ceph fs status`?

On Tue, Jul 31, 2018 at 3:36 AM Linh Vu  wrote:
>
> Hi all,
>
>
> I want a non-admin client to be able to run `ceph fs status`, either via the 
> ceph CLI or a python script. Adding `mgr "allow *"` to this client's cephx 
> caps works, but I'd like to be more specific if possible. I can't find the 
> complete list of mgr cephx caps anywhere, so if you could point me in the 
> right direction, that'd be great!

Both mgr and mon caps have an "allow command" syntax that lets you
restrict users to specific named commands (and even specific
arguments). Internally, the mgr and the mon use the same code to
intepret capabilities.

I just went looking for the documentation for those mon caps and it
appears not to exist!

Anyway, in your case it's something like this:

mgr "allow command \"fs status\""

I don't think I've ever tested this on a mgr daemon, so let us know
how you get on.

John



>
> Cheers,
>
> Linh
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OMAP warning ( again )

2018-07-31 Thread Brad Hubbard
Search the cluster log for 'Large omap object found' for more details.

On Wed, Aug 1, 2018 at 3:50 AM, Brent Kennedy  wrote:
> Upgraded from 12.2.5 to 12.2.6, got a “1 large omap objects” warning
> message, then upgraded to 12.2.7 and the message went away.  I just added
> four OSDs to balance out the cluster ( we had some servers with fewer drives
> in them; jbod config ) and now the “1 large omap objects” warning message is
> back.  I did some googlefoo to try to figure out what it means and then how
> to correct it, but the how to correct it is a bit vague.
>
>
>
> We use rados gateways for all storage, so everything is in the .rgw.buckets
> pool, which I gather from research is why we are getting the warning message
> ( there are millions of objects in there ).
>
>
>
> Is there an if/then process to clearing this error message?
>
>
>
> Regards,
>
> -Brent
>
>
>
> Existing Clusters:
>
> Test: Luminous 12.2.7 with 3 osd servers, 1 mon/man, 1 gateway ( all virtual
> )
>
> US Production: Firefly with 4 osd servers, 3 mons, 3 gateways behind haproxy
> LB
>
> UK Production: Luminous 12.2.7 with 8 osd servers, 3 mons/man, 3 gateways
> behind haproxy LB
>
>
>
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Force cephfs delayed deletion

2018-07-31 Thread Yan, Zheng
On Wed, Aug 1, 2018 at 6:43 AM Kamble, Nitin A 
wrote:

> Hi John,
>
>
>
> I am running ceph Luminous 12.2.1 release on the storage nodes with
> v4.4.114 kernel on the cephfs clients.
>
>
>
> 3 client nodes are running 3 instances of a test program.
>
> The test program is doing this repeatedly in a loop:
>
>- sequentially write a 256GB file on cephfs
>- delete the file
>
> do the clients write to the same file? I mean same file name in a
directory.



>
>
> ‘ceph df’ shows that after delete the space is not getting freed from
> cephfs and and cephfs space utilization (number of objects, space used and
>  % utilization) keeps growing up continuously.
>
>
>
> I double checked, and no process is holding an open handle to the closed
> files.
>
>
>
> When the test program is stopped, the writing workload stops and then the
> cephfs space utilization starts going down as expected.
>
>
>
> Looks like the cephfs write load is not giving enough opportunity to
> actually perform the delete file operations from clients. It is a
> consistent behavior, and easy to reproduce.
>
>
>
> I tried playing with these advanced MDS config parameters:
>
>- mds_max_purge_files
>- mds_max_purge_ops
>- mds_max_purge_ops_per_pg
>- mds_purge_queue_busy_flush_period
>
>
>
> But it is not helping with the workload.
>
>
>
> Is this a known issue? And is there a workaround to give more priority to
> the objects purging operations?
>
>
>
> Thanks in advance,
>
> Nitin
>
>
>
> *From: *ceph-users  on behalf of
> Alexander Ryabov 
> *Date: *Thursday, July 19, 2018 at 8:09 AM
> *To: *John Spray 
> *Cc: *"ceph-users@lists.ceph.com" 
> *Subject: *Re: [ceph-users] Force cephfs delayed deletion
>
>
>
> >Also, since I see this is a log directory, check that you don't have
> some processes that are holding their log files open even after they're
> unlinked.
>
> Thank you very much - that was the case.
>
> lsof /mnt/logs | grep deleted
>
>
>
> After dealing with these, space was reclaimed in about 2-3min.
>
>
>
>
> --
>
> *From:* John Spray 
> *Sent:* Thursday, July 19, 2018 17:24
> *To:* Alexander Ryabov
> *Cc:* ceph-users@lists.ceph.com
> *Subject:* Re: [ceph-users] Force cephfs delayed deletion
>
>
>
> On Thu, Jul 19, 2018 at 1:58 PM Alexander Ryabov 
> wrote:
>
> Hello,
>
> I see that free space is not released after files are removed on CephFS.
>
> I'm using Luminous with replica=3 without any snapshots etc and with
> default settings.
>
>
>
> From client side:
>
> $ du -sh /mnt/logs/
>
> 4.1G /mnt/logs/
>
> $ df -h /mnt/logs/
>
> Filesystem   Size  Used Avail Use% Mounted on
>
> h1,h2:/logs  125G   87G   39G  70% /mnt/logs
>
>
>
> These stats are after couple of large files were removed in /mnt/logs dir,
> but that only dropped Useв space a little.
>
>
>
> Check what version of the client you're using -- some older clients had
> bugs that would hold references to deleted files and prevent them from
> being purged.  If you find that the space starts getting freed when you
> unmount the client, this is likely to be because of a client bug.
>
>
>
> Also, since I see this is a log directory, check that you don't have some
> processes that are holding their log files open even after they're unlinked.
>
>
>
> John
>
>
>
>
>
>
>
> Doing 'sync' command also changes nothing.
>
>
>
> From server side:
>
> # ceph  df
>
> GLOBAL:
>
> SIZE AVAIL  RAW USED %RAW USED
>
> 124G 39226M   88723M 69.34
>
> POOLS:
>
> NAMEID USED   %USED MAX AVAIL OBJECTS
>
> cephfs_data 1  28804M 76.80 8703M7256
>
> cephfs_metadata 2236M  2.65 8703M 101
>
>
>
> Why there are such a large difference between 'du' and 'USED'?
>
> I've found that it could be due to 'delayed delete'
> http://docs.ceph.com/docs/luminous/dev/delayed-delete/
> 
>
> And previously it seems could be tuned by adjusting the "mds max purge
> files" and "mds max purge ops"
>
>
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-October/013679.html
> 
>
> But there is no more of such options in
> http://docs.ceph.com/docs/luminous/cephfs/mds-config-ref/
> 

Re: [ceph-users] Write operation to cephFS mount hangs

2018-07-31 Thread Gregory Farnum
On Tue, Jul 31, 2018 at 7:46 PM Bödefeld Sabine 
wrote:

> Hello,
>
>
>
> we have a Ceph Cluster 10.2.10 on VMs with Ubuntu 16.04 using Xen as the
> hypervisor. We use CephFS and the clients use ceph-fuse to access the files.
>
> Some of the ceph-fuse clients hang on write operations to the cephFS. On
> copying a file to the cephFS, the file is created but it's empty and the
> write operation hangs forever. Ceph-fuse version is 10.2.9.
>

Sounds like the client has the MDS permissions required to update the
CephFS metadata hierarchy, but lacks permission to write to the RADOS pools
which actually store the file data. What permissions do the clients have?
Have you checked with "ceph auth list" or similar to make sure they all
have the same CephX capabilities?
-Greg


> In the logfile of the mds there are no error messages. Also, ceph health
> returns HEALTH_OK.
>
> ceph daemon mds.eco61 session ls reports no problems (if I interpret
> correctly):
>
>{
>
> "id": 64396,
>
> "num_leases": 2,
>
> "num_caps": 32,
>
> "state": "open",
>
> "replay_requests": 0,
>
> "completed_requests": 1,
>
> "reconnecting": false,
>
> "inst": "client.64396 192.168.1.179:0\/980852091",
>
> "client_metadata": {
>
> "ceph_sha1": "2ee413f77150c0f375ff6f10edd6c8f9c7d060d0",
>
> "ceph_version": "ceph version 10.2.9
> (2ee413f77150c0f375ff6f10edd6c8f9c7d060d0)",
>
> "entity_id": "admin",
>
> "hostname": "eco79",
>
> "mount_point": "\/mnt\/cephfs",
>
> "root": "\/"
>
> }
>
> },
>
>
>
> Does anyone have an idea where the problem lies? Any help would be greatly
> appreciated.
>
> Thanks very much,
>
> Kind regards
>
> Sabine
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com