Re: [ceph-users] Cephfs NFS failover
On 20.12.2017 18:45, nigel davies wrote: > Hay all > > Can any one advise on how it can do this. You can use ctdb for that and run an active/active NFS cluster: https://wiki.samba.org/index.php/Setting_up_CTDB_for_Clustered_NFS The cluster filesystem can be a CephFS. This also works with Samba, i.e. you get an unlimited fileserver. Regards -- Robert Sander Heinlein Support GmbH Schwedter Str. 8/9b, 10119 Berlin http://www.heinlein-support.de Tel: 030 / 405051-43 Fax: 030 / 405051-19 Zwangsangaben lt. §35a GmbHG: HRB 93818 B / Amtsgericht Berlin-Charlottenburg, Geschäftsführer: Peer Heinlein -- Sitz: Berlin signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Proper way of removing osds
Hi, This is how I remove an OSD from cluster - Take it out ceph osd out osdid Wait for the balancing to finish - Mark it down ceph osd down osdid Then Purge it ceph osd purge osdid --yes-i-really-mean-it While purging I can see there is another rebalancing occurring. Is this the correct way to removes OSDs, or am I doing something wrong ? Karun ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Cephfs limis
Hay all is it possable to set cephfs to have an sapce limit eg i like to set my cephfs to have an limit of 20TB and my s3 storage to have 4TB for example thanks ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Proper way of removing osds
Is this the correct way to removes OSDs, or am I doing something wrong ? Generic way for maintenance (e.g. disk replace) is rebalance by change osd weight: ceph osd crush reweight osdid 0 cluster migrate data "from this osd" When HEALTH_OK you can safe remove this OSD: ceph osd out osd_id systemctl stop ceph-osd@osd_id ceph osd crush remove osd_id ceph auth del osd_id ceph osd rm osd_id k ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Slow backfilling with bluestore, ssd and metadata pools
Hi, we are in the process of migrating our hosts to bluestore. Each host has 12 HDDs (6TB / 4TB) and two Intel P3700 NVME SSDs with 375 GB capacity. The new bluestore OSDs are created by ceph-volume: ceph-volume lvm create --bluestore --block.db /dev/nvmeXn1pY --data /dev/sdX1 6 OSDs share a SSD with 30GB partitions for rocksdb; the remaining space is used as additional ssd based osd without specifying additional partitions. Backfilling from the other nodes works fine for the hdd based OSDs, but is _really_ slow for the ssd based ones. With filestore moving our cephfs metadata pool around was a matter of 10 minutes (350MB, 8 million objects, 1024 PGs). With bluestore remapped a part of the pool (about 400PGs, those affected by adding a new pair of ssd based OSDs) did not finish over night OSD config section from ceph.conf: [osd] osd_scrub_sleep = 0.05 osd_journal_size = 10240 osd_scrub_chunk_min = 1 osd_scrub_chunk_max = 1 max_pg_per_osd_hard_ratio = 4.0 osd_max_pg_per_osd_hard_ratio = 4.0 bluestore_cache_size_hdd = 5368709120 mon_max_pg_per_osd = 400 Backfilling runs with max-backfills set to 20 during day and 50 during night. Some numbers (ceph pg dump for the most advanced backfilling cephfs metadata PG, ten seconds difference): ceph pg dump | grep backfilling | grep -v undersized | sort -k4 -n -r | tail -n 1 && sleep 10 && echo && ceph pg dump | grep backfilling | grep -v undersized | sort -k4 -n -r | tail -n 1 dumped all 8.101 7581 0 0 4549 0 4194304 2488 2488 active+remapped+backfilling 2017-12-21 09:03:30.429605 543240'1012998 543248:1923733 [78,34,49] 78 [78,34,19] 78 522371'1009118 2017-12-18 16:11:29.755231 522371'1009118 2017-12-18 16:11:29.755231 dumped all 8.101 7580 0 0 4542 0 0 2489 2489 active+remapped+backfilling 2017-12-21 09:03:30.429605 543248'1012999 543250:1923755 [78,34,49] 78 [78,34,19] 78 522371'1009118 2017-12-18 16:11:29.755231 522371'1009118 2017-12-18 16:11:29.755231 Seven objects in 10 seconds does not sound sane to me, given that only key-value has to be transferred. Any hints how to tune this? Regards, Burkhard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Proper way of removing osds
On 21/12/17 10:21, Konstantin Shalygin wrote: >> Is this the correct way to removes OSDs, or am I doing something wrong ? > Generic way for maintenance (e.g. disk replace) is rebalance by change osd > weight: > > > ceph osd crush reweight osdid 0 > > cluster migrate data "from this osd" > > > When HEALTH_OK you can safe remove this OSD: > > ceph osd out osd_id > systemctl stop ceph-osd@osd_id > ceph osd crush remove osd_id > ceph auth del osd_id > ceph osd rm osd_id > > > > k basically this, when you mark an OSD "out" it stops receiving data and PGs will be remapped but it is still part of the crushmap and influences the weights of buckets - so when you do the final purge your weights will shift and another rebalance occurs. Weighting the OSD to 0 first will ensure you don't incur any extra data movement when you finally purge it. Rich signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Proper way of removing osds
Hi, On 12/21/2017 11:03 AM, Karun Josy wrote: Hi, This is how I remove an OSD from cluster * Take it out ceph osd out osdid Wait for the balancing to finish * Mark it down ceph osd down osdid Then Purge it cephosd purge osdid --yes-i-really-mean-it While purging I can see there is another rebalancing occurring. Is this the correct way to removes OSDs, or am I doing something wrong ? The procedure is correct, but not optimal. The first rebalancing is due to the osd being down; the second rebalancing is due to fact that removing the osd changes the crush weight of the host and thus the base of the overall data distribution. If you want to skip this, you can set the crush weight of the to-be-removed osd to 0.0, wait for the rebalancing to be finished, and stop and remove the osds afterwards. You can also use smaller steps to reduce the backfill impact if necessary, Regards, Burkhard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Slow backfilling with bluestore, ssd and metadata pools
On 21/12/17 10:28, Burkhard Linke wrote: > OSD config section from ceph.conf: > > [osd] > osd_scrub_sleep = 0.05 > osd_journal_size = 10240 > osd_scrub_chunk_min = 1 > osd_scrub_chunk_max = 1 > max_pg_per_osd_hard_ratio = 4.0 > osd_max_pg_per_osd_hard_ratio = 4.0 > bluestore_cache_size_hdd = 5368709120 > mon_max_pg_per_osd = 400 Consider also playing with the following OSD parameters: osd_recovery_max_active osd_recovery_sleep osd_recovery_sleep_hdd osd_recovery_sleep_hybrid osd_recovery_sleep_ssd In my anecdotal experience, the forced wait between requests (controlled by the recovery_sleep parameters) was causing significant slowdown in recovery speed in my cluster, though even at the default values it wasn't making things go nearly as slowly as your cluster - it sounds like something else is probably wrong. Rich signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Not timing out watcher
On Wed, Dec 20, 2017 at 6:56 PM, Jason Dillaman wrote: > ... looks like this watch "timeout" was introduced in the kraken > release [1] so if you don't see this issue with a Jewel cluster, I > suspect that's the cause. > > [1] https://github.com/ceph/ceph/pull/11378 Strictly speaking that's a backwards incompatible change, because zeroes have never been and aren't enforced -- clients are free to fill the remaining bits of ceph_osd_op with whatever values. That said, the kernel client has always been zeroing the front portion of the message before encoding, so even though the timeout field hasn't been carried into ceph_osd_op definition in the kernel, it's always 0 (for "use osd_client_watch_timeout for this watch"). Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Not timing out watcher
On Wed, Dec 20, 2017 at 6:20 PM, Serguei Bezverkhi (sbezverk) wrote: > It took 30 minutes for the Watcher to time out after ungraceful restart. Is > there a way limit it to something a bit more reasonable? Like 1-3 minutes? > > On 2017-12-20, 12:01 PM, "Serguei Bezverkhi (sbezverk)" > wrote: > > Ok, here is what I found out. If I gracefully kill a pod then watcher > gets properly cleared, but if it is done ungracefully, without “rbd unmap” > then even after a node reboot Watcher stays up for a long time, it has been > more than 20 minutes and it is still active (no any kubernetes services are > running). Hi Serguei, Can you try taking k8s out of the equation -- set up a fresh VM with the same kernel, do "rbd map" in it and kill it? Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] POOL_NEARFULL
Update your ceph.conf file This is also not help. I was create ticket http://tracker.ceph.com/issues/22520 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Slow backfilling with bluestore, ssd and metadatapools
Hi, On 12/21/2017 11:43 AM, Richard Hesketh wrote: On 21/12/17 10:28, Burkhard Linke wrote: OSD config section from ceph.conf: [osd] osd_scrub_sleep = 0.05 osd_journal_size = 10240 osd_scrub_chunk_min = 1 osd_scrub_chunk_max = 1 max_pg_per_osd_hard_ratio = 4.0 osd_max_pg_per_osd_hard_ratio = 4.0 bluestore_cache_size_hdd = 5368709120 mon_max_pg_per_osd = 400 Consider also playing with the following OSD parameters: osd_recovery_max_active osd_recovery_sleep osd_recovery_sleep_hdd osd_recovery_sleep_hybrid osd_recovery_sleep_ssd In my anecdotal experience, the forced wait between requests (controlled by the recovery_sleep parameters) was causing significant slowdown in recovery speed in my cluster, though even at the default values it wasn't making things go nearly as slowly as your cluster - it sounds like something else is probably wrong. Thanks for the hint. I've been thinking about recovery_sleep, too. But the default for ssd osds is set to 0.0: # ceph daemon osd.93 config show | grep recovery "osd_allow_recovery_below_min_size": "true", "osd_debug_skip_full_check_in_recovery": "false", "osd_force_recovery_pg_log_entries_factor": "1.30", "osd_min_recovery_priority": "0", "osd_recovery_cost": "20971520", "osd_recovery_delay_start": "0.00", "osd_recovery_forget_lost_objects": "false", "osd_recovery_max_active": "3", "osd_recovery_max_chunk": "8388608", "osd_recovery_max_omap_entries_per_chunk": "64000", "osd_recovery_max_single_start": "1", "osd_recovery_op_priority": "3", "osd_recovery_op_warn_multiple": "16", "osd_recovery_priority": "5", "osd_recovery_retry_interval": "30.00", "osd_recovery_sleep": "0.00", "osd_recovery_sleep_hdd": "0.10", "osd_recovery_sleep_hybrid": "0.025000", "osd_recovery_sleep_ssd": "0.00", "osd_recovery_thread_suicide_timeout": "300", "osd_recovery_thread_timeout": "30", "osd_scrub_during_recovery": "false", osd 93 is one of the ssd osd I've just recreated using bluestore about 3 hours ago. All recovery related values are at their defaults. Since the first mail one hour ago the PG made some progress: 8.101 7580 0 0 2777 0 0 2496 2496 active+remapped+backfilling 2017-12-21 09:03:30.429605 543455'1013006 543518:1927782 [78,34,49] 78 [78,34,19] 78 522371'1009118 2017-12-18 16:11:29.755231 522371'1009118 2017-12-18 16:11:29.755231 So roughly 2000 objects on this PG have been copied to a new ssd based OSD (78,34,19 -> 78,34,49 -> one new copy). Regards, Burkhard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs mds millions of caps
I have upgraded the kernel on a client node (one that has close-to-zero traffic) used for tests. { "reconnecting" : false, "id" : 1620266, "num_leases" : 0, "inst" : "client.1620266 10.0.0.111:0/3921220890", "state" : "open", "completed_requests" : 0, "num_caps" : 1402490, "client_metadata" : { "kernel_version" : "4.4.0-104-generic", "hostname" : "suppressed", "entity_id" : "admin" }, "replay_requests" : 0 }, still 1.4M caps used. is upgrading the client kernel enough ? Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* On Fri, Dec 15, 2017 at 11:16 AM, Webert de Souza Lima < webert.b...@gmail.com> wrote: > So, > > On Fri, Dec 15, 2017 at 10:58 AM, Yan, Zheng wrote: > >> >> 300k are ready quite a lot. opening them requires long time. does you >> mail server really open so many files? > > > Yes, probably. It's a commercial solution. A few thousand domains, dozens > of thousands of users and god knows how any mailboxes. > From the daemonperf you can see the write workload is high, so yes, too > much files opening (dovecot mdbox stores multiple e-mails per file, split > into many files). > > I checked 4.4 kernel, it includes the code that trim cache when mds >> recovers. > > > Ok, all nodes are running 4.4.0-75-generic. The fix might have been > included in a newer version. > I'll upgrade it asap. > > > Regards, > > Webert Lima > DevOps Engineer at MAV Tecnologia > *Belo Horizonte - Brasil* > *IRC NICK - WebertRLZ* > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Many concurrent drive failures - How do I activate pgs?
Hi, Since many ceph clusters use intel ssds and admins do recommend them, they are probably very good drives. My own experiences however are not so good with them. (About 70% of our intel drives ran into the 8mb bug at my previous job, 5xx and DC35xx series both, latest firmware at that time, <<10% cell usage, ~1 year use). For the future I would recommend that you use different series / vendors for each of the failure domains, this way you can minimize the chance of "correlated failures". There is a lecture about this here from Lars at Suse: https://www.youtube.com/watch?v=fgRWVZXxRN8 Regards, Denes. On 12/21/2017 02:48 AM, David Herselman wrote: Hi Christian, Thanks for taking the time, I haven't been contacted by anyone yet but managed to get the down placement groups cleared by exporting 7.4s0 and 7.fs0 and then marking them as complete on the surviving OSDs: kvm5c: ceph-objectstore-tool --op export --pgid 7.4s0 --data-path /var/lib/ceph/osd/ceph-8 --journal-path /var/lib/ceph/osd/ceph-8/journal --file /var/lib/vz/template/ssd_recovery/osd8_7.4s0.export; ceph-objectstore-tool --op mark-complete --data-path /var/lib/ceph/osd/ceph-8 --journal-path /var/lib/ceph/osd/ceph-8/journal --pgid 7.4s0; kvm5f: ceph-objectstore-tool --op export --pgid 7.fs0 --data-path /var/lib/ceph/osd/ceph-23 --journal-path /var/lib/ceph/osd/ceph-23/journal --file /var/lib/vz/template/ssd_recovery/osd23_7.fs0.export; ceph-objectstore-tool --op mark-complete --data-path /var/lib/ceph/osd/ceph-23 --journal-path /var/lib/ceph/osd/ceph-23/journal --pgid 7.fs0; This would presumably simply punch holes in the RBD images but at least we can copy them out of that pool and hope that Intel can somehow unlock the drives for us to then export/import objects. To answer your questions though, we have 6 near identical Intel Wildcat Pass 1U servers and have Proxmox loaded on them. Proxmox uses a Debian 9 base with the Ubuntu kernel, for which they apply cherry picked kernel patches (eg Intel NIC driver updates, vhost perf regression and mem-leak fixes, etc): kvm5a: Intel R1208WTTGSR System (serial: BQWS55091014) Intel S2600WTTR Motherboard (serial: BQWL54950385, BIOS ID: SE5C610.86B.01.01.0021.032120170601) 2 x Intel Xeon E5-2640v4 2.4GHz (HT disabled) 24 x Micron 8GB DDR4 2133MHz (24 x 18ASF1G72PZ-2G1B1) Intel AXX10GBNIA I/O Module kvm5b: Intel R1208WTTGS System (serial: BQWS53890178) Intel S2600WTT Motherboard (serial: BQWL52550359, BIOS ID: SE5C610.86B.01.01.0021.032120170601) 2 x Intel Xeon E5-2640v4 2.4GHz (HT enabled) 4 x Micron 64GB DDR4 2400MHz LR-DIMM (4 x 72ASS8G72LZ-2G3B2) Intel AXX10GBNIA I/O Module kvm5c: Intel R1208WT2GS System (serial: BQWS50490279) Intel S2600WT2 Motherboard (serial: BQWL44650203, BIOS ID: SE5C610.86B.01.01.0021.032120170601) 2 x Intel Xeon E5-2640v3 2.6GHz (HT enabled) 4 x Micron 64GB DDR4 2400MHz LR-DIMM (4 x 72ASS8G72LZ-2G3B2) Intel AXX10GBNIA I/O Module kvm5d: Intel R1208WTTGSR System (serial: BQWS62291318) Intel S2600WTTR Motherboard (serial: BQWL61855187, BIOS ID: SE5C610.86B.01.01.0021.032120170601) 2 x Intel Xeon E5-2640v4 2.4GHz (HT enabled) 4 x Micron 64GB DDR4 2400MHz LR-DIMM (4 x 72ASS8G72LZ-2G3B2) Intel AXX10GBNIA I/O Module kvm5e: Intel R1208WTTGSR System (serial: BQWS64290162) Intel S2600WTTR Motherboard (serial: BQWL63953066, BIOS ID: SE5C610.86B.01.01.0021.032120170601) 2 x Intel Xeon E5-2640v4 2.4GHz (HT enabled) 4 x Micron 64GB DDR4 2400MHz LR-DIMM (4 x 72ASS8G72LZ-2G3B2) Intel AXX10GBNIA I/O Module kvm5f: Intel R1208WTTGSR System (serial: BQWS71790632) Intel S2600WTTR Motherboard (serial: BQWL71050622, BIOS ID: SE5C610.86B.01.01.0021.032120170601) 2 x Intel Xeon E5-2640v4 2.4GHz (HT enabled) 4 x Micron 64GB DDR4 2400MHz LR-DIMM (4 x 72ASS8G72LZ-2G3B2) Intel AXX10GBNIA I/O Module Summary: * 5b has an Intel S2600WTT, 5c has an Intel S2600WT2, all others have S2600WTTR Motherboards * 5a has ECC Registered Dual Rank DDR DIMMs, all others have ECC LoadReduced-DIMMs * 5c has an Intel X540-AT2 10 GbE adapter as the on-board NICs are only 1 GbE Each system has identical discs: * 2 x 480 GB Intel SSD DC S3610 (SSDSC2BX480G4) - partitioned as software RAID1 OS volume and Ceph FileStore journals (spinners) * 4 x 2 TB Seagate discs (ST2000NX0243) - Ceph FileStore OSDs (journals in S3610 partitions) * 2 x 1.9 TB Intel SSD DC S4600 (SSDSC2KG019T7) - Ceph BlueStore OSDs (problematic) Additional information: * All drives are directly attached to the on-board AHCI SATA controllers, via the standard 2.5 inch drive chassis hot-swap bays. * We added 12 x 1.9 TB SSD DC S4600 drives last week Thursday, 2 in each system's slots 7 & 8 * Systems have been ope
Re: [ceph-users] ceph status doesnt show available and used disk space after upgrade
accidently removed mailing list email ++ceph-users Thanks a lot JC for looking into this issue. I am really out of ideas. ceph.conf on mgr node which is also monitor node. [global] fsid = 06c5c906-fc43-499f-8a6f-6c8e21807acf mon_initial_members = node-16 node-30 node-31 mon_host = 172.16.1.9 172.16.1.3 172.16.1.11 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true log_to_syslog_level = info log_to_syslog = True osd_pool_default_size = 2 osd_pool_default_min_size = 1 osd_pool_default_pg_num = 64 public_network = 172.16.1.0/24 log_to_syslog_facility = LOG_LOCAL0 osd_journal_size = 2048 auth_supported = cephx osd_pool_default_pgp_num = 64 osd_mkfs_type = xfs cluster_network = 172.16.1.0/24 osd_recovery_max_active = 1 osd_max_backfills = 1 mon allow pool delete = true [client] rbd_cache_writethrough_until_flush = True rbd_cache = True [client.radosgw.gateway] rgw_keystone_accepted_roles = _member_, Member, admin, swiftoperator keyring = /etc/ceph/keyring.radosgw.gateway rgw_frontends = fastcgi socket_port=9000 socket_host=127.0.0.1 rgw_socket_path = /tmp/radosgw.sock rgw_keystone_revocation_interval = 100 rgw_keystone_url = http://192.168.1.3:35357 rgw_keystone_admin_token = jaJSmlTNxgsFp1ttq5SuAT1R rgw_init_timeout = 36 host = controller3 rgw_dns_name = *.sapiennetworks.com rgw_print_continue = True rgw_keystone_token_cache_size = 10 rgw_data = /var/lib/ceph/radosgw user = www-data ceph auth list osd.100 key: AQAtZjpaVZOFBxAAwl0yFLdUOidLzPFjv+HnjA== caps: [mgr] allow profile osd caps: [mon] allow profile osd caps: [osd] allow * osd.101 key: AQA4ZjpaS4wwGBAABwgoXQRc1J8sav4MUkWceQ== caps: [mgr] allow profile osd caps: [mon] allow profile osd caps: [osd] allow * osd.102 key: AQBDZjpaBS2tEBAAtFiPKBzh8JGi8Nh3PtAGCg== caps: [mgr] allow profile osd caps: [mon] allow profile osd caps: [osd] allow * client.admin key: AQD0yXFYflnYFxAAEz/2XLHO/6RiRXQ5HXRAnw== caps: [mds] allow * caps: [mgr] allow * caps: [mon] allow * caps: [osd] allow * client.backups key: AQC0y3FY4YQNNhAAs5fludq0yvtp/JJt7RT4HA== caps: [mgr] allow r caps: [mon] allow r caps: [osd] allow class-read object_prefix rbd_children, allow rwx pool=backups, allow rwx pool=volumes client.bootstrap-mds key: AQD5yXFYyIxiFxAAyoqLPnxxqWmUr+zz7S+qVQ== caps: [mgr] allow r caps: [mon] allow profile bootstrap-mds client.bootstrap-mgr key: AQBmOTpaXqHQDhAAyDXoxlPmG9QovfmmUd8gIg== caps: [mon] allow profile bootstrap-mgr client.bootstrap-osd key: AQD0yXFYuGkSIhAAelSb3TCPuXRFoFJTBh7Vdg== caps: [mgr] allow r caps: [mon] allow profile bootstrap-osd client.bootstrap-rbd key: AQBnOTpafDS/IRAAnKzuI9AYEF81/6mDVv0QgQ== caps: [mon] allow profile bootstrap-rbd client.bootstrap-rgw key: AQD3yXFYxt1mLRAArxOgRvWmmzT9pmsqTLpXKw== caps: [mgr] allow r caps: [mon] allow profile bootstrap-rgw client.compute key: AQCbynFYRcNWOBAAPzdAKfP21GvGz1VoHBimGQ== caps: [mgr] allow r caps: [mon] allow r caps: [osd] allow class-read object_prefix rbd_children, allow rwx pool=volumes, allow rx pool=images, allow rwx pool=compute client.images key: AQCyy3FYSMtlJRAAbJ8/U/R82NXvWBC5LmkPGw== caps: [mgr] allow r caps: [mon] allow r caps: [osd] allow class-read object_prefix rbd_children, allow rwx pool=images client.radosgw.gateway key: AQA3ynFYAYMSAxAApvfe/booa9KhigpKpLpUOA== caps: [mgr] allow r caps: [mon] allow rw caps: [osd] allow rwx client.volumes key: AQCzy3FYa3paKBAA9BlYpQ1PTeR770ghVv1jKQ== caps: [mgr] allow r caps: [mon] allow r caps: [osd] allow class-read object_prefix rbd_children, allow rwx pool=volumes, allow rx pool=images mgr.controller2 key: AQAmVTpaA+9vBhAApD3rMs//Qri+SawjUF4U4Q== caps: [mds] allow * caps: [mgr] allow * caps: [mon] allow * caps: [osd] allow * mgr.controller3 key: AQByfDparprIEBAAj7Pxdr/87/v0kmJV49aKpQ== caps: [mds] allow * caps: [mgr] allow * caps: [mon] allow * caps: [osd] allow * Regards, Kevin On Thu, Dec 21, 2017 at 8:10 AM, kevin parrikar wrote: > Thanks JC, > I tried > ceph auth caps client.admin osd 'allow *' mds 'allow *' mon 'allow *' mgr > 'allow *' > > but still status is same,also mgr.log is being flooded with below errors. > > 2017-12-21 02:39:10.622834 7fb40a22b700 0 Cannot get stat of OSD 140 > 2017-12-21 02:39:10.622835 7fb40a22b700 0 Cannot get stat of OSD 141 > Not sure whats wrong in my setup > > Regards, > Kevin > > > On Thu, Dec 21, 2017 at 2:37 AM, Jean-Charles Lopez > wrote: > >> Hi, >> >> make sure client.admin user has an MGR cap using ceph auth list. At some >> point there was a
Re: [ceph-users] Cephfs limis
On Thu, Dec 21, 2017 at 6:18 PM, nigel davies wrote: > Hay all is it possable to set cephfs to have an sapce limit > eg i like to set my cephfs to have an limit of 20TB > and my s3 storage to have 4TB for example > you can set pool quota on cephfs data pools > thanks > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Added two OSDs, 10% of pgs went inactive
Caspar, I found Nick Fisk's post yesterday http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-December/023223.html and set osd_max_pg_per_osd_hard_ratio = 4 in my ceph.conf on the OSDs and restarted the 10TB OSDs. The PGs went back active and recovery is complete now. My setup is similar to his in that there's a large difference in OSD size, most are 1.8TB, but about 10% of them are 10TB. The difference is I had a functional Luminous cluster, until increased the number 10TB OSDs from 6 to 8. I'm still not sure why that caused *more* PGs per OSD with the same pools. Thanks! Daniel On Wed, Dec 20, 2017 at 10:23 AM, Caspar Smit wrote: > Hi Daniel, > > I've had the same problem with creating a new 12.2.2 cluster where i > couldn't get some pgs out of the "activating+remapped" status after i > switched some OSD's from one chassis to another (there was no data on it > yet). > > I tried restarting OSD's to no avail. > > Couldn't find anything about the stuck in "activating+remapped" state so > in the end i threw away the pool and started over. > > Could this be a bug in 12.2.2 ? > > Kind regards, > Caspar > > 2017-12-20 15:48 GMT+01:00 Daniel K : > >> Just an update. >> >> Recovery completed but the PGS are still inactive. >> >> Still having a hard time understanding why adding OSDs caused this. I'm >> on 12.2.2 >> >> user@admin:~$ ceph -s >> cluster: >> id: a3672c60-3051-440c-bd83-8aff7835ce53 >> health: HEALTH_WARN >> Reduced data availability: 307 pgs inactive >> Degraded data redundancy: 307 pgs unclean >> >> services: >> mon: 5 daemons, quorum stor585r2u8a,stor585r2u12a,sto >> r585r2u16a,stor585r2u20a,stor585r2u24a >> mgr: stor585r2u8a(active) >> osd: 88 osds: 87 up, 87 in; 133 remapped pgs >> >> data: >> pools: 12 pools, 3016 pgs >> objects: 387k objects, 1546 GB >> usage: 3313 GB used, 186 TB / 189 TB avail >> pgs: 10.179% pgs not active >> 2709 active+clean >> 174 activating >> 133 activating+remapped >> >> io: >> client: 8436 kB/s rd, 935 kB/s wr, 140 op/s rd, 64 op/s wr >> >> >> On Tue, Dec 19, 2017 at 8:57 PM, Daniel K wrote: >> >>> I'm trying to understand why adding OSDs would cause pgs to go inactive. >>> >>> This cluster has 88 OSDs, and had 6 OSD with device class "hdd_10TB_7.2k" >>> >>> I added two more OSDs, set the device class to "hdd_10TB_7.2k" and 10% >>> of pgs went inactive. >>> >>> I have an EC pool on these OSDs with the profile: >>> user@admin:~$ ceph osd erasure-code-profile get ISA_10TB_7.2k_4.2 >>> crush-device-class=hdd_10TB_7.2k >>> crush-failure-domain=host >>> crush-root=default >>> k=4 >>> m=2 >>> plugin=isa >>> technique=reed_sol_van. >>> >>> some outputs of ceph health detail and ceph osd df >>> user@admin:~$ ceph osd df |grep 10TB >>> 76 hdd_10TB_7.2k 9.09509 1.0 9313G 349G 8963G 3.76 2.20 488 >>> 20 hdd_10TB_7.2k 9.09509 1.0 9313G 345G 8967G 3.71 2.17 489 >>> 28 hdd_10TB_7.2k 9.09509 1.0 9313G 344G 8968G 3.70 2.17 484 >>> 36 hdd_10TB_7.2k 9.09509 1.0 9313G 345G 8967G 3.71 2.17 484 >>> 87 hdd_10TB_7.2k 9.09560 1.0 9313G 8936M 9305G 0.09 0.05 311 >>> 86 hdd_10TB_7.2k 9.09560 1.0 9313G 8793M 9305G 0.09 0.05 304 >>> 6 hdd_10TB_7.2k 9.09509 1.0 9313G 344G 8969G 3.70 2.16 471 >>> 68 hdd_10TB_7.2k 9.09509 1.0 9313G 344G 8969G 3.70 2.17 480 >>> user@admin:~$ ceph health detail|grep inactive >>> HEALTH_WARN 68287/1928007 objects misplaced (3.542%); Reduced data >>> availability: 307 pgs inactive; Degraded data redundancy: 341 pgs unclean >>> PG_AVAILABILITY Reduced data availability: 307 pgs inactive >>> pg 24.60 is stuck inactive for 1947.792377, current state >>> activating+remapped, last acting [36,20,76,6,68,28] >>> pg 24.63 is stuck inactive for 1946.571425, current state >>> activating+remapped, last acting [28,76,6,20,68,36] >>> pg 24.71 is stuck inactive for 1947.625988, current state >>> activating+remapped, last acting [6,68,20,36,28,76] >>> pg 24.73 is stuck inactive for 1947.705250, current state >>> activating+remapped, last acting [36,6,20,76,68,28] >>> pg 24.74 is stuck inactive for 1947.828063, current state >>> activating+remapped, last acting [68,36,28,20,6,76] >>> pg 24.75 is stuck inactive for 1947.475644, current state >>> activating+remapped, last acting [6,28,76,36,20,68] >>> pg 24.76 is stuck inactive for 1947.712046, current state >>> activating+remapped, last acting [20,76,6,28,68,36] >>> pg 24.78 is stuck inactive for 1946.576304, current state >>> activating+remapped, last acting [76,20,68,36,6,28] >>> pg 24.7a is stuck inactive for 1947.820932, current state >>> activating+remapped, last acting [36,20,28,68,6,76] >>> pg 24.7b is stuck inactive for 1947.858305, current state >>> activating+remapped, last acting [68,6,20,28,76,36] >>> pg 24.7c is stuck inactive for 1947.753917, current state >>> activating+remapped, las
[ceph-users] How to use vfs_ceph
Hello folks, is anybody using the vfs_ceph module for exporting cephfs as samba shares? We are running ceph jewel with cephx enabled. Manpage of vfs_ceph only references the option ceph:config_file. How do I need to configure my share (or maybe ceph.conf)? log.smbd: '/' does not exist or permission denied when connecting to [vfs] Error was Transport endpoint is not connected I have a user ctdb with keyring file /etc/ceph/ceph.client.ctdb.keyring with permissions: caps: [mds] allow rw caps: [mon] allow r caps: [osd] allow rwx pool=cephfs_metadata,allow rwx pool=cephfs_data I can mount cephfs with cephf-fuse using the id ctdb and its keyfile. My share definition is: [vfs] comment = vfs path = / read only = No vfs objects = acl_xattr ceph ceph:user_id = ctdb ceph:config_file = /etc/ceph/ceph.conf Any advice is appreciated. Regards Felix -- Forschungszentrum Jülich GmbH 52425 Jülich Sitz der Gesellschaft: Jülich Eingetragen im Handelsregister des Amtsgerichts Düren Nr. HR B 3498 Vorsitzender des Aufsichtsrats: MinDir. Dr. Karl Eugen Huthmacher Geschäftsführung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender), Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt, Prof. Dr. Sebastian M. Schmidt smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Luminous 12.2.2] Cluster peformance drops after certain point of time
Thanks for your information, but I don't think it is my case.My cluster don't have any ssd. 2017-12-21 lin.yunfan 发件人:Denes Dolhay 发送时间:2017-12-18 06:41 主题:Re: [ceph-users] [Luminous 12.2.2] Cluster peformance drops after certain point of time 收件人:"ceph-users" 抄送: Hi, This is just a tip, I do not know if this actually applies to you, but some ssds are decreasing their write throughput on purpose so they do not wear out the cells before the warranty period is over. Denes. On 12/17/2017 06:45 PM, shadow_lin wrote: Hi All, I am testing luminous 12.2.2 and find a strange behavior of my cluster. I was testing my cluster throughput by using fio on a mounted rbd with follow fio parameters: fio -directory=fiotest -direct=1 -thread -rw=write -ioengine=libaio -size=200G -group_reporting -bs=1m -iodepth 4 -numjobs=200 -name=writetest Everything was fine at the begining, but after about 10 hrs of testing I found the performance dropped noticeably. Throughput droped from 300-450MBps to 250-350MBps and osd latency increased from 300ms to 400ms I also noted the heap stats showed the osd start reclaiming page heap freelist much more frequently but the rss memory of osd were increasing. below is the links of grafana graph of my cluster. cluster metrics: https://pasteboard.co/GYEOgV1.jpg osd mem metrics: https://pasteboard.co/GYEP74M.png In the graph the performance dropped after 10:00. I am investigating what happened but haven't found any clue yet. If you know any thing about how to solve this problem or where I should look into please let me know. Thanks. 2017-12-18 lin.yunfan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] MDS behind on trimming
Hi, We have two MDS servers. One active, one active-standby. While doing a parallel rsync of 10 threads with loads of files, dirs, subdirs we get the following HEALTH_WARN: ceph health detail HEALTH_WARN 2 MDSs behind on trimming MDS_TRIM 2 MDSs behind on trimming mdsmds2(mds.0): Behind on trimming (124/30)max_segments: 30, num_segments: 124 mdsmds1(mds.0): Behind on trimming (118/30)max_segments: 30, num_segments: 118 To be clear: the amount of segments behind on trimming fluctuates. It sometimes does get smaller, and is relatively stable around ~ 130. The load on the MDS is low, load on OSDs is low (both CPU/RAM/IO). All flash, cephfs_metadata co-located on the same OSDs. Using cephfs kernel client (4.13.0-19-generic) with Ceph 12.2.2 (cllient as well as cluster runs Ceph 12.2.2). In older threads I found several possible explanations for getting this warning: 1) When the number of segments exceeds that setting, the MDS starts writing back metadata so that it can remove (trim) the oldest segments. If this process is too slow, or a software bug is preventing trimming, then this health message appears. 2) The OSDs cannot keep up with the load 3) cephfs kernel client mis behaving / bug I definitely don't think nr 2) is the reason. I doubt it's a Ceph MDS 1) or client bug 3). Might this be conservative default settings? I.e. not trying to trim fast / soon enough. John wonders in thread [1] if the default journal length should be longer. Yan [2] recommends bumping "mds_log_max_expiring" to a large value (200). What would you suggest at this point? I'm thinking about the following changes: mds log max segments = 200 mds log max expiring = 200 Thanks, Stefan [1]: https://www.spinics.net/lists/ceph-users/msg39387.html [2]: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-July/011138.html -- | BIT BV http://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Many concurrent drive failures - How do I activate pgs?
Hi, I assume this can only be a physical manufacturing flaw or a firmware bug? Do Intel publish advisories on recalled equipment? Should others be concerned about using Intel DC S4600 SSD drives? Could this be an electrical issue on the Hot Swap Backplane or BMC firmware issue? Either way, all pure Intel... The hole is only 1.3 GB (4 MB x 339 objects) but perfectly striped through images, file systems are subsequently severely damaged. Is it possible to get Ceph to read in partial data shards? It would provide between 25-75% more yield... Is there anything wrong with how we've proceeded thus far? Would be nice to reference examples of using ceph-objectstore-tool but documentation is virtually non-existent. We used another SSD drive to simulate bringing all the SSDs back online. We carved up the drive to provide equal partitions to essentially simulate the original SSDs: # Partition a drive to provide 12 x 150GB partitions, eg: sdd 8:48 0 1.8T 0 disk |-sdd18:49 0 140G 0 part |-sdd28:50 0 140G 0 part |-sdd38:51 0 140G 0 part |-sdd48:52 0 140G 0 part |-sdd58:53 0 140G 0 part |-sdd68:54 0 140G 0 part |-sdd78:55 0 140G 0 part |-sdd88:56 0 140G 0 part |-sdd98:57 0 140G 0 part |-sdd10 8:58 0 140G 0 part |-sdd11 8:59 0 140G 0 part +-sdd12 8:60 0 140G 0 part Pre-requisites: ceph osd set noout; apt-get install uuid-runtime; for ID in `seq 24 35`; do UUID=`uuidgen`; OSD_SECRET=`ceph-authtool --gen-print-key`; DEVICE='/dev/sdd'$[$ID-23]; # 24-23 = /dev/sdd1, 35-23 = /dev/sdd12 echo "{\"cephx_secret\": \"$OSD_SECRET\"}" | ceph osd new $UUID $ID -i - -n client.bootstrap-osd -k /var/lib/ceph/bootstrap-osd/ceph.keyring; mkdir /var/lib/ceph/osd/ceph-$ID; mkfs.xfs $DEVICE; mount $DEVICE /var/lib/ceph/osd/ceph-$ID; ceph-authtool --create-keyring /var/lib/ceph/osd/ceph-$ID/keyring --name osd.$ID --add-key $OSD_SECRET; ceph-osd -i $ID --mkfs --osd-uuid $UUID; chown -R ceph:ceph /var/lib/ceph/osd/ceph-$ID; systemctl enable ceph-osd@$ID; systemctl start ceph-osd@$ID; done Once up we imported previous exports of empty head files in to 'real' OSDs: kvm5b: systemctl stop ceph-osd@8; ceph-objectstore-tool --op import --pgid 7.4s0 --data-path /var/lib/ceph/osd/ceph-8 --journal-path /var/lib/ceph/osd/ceph-8/journal --file /var/lib/vz/template/ssd_recovery/osd8_7.4s0.export; chown ceph:ceph -R /var/lib/ceph/osd/ceph-8; systemctl start ceph-osd@8; kvm5f: systemctl stop ceph-osd@23; ceph-objectstore-tool --op import --pgid 7.fs0 --data-path /var/lib/ceph/osd/ceph-23 --journal-path /var/lib/ceph/osd/ceph-23/journal --file /var/lib/vz/template/ssd_recovery/osd23_7.fs0.export; chown ceph:ceph -R /var/lib/ceph/osd/ceph-23; systemctl start ceph-osd@23; Bulk import previously exported objects: cd /var/lib/vz/template/ssd_recovery; for FILE in `ls -1A osd*_*.export | grep -Pv '^osd(8|23)_'`; do OSD=`echo $FILE | perl -pe 's/^osd(\d+).*/\1/'`; PGID=`echo $FILE | perl -pe 's/^osd\d+_(.*?).export/\1/g'`; echo -e "systemctl stop ceph-osd@$OSD\t ceph-objectstore-tool --op import --pgid $PGID --data-path /var/lib/ceph/osd/ceph-$OSD --journal-path /var/lib/ceph/osd/ceph-$OSD/journal --file /var/lib/vz/template/ssd_recovery/osd"$OSD"_$PGID.export"; done | sort Sample output (this will wrap): systemctl stop ceph-osd@27 ceph-objectstore-tool --op import --pgid 7.4s3 --data-path /var/lib/ceph/osd/ceph-27 --journal-path /var/lib/ceph/osd/ceph-27/journal --file /var/lib/vz/template/ssd_recovery/osd27_7.4s3.export systemctl stop ceph-osd@27 ceph-objectstore-tool --op import --pgid 7.fs5 --data-path /var/lib/ceph/osd/ceph-27 --journal-path /var/lib/ceph/osd/ceph-27/journal --file /var/lib/vz/template/ssd_recovery/osd27_7.fs5.export systemctl stop ceph-osd@30 ceph-objectstore-tool --op import --pgid 7.fs4 --data-path /var/lib/ceph/osd/ceph-30 --journal-path /var/lib/ceph/osd/ceph-30/journal --file /var/lib/vz/template/ssd_recovery/osd30_7.fs4.export systemctl stop ceph-osd@31 ceph-objectstore-tool --op import --pgid 7.4s2 --data-path /var/lib/ceph/osd/ceph-31 --journal-path /var/lib/ceph/osd/ceph-31/journal --file /var/lib/vz/template/ssd_recovery/osd31_7.4s2.export systemctl stop ceph-osd@32 ceph-objectstore-tool --op import --pgid 7.4s4 --data-path /var/lib/ceph/osd/ceph-32 --journal-path /var/lib/ceph/osd/ceph-32/journal --file /var/lib/vz/template/ssd_recovery/osd32_7.4s4.export systemctl stop ceph-osd@32 ceph-objectstore-tool --op import --pgid 7.fs2 --data-path /var/lib/ceph/osd/ceph-32 --journal-path /var/lib/ceph/osd/ceph-32/journal --file /var/lib/vz/template/ssd_recovery/osd32_7.fs2.export systemctl stop ceph-osd@34 ceph-objectstore-tool --op import --pgid 7.4s5 --data
Re: [ceph-users] MDS behind on trimming
Hi, We've used double the defaults for around 6 months now and haven't had any behind on trimming errors in that time. mds log max segments = 60 mds log max expiring = 40 Should be simple to try. -- dan On Thu, Dec 21, 2017 at 2:32 PM, Stefan Kooman wrote: > Hi, > > We have two MDS servers. One active, one active-standby. While doing a > parallel rsync of 10 threads with loads of files, dirs, subdirs we get > the following HEALTH_WARN: > > ceph health detail > HEALTH_WARN 2 MDSs behind on trimming > MDS_TRIM 2 MDSs behind on trimming > mdsmds2(mds.0): Behind on trimming (124/30)max_segments: 30, > num_segments: 124 > mdsmds1(mds.0): Behind on trimming (118/30)max_segments: 30, > num_segments: 118 > > To be clear: the amount of segments behind on trimming fluctuates. It > sometimes does get smaller, and is relatively stable around ~ 130. > > The load on the MDS is low, load on OSDs is low (both CPU/RAM/IO). All > flash, cephfs_metadata co-located on the same OSDs. Using cephfs kernel > client (4.13.0-19-generic) with Ceph 12.2.2 (cllient as well as cluster > runs Ceph 12.2.2). In older threads I found several possible > explanations for getting this warning: > > 1) When the number of segments exceeds that setting, the MDS starts > writing back metadata so that it can remove (trim) the oldest > segments. If this process is too slow, or a software bug is preventing > trimming, then this health message appears. > > 2) The OSDs cannot keep up with the load > > 3) cephfs kernel client mis behaving / bug > > I definitely don't think nr 2) is the reason. I doubt it's a Ceph MDS 1) > or client bug 3). Might this be conservative default settings? I.e. not > trying to trim fast / soon enough. John wonders in thread [1] if the > default journal length should be longer. Yan [2] recommends bumping > "mds_log_max_expiring" to a large value (200). > > What would you suggest at this point? I'm thinking about the following > changes: > > mds log max segments = 200 > mds log max expiring = 200 > > Thanks, > > Stefan > > [1]: https://www.spinics.net/lists/ceph-users/msg39387.html > [2]: > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-July/011138.html > > -- > | BIT BV http://www.bit.nl/Kamer van Koophandel 09090351 > | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MDS behind on trimming
On Thu, Dec 21, 2017 at 9:32 PM, Stefan Kooman wrote: > Hi, > > We have two MDS servers. One active, one active-standby. While doing a > parallel rsync of 10 threads with loads of files, dirs, subdirs we get > the following HEALTH_WARN: > > ceph health detail > HEALTH_WARN 2 MDSs behind on trimming > MDS_TRIM 2 MDSs behind on trimming > mdsmds2(mds.0): Behind on trimming (124/30)max_segments: 30, > num_segments: 124 > mdsmds1(mds.0): Behind on trimming (118/30)max_segments: 30, > num_segments: 118 > > To be clear: the amount of segments behind on trimming fluctuates. It > sometimes does get smaller, and is relatively stable around ~ 130. > > The load on the MDS is low, load on OSDs is low (both CPU/RAM/IO). All > flash, cephfs_metadata co-located on the same OSDs. Using cephfs kernel > client (4.13.0-19-generic) with Ceph 12.2.2 (cllient as well as cluster > runs Ceph 12.2.2). In older threads I found several possible > explanations for getting this warning: > > 1) When the number of segments exceeds that setting, the MDS starts > writing back metadata so that it can remove (trim) the oldest > segments. If this process is too slow, or a software bug is preventing > trimming, then this health message appears. > > 2) The OSDs cannot keep up with the load > > 3) cephfs kernel client mis behaving / bug > > I definitely don't think nr 2) is the reason. I doubt it's a Ceph MDS 1) > or client bug 3). Might this be conservative default settings? I.e. not > trying to trim fast / soon enough. John wonders in thread [1] if the > default journal length should be longer. Yan [2] recommends bumping > "mds_log_max_expiring" to a large value (200). > > What would you suggest at this point? I'm thinking about the following > changes: > > mds log max segments = 200 > mds log max expiring = 200 > Yes, these change should help. you can also try https://github.com/ceph/ceph/pull/18783 > Thanks, > > Stefan > > [1]: https://www.spinics.net/lists/ceph-users/msg39387.html > [2]: > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-July/011138.html > > -- > | BIT BV http://www.bit.nl/Kamer van Koophandel 09090351 > | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs mds millions of caps
On Thu, Dec 21, 2017 at 7:33 PM, Webert de Souza Lima wrote: > I have upgraded the kernel on a client node (one that has close-to-zero > traffic) used for tests. > >{ > "reconnecting" : false, > "id" : 1620266, > "num_leases" : 0, > "inst" : "client.1620266 10.0.0.111:0/3921220890", > "state" : "open", > "completed_requests" : 0, > "num_caps" : 1402490, > "client_metadata" : { > "kernel_version" : "4.4.0-104-generic", > "hostname" : "suppressed", > "entity_id" : "admin" > }, > "replay_requests" : 0 >}, > > still 1.4M caps used. > > is upgrading the client kernel enough ? > See http://tracker.ceph.com/issues/22446. We haven't implemented that feature. "echo 3 >/proc/sys/vm/drop_caches" should drop most caps. > > > Regards, > > Webert Lima > DevOps Engineer at MAV Tecnologia > Belo Horizonte - Brasil > IRC NICK - WebertRLZ > > On Fri, Dec 15, 2017 at 11:16 AM, Webert de Souza Lima > wrote: >> >> So, >> >> On Fri, Dec 15, 2017 at 10:58 AM, Yan, Zheng wrote: >>> >>> >>> 300k are ready quite a lot. opening them requires long time. does you >>> mail server really open so many files? >> >> >> Yes, probably. It's a commercial solution. A few thousand domains, dozens >> of thousands of users and god knows how any mailboxes. >> From the daemonperf you can see the write workload is high, so yes, too >> much files opening (dovecot mdbox stores multiple e-mails per file, split >> into many files). >> >>> I checked 4.4 kernel, it includes the code that trim cache when mds >>> recovers. >> >> >> Ok, all nodes are running 4.4.0-75-generic. The fix might have been >> included in a newer version. >> I'll upgrade it asap. >> >> >> Regards, >> >> Webert Lima >> DevOps Engineer at MAV Tecnologia >> Belo Horizonte - Brasil >> IRC NICK - WebertRLZ > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Not timing out watcher
Hi Ilya, Here you go, no k8s services running this time: sbezverk@kube-4:~$ sudo rbd map raw-volume --pool kubernetes --id admin -m 192.168.80.233 --key=AQCeHO1ZILPPDRAA7zw3d76bplkvTwzoosybvA== /dev/rbd0 sbezverk@kube-4:~$ sudo rbd status raw-volume --pool kubernetes --id admin -m 192.168.80.233 --key=AQCeHO1ZILPPDRAA7zw3d76bplkvTwzoosybvA== Watchers: watcher=192.168.80.235:0/3465920438 client.65327 cookie=1 sbezverk@kube-4:~$ sudo rbd info raw-volume --pool kubernetes --id admin -m 192.168.80.233 --key=AQCeHO1ZILPPDRAA7zw3d76bplkvTwzoosybvA== rbd image 'raw-volume': size 10240 MB in 2560 objects order 22 (4096 kB objects) block_name_prefix: rb.0.fafa.625558ec format: 1 sbezverk@kube-4:~$ sudo reboot sbezverk@kube-4:~$ sudo rbd status raw-volume --pool kubernetes --id admin -m 192.168.80.233 --key=AQCeHO1ZILPPDRAA7zw3d76bplkvTwzoosybvA== Watchers: none It seems when the image was mapped manually, this issue is not reproducible. K8s does not just map the image, it also creates loopback device which is linked to /dev/rbd0. Maybe this somehow reminds rbd client to re-activate a watcher on reboot. I will try to mimic exact steps k8s follows manually to see what exactly forces an active watcher after reboot. Thank you Serguei On 2017-12-21, 5:49 AM, "Ilya Dryomov" wrote: On Wed, Dec 20, 2017 at 6:20 PM, Serguei Bezverkhi (sbezverk) wrote: > It took 30 minutes for the Watcher to time out after ungraceful restart. Is there a way limit it to something a bit more reasonable? Like 1-3 minutes? > > On 2017-12-20, 12:01 PM, "Serguei Bezverkhi (sbezverk)" wrote: > > Ok, here is what I found out. If I gracefully kill a pod then watcher gets properly cleared, but if it is done ungracefully, without “rbd unmap” then even after a node reboot Watcher stays up for a long time, it has been more than 20 minutes and it is still active (no any kubernetes services are running). Hi Serguei, Can you try taking k8s out of the equation -- set up a fresh VM with the same kernel, do "rbd map" in it and kill it? Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Gateway timeout
I have noticed over the years ( been using ceph since 2013 ) that when an OSD attached to a single physical drive ( JBOD setup ) that is failing, that at times this will cause rados gateways to go offline. I have two clusters running ( one on firefly and one on hammer, both scheduled for upgrades next year ) and it happens on both when a drive is not marked out but has many blocked ops requests. The drive is physically still functioning but is most likely failing, just not failed yet. The issue is that the gateways will just stop responding to all requests. Both of our clusters have 3 rados gateways behind a haproxy load balancer, so we know immediately when they drop. This will occur continually until we out the failing OSD ( normally we restart the gateways or the services on them first, then move to out the drive ). Wonder if anyone runs into this, a quick search revealed one hit with no actual resolution. Also wondering if there is some way I could prevent the gateways from falling over due to the unresponsive OSD? I did setup a test Jewel install in our dev and semi-recreate the problem by shutting down all the OSDs. This resulted in the gateway going down completely as well. I imagine taking the OSDs offline like that wouldn't be expected though. It would be nice if the gateway would just throw a message back, like service unavailable. I suppose haproxy is doing this for it though. Regards, Brent ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph-volume lvm deactivate/destroy/zap
Hi, For someone who is not an lvm expert, does anyone have a recipe for destroying a ceph-volume lvm osd? (I have a failed disk which I want to deactivate / wipe before physically removing from the host, and the tooling for this doesn't exist yet http://tracker.ceph.com/issues/22287) > ceph-volume lvm zap /dev/sdu # does not work Zapping: /dev/sdu Running command: sudo wipefs --all /dev/sdu stderr: wipefs: error: /dev/sdu: probing initialization failed: Device or resource busy --> RuntimeError: command returned non-zero exit status: 1 This is the drive I want to remove: = osd.240 == [block]/dev/ceph-/osd-block-f1455f38-b94b-4501-86df-6d6c96727d02 type block osd id240 cluster fsid xxx cluster name ceph osd fsid f1455f38-b94b-4501-86df-6d6c96727d02 block uuidN4fpLc-O3y0-hvfN-oRpD-y6kH-znfl-4EaVLi block device /dev/ceph-/osd-block-f1455f38-b94b-4501-86df-6d6c96727d02 How does one tear that down so it can be zapped? Best Regards, Dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] [luminous 12.2.2] Cluster write performance degradation problem(possibly tcmalloc related)
My testing cluster is an all hdd cluster with 12 osd(10T hdd each). I moinitor luminous 12.2.2 write performance and osd memory usage with grafana graph for statistic logging. The test is done by using fio on a mounted rbd with follow fio parameters: fio -directory=fiotest -direct=1 -thread -rw=write -ioengine=libaio -size=200G -group_reporting -bs=1m -iodepth 4 -numjobs=200 -name=writetest I found there is a noticeably performance degration over time. Graph of write throughput and iops https://pasteboard.co/GZflpTO.png Graph of osd memory usage(2 of 12 osds,the pattern are identical) https://pasteboard.co/GZfmfzo.png Graph of osd perf https://pasteboard.co/GZfmZNx.png There are some interesting founding from the graph. After 18:00 suddenly the write throughput dropped and the osd latency increased. TCmalloc started relcaim page heap freelist much more frequently.All of this happened very fast and every osd had the indentical pattern. I have done this kind of test several times with different bluestore cache setting and find out with more cache the performance degradation would happen later. I don't know if this is a bug or I can fix it with modify some of the config of my cluster. Any advice or direction to look into is appreciated. Thanks 2017-12-21 lin.yunfan___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-volume lvm deactivate/destroy/zap
Quoting Dan van der Ster (d...@vanderster.com): > Hi, > > For someone who is not an lvm expert, does anyone have a recipe for > destroying a ceph-volume lvm osd? > (I have a failed disk which I want to deactivate / wipe before > physically removing from the host, and the tooling for this doesn't > exist yet http://tracker.ceph.com/issues/22287) > > > ceph-volume lvm zap /dev/sdu # does not work > Zapping: /dev/sdu > Running command: sudo wipefs --all /dev/sdu > stderr: wipefs: error: /dev/sdu: probing initialization failed: > Device or resource busy > > How does one tear that down so it can be zapped? wipefs -fa /dev/the/device dd if=/dev/zero of=/dev/the/device bs=1M count=1 ^^ I have succesfully re-created ceph-volume lvm bluestore OSDs with above method (assuming you have done the ceph osd purge osd.$ID part already and brought down the OSD process itself). Gr. Stefan -- | BIT BV http://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Not timing out watcher
On Thu, Dec 21, 2017 at 3:04 PM, Serguei Bezverkhi (sbezverk) wrote: > Hi Ilya, > > Here you go, no k8s services running this time: > > sbezverk@kube-4:~$ sudo rbd map raw-volume --pool kubernetes --id admin -m > 192.168.80.233 --key=AQCeHO1ZILPPDRAA7zw3d76bplkvTwzoosybvA== > /dev/rbd0 > sbezverk@kube-4:~$ sudo rbd status raw-volume --pool kubernetes --id admin -m > 192.168.80.233 --key=AQCeHO1ZILPPDRAA7zw3d76bplkvTwzoosybvA== > Watchers: > watcher=192.168.80.235:0/3465920438 client.65327 cookie=1 > sbezverk@kube-4:~$ sudo rbd info raw-volume --pool kubernetes --id admin -m > 192.168.80.233 --key=AQCeHO1ZILPPDRAA7zw3d76bplkvTwzoosybvA== > rbd image 'raw-volume': > size 10240 MB in 2560 objects > order 22 (4096 kB objects) > block_name_prefix: rb.0.fafa.625558ec > format: 1 > sbezverk@kube-4:~$ sudo reboot > > sbezverk@kube-4:~$ sudo rbd status raw-volume --pool kubernetes --id admin -m > 192.168.80.233 --key=AQCeHO1ZILPPDRAA7zw3d76bplkvTwzoosybvA== > Watchers: none > > It seems when the image was mapped manually, this issue is not reproducible. > > K8s does not just map the image, it also creates loopback device which is > linked to /dev/rbd0. Maybe this somehow reminds rbd client to re-activate a > watcher on reboot. I will try to mimic exact steps k8s follows manually to > see what exactly forces an active watcher after reboot. To confirm, I'd also make sure that nothing runs "rbd unmap" on all images (or some subset of images) during shutdown in the manual case. Either do a hard reboot or rename /usr/bin/rbd to something else before running reboot. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MDS behind on trimming
Quoting Dan van der Ster (d...@vanderster.com): > Hi, > > We've used double the defaults for around 6 months now and haven't had any > behind on trimming errors in that time. > >mds log max segments = 60 >mds log max expiring = 40 > > Should be simple to try. Yup, and works like a charm: ceph tell mds.* injectargs '--mds_log_max_segments=60' ceph tell mds.* injectargs '--mds_log_max_expiring=40' Although you see this logged: (not observed, change may require restart), these settings do get applied almost instantly ... and the trim lag was gone within 30 seconds after that. Thanks, Stefan -- | BIT BV http://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-volume lvm deactivate/destroy/zap
On Thu, Dec 21, 2017 at 3:59 PM, Stefan Kooman wrote: > Quoting Dan van der Ster (d...@vanderster.com): >> Hi, >> >> For someone who is not an lvm expert, does anyone have a recipe for >> destroying a ceph-volume lvm osd? >> (I have a failed disk which I want to deactivate / wipe before >> physically removing from the host, and the tooling for this doesn't >> exist yet http://tracker.ceph.com/issues/22287) >> >> > ceph-volume lvm zap /dev/sdu # does not work >> Zapping: /dev/sdu >> Running command: sudo wipefs --all /dev/sdu >> stderr: wipefs: error: /dev/sdu: probing initialization failed: >> Device or resource busy >> >> How does one tear that down so it can be zapped? > > wipefs -fa /dev/the/device > dd if=/dev/zero of=/dev/the/device bs=1M count=1 Thanks Stefan. But isn't there also some vgremove or lvremove magic that needs to bring down these /dev/dm-... devices I have? -- dan > > ^^ I have succesfully re-created ceph-volume lvm bluestore OSDs with > above method (assuming you have done the ceph osd purge osd.$ID part > already and brought down the OSD process itself). > > Gr. Stefan > > -- > | BIT BV http://www.bit.nl/Kamer van Koophandel 09090351 > | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs mds millions of caps
Hello Zheng, Thanks for opening that issue on the bug tracker. Also thanks for that tip. Caps dropped from 1.6M to 600k for that client. Is it safe to run in a cronjob? Let's say, once or twice a day during production? Thanks! Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* On Thu, Dec 21, 2017 at 11:55 AM, Yan, Zheng wrote: > On Thu, Dec 21, 2017 at 7:33 PM, Webert de Souza Lima > wrote: > > I have upgraded the kernel on a client node (one that has close-to-zero > > traffic) used for tests. > > > >{ > > "reconnecting" : false, > > "id" : 1620266, > > "num_leases" : 0, > > "inst" : "client.1620266 10.0.0.111:0/3921220890", > > "state" : "open", > > "completed_requests" : 0, > > "num_caps" : 1402490, > > "client_metadata" : { > > "kernel_version" : "4.4.0-104-generic", > > "hostname" : "suppressed", > > "entity_id" : "admin" > > }, > > "replay_requests" : 0 > >}, > > > > still 1.4M caps used. > > > > is upgrading the client kernel enough ? > > > > See http://tracker.ceph.com/issues/22446. We haven't implemented that > feature. "echo 3 >/proc/sys/vm/drop_caches" should drop most caps. > > > > > > > Regards, > > > > Webert Lima > > DevOps Engineer at MAV Tecnologia > > Belo Horizonte - Brasil > > IRC NICK - WebertRLZ > > > > On Fri, Dec 15, 2017 at 11:16 AM, Webert de Souza Lima > > wrote: > >> > >> So, > >> > >> On Fri, Dec 15, 2017 at 10:58 AM, Yan, Zheng wrote: > >>> > >>> > >>> 300k are ready quite a lot. opening them requires long time. does you > >>> mail server really open so many files? > >> > >> > >> Yes, probably. It's a commercial solution. A few thousand domains, > dozens > >> of thousands of users and god knows how any mailboxes. > >> From the daemonperf you can see the write workload is high, so yes, too > >> much files opening (dovecot mdbox stores multiple e-mails per file, > split > >> into many files). > >> > >>> I checked 4.4 kernel, it includes the code that trim cache when mds > >>> recovers. > >> > >> > >> Ok, all nodes are running 4.4.0-75-generic. The fix might have been > >> included in a newer version. > >> I'll upgrade it asap. > >> > >> > >> Regards, > >> > >> Webert Lima > >> DevOps Engineer at MAV Tecnologia > >> Belo Horizonte - Brasil > >> IRC NICK - WebertRLZ > > > > > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Many concurrent drive failures - How do I activate pgs?
If i were in your shoes, i would grab a failed disc which DOES NOT contain the data you need, an oscilloscope, and start experimenting on it ... try to find debug testpoints on the panel etc. At the same time i would contact the factory or a data recovery company with a good reputation, and experience with ssds. If i would have to bet, i would bet on a mfg defect because i cannot see any other mention of your problem on the net. On December 21, 2017 2:38:52 PM GMT+01:00, David Herselman wrote: >Hi, > >I assume this can only be a physical manufacturing flaw or a firmware >bug? Do Intel publish advisories on recalled equipment? Should others >be concerned about using Intel DC S4600 SSD drives? Could this be an >electrical issue on the Hot Swap Backplane or BMC firmware issue? >Either way, all pure Intel... > >The hole is only 1.3 GB (4 MB x 339 objects) but perfectly striped >through images, file systems are subsequently severely damaged. > >Is it possible to get Ceph to read in partial data shards? It would >provide between 25-75% more yield... > > >Is there anything wrong with how we've proceeded thus far? Would be >nice to reference examples of using ceph-objectstore-tool but >documentation is virtually non-existent. > >We used another SSD drive to simulate bringing all the SSDs back >online. We carved up the drive to provide equal partitions to >essentially simulate the original SSDs: > # Partition a drive to provide 12 x 150GB partitions, eg: >sdd 8:48 0 1.8T 0 disk >|-sdd18:49 0 140G 0 part >|-sdd28:50 0 140G 0 part >|-sdd38:51 0 140G 0 part >|-sdd48:52 0 140G 0 part >|-sdd58:53 0 140G 0 part >|-sdd68:54 0 140G 0 part >|-sdd78:55 0 140G 0 part >|-sdd88:56 0 140G 0 part >|-sdd98:57 0 140G 0 part >|-sdd10 8:58 0 140G 0 part >|-sdd11 8:59 0 140G 0 part >+-sdd12 8:60 0 140G 0 part > > > Pre-requisites: >ceph osd set noout; >apt-get install uuid-runtime; > > > for ID in `seq 24 35`; do >UUID=`uuidgen`; >OSD_SECRET=`ceph-authtool --gen-print-key`; >DEVICE='/dev/sdd'$[$ID-23];# 24-23 = /dev/sdd1, 35-23 = /dev/sdd12 >echo "{\"cephx_secret\": \"$OSD_SECRET\"}" | ceph osd new $UUID $ID -i >- -n client.bootstrap-osd -k /var/lib/ceph/bootstrap-osd/ceph.keyring; >mkdir /var/lib/ceph/osd/ceph-$ID; >mkfs.xfs $DEVICE; >mount $DEVICE /var/lib/ceph/osd/ceph-$ID; >ceph-authtool --create-keyring /var/lib/ceph/osd/ceph-$ID/keyring >--name osd.$ID --add-key $OSD_SECRET; >ceph-osd -i $ID --mkfs --osd-uuid $UUID; >chown -R ceph:ceph /var/lib/ceph/osd/ceph-$ID; >systemctl enable ceph-osd@$ID; >systemctl start ceph-osd@$ID; > done > > >Once up we imported previous exports of empty head files in to 'real' >OSDs: > kvm5b: >systemctl stop ceph-osd@8; >ceph-objectstore-tool --op import --pgid 7.4s0 --data-path >/var/lib/ceph/osd/ceph-8 --journal-path >/var/lib/ceph/osd/ceph-8/journal --file >/var/lib/vz/template/ssd_recovery/osd8_7.4s0.export; >chown ceph:ceph -R /var/lib/ceph/osd/ceph-8; >systemctl start ceph-osd@8; > kvm5f: >systemctl stop ceph-osd@23; >ceph-objectstore-tool --op import --pgid 7.fs0 --data-path >/var/lib/ceph/osd/ceph-23 --journal-path >/var/lib/ceph/osd/ceph-23/journal --file >/var/lib/vz/template/ssd_recovery/osd23_7.fs0.export; >chown ceph:ceph -R /var/lib/ceph/osd/ceph-23; >systemctl start ceph-osd@23; > > >Bulk import previously exported objects: >cd /var/lib/vz/template/ssd_recovery; >for FILE in `ls -1A osd*_*.export | grep -Pv '^osd(8|23)_'`; do > OSD=`echo $FILE | perl -pe 's/^osd(\d+).*/\1/'`; > PGID=`echo $FILE | perl -pe 's/^osd\d+_(.*?).export/\1/g'`; >echo -e "systemctl stop ceph-osd@$OSD\t ceph-objectstore-tool --op >import --pgid $PGID --data-path /var/lib/ceph/osd/ceph-$OSD >--journal-path /var/lib/ceph/osd/ceph-$OSD/journal --file >/var/lib/vz/template/ssd_recovery/osd"$OSD"_$PGID.export"; >done | sort > >Sample output (this will wrap): >systemctl stop ceph-osd@27 ceph-objectstore-tool --op import >--pgid 7.4s3 --data-path /var/lib/ceph/osd/ceph-27 --journal-path >/var/lib/ceph/osd/ceph-27/journal --file >/var/lib/vz/template/ssd_recovery/osd27_7.4s3.export >systemctl stop ceph-osd@27 ceph-objectstore-tool --op import >--pgid 7.fs5 --data-path /var/lib/ceph/osd/ceph-27 --journal-path >/var/lib/ceph/osd/ceph-27/journal --file >/var/lib/vz/template/ssd_recovery/osd27_7.fs5.export >systemctl stop ceph-osd@30 ceph-objectstore-tool --op import >--pgid 7.fs4 --data-path /var/lib/ceph/osd/ceph-30 --journal-path >/var/lib/ceph/osd/ceph-30/journal --file >/var/lib/vz/template/ssd_recovery/osd30_7.fs4.export >systemctl stop ceph-osd@31 ceph-objectstore-tool --op import >--pgid 7.4s2 --data-path /var/lib/ceph/osd/ceph-31 --journal-path >/var/lib/ceph/osd/ceph-31/journal --file >/var/lib/vz/template/ssd_reco
Re: [ceph-users] Cache tier unexpected behavior: promote on lock
Thanks for the answers! As it leads to a decrease of caching efficiency, i've opened an issue: http://tracker.ceph.com/issues/22528 15.12.2017, 23:03, "Gregory Farnum" : > On Thu, Dec 14, 2017 at 9:11 AM, Захаров Алексей > wrote: >> Hi, Gregory, >> Thank you for your answer! >> >> Is there a way to not promote on "locking", when not using EC pools? >> Is it possible to make this configurable? >> >> We don't use EC pool. So, for us this meachanism is overhead. It only adds >> more load on both pools and network. > > Unfortunately I don't think there's an easy way to avoid it that > exists right now. The caching is generally not set up well for > handling these kinds of things, but it's possible the logic to proxy > class operations onto replicated pools might not be *too* > objectionable > -Greg > >> 14.12.2017, 01:16, "Gregory Farnum" : >> >> Voluntary “locking” in RADOS is an “object class” operation. These are not >> part of the core API and cannot run on EC pools, so any operation using them >> will cause an immediate promotion. >> On Wed, Dec 13, 2017 at 4:02 AM Захаров Алексей >> wrote: >> >> Hello, >> >> I've found that when client gets lock on object then ceph ignores any >> promotion settings and promotes this object immedeatly. >> >> Is it a bug or a feature? >> Is it configurable? >> >> Hope for any help! >> >> Ceph version: 10.2.10 and 12.2.2 >> We use libradosstriper-based clients. >> >> Cache pool settings: >> size: 3 >> min_size: 2 >> crash_replay_interval: 0 >> pg_num: 2048 >> pgp_num: 2048 >> crush_ruleset: 0 >> hashpspool: true >> nodelete: false >> nopgchange: false >> nosizechange: false >> write_fadvise_dontneed: false >> noscrub: true >> nodeep-scrub: false >> hit_set_type: bloom >> hit_set_period: 60 >> hit_set_count: 30 >> hit_set_fpp: 0.05 >> use_gmt_hitset: 1 >> auid: 0 >> target_max_objects: 0 >> target_max_bytes: 18819770744832 >> cache_target_dirty_ratio: 0.4 >> cache_target_dirty_high_ratio: 0.6 >> cache_target_full_ratio: 0.8 >> cache_min_flush_age: 60 >> cache_min_evict_age: 180 >> min_read_recency_for_promote: 15 >> min_write_recency_for_promote: 15 >> fast_read: 0 >> hit_set_grade_decay_rate: 50 >> hit_set_search_last_n: 30 >> >> To get lock via cli (to test behavior) we use: >> # rados -p poolname lock get --lock-tag weird_ceph_locks --lock-cookie >> `uuid` objectname striper.lock >> Right after that object could be found in caching pool. >> >> -- >> Regards, >> Aleksei Zakharov >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> -- >> Regards, >> Aleksei Zakharov -- Regards, Aleksei Zakharov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-volume lvm deactivate/destroy/zap
Quoting Dan van der Ster (d...@vanderster.com): > Thanks Stefan. But isn't there also some vgremove or lvremove magic > that needs to bring down these /dev/dm-... devices I have? Ah, you want to clean up properly before that. Sure: lvremove -f / vgremove pvremove /dev/ceph-device (should wipe labels) So ideally there should be a ceph-volume lvm destroy / zap option that takes care of this: 1) Properly remove LV/VG/PV as shown above 2) wipefs to get rid of LVM signatures 3) dd zeroes to get rid of signatures that might still be there Gr. Stefan -- | BIT BV http://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to use vfs_ceph
At a glance looks OK, I've not tested this in a while. Silly question but does your Samba package definitely ship with the Ceph vfs? Caught me out in the past. Have you tried exporting a sub dir? Maybe 777 it although shouldn't make a difference. On 21 Dec 2017 13:16, "Felix Stolte" wrote: > Hello folks, > > is anybody using the vfs_ceph module for exporting cephfs as samba shares? > We are running ceph jewel with cephx enabled. Manpage of vfs_ceph only > references the option ceph:config_file. How do I need to configure my share > (or maybe ceph.conf)? > > log.smbd: '/' does not exist or permission denied when connecting to > [vfs] Error was Transport endpoint is not connected > > I have a user ctdb with keyring file /etc/ceph/ceph.client.ctdb.keyring > with permissions: > > caps: [mds] allow rw > caps: [mon] allow rcaps: [osd] allow rwx > pool=cephfs_metadata,allow rwx pool=cephfs_data > > I can mount cephfs with cephf-fuse using the id ctdb and its keyfile. > > My share definition is: > > [vfs] > comment = vfs > path = / > read only = No > vfs objects = acl_xattr ceph > ceph:user_id = ctdb > ceph:config_file = /etc/ceph/ceph.conf > > > Any advice is appreciated. > > Regards Felix > > -- > Forschungszentrum Jülich GmbH > 52425 Jülich > Sitz der Gesellschaft: Jülich > Eingetragen im Handelsregister des Amtsgerichts Düren Nr. HR B 3498 > Vorsitzender des Aufsichtsrats: MinDir. Dr. Karl Eugen Huthmacher > Geschäftsführung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender), > Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt, > Prof. Dr. Sebastian M. Schmidt > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Permissions for mon status command
Hi, I'm writing a small python script using librados to display cluster health, same info as ceph health detail show, it works fine but I rather not use the admin keyring for something like this. However I have no clue what kind of caps I should or can set, I was kind of hoping that mon allow r would do it, but that didn't work, and I'm unable to find any documentation that covers this. Any pointers would be appreciated. Thanks, Andreas ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Permissions for mon status command
Hi Andreas, I believe is not a problem of caps, I have tested using the same cap on mon and I have the same problem, still looking into. [client.python] key = AQDORjxaYHG9JxAA0qiZC0Rmf3qulsO3P/bZgw== caps mon = "allow r" # ceph -n client.python --keyring ceph.client.python.keyring health HEALTH_OK but if I run the python script that contains a connect command to the cluster. # python health.py Traceback (most recent call last): File "health.py", line 13, in r.connect() File "/usr/lib/python2.7/dist-packages/rados.py", line 429, in connect raise make_ex(ret, "error connecting to the cluster") rados.Error: error connecting to the cluster: errno EINVAL ** PYTHON SCRIPT #!/usr/bin/env python import rados import json def get_cluster_health(r): cmd = {"prefix":"status", "format":"json"} ret, buf, errs = r.mon_command(json.dumps(cmd), b'', timeout=5) result = json.loads(buf) return result['health']['overall_status'] r = rados.Rados(conffile = '/etc/ceph/ceph.conf', conf = dict (keyring = '/etc/ceph/ceph.client.python.keyring')) r.connect() print("{0}".format(get_cluster_health(r))) if r is not None: r.shutdown() * On Thu, Dec 21, 2017 at 4:15 PM, Andreas Calminder < andreas.calmin...@klarna.com> wrote: > Hi, > I'm writing a small python script using librados to display cluster > health, same info as ceph health detail show, it works fine but I rather > not use the admin keyring for something like this. However I have no clue > what kind of caps I should or can set, I was kind of hoping that mon allow > r would do it, but that didn't work, and I'm unable to find any > documentation that covers this. Any pointers would be appreciated. > > Thanks, > Andreas > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- ATTE. Alvaro Soto Escobar -- Great people talk about ideas, average people talk about things, small people talk ... about other people. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph not reclaiming space or overhead?
I will start with I am very new to ceph and am trying to teach myself the ins and outs. While doing this I have been creating and destroying pools as I experiment on some test hardware. Something I noticed was that when a pool is deleted, the space is not always freed 100%. This is true even after days of idle time. Right now with 7 OSD and a few empty pools I have 70GBs of raw spaced used. Now, I am not sure if this is normal, but I did migrate my OSDs to bluestore and have been adding OSDs. So maybe some space is just overhead for each OSD? I lost one of my disks and the usage dropped to 70GBs. Though when I had that failure I got some REALLY odd results from ceph -s… Note the number of data objects (242 total) vs. the number of degraded objects (101 of 726): -- root@MediaServer:~# ceph -s cluster: id: 26c81563-ee27-4967-a950-afffb795f29e health: HEALTH_WARN 1 filesystem is degraded insufficient standby MDS daemons available 1 osds down Degraded data redundancy: 101/726 objects degraded (13.912%), 92 pgs unclean, 92 pgs degraded, 92 pgs undersized services: mon: 2 daemons, quorum TheMonolith,MediaServer mgr: MediaServer.domain(active), standbys: TheMonolith.domain mds: MediaStoreFS-1/1/1 up {0=MediaMDS=up:reconnect(laggy or crashed)} osd: 8 osds: 7 up, 8 in rgw: 2 daemons active data: pools: 8 pools, 176 pgs objects: 242 objects, 3568 bytes usage: 80463 MB used, 10633 GB / 10712 GB avail pgs: 101/726 objects degraded (13.912%) 92 active+undersized+degraded 84 active+clean -- After reweighting the failed OSD out: -- root@MediaServer:/var/log/ceph# ceph -s cluster: id: 26c81563-ee27-4967-a950-afffb795f29e health: HEALTH_WARN 1 filesystem is degraded insufficient standby MDS daemons available services: mon: 2 daemons, quorum TheMonolith,MediaServer mgr: MediaServer.domain(active), standbys: TheMonolith.domain mds: MediaStoreFS-1/1/1 up {0=MediaMDS=up:reconnect(laggy or crashed)} osd: 8 osds: 7 up, 7 in rgw: 2 daemons active data: pools: 8 pools, 176 pgs objects: 242 objects, 3568 bytes usage: 71189 MB used, 8779 GB / 8849 GB avail pgs: 176 active+clean -- My pools: -- root@MediaServer:/var/log/ceph# ceph df GLOBAL: SIZE AVAIL RAW USED %RAW USED 8849G 8779G 71189M 0.79 POOLS: NAME ID USED %USED MAX AVAIL OBJECTS .rgw.root 6 1322 0 3316G 3 default.rgw.control 7 0 0 3316G 11 default.rgw.meta 8 0 0 3316G 0 default.rgw.log 9 0 0 3316G 207 MediaStorePool190 0 5970G 0 MediaStorePool-Meta 20 2246 0 3316G 21 MediaStorePool-WriteCache 210 0 3316G 0 rbd 220 0 4975G 0 -- Am I looking at some sort of a file system leak, or is this normal? Also, before I deleted (or broke rather) my last pool, I marked OSDs in and out and tracked the space. The data pool was erasure with 4 data and 1 parity and all data cleared from the cache pool: Obj Used Total Size Data Expected Usage Difference Notes 639 10712 417 521.25 -117.75 8 OSDs 337k 636 10246 417 521.25 -114.75 7 OSDs (complete removal, osd 0, 500GB) 337k 629 10712 417 521.25 -107.75 8 OSDs (Wiped and re-added as osd.51002) 337k 631 9780 417 521.25 -109.75 7 OSDs (out, crush removed, osd 5, 1TB) 337k 649 10712 417 521.25 -127.75 8 OSDs (crush add, osd in) 337k 643 9780 417 521.25 -121.75 7 OSDs (out, osd 5, 1TB) 337k 625 9780 417 521.25 -103.75 7 OSDs (crush reweight 0, osd 5, 1TB) There was enough difference between the in and out of OSDs that I kinda think something is up. Even with the 80GBs removed from the difference when I have no data at all, that still leaved me with upwards of 40GBs of unaccounted for usage... Debian 9 \ Kernel: 4.4.0-104-generic ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous (stable) Thanks for your input! It's appreciated! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Permissions for mon status command
You aren't specifying your cluster user, only the keyring. So the connection command is still trying to use the default client.admin instead of client.python. Here's the connect line I use in my scripts. rados.Rados(conffile='/etc/ceph/ceph.conf', conf=dict(keyring = ' /etc/ceph/ceph.client.python.keyring'), name='client.python') On Thu, Dec 21, 2017 at 6:55 PM Alvaro Soto wrote: > Hi Andreas, > I believe is not a problem of caps, I have tested using the same cap on > mon and I have the same problem, still looking into. > > [client.python] > > key = AQDORjxaYHG9JxAA0qiZC0Rmf3qulsO3P/bZgw== > > caps mon = "allow r" > > > > # ceph -n client.python --keyring ceph.client.python.keyring health > > HEALTH_OK > > > but if I run the python script that contains a connect command to the > cluster. > > > # python health.py > > Traceback (most recent call last): > > File "health.py", line 13, in > > r.connect() > > File "/usr/lib/python2.7/dist-packages/rados.py", line 429, in connect > > raise make_ex(ret, "error connecting to the cluster") > > rados.Error: error connecting to the cluster: errno EINVAL > > > ** PYTHON SCRIPT > > #!/usr/bin/env python > > > import rados > > import json > > > def get_cluster_health(r): > > cmd = {"prefix":"status", "format":"json"} > > ret, buf, errs = r.mon_command(json.dumps(cmd), b'', timeout=5) > > result = json.loads(buf) > > return result['health']['overall_status'] > > > r = rados.Rados(conffile = '/etc/ceph/ceph.conf', conf = dict (keyring = > '/etc/ceph/ceph.client.python.keyring')) > > r.connect() > > > print("{0}".format(get_cluster_health(r))) > > > if r is not None: > > r.shutdown() > > * > > > > On Thu, Dec 21, 2017 at 4:15 PM, Andreas Calminder < > andreas.calmin...@klarna.com> wrote: > >> Hi, >> I'm writing a small python script using librados to display cluster >> health, same info as ceph health detail show, it works fine but I rather >> not use the admin keyring for something like this. However I have no clue >> what kind of caps I should or can set, I was kind of hoping that mon allow >> r would do it, but that didn't work, and I'm unable to find any >> documentation that covers this. Any pointers would be appreciated. >> >> Thanks, >> Andreas >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> > > > -- > > ATTE. Alvaro Soto Escobar > > -- > Great people talk about ideas, > average people talk about things, > small people talk ... about other people. > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph as an Alternative to HDFS for Hadoop
Hi List I'm researching the possibility os using ceph as a drop in replacement for hdfs for applications using spark and hadoop. I note that the jewel documentation states that it requires hadoop 1.1.x, which seems a little dated and would be of concern for peopel: http://docs.ceph.com/docs/jewel/cephfs/hadoop/ What about the 2.x series? Also, are there any benchmark comparisons between hdfs and ceph specifically around performance of apps benefiting from data locality ? Many thanks in advance for any feedback! Regards, Traiano ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph as an Alternative to HDFS for Hadoop
>Also, are there any benchmark comparisons between hdfs and ceph specifically >around performance of apps benefiting from data locality ? There will be no data locality in ceph, because all the data is accessed through network. On Fri, Dec 22, 2017 at 4:52 AM, Traiano Welcome wrote: > Hi List > > I'm researching the possibility os using ceph as a drop in replacement for > hdfs for applications using spark and hadoop. > > I note that the jewel documentation states that it requires hadoop 1.1.x, > which seems a little dated and would be of concern for peopel: > > http://docs.ceph.com/docs/jewel/cephfs/hadoop/ > > What about the 2.x series? > > Also, are there any benchmark comparisons between hdfs and ceph specifically > around performance of apps benefiting from data locality ? > > Many thanks in advance for any feedback! > > Regards, > Traiano > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs mds millions of caps
On Thu, Dec 21, 2017 at 11:46 PM, Webert de Souza Lima wrote: > Hello Zheng, > > Thanks for opening that issue on the bug tracker. > > Also thanks for that tip. Caps dropped from 1.6M to 600k for that client. idle client shouldn't hold so many caps. > Is it safe to run in a cronjob? Let's say, once or twice a day during > production? > yes. For now, it's better to run "echo 3 >/proc/sys/vm/drop_caches" after cronjob finishes > Thanks! > > > Regards, > > Webert Lima > DevOps Engineer at MAV Tecnologia > Belo Horizonte - Brasil > IRC NICK - WebertRLZ > > On Thu, Dec 21, 2017 at 11:55 AM, Yan, Zheng wrote: >> >> On Thu, Dec 21, 2017 at 7:33 PM, Webert de Souza Lima >> wrote: >> > I have upgraded the kernel on a client node (one that has close-to-zero >> > traffic) used for tests. >> > >> >{ >> > "reconnecting" : false, >> > "id" : 1620266, >> > "num_leases" : 0, >> > "inst" : "client.1620266 10.0.0.111:0/3921220890", >> > "state" : "open", >> > "completed_requests" : 0, >> > "num_caps" : 1402490, >> > "client_metadata" : { >> > "kernel_version" : "4.4.0-104-generic", >> > "hostname" : "suppressed", >> > "entity_id" : "admin" >> > }, >> > "replay_requests" : 0 >> >}, >> > >> > still 1.4M caps used. >> > >> > is upgrading the client kernel enough ? >> > >> >> See http://tracker.ceph.com/issues/22446. We haven't implemented that >> feature. "echo 3 >/proc/sys/vm/drop_caches" should drop most caps. >> >> > >> > >> > Regards, >> > >> > Webert Lima >> > DevOps Engineer at MAV Tecnologia >> > Belo Horizonte - Brasil >> > IRC NICK - WebertRLZ >> > >> > On Fri, Dec 15, 2017 at 11:16 AM, Webert de Souza Lima >> > wrote: >> >> >> >> So, >> >> >> >> On Fri, Dec 15, 2017 at 10:58 AM, Yan, Zheng wrote: >> >>> >> >>> >> >>> 300k are ready quite a lot. opening them requires long time. does you >> >>> mail server really open so many files? >> >> >> >> >> >> Yes, probably. It's a commercial solution. A few thousand domains, >> >> dozens >> >> of thousands of users and god knows how any mailboxes. >> >> From the daemonperf you can see the write workload is high, so yes, too >> >> much files opening (dovecot mdbox stores multiple e-mails per file, >> >> split >> >> into many files). >> >> >> >>> I checked 4.4 kernel, it includes the code that trim cache when mds >> >>> recovers. >> >> >> >> >> >> Ok, all nodes are running 4.4.0-75-generic. The fix might have been >> >> included in a newer version. >> >> I'll upgrade it asap. >> >> >> >> >> >> Regards, >> >> >> >> Webert Lima >> >> DevOps Engineer at MAV Tecnologia >> >> Belo Horizonte - Brasil >> >> IRC NICK - WebertRLZ >> > >> > >> > >> > ___ >> > ceph-users mailing list >> > ceph-users@lists.ceph.com >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cephfs NFS failover
Thanks all on this one The ctdb worked amazing, just need to tweak the settings on it. So the failover happens a tad faster. But all in all it works. Thanks for all your help On 21 Dec 2017 9:08 am, "Robert Sander" wrote: > On 20.12.2017 18:45, nigel davies wrote: > > Hay all > > > > Can any one advise on how it can do this. > > You can use ctdb for that and run an active/active NFS cluster: > > https://wiki.samba.org/index.php/Setting_up_CTDB_for_Clustered_NFS > > The cluster filesystem can be a CephFS. This also works with Samba, i.e. > you get an unlimited fileserver. > > Regards > -- > Robert Sander > Heinlein Support GmbH > Schwedter Str. 8/9b, 10119 Berlin > > http://www.heinlein-support.de > > Tel: 030 / 405051-43 > Fax: 030 / 405051-19 > > Zwangsangaben lt. §35a GmbHG: > HRB 93818 B / Amtsgericht Berlin-Charlottenburg, > Geschäftsführer: Peer Heinlein -- Sitz: Berlin > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cephfs limis
Right ok I take an look. Can you do that after the pool /cephfs has been set up On 21 Dec 2017 12:25 pm, "Yan, Zheng" wrote: > On Thu, Dec 21, 2017 at 6:18 PM, nigel davies wrote: > > Hay all is it possable to set cephfs to have an sapce limit > > eg i like to set my cephfs to have an limit of 20TB > > and my s3 storage to have 4TB for example > > > > you can set pool quota on cephfs data pools > > > thanks > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] MDS locatiins
Hay all Is it ok to set up mds on the same serves that do host the osd's or should they be on different server's ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com