[ceph-users] Re: ceph fs crashes on simple fio test
Dear all, I found a partial solution to the problem and I also repeated a bit of testing, see below. # Client-sided solution, works for single-client IO The hard solution is to mount cephfs with the option "sync". This will translate any IO to direct IO and successfully throttle clients no matter how they perform IO. This will even work in multi-client set-ups. A somewhat less restrictive option is to set low values for vm.dirty_[background_]bytes to allow some buffered IO for small bursts. I tried with vm.dirty_background_bytes = 524288 vm.dirty_bytes = 1048576 and less restrictive vm.dirty_background_bytes = 2097152 vm.dirty_bytes = 67108864 (without sync mount option) and it seems to have the desired effect. It is possible to obtain good large-IO size throughput while limiting small IO size IOPs to a healthy level. Of course, this does not address destructive multi-client IO patterns, which must be addressed on the server side. # Test observations Today I repeated a shorter test to avoid crashing the cluster bad. We are in production and I don't have a test cluster. Therefore, if anyone could try this on a test cluster and check if the observations can be confirmed, that would be great. Here is a one-line command: fio -name=rand-write -directory=/mnt/cephfs/home/frans/fio -filename_format=tmp/fio-\$jobname-\$jobnum-\$filenum -rw=randwrite -bs=4K -numjobs=4 -time_based=1 -runtime=5 -filesize=100G -ioengine=sync -direct=0 -iodepth=1 Adjust runtime and numjobs to increasingly higher values to increase stress. In my original tests I observed OSD outages with numjobs=4 and runtime=30 already. Note that these occur several minutes after the fio command completes. Here are my today's observations with "osd_op_queue=wpq" and "osd_op_queue_cut_off=high" and a 5 sec run time: - High IOPs (>4kops) on the data pool come in two waves. - The first wave does not cause slow ops. - There is a phase of low activity. - A second wave starts and now slow meta data ops are reported by the MDS. Health level becomes warn. - The cluster crunches through the meta data ops for a minute or so and then settles. This is quite a long time considering a 5 secs burst. - OSDs did not go out, but this could be due to not running the test long enough. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Cannot start virtual machines KVM / LXC
Hi, I cannot get rid of pgs unknown because there were 3 disks that couldn't be started. Therefore I destroyed the relevant OSD and re-created it for the relevant disks. Then I added the 3 OSDs to crushmap. Regards Thomas Am 20.09.2019 um 08:19 schrieb Ashley Merrick: > Your need to fix this first. > > pgs: 0.056% pgs unknown > 0.553% pgs not active > > The back filling will cause slow I/O, but having pgs unknown and not > active will cause I/O blocking which your seeing with the VM booting. > > Seems you have 4 OSD's down, if you get them back online you should be > able to get all the PG's online. > > > On Fri, 20 Sep 2019 14:14:01 +0800 *Thomas <74cmo...@gmail.com>* > wrote > > Hi, > > here I describe 1 of the 2 major issues I'm currently facing in my 8 > node ceph cluster (2x MDS, 6x ODS). > > The issue is that I cannot start any virtual machine KVM or container > LXC; the boot process just hangs after a few seconds. > All these KVMs and LXCs have in common that their virtual disks > reside > in the same pool: hdd > > This pool hdd is relatively small compared to the largest pool: > hdb_backup > root@ld3955:~# rados df > POOL_NAME USED OBJECTS CLONES COPIES > MISSING_ON_PRIMARY > UNFOUND DEGRADED RD_OPS RD WR_OPS WR USED COMPR > UNDER COMPR > backup 0 B 0 0 > 0 > 0 0 0 0 0 B 0 0 B 0 > B 0 B > hdb_backup 589 TiB 51262212 0 > 153786636 > 0 0 124895 12266095 4.3 TiB 247132863 463 TiB 0 > B 0 B > hdd 3.2 TiB 281884 6568 > 845652 > 0 0 1658 275277357 16 TiB 208213922 10 TiB 0 > B 0 B > pve_cephfs_data 955 GiB 91832 0 > 275496 > 0 0 3038 2103 1021 MiB 102170 318 GiB 0 > B 0 B > pve_cephfs_metadata 486 MiB 62 0 > 186 > 0 0 7 860 1.4 GiB 12393 166 MiB 0 > B 0 B > > total_objects 51635990 > total_used 597 TiB > total_avail 522 TiB > total_space 1.1 PiB > > This is the current health status of the ceph cluster: > cluster: > id: 6b1b5117-6e08-4843-93d6-2da3cf8a6bae > health: HEALTH_ERR > 1 filesystem is degraded > 1 MDSs report slow metadata IOs > 1 backfillfull osd(s) > 87 nearfull osd(s) > 1 pool(s) backfillfull > Reduced data availability: 54 pgs inactive, 47 pgs > peering, > 1 pg stale > Degraded data redundancy: 129598/154907946 objects > degraded > (0.084%), 33 pgs degraded, 33 pgs undersized > Degraded data redundancy (low space): 322 pgs > backfill_toofull > 1 subtrees have overcommitted pool target_size_bytes > 1 subtrees have overcommitted pool target_size_ratio > 1 pools have too many placement groups > 21 slow requests are blocked > 32 sec > > services: > mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 14h) > mgr: ld5507(active, since 16h), standbys: ld5506, ld5505 > mds: pve_cephfs:1/1 {0=ld3955=up:replay} 1 up:standby > osd: 360 osds: 356 up, 356 in; 382 remapped pgs > > data: > pools: 5 pools, 8868 pgs > objects: 51.64M objects, 197 TiB > usage: 597 TiB used, 522 TiB / 1.1 PiB avail > pgs: 0.056% pgs unknown > 0.553% pgs not active > 129598/154907946 objects degraded (0.084%) > 229/154907946 objects misplaced (1.427%) > 8458 active+clean > 298 active+remapped+backfill_toofull > 29 remapped+peering > 24 > active+undersized+degraded+remapped+backfill_toofull > 22 active+remapped+backfill_wait > 17 peering > 5 unknown > 5 active+recovery_wait+undersized+degraded+remapped > 3 active+undersized+degraded+remapped+backfill_wait > 2 activating+remapped > 1 active+clean+remapped > 1 stale+peering > 1 active+remapped+backfilling > 1 active+recovering+undersized+remapped > 1 active+recovery_wait+degraded > > io: > client: 9.2 KiB/s wr, 0 op/s rd, 1 op/s wr > > I believe the cluster is busy with rebalancing pool hdb_backup. > I set the balance mode upmap recently after the 589TB data was > written. > root@ld39
[ceph-users] Re: Cannot start virtual machines KVM / LXC
Hi, I cannot get rid of pgs unknown because there were 3 disks that couldn't be started. Therefore I destroyed the relevant OSD and re-created it for the relevant disks. Then I added the 3 OSDs to crushmap. Regards Thomas Am 20.09.2019 um 08:19 schrieb Ashley Merrick: > Your need to fix this first. > > pgs: 0.056% pgs unknown > 0.553% pgs not active > > The back filling will cause slow I/O, but having pgs unknown and not > active will cause I/O blocking which your seeing with the VM booting. > > Seems you have 4 OSD's down, if you get them back online you should be > able to get all the PG's online. > > > On Fri, 20 Sep 2019 14:14:01 +0800 *Thomas <74cmo...@gmail.com>* > wrote > > Hi, > > here I describe 1 of the 2 major issues I'm currently facing in my 8 > node ceph cluster (2x MDS, 6x ODS). > > The issue is that I cannot start any virtual machine KVM or container > LXC; the boot process just hangs after a few seconds. > All these KVMs and LXCs have in common that their virtual disks > reside > in the same pool: hdd > > This pool hdd is relatively small compared to the largest pool: > hdb_backup > root@ld3955:~# rados df > POOL_NAME USED OBJECTS CLONES COPIES > MISSING_ON_PRIMARY > UNFOUND DEGRADED RD_OPS RD WR_OPS WR USED COMPR > UNDER COMPR > backup 0 B 0 0 > 0 > 0 0 0 0 0 B 0 0 B 0 > B 0 B > hdb_backup 589 TiB 51262212 0 > 153786636 > 0 0 124895 12266095 4.3 TiB 247132863 463 TiB 0 > B 0 B > hdd 3.2 TiB 281884 6568 > 845652 > 0 0 1658 275277357 16 TiB 208213922 10 TiB 0 > B 0 B > pve_cephfs_data 955 GiB 91832 0 > 275496 > 0 0 3038 2103 1021 MiB 102170 318 GiB 0 > B 0 B > pve_cephfs_metadata 486 MiB 62 0 > 186 > 0 0 7 860 1.4 GiB 12393 166 MiB 0 > B 0 B > > total_objects 51635990 > total_used 597 TiB > total_avail 522 TiB > total_space 1.1 PiB > > This is the current health status of the ceph cluster: > cluster: > id: 6b1b5117-6e08-4843-93d6-2da3cf8a6bae > health: HEALTH_ERR > 1 filesystem is degraded > 1 MDSs report slow metadata IOs > 1 backfillfull osd(s) > 87 nearfull osd(s) > 1 pool(s) backfillfull > Reduced data availability: 54 pgs inactive, 47 pgs > peering, > 1 pg stale > Degraded data redundancy: 129598/154907946 objects > degraded > (0.084%), 33 pgs degraded, 33 pgs undersized > Degraded data redundancy (low space): 322 pgs > backfill_toofull > 1 subtrees have overcommitted pool target_size_bytes > 1 subtrees have overcommitted pool target_size_ratio > 1 pools have too many placement groups > 21 slow requests are blocked > 32 sec > > services: > mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 14h) > mgr: ld5507(active, since 16h), standbys: ld5506, ld5505 > mds: pve_cephfs:1/1 {0=ld3955=up:replay} 1 up:standby > osd: 360 osds: 356 up, 356 in; 382 remapped pgs > > data: > pools: 5 pools, 8868 pgs > objects: 51.64M objects, 197 TiB > usage: 597 TiB used, 522 TiB / 1.1 PiB avail > pgs: 0.056% pgs unknown > 0.553% pgs not active > 129598/154907946 objects degraded (0.084%) > 229/154907946 objects misplaced (1.427%) > 8458 active+clean > 298 active+remapped+backfill_toofull > 29 remapped+peering > 24 > active+undersized+degraded+remapped+backfill_toofull > 22 active+remapped+backfill_wait > 17 peering > 5 unknown > 5 active+recovery_wait+undersized+degraded+remapped > 3 active+undersized+degraded+remapped+backfill_wait > 2 activating+remapped > 1 active+clean+remapped > 1 stale+peering > 1 active+remapped+backfilling > 1 active+recovering+undersized+remapped > 1 active+recovery_wait+degraded > > io: > client: 9.2 KiB/s wr, 0 op/s rd, 1 op/s wr > > I believe the cluster is busy with rebalancing pool hdb_backup. > I set the balance mode upmap recently after the 589TB data was > written. > root@ld39
[ceph-users] How to reduce or control memory usage during recovery?
Hi, I am using ceph mimic in a small test setup using the below configuration. OS: ubuntu 18.04 1 node running (mon,mds,mgr) + 4 core cpu and 4GB RAM and 1 Gb lan 3 nodes each having 2 osd's, disks are 2TB + 2 core cpu and 4G RAM and 1 Gb lan 1 node acting as cephfs client + 2 core cpu and 4G RAM and 1 Gb lan configured cephfs_metadata_pool (3 replica) and cephfs_data_pool erasure 2+1. When running a script doing multiple folders creation ceph started throwing error late IO due to high metadata workload. once after folder creation complete PG's degraded and I am waiting for PG to complete recovery but my OSD's starting to crash due to OOM and restarting after some time. Now my question is I can wait for recovery to complete but how do I stop OOM and OSD crash? basically want to know the way to control memory usage during recovery and make it stable. I have also set very low PG metadata_pool 8 and data_pool 16. I have already set "mon osd memory target to 1Gb" and I have set max-backfill from 1 to 8. Attached msg from "kern.log" from one of the node and snippet of error msg in this mail. -error msg snippet -- -bash: fork: Cannot allocate memory Sep 18 19:01:57 test-node1 kernel: [341246.765644] msgr-worker-0 invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0 Sep 18 19:02:00 test-node1 kernel: [341246.765645] msgr-worker-0 cpuset=/ mems_allowed=0 Sep 18 19:02:00 test-node1 kernel: [341246.765650] CPU: 1 PID: 1737 Comm: msgr-worker-0 Not tainted 4.15.0-45-generic #48-Ubuntu Sep 18 19:02:02 test-node1 kernel: [341246.765833] Out of memory: Kill process 1727 (ceph-osd) score 489 or sacrifice child Sep 18 19:02:03 test-node1 kernel: [341246.765919] Killed process 1727 (ceph-osd) total-vm:3483844kB, anon-rss:1992708kB, file-rss:0kB, shmem-rss:0kB Sep 18 19:02:03 test-node1 kernel: [341246.899395] oom_reaper: reaped process 1727 (ceph-osd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB Sep 18 22:09:57 test-node1 kernel: [352529.433155] perf: interrupt took too long (4965 > 4938), lowering kernel.perf_event_max_sample_rate to 40250 regards Amudhan ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] handle_connect_reply_2 connect got BADAUTHORIZER when running ceph pg query
Hi, ceph health status reports unknown objects. All objects reside on same osd.9 When I execute ceph pg query I get this (endless) output: 2019-09-20 14:47:35.922 7f937144f700 0 --1- 10.97.206.91:0/2060489821 >> v1:10.97.206.93:7054/15812 conn(0x7f935407c120 0x7f935407b120 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2 connect got BADAUTHORIZER ^CTraceback (most recent call last): File "/usr/bin/ceph", line 1263, in retval = main() File "/usr/bin/ceph", line 1179, in main prefix='get_command_descriptions') File "/usr/lib/python2.7/dist-packages/ceph_argparse.py", line 1459, in json_command inbuf, timeout, verbose) File "/usr/lib/python2.7/dist-packages/ceph_argparse.py", line 1329, in send_command_retry return send_command(*args, **kwargs) File "/usr/lib/python2.7/dist-packages/ceph_argparse.py", line 1381, in send_command cluster.pg_command, pgid, cmd, inbuf, timeout=timeout) File "/usr/lib/python2.7/dist-packages/ceph_argparse.py", line 1311, in run_in_thread t.join(timeout=timeout) File "/usr/lib/python2.7/threading.py", line 951, in join self.__block.wait(delay) File "/usr/lib/python2.7/threading.py", line 359, in wait _sleep(delay) KeyboardInterrupt 2019-09-20 14:47:35.950 7f937144f700 0 --1- 10.97.206.91:0/2060489821 >> v1:10.97.206.93:7054/15812 conn(0x7f935407f4b0 0x7f935407b920 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2 connect got BADAUTHORIZER 2019-09-20 14:47:35.950 7f937144f700 0 --1- 10.97.206.91:0/2060489821 >> v1:10.97.206.93:7054/15812 conn(0x7f935407c120 0x7f935407b120 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2 connect got BADAUTHORIZER 2019-09-20 14:47:35.950 7f937144f700 0 --1- 10.97.206.91:0/2060489821 >> v1:10.97.206.93:7054/15812 conn(0x7f935407f4b0 0x7f935407b920 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2 connect got BADAUTHORIZER How can I fix this issue with pg / osd.9? THX ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Nautilus dashboard: MDS performance graph doesn't refresh
Hi all, I regularly check the MDS performance graphs in the dashboard, especially the requests per second is interesting in my case. Since our upgrade to Nautilus the values in the activity column are still refreshed every 5 seconds (I believe), but the graphs are not refreshed since that upgrade anymore. I couldn't find anything in the tracker or the mailing list, can anyone comment on this? Thank you & best regards, Eugen ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] RGW backup to tape
The question was posed, "What if we want to backup our RGW data to tape?" Anyone doing this? Any suggestions? We could probably just catch any PUT requests and queue them to be written to tape. Our dataset is so large, that traditional backup solutions don't seem feasible (GFS), so probably a single copy (or two copies on different tapes at the same time) when the object is created. Bonus points for being near-line. Thanks, Robert LeBlanc Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: RGW backup to tape
Probably easiest if you get a tape library that supports S3. You might even have some luck with radosgw's cloud sync module (but I wouldn't count on it, Octopus should improve things, though) Just intercepting PUT requests isn't that easy because of multi-part stuff and load balancing. I.e., if you upload a large file you should be sending it in chunks and each chunk should go to a different server, that makes any "simple" solutions pretty messy. Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Fri, Sep 20, 2019 at 8:01 PM Robert LeBlanc wrote: > > The question was posed, "What if we want to backup our RGW data to > tape?" Anyone doing this? Any suggestions? We could probably just > catch any PUT requests and queue them to be written to tape. Our > dataset is so large, that traditional backup solutions don't seem > feasible (GFS), so probably a single copy (or two copies on different > tapes at the same time) when the object is created. > > Bonus points for being near-line. > > Thanks, > Robert LeBlanc > > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: HEALTH_WARN due to large omap object wont clear even after trim
Still trying to solve this one. Here is the corresponding log entry when the large omap object was found: ceph-osd.1284.log.2.gz:2019-09-18 11:43:39.237 7fcd68f96700 0 log_channel(cluster) log [WRN] : Large omap object found. Object: 26:86e4c833:::usage.22:head Key count: 2009548 Size (bytes): 369641376 I have since trimmed the entire usage log and disabled it entirely. You can see from the output below that there's nothing in these usage log objects. for i in `rados -p .usage ls`; do echo $i; rados -p .usage listomapkeys $i | wc -l; done usage.29 0 usage.12 0 usage.1 0 usage.26 0 usage.20 0 usage.24 0 usage.16 0 usage.15 0 usage.3 0 usage.19 0 usage.23 0 usage.5 0 usage.11 0 usage.7 0 usage.30 0 usage.18 0 usage.21 0 usage.27 0 usage.13 0 usage.22 0 usage.25 0 . 4 usage.10 0 usage.8 0 usage.9 0 usage.28 0 usage.2 0 usage.4 0 usage.6 0 usage.31 0 usage.17 0 root@infra:~# rados -p .usage listomapkeys usage.22 root@infra:~# On Thu, Sep 19, 2019 at 12:54 PM Charles Alva wrote: > > Could you please share how you trimmed the usage log? > > Kind regards, > > Charles Alva > Sent from Gmail Mobile > > > On Thu, Sep 19, 2019 at 11:46 PM shubjero wrote: >> >> Hey all, >> >> Yesterday our cluster went in to HEALTH_WARN due to 1 large omap >> object in the .usage pool (I've posted about this in the past). Last >> time we resolved the issue by trimming the usage log below the alert >> threshold but this time it seems like the alert wont clear even after >> trimming and (this time) disabling the usage log entirely. >> >> ceph health detail >> HEALTH_WARN 1 large omap objects >> LARGE_OMAP_OBJECTS 1 large omap objects >> 1 large objects found in pool '.usage' >> Search the cluster log for 'Large omap object found' for more details. >> >> I've bounced ceph-mon, ceph-mgr, radosgw and even issued osd scrub on >> the two osd's that hold pg's for the .usage pool but the alert wont >> clear. >> >> It's been over 24 hours since I trimmed the usage log. >> >> Any suggestions? >> >> Jared Baker >> Cloud Architect, OICR >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: HEALTH_WARN due to large omap object wont clear even after trim
Hi Jared, My understanding is that these 'large omap object' warnings are only issued or cleared during scrub, so I'd expect them to go away the next time the usage objects get scrubbed. On 9/20/19 2:31 PM, shubjero wrote: Still trying to solve this one. Here is the corresponding log entry when the large omap object was found: ceph-osd.1284.log.2.gz:2019-09-18 11:43:39.237 7fcd68f96700 0 log_channel(cluster) log [WRN] : Large omap object found. Object: 26:86e4c833:::usage.22:head Key count: 2009548 Size (bytes): 369641376 I have since trimmed the entire usage log and disabled it entirely. You can see from the output below that there's nothing in these usage log objects. for i in `rados -p .usage ls`; do echo $i; rados -p .usage listomapkeys $i | wc -l; done usage.29 0 usage.12 0 usage.1 0 usage.26 0 usage.20 0 usage.24 0 usage.16 0 usage.15 0 usage.3 0 usage.19 0 usage.23 0 usage.5 0 usage.11 0 usage.7 0 usage.30 0 usage.18 0 usage.21 0 usage.27 0 usage.13 0 usage.22 0 usage.25 0 . 4 usage.10 0 usage.8 0 usage.9 0 usage.28 0 usage.2 0 usage.4 0 usage.6 0 usage.31 0 usage.17 0 root@infra:~# rados -p .usage listomapkeys usage.22 root@infra:~# On Thu, Sep 19, 2019 at 12:54 PM Charles Alva wrote: Could you please share how you trimmed the usage log? Kind regards, Charles Alva Sent from Gmail Mobile On Thu, Sep 19, 2019 at 11:46 PM shubjero wrote: Hey all, Yesterday our cluster went in to HEALTH_WARN due to 1 large omap object in the .usage pool (I've posted about this in the past). Last time we resolved the issue by trimming the usage log below the alert threshold but this time it seems like the alert wont clear even after trimming and (this time) disabling the usage log entirely. ceph health detail HEALTH_WARN 1 large omap objects LARGE_OMAP_OBJECTS 1 large omap objects 1 large objects found in pool '.usage' Search the cluster log for 'Large omap object found' for more details. I've bounced ceph-mon, ceph-mgr, radosgw and even issued osd scrub on the two osd's that hold pg's for the .usage pool but the alert wont clear. It's been over 24 hours since I trimmed the usage log. Any suggestions? Jared Baker Cloud Architect, OICR ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Doubt about ceph-iscsi and Vmware
Hi, I'm testing Ceph with Vmware, using Ceph-iscsi gateway. I reading documentation* and have doubts some points: - If I understanded, in general terms, for each VMFS datastore in VMware will match the an RBD image. (consequently in an RBD image I will possible have many VMWare disks). Its correct? - In documentation is this: "gwcli requires a pool with the name rbd, so it can store metadata like the iSCSI configuration". In part 4 of "Configuration", have: "Add a RBD image with the name disk_1 in the pool rbd". In this part, the use of "rbd" pool is a example and I could use any pool for storage of image, or the pool should be "rbd"? Resuming: gwcli require "rbd" pool for metadata and I could use any pool for image, or i will use just "rbd pool" for storage image and metadata? - How much memory ceph-iscsi use? Which is a good number of RAM? Regards Gesiel * https://docs.ceph.com/docs/master/rbd/iscsi-target-cli/ ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Doubt about ceph-iscsi and Vmware
On Fri, Sep 20, 2019 at 8:55 PM Gesiel Galvão Bernardes wrote: > > Hi, > I'm testing Ceph with Vmware, using Ceph-iscsi gateway. I reading > documentation* and have doubts some points: > > - If I understanded, in general terms, for each VMFS datastore in VMware will > match the an RBD image. (consequently in an RBD image I will possible have > many VMWare disks). Its correct? yes > - In documentation is this: "gwcli requires a pool with the name rbd, so it > can store metadata like the iSCSI configuration". In part 4 of > "Configuration", have: "Add a RBD image with the name disk_1 in the pool > rbd". In this part, the use of "rbd" pool is a example and I could use any > pool for storage of image, or the pool should be "rbd"? that's just an example, yes > Resuming: gwcli require "rbd" pool for metadata and I could use any pool for > image, or i will use just "rbd pool" for storage image and metadata? you can even store the metadata elsewhere in newer versions, see options in the config file > - How much memory ceph-iscsi use? Which is a good number of RAM? since it can't cache anything: virtually nothing > > Regards > Gesiel > > * https://docs.ceph.com/docs/master/rbd/iscsi-target-cli/ > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Doubt about ceph-iscsi and Vmware
Hello Gesiel, Some iscsi settings are stored in an object, this object is stored in the rbd pool. Hnece the rbd pool is required. Your LUN's are mapped to {pool}/{rbdimage}. You should treat these as you treat pools and rbd images in general. In smallish deployments I try to keep it simple and make 1 pool for each deviceclass and make LUN's as big as possible, while still allowing for setting 1 LUN in maintenance mode in the datastore cluster within vSphere, in case we need to re-format as part of vmfs upgrade. Remember to set PSP and recovery timeout properly. From the LUN level and up, you just treat the storage as any other iscsi storage connected to vSphere. the iGW consumes RAM pr. LUN export... I can't remember the default settings but we are talking about single-digit Gb qith tens of LUN exported, so it's fairly lightweight. /Heðin On frí, 2019-09-20 at 15:52 -0300, Gesiel Galvão Bernardes wrote: > Hi, > I'm testing Ceph with Vmware, using Ceph-iscsi gateway. I reading > documentation* and have doubts some points: > > - If I understanded, in general terms, for each VMFS datastore in > VMware will match the an RBD image. (consequently in an RBD image I > will possible have many VMWare disks). Its correct? > > - In documentation is this: "gwcli requires a pool with the name rbd, > so it can store metadata like the iSCSI configuration". In part 4 of > "Configuration", have: "Add a RBD image with the name disk_1 in the > pool rbd". In this part, the use of "rbd" pool is a example and I > could use any pool for storage of image, or the pool should be "rbd"? > Resuming: gwcli require "rbd" pool for metadata and I could use any > pool for image, or i will use just "rbd pool" for storage image and > metadata? > > - How much memory ceph-iscsi use? Which is a good number of RAM? > > Regards > Gesiel > > * https://docs.ceph.com/docs/master/rbd/iscsi-target-cli/ > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io signature.asc Description: This is a digitally signed message part ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: How to set timeout on Rados gateway request
On 9/19/19 11:52 PM, Hanyu Liu wrote: Hi, We are looking for a way to set timeout on requests to rados gateway. If a request takes too long time, just kill it. 1. Is there a command that can set the timeout? there isn't, no 2. This parameter looks interesting. Can I know what the "open threads" means? |rgw op thread timeout| Description:The timeout in seconds for open threads. Type: Integer Default:600 /(from https://docs.ceph.com/docs/nautilus/radosgw/config-ref/)/ this thread timeout option is left over from frontends that used ceph's internal WorkQueue/ThreadPool infrastructure. i believe the timeout option just caused the WorkQueue to print a warning when a request took longer to complete. the associated 'rgw op thread suicide timeout' went a step further and actually killed the radosgw process. however, these options don't apply to either of the currently supported frontends as they each have their own threading model Thanks, Hanyu ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Cannot start virtual machines KVM / LXC
On Fri, Sep 20, 2019 at 1:31 PM Thomas Schneider <74cmo...@gmail.com> wrote: > > Hi, > > I cannot get rid of > pgs unknown > because there were 3 disks that couldn't be started. > Therefore I destroyed the relevant OSD and re-created it for the > relevant disks. and you had it configured to run with replica 3? Well, I guess the down PGs where located on these three disks that you wiped. Do you still have the disks? Use ceph-objectstore-tool to export the affected PGs manually and inject them into another OSD. Paul > Then I added the 3 OSDs to crushmap. > > Regards > Thomas > > Am 20.09.2019 um 08:19 schrieb Ashley Merrick: > > Your need to fix this first. > > > > pgs: 0.056% pgs unknown > > 0.553% pgs not active > > > > The back filling will cause slow I/O, but having pgs unknown and not > > active will cause I/O blocking which your seeing with the VM booting. > > > > Seems you have 4 OSD's down, if you get them back online you should be > > able to get all the PG's online. > > > > > > On Fri, 20 Sep 2019 14:14:01 +0800 *Thomas <74cmo...@gmail.com>* > > wrote > > > > Hi, > > > > here I describe 1 of the 2 major issues I'm currently facing in my 8 > > node ceph cluster (2x MDS, 6x ODS). > > > > The issue is that I cannot start any virtual machine KVM or container > > LXC; the boot process just hangs after a few seconds. > > All these KVMs and LXCs have in common that their virtual disks > > reside > > in the same pool: hdd > > > > This pool hdd is relatively small compared to the largest pool: > > hdb_backup > > root@ld3955:~# rados df > > POOL_NAME USED OBJECTS CLONESCOPIES > > MISSING_ON_PRIMARY > > UNFOUND DEGRADEDRD_OPS RDWR_OPS WR USED COMPR > > UNDER COMPR > > backup 0 B0 0 > > 0 > > 0 00 0 0 B 0 0 B0 > > B 0 B > > hdb_backup 589 TiB 51262212 0 > > 153786636 > > 0 0 124895 12266095 4.3 TiB 247132863 463 TiB0 > > B 0 B > > hdd 3.2 TiB 281884 6568 > > 845652 > > 0 0 1658 275277357 16 TiB 208213922 10 TiB0 > > B 0 B > > pve_cephfs_data 955 GiB91832 0 > > 275496 > > 0 0 3038 2103 1021 MiB102170 318 GiB0 > > B 0 B > > pve_cephfs_metadata 486 MiB 62 0 > > 186 > > 0 07 860 1.4 GiB 12393 166 MiB0 > > B 0 B > > > > total_objects51635990 > > total_used 597 TiB > > total_avail 522 TiB > > total_space 1.1 PiB > > > > This is the current health status of the ceph cluster: > > cluster: > > id: 6b1b5117-6e08-4843-93d6-2da3cf8a6bae > > health: HEALTH_ERR > > 1 filesystem is degraded > > 1 MDSs report slow metadata IOs > > 1 backfillfull osd(s) > > 87 nearfull osd(s) > > 1 pool(s) backfillfull > > Reduced data availability: 54 pgs inactive, 47 pgs > > peering, > > 1 pg stale > > Degraded data redundancy: 129598/154907946 objects > > degraded > > (0.084%), 33 pgs degraded, 33 pgs undersized > > Degraded data redundancy (low space): 322 pgs > > backfill_toofull > > 1 subtrees have overcommitted pool target_size_bytes > > 1 subtrees have overcommitted pool target_size_ratio > > 1 pools have too many placement groups > > 21 slow requests are blocked > 32 sec > > > > services: > > mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 14h) > > mgr: ld5507(active, since 16h), standbys: ld5506, ld5505 > > mds: pve_cephfs:1/1 {0=ld3955=up:replay} 1 up:standby > > osd: 360 osds: 356 up, 356 in; 382 remapped pgs > > > > data: > > pools: 5 pools, 8868 pgs > > objects: 51.64M objects, 197 TiB > > usage: 597 TiB used, 522 TiB / 1.1 PiB avail > > pgs: 0.056% pgs unknown > > 0.553% pgs not active > > 129598/154907946 objects degraded (0.084%) > > 229/154907946 objects misplaced (1.427%) > > 8458 active+clean > > 298 active+remapped+backfill_toofull > > 29 remapped+peering > > 24 > > active+undersized+degraded+remapped+backfill_toofull > > 22 active+remapped+backfill_wait > > 17 peering > > 5unknown > > 5active+recovery_wait+undersized+degraded+remapped > > 3active+undersized+degraded+remapped+backfill_wait > > 2activating+remapped > > 1
[ceph-users] Re: HEALTH_WARN due to large omap object wont clear even after trim
Thanks Casey. I will issue a scrub for the pg that contains this object to speed things along. Will report back when that's done. On Fri, Sep 20, 2019 at 2:50 PM Casey Bodley wrote: > > Hi Jared, > > My understanding is that these 'large omap object' warnings are only > issued or cleared during scrub, so I'd expect them to go away the next > time the usage objects get scrubbed. > > On 9/20/19 2:31 PM, shubjero wrote: > > Still trying to solve this one. > > > > Here is the corresponding log entry when the large omap object was found: > > > > ceph-osd.1284.log.2.gz:2019-09-18 11:43:39.237 7fcd68f96700 0 > > log_channel(cluster) log [WRN] : Large omap object found. Object: > > 26:86e4c833:::usage.22:head Key count: 2009548 Size (bytes): 369641376 > > > > I have since trimmed the entire usage log and disabled it entirely. > > You can see from the output below that there's nothing in these usage > > log objects. > > > > for i in `rados -p .usage ls`; do echo $i; rados -p .usage > > listomapkeys $i | wc -l; done > > usage.29 > > 0 > > usage.12 > > 0 > > usage.1 > > 0 > > usage.26 > > 0 > > usage.20 > > 0 > > usage.24 > > 0 > > usage.16 > > 0 > > usage.15 > > 0 > > usage.3 > > 0 > > usage.19 > > 0 > > usage.23 > > 0 > > usage.5 > > 0 > > usage.11 > > 0 > > usage.7 > > 0 > > usage.30 > > 0 > > usage.18 > > 0 > > usage.21 > > 0 > > usage.27 > > 0 > > usage.13 > > 0 > > usage.22 > > 0 > > usage.25 > > 0 > > . > > 4 > > usage.10 > > 0 > > usage.8 > > 0 > > usage.9 > > 0 > > usage.28 > > 0 > > usage.2 > > 0 > > usage.4 > > 0 > > usage.6 > > 0 > > usage.31 > > 0 > > usage.17 > > 0 > > > > > > root@infra:~# rados -p .usage listomapkeys usage.22 > > root@infra:~# > > > > > > On Thu, Sep 19, 2019 at 12:54 PM Charles Alva wrote: > >> Could you please share how you trimmed the usage log? > >> > >> Kind regards, > >> > >> Charles Alva > >> Sent from Gmail Mobile > >> > >> > >> On Thu, Sep 19, 2019 at 11:46 PM shubjero wrote: > >>> Hey all, > >>> > >>> Yesterday our cluster went in to HEALTH_WARN due to 1 large omap > >>> object in the .usage pool (I've posted about this in the past). Last > >>> time we resolved the issue by trimming the usage log below the alert > >>> threshold but this time it seems like the alert wont clear even after > >>> trimming and (this time) disabling the usage log entirely. > >>> > >>> ceph health detail > >>> HEALTH_WARN 1 large omap objects > >>> LARGE_OMAP_OBJECTS 1 large omap objects > >>> 1 large objects found in pool '.usage' > >>> Search the cluster log for 'Large omap object found' for more > >>> details. > >>> > >>> I've bounced ceph-mon, ceph-mgr, radosgw and even issued osd scrub on > >>> the two osd's that hold pg's for the .usage pool but the alert wont > >>> clear. > >>> > >>> It's been over 24 hours since I trimmed the usage log. > >>> > >>> Any suggestions? > >>> > >>> Jared Baker > >>> Cloud Architect, OICR > >>> ___ > >>> ceph-users mailing list -- ceph-users@ceph.io > >>> To unsubscribe send an email to ceph-users-le...@ceph.io > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cephfs and selinux
Thank you for the responce, but of course I'd tried this before asking. It has no effect. Selinux still prevents to open authorized_keys. I suppose there is something wrong with file contexts at my cephfs. For instance, 'ls -Z' shows just a '?' as a context, and chcon fails with "Operation not supported" message. Where should I look for error? You can setup a custom SELinux module to enable access. We use the following snippet to allow sshd to access authorized keys in home directories on CephFS: module local-ceph-ssh-auth 1.0; require { type cephfs_t; type sshd_t; class file { read getattr open }; } #= sshd_t == allow sshd_t cephfs_t:file { read getattr open }; Compiling and persistently installing such a module is covered by various documentation, such as: https://wiki.centos.org/HowTos/SELinux#head-aa437f65e1c7873cddbafd9e9a73bbf9d102c072 (7.1. Manually Customizing Policy Modules). Also covered there is using audit2allow to create your own module from SELinux audit logs. thanks, Ben On Tue, Sep 17, 2019 at 9:22 AM Andrey Suharev wrote: Hi all, I would like to have my home dir at cephfs and to keep selinux enabled at the same time. The trouble is selinux prevents sshd to access ~/.ssh/authorized_keys file. Any ideas how to fix it? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: HEALTH_WARN due to large omap object wont clear even after trim
The deep scrub of the pg updated the cluster that the large omap was gone. HEALTH_OK ! On Fri., Sep. 20, 2019, 2:31 p.m. shubjero, wrote: > Still trying to solve this one. > > Here is the corresponding log entry when the large omap object was found: > > ceph-osd.1284.log.2.gz:2019-09-18 11:43:39.237 7fcd68f96700 0 > log_channel(cluster) log [WRN] : Large omap object found. Object: > 26:86e4c833:::usage.22:head Key count: 2009548 Size (bytes): 369641376 > > I have since trimmed the entire usage log and disabled it entirely. > You can see from the output below that there's nothing in these usage > log objects. > > for i in `rados -p .usage ls`; do echo $i; rados -p .usage > listomapkeys $i | wc -l; done > usage.29 > 0 > usage.12 > 0 > usage.1 > 0 > usage.26 > 0 > usage.20 > 0 > usage.24 > 0 > usage.16 > 0 > usage.15 > 0 > usage.3 > 0 > usage.19 > 0 > usage.23 > 0 > usage.5 > 0 > usage.11 > 0 > usage.7 > 0 > usage.30 > 0 > usage.18 > 0 > usage.21 > 0 > usage.27 > 0 > usage.13 > 0 > usage.22 > 0 > usage.25 > 0 > . > 4 > usage.10 > 0 > usage.8 > 0 > usage.9 > 0 > usage.28 > 0 > usage.2 > 0 > usage.4 > 0 > usage.6 > 0 > usage.31 > 0 > usage.17 > 0 > > > root@infra:~# rados -p .usage listomapkeys usage.22 > root@infra:~# > > > On Thu, Sep 19, 2019 at 12:54 PM Charles Alva > wrote: > > > > Could you please share how you trimmed the usage log? > > > > Kind regards, > > > > Charles Alva > > Sent from Gmail Mobile > > > > > > On Thu, Sep 19, 2019 at 11:46 PM shubjero wrote: > >> > >> Hey all, > >> > >> Yesterday our cluster went in to HEALTH_WARN due to 1 large omap > >> object in the .usage pool (I've posted about this in the past). Last > >> time we resolved the issue by trimming the usage log below the alert > >> threshold but this time it seems like the alert wont clear even after > >> trimming and (this time) disabling the usage log entirely. > >> > >> ceph health detail > >> HEALTH_WARN 1 large omap objects > >> LARGE_OMAP_OBJECTS 1 large omap objects > >> 1 large objects found in pool '.usage' > >> Search the cluster log for 'Large omap object found' for more > details. > >> > >> I've bounced ceph-mon, ceph-mgr, radosgw and even issued osd scrub on > >> the two osd's that hold pg's for the .usage pool but the alert wont > >> clear. > >> > >> It's been over 24 hours since I trimmed the usage log. > >> > >> Any suggestions? > >> > >> Jared Baker > >> Cloud Architect, OICR > >> ___ > >> ceph-users mailing list -- ceph-users@ceph.io > >> To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: RGW backup to tape
On Fri, Sep 20, 2019 at 11:10 AM Paul Emmerich wrote: > > Probably easiest if you get a tape library that supports S3. You might > even have some luck with radosgw's cloud sync module (but I wouldn't > count on it, Octopus should improve things, though) > > Just intercepting PUT requests isn't that easy because of multi-part > stuff and load balancing. I.e., if you upload a large file you should > be sending it in chunks and each chunk should go to a different > server, that makes any "simple" solutions pretty messy. I wasn't aware of any library being S3 aware, usually it's been part of the backup software. Do you have any suggestions for multi PB libraries that have the S3 feature? The idea with the PUT was not to intercept them in the path, but to basically have RGW log access to LogStash, then a job would run to find all the objects that were PUT within a time frame, then read the objects off the cluster and write them to tape. Maybe that's not as easy as I'm thinking either. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: RGW backup to tape
Robert, There're a storage company that integrate TAPES as OSD for deep-cold ceph. But the code is not opensource Regards -Mensaje original- De: Robert LeBlanc Enviado el: viernes, 20 de septiembre de 2019 23:28 Para: Paul Emmerich CC: ceph-users Asunto: [ceph-users] Re: RGW backup to tape On Fri, Sep 20, 2019 at 11:10 AM Paul Emmerich wrote: > > Probably easiest if you get a tape library that supports S3. You might > even have some luck with radosgw's cloud sync module (but I wouldn't > count on it, Octopus should improve things, though) > > Just intercepting PUT requests isn't that easy because of multi-part > stuff and load balancing. I.e., if you upload a large file you should > be sending it in chunks and each chunk should go to a different > server, that makes any "simple" solutions pretty messy. I wasn't aware of any library being S3 aware, usually it's been part of the backup software. Do you have any suggestions for multi PB libraries that have the S3 feature? The idea with the PUT was not to intercept them in the path, but to basically have RGW log access to LogStash, then a job would run to find all the objects that were PUT within a time frame, then read the objects off the cluster and write them to tape. Maybe that's not as easy as I'm thinking either. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Doubt about ceph-iscsi and Vmware
On 09/20/2019 01:52 PM, Gesiel Galvão Bernardes wrote: > Hi, > I'm testing Ceph with Vmware, using Ceph-iscsi gateway. I reading > documentation* and have doubts some points: > > - If I understanded, in general terms, for each VMFS datastore in VMware > will match the an RBD image. (consequently in an RBD image I will > possible have many VMWare disks). Its correct? > > - In documentation is this: "gwcli requires a pool with the name rbd, so > it can store metadata like the iSCSI configuration". In part 4 of > "Configuration", have: "Add a RBD image with the name disk_1 in the pool > rbd". In this part, the use of "rbd" pool is a example and I could use > any pool for storage of image, or the pool should be "rbd"? > Resuming: gwcli require "rbd" pool for metadata and I could use any pool > for image, or i will use just "rbd pool" for storage image and metadata? > > - How much memory ceph-iscsi use? Which is a good number of RAM? > The major memory use is: 1. In RHEL 7.5 kernels and older we allocate max_data_area_mb of kernel memory per device. The default value for that is 8. You can use gwcli to configure it. It is allocated when the device is created. In newer kernels, there is pool of memory and each device can use up to max_data_area_mb worth of it. The per device default is the same and you can change it with gwcli. The total pool limit is 2 GB. There is a sysfs file: /sys/module/target_core_user/parameters/global_max_data_area_mb that can be used to change it. 2. Each device uses about 20 MB of memory in userspace. This is not configurable. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] V/v [Ceph] problem with delete object in large bucket
Hi Ceph team, Can you explain for me how ceph deleting object work? I have a bucket with above 100M object (file size ~ 50KB). When I delete object for free space, speed of deleting object very slow (about ~ 30-33 objects /s). I wan to tuning performance of cluster but i do not clearly know how ceph do deleting objects? Thank you very much - Br, Dương Tuấn Dũng ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io