Re: [ceph-users] RGW: Reshard index of non-master zones in multi-site
On Tue, 5 Feb 2019 at 10:04, Iain Buclaw wrote: > > On Tue, 5 Feb 2019 at 09:46, Iain Buclaw wrote: > > > > Hi, > > > > Following the update of one secondary site from 12.2.8 to 12.2.11, the > > following warning have come up. > > > > HEALTH_WARN 1 large omap objects > > LARGE_OMAP_OBJECTS 1 large omap objects > > 1 large objects found in pool '.rgw.buckets.index' > > Search the cluster log for 'Large omap object found' for more details. > > > > [...] > > > Is this the reason why resharding hasn't propagated? > > > > Furthermore, infact it looks like the index is broken on the secondaries. > > On the master: > > # radosgw-admin bi get --bucket=mybucket --object=myobject > { > "type": "plain", > "idx": "myobject", > "entry": { > "name": "myobject", > "instance": "", > "ver": { > "pool": 28, > "epoch": 8848 > }, > "locator": "", > "exists": "true", > "meta": { > "category": 1, > "size": 9200, > "mtime": "2018-03-27 21:12:56.612172Z", > "etag": "c365c324cda944d2c3b687c0785be735", > "owner": "mybucket", > "owner_display_name": "Bucket User", > "content_type": "application/octet-stream", > "accounted_size": 9194, > "user_data": "" > }, > "tag": "0ef1a91a-4aee-427e-bdf8-30589abb2d3e.36603989.137292", > "flags": 0, > "pending_map": [], > "versioned_epoch": 0 > } > } > > > On the secondaries: > > # radosgw-admin bi get --bucket=mybucket --object=myobject > ERROR: bi_get(): (2) No such file or directory > > How does one go about rectifying this mess? > Random blog in language I don't understand seems to allude to using radosgw-admin bi put to restore backed up indexes, but not under what circumstances you would use such a command. https://cloud.tencent.com/developer/article/1032854 Would this be safe to run on secondaries? -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0'; ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Multicast communication compuverde
Multicast traffic from storage has a point in things like the old Windows provisioning software Ghost where you could netboot a room full och computers, have them listen to a mcast stream of the same data/image and all apply it at the same time, and perhaps re-sync potentially missing stuff at the end, which would be far less data overall than having each client ask the server(s) for the same image over and over. In the case of ceph, I would say it was much less probable that many clients would ask for exactly same data in the same order, so it would just mean all clients hear all traffic (or at least more traffic than they asked for) and need to skip past a lot of it. Den tis 5 feb. 2019 kl 22:07 skrev Marc Roos : > > > I am still testing with ceph mostly, so my apologies for bringing up > something totally useless. But I just had a chat about compuverde > storage. They seem to implement multicast in a scale out solution. > > I was wondering if there is any experience here with compuverde and how > it compared to ceph. And maybe this multicast approach could be > interesting to use with ceph? > > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- May the most significant bit of your life be positive. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Multicast communication compuverde
Yes indeed, but for osd's writing the replication or erasure objects you get sort of parrallel processing not? Multicast traffic from storage has a point in things like the old Windows provisioning software Ghost where you could netboot a room full och computers, have them listen to a mcast stream of the same data/image and all apply it at the same time, and perhaps re-sync potentially missing stuff at the end, which would be far less data overall than having each client ask the server(s) for the same image over and over. In the case of ceph, I would say it was much less probable that many clients would ask for exactly same data in the same order, so it would just mean all clients hear all traffic (or at least more traffic than they asked for) and need to skip past a lot of it. Den tis 5 feb. 2019 kl 22:07 skrev Marc Roos : I am still testing with ceph mostly, so my apologies for bringing up something totally useless. But I just had a chat about compuverde storage. They seem to implement multicast in a scale out solution. I was wondering if there is any experience here with compuverde and how it compared to ceph. And maybe this multicast approach could be interesting to use with ceph? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- May the most significant bit of your life be positive. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Multicast communication compuverde
Hi, we have a compuverde cluster, and AFAIK it uses multicast for node discovery, not for data distribution. If you need more information, feel free to contact me either by email or via IRC (-> Be-El). Regards, Burkhard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Multicast communication compuverde
For EC coded stuff,at 10+4 with 13 others needing data apart from the primary, they are specifically NOT getting the same data, they are getting either 1/10th of the pieces, or one of the 4 different checksums, so it would be nasty to send full data to all OSDs expecting a 14th of the data. Den ons 6 feb. 2019 kl 10:14 skrev Marc Roos : > > Yes indeed, but for osd's writing the replication or erasure objects you > get sort of parrallel processing not? > > > > Multicast traffic from storage has a point in things like the old > Windows provisioning software Ghost where you could netboot a room full > och computers, have them listen to a mcast stream of the same data/image > and all apply it at the same time, and perhaps re-sync potentially > missing stuff at the end, which would be far less data overall than > having each client ask the server(s) for the same image over and over. > In the case of ceph, I would say it was much less probable that many > clients would ask for exactly same data in the same order, so it would > just mean all clients hear all traffic (or at least more traffic than > they asked for) and need to skip past a lot of it. > > > Den tis 5 feb. 2019 kl 22:07 skrev Marc Roos : > > > > > I am still testing with ceph mostly, so my apologies for bringing > up > something totally useless. But I just had a chat about compuverde > storage. They seem to implement multicast in a scale out solution. > > I was wondering if there is any experience here with compuverde > and > how > it compared to ceph. And maybe this multicast approach could be > interesting to use with ceph? > > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > -- > > May the most significant bit of your life be positive. > > > > -- May the most significant bit of your life be positive. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph mon_data_size_warn limits for large cluster
Hello - Are the any limits for mon_data_size for cluster with 2PB (with 2000+ OSDs)? Currently it set as 15G. What is logic behind this? Can we increase when we get the mon_data_size_warn messages? I am getting the mon_data_size_warn message even though there a ample of free space on the disk (around 300G free disk) Earlier thread on the same discusion: https://www.spinics.net/lists/ceph-users/msg42456.html Thanks Swami ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Multicast communication compuverde
On 06/02/2019 11:14, Marc Roos wrote: Yes indeed, but for osd's writing the replication or erasure objects you get sort of parrallel processing not? Multicast traffic from storage has a point in things like the old Windows provisioning software Ghost where you could netboot a room full och computers, have them listen to a mcast stream of the same data/image and all apply it at the same time, and perhaps re-sync potentially missing stuff at the end, which would be far less data overall than having each client ask the server(s) for the same image over and over. In the case of ceph, I would say it was much less probable that many clients would ask for exactly same data in the same order, so it would just mean all clients hear all traffic (or at least more traffic than they asked for) and need to skip past a lot of it. Den tis 5 feb. 2019 kl 22:07 skrev Marc Roos : I am still testing with ceph mostly, so my apologies for bringing up something totally useless. But I just had a chat about compuverde storage. They seem to implement multicast in a scale out solution. I was wondering if there is any experience here with compuverde and how it compared to ceph. And maybe this multicast approach could be interesting to use with ceph? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com It could be used for sending cluster maps or other configuration in a push model, i believe corosync uses this by default. For use in sending actual data during write ops, a primary osd can send to its replicas, they do not have to process all traffic but can listen on specific group address associated with that pg, which could be an increment from a base multicast address defined. Some additional erasure codes and acknowledgment messages need to be added to account for errors/dropped packets. i doubt it will give a appreciable boost given most pools use 3 replicas in total, additionally there could be issues to get multicast working correctly like setup igmp, so all in all in it could be a hassle. /Maged ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] krbd and image striping
Hi, I have been doing some testing with striped rbd images and have a question about the calculation of the optimal_io_size and minimum_io_size parameters. My test image was created using a 4M object size, stripe unit 64k and stripe count 16. In the kernel rbd_init_disk() code: unsigned int objset_bytes = rbd_dev->layout.object_size * rbd_dev->layout.stripe_count; blk_queue_io_min(q, objset_bytes); blk_queue_io_opt(q, objset_bytes); Which resulted in 64M minimal / optimal io sizes. If I understand the meaning correctly then even for a small write there is going to be at least 64M data written? My use case is a ceph cluster (13.2.4) hosting rbd images for VMs running on Xen. The rbd volumes are mapped to dom0 and then passed through to the guest using standard blkback/blkfront drivers. I am doing a bit of testing with different stripe unit sizes but keeping object size * count = 4M. Does anyone have any experience finding optimal rbd parameters for this scenario? Thanks, James Zynstra is a private limited company registered in England and Wales (registered number 07864369). Our registered office and Headquarters are at The Innovation Centre, Broad Quay, Bath, BA1 1UD. This email, its contents and any attachments are confidential. If you have received this message in error please delete it from your system and advise the sender immediately. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph dashboard cert documentation bug?
I was trying to set my mimic dashboard cert using the instructions from http://docs.ceph.com/docs/mimic/mgr/dashboard/ and I'm pretty sure the lines $ ceph config-key set mgr mgr/dashboard/crt -i dashboard.crt $ ceph config-key set mgr mgr/dashboard/key -i dashboard.key should be $ ceph config-key set mgr/dashboard/crt -i dashboard.crt $ ceph config-key set mgr/dashboard/key -i dashboard.key Can anyone confirm? -- Junk ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] krbd and image striping
On Wed, Feb 6, 2019 at 11:09 AM James Dingwall wrote: > > Hi, > > I have been doing some testing with striped rbd images and have a > question about the calculation of the optimal_io_size and > minimum_io_size parameters. My test image was created using a 4M object > size, stripe unit 64k and stripe count 16. > > In the kernel rbd_init_disk() code: > > unsigned int objset_bytes = > rbd_dev->layout.object_size * rbd_dev->layout.stripe_count; > > blk_queue_io_min(q, objset_bytes); > blk_queue_io_opt(q, objset_bytes); > > Which resulted in 64M minimal / optimal io sizes. If I understand the > meaning correctly then even for a small write there is going to be at > least 64M data written? No, these are just hints. The exported values are pretty stupid even in the default case and more so in the custom striping case and should be changed. It's certainly not the case that any write is going to be turned into io_min or io_opt sized write. > > My use case is a ceph cluster (13.2.4) hosting rbd images for VMs > running on Xen. The rbd volumes are mapped to dom0 and then passed > through to the guest using standard blkback/blkfront drivers. > > I am doing a bit of testing with different stripe unit sizes but keeping > object size * count = 4M. Does anyone have any experience finding > optimal rbd parameters for this scenario? I'd recommend focusing on the client side performance numbers for the expected workload(s), not io_min/io_opt or object size * count target. su = 64k and sc = 16 means that a 1M request will need responses from up to 16 OSDs at once, which is probably not what you want unless you have a small sequential write workload (where a custom striping layout can prove very useful). Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph mon_data_size_warn limits for large cluster
Hi Swami The limit is somewhat arbitrary, based on cluster sizes we had seen when we picked it. In your case it should be perfectly safe to increase it. sage On Wed, 6 Feb 2019, M Ranga Swami Reddy wrote: > Hello - Are the any limits for mon_data_size for cluster with 2PB > (with 2000+ OSDs)? > > Currently it set as 15G. What is logic behind this? Can we increase > when we get the mon_data_size_warn messages? > > I am getting the mon_data_size_warn message even though there a ample > of free space on the disk (around 300G free disk) > > Earlier thread on the same discusion: > https://www.spinics.net/lists/ceph-users/msg42456.html > > Thanks > Swami > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph mon_data_size_warn limits for large cluster
Hi, With HEALTH_OK a mon data dir should be under 2GB for even such a large cluster. During backfilling scenarios, the mons keep old maps and grow quite quickly. So if you have balancing, pg splitting, etc. ongoing for awhile, the mon stores will eventually trigger that 15GB alarm. But the intended behavior is that once the PGs are all active+clean, the old maps should be trimmed and the disk space freed. However, several people have noted that (at least in luminous releases) the old maps are not trimmed until after HEALTH_OK *and* all mons are restarted. This ticket seems related: http://tracker.ceph.com/issues/37875 (Over here we're restarting mons every ~2-3 weeks, resulting in the mon stores dropping from >15GB to ~700MB each time). -- Dan On Wed, Feb 6, 2019 at 1:26 PM Sage Weil wrote: > > Hi Swami > > The limit is somewhat arbitrary, based on cluster sizes we had seen when > we picked it. In your case it should be perfectly safe to increase it. > > sage > > > On Wed, 6 Feb 2019, M Ranga Swami Reddy wrote: > > > Hello - Are the any limits for mon_data_size for cluster with 2PB > > (with 2000+ OSDs)? > > > > Currently it set as 15G. What is logic behind this? Can we increase > > when we get the mon_data_size_warn messages? > > > > I am getting the mon_data_size_warn message even though there a ample > > of free space on the disk (around 300G free disk) > > > > Earlier thread on the same discusion: > > https://www.spinics.net/lists/ceph-users/msg42456.html > > > > Thanks > > Swami > > > > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Need help with upmap feature on luminous
Note that there are some improved upmap balancer heuristics in development here: https://github.com/ceph/ceph/pull/26187 -- dan On Tue, Feb 5, 2019 at 10:18 PM Kári Bertilsson wrote: > > Hello > > I previously enabled upmap and used automatic balancing with "ceph balancer > on". I got very good results and OSD's ended up with perfectly distributed > pg's. > > Now after adding several new OSD's, auto balancing does not seem to be > working anymore. OSD's have 30-50% usage where previously all had almost the > same %. > > I turned off auto balancer and tried manually running a plan > > # ceph balancer reset > # ceph balancer optimize myplan > # ceph balancer show myplan > ceph osd pg-upmap-items 41.1 106 125 95 121 84 34 36 99 72 126 > ceph osd pg-upmap-items 41.5 12 121 65 3 122 52 5 126 > ceph osd pg-upmap-items 41.b 117 99 65 125 > ceph osd pg-upmap-items 41.c 49 121 81 131 > ceph osd pg-upmap-items 41.e 61 82 73 52 122 46 84 118 > ceph osd pg-upmap-items 41.f 71 127 15 121 56 82 > ceph osd pg-upmap-items 41.12 81 92 > ceph osd pg-upmap-items 41.17 35 127 71 44 > ceph osd pg-upmap-items 41.19 81 131 21 119 18 52 > ceph osd pg-upmap-items 41.25 18 52 37 125 40 3 41 34 71 127 4 128 > > > After running this plan there's no difference and still huge inbalance on the > OSD's. Creating a new plan give the same plan again. > > # ceph balancer eval > current cluster score 0.015162 (lower is better) > > Balancer eval shows quite low number, so it seems to think the pg > distribution is already optimized ? > > Since i'm not getting this working again. I looked into the offline > optimization at http://docs.ceph.com/docs/mimic/rados/operations/upmap/ > > I have 2 pools. > Replicated pool using 3 OSD's with "10k" device class. > And remaining OSD's have "hdd" device class. > > The resulting out.txt creates a much larger plan, but would map alot of PG's > to the "10k" OSD's (where they should not be). And i can't seem to find any > way to exclude these 3 OSD's. > > Any ideas how to proceed ? > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Orchestration weekly meeting location change
Hey all, The Orchestration weekly team meeting on Mondays at 16:00 UTC has a new meeting location. The blue jeans url has changed so we can start recording the meetings. Please see instructions below. The event also has updated information: To join the meeting on a computer or mobile phone: https://bluejeans.com/908675367?src=calendarLink To join from a Red Hat Deskphone or Softphone, dial: 84336. Connecting directly from a room system? 1.) Dial: 199.48.152.152 or bjn.vc 2.) Enter Meeting ID: 908675367 Just want to dial in on your phone? 1.) Dial one of the following numbers: 408-915-6466 (US) See all numbers: https://www.redhat.com/en/conference-numbers 2.) Enter Meeting ID: 908675367 3.) Press # Want to test your video connection? https://bluejeans.com/111 -- Mike Perez (thingee) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Using Cephfs Snapshots in Luminous
Le lundi 12 novembre 2018 à 15:31 +0100, Marc Roos a écrit : > > > is anybody using cephfs with snapshots on luminous? Cephfs > > > snapshots are declared stable in mimic, but I'd like to know > > > about the risks using them on luminous. Do I risk a complete > > > cephfs failure or just some not working snapshots? It is one > > > namespace, one fs, one data and one metadata pool. > > > > > > > For luminous, snapshot in single mds setup basically works. > > But snapshot is complete broken in multiple setup. > > > > Single active mds not? And hardlinks are not supported with > snapshots? What's the final feeling on snapshots ? * Luminous 12.2.10 on Debian stretch * ceph-fuse clients * 1 active MDS, some standbys * single FS, single namespace, no hardlinks * will probably create nested snapshots, ie. /1/.snaps/first and /1/2/3/.snaps/nested * will use the facility through VirtFS from within VMs, where ceph-fuse runs on the host server What's the risk of using that experimental feature (as said in [1]) ? * losing snapshots ? * losing the main/last contents ? * losing some directory trees, entire filesystem ? * other ? TIA, [1] http://docs.ceph.com/docs/luminous/cephfs/experimental-features/#sn apshots -- Nicolas Huillard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Proxmox 4.4, Ceph hammer, OSD cache link...
I come back here. > I've recently added a host to my ceph cluster, using proxmox 'helpers' > to add OSD, eg: > > pveceph createosd /dev/sdb -journal_dev /dev/sda5 > > and now i've: > > root@blackpanther:~# ls -la /var/lib/ceph/osd/ceph-12 > totale 60 > drwxr-xr-x 3 root root 199 nov 21 17:02 . > drwxr-xr-x 6 root root 4096 nov 21 23:08 .. > -rw-r--r-- 1 root root 903 nov 21 17:02 activate.monmap > -rw-r--r-- 1 root root 3 nov 21 17:02 active > -rw-r--r-- 1 root root37 nov 21 17:02 ceph_fsid > drwxr-xr-x 432 root root 12288 dic 1 18:21 current > -rw-r--r-- 1 root root37 nov 21 17:02 fsid > lrwxrwxrwx 1 root root 9 nov 21 17:02 journal -> /dev/sda5 > -rw--- 1 root root57 nov 21 17:02 keyring > -rw-r--r-- 1 root root21 nov 21 17:02 magic > -rw-r--r-- 1 root root 6 nov 21 17:02 ready > -rw-r--r-- 1 root root 4 nov 21 17:02 store_version > -rw-r--r-- 1 root root53 nov 21 17:02 superblock > -rw-r--r-- 1 root root 0 nov 21 17:02 sysvinit > -rw-r--r-- 1 root root 3 nov 21 17:02 whoami > > and all works as expected, only i suposed to find as a journal not the > device (/dev/sda5) but the uuid (/dev/disk/by-uuid/). > > But seems that the cache partition does not have an UUID associated: > > root@blackpanther:~# ls -la /dev/disk/by-uuid/ | grep sda5 > root@blackpanther:~# blkid /dev/sda5 > /dev/sda5: PARTUUID="a222c6bf-05" > > I'm a but ''puzzled'' because if i've to add a disk ''before'' sda, all > device name will change with, i suppose, unexpected result. > > I'm missing something? Thanks. I was forced to change some journal, using some partition (MBR); i've stopped osd, flushed old journal, changed symplink and then do a 'journal format': root@deadpool:/var/lib/ceph/osd/ceph-6# ls -la totale 64 drwxr-xr-x 3 root root 199 feb 6 17:45 . drwxr-xr-x 6 root root 4096 dic 14 2016 .. -rw-r--r-- 1 root root 751 dic 14 2016 activate.monmap -rw-r--r-- 1 root root 3 dic 14 2016 active -rw-r--r-- 1 root root37 dic 14 2016 ceph_fsid drwxr-xr-x 378 root root 20480 feb 6 17:12 current -rw-r--r-- 1 root root37 dic 14 2016 fsid lrwxrwxrwx 1 root root 9 feb 6 17:45 journal -> /dev/sda5 -rw--- 1 root root56 dic 14 2016 keyring -rw-r--r-- 1 root root21 dic 14 2016 magic -rw-r--r-- 1 root root 6 dic 14 2016 ready -rw-r--r-- 1 root root 4 dic 14 2016 store_version -rw-r--r-- 1 root root53 dic 14 2016 superblock -rw-r--r-- 1 root root 0 feb 6 17:10 sysvinit -rw-r--r-- 1 root root 2 dic 14 2016 whoami root@deadpool:/var/lib/ceph/osd/ceph-6# ceph-osd -i 6 --mkjournal 2019-02-06 17:45:35.030359 7ff679c24880 -1 journal check: ondisk fsid ---- doesn't match expected 70357923-3227-4d57-980f-92b8c853dc76, invalid (someone else's?) journal 2019-02-06 17:45:35.038522 7ff679c24880 -1 created new journal /var/lib/ceph/osd/ceph-6/journal for object store /var/lib/ceph/osd/ceph-6 Clearly i've changed the journal partition by hand (eg, direct link) so i'm expecting that link is 'direct to partition'; but, and see the warning about fsid, still there's no 'id' associated to that partition (eg, no link in /dev/disk/by-*/). If i rerun the 'mkjournal': root@deadpool:/var/lib/ceph/osd/ceph-6# ceph-osd -i 6 --mkjournal 2019-02-06 17:45:37.621855 7f3391377880 -1 created new journal /var/lib/ceph/osd/ceph-6/journal for object store /var/lib/ceph/osd/ceph-6 So seems that effectively journal partition get 'tagged' in someway. But i'm still confused... using ID link in journal partitions works only for GPO partitioning? Thanks. -- dott. Marco Gaiarin GNUPG Key ID: 240A3D66 Associazione ``La Nostra Famiglia'' http://www.lanostrafamiglia.it/ Polo FVG - Via della Bontà, 7 - 33078 - San Vito al Tagliamento (PN) marco.gaiarin(at)lanostrafamiglia.it t +39-0434-842711 f +39-0434-842797 Dona il 5 PER MILLE a LA NOSTRA FAMIGLIA! http://www.lanostrafamiglia.it/index.php/it/sostienici/5x1000 (cf 00307430132, categoria ONLUS oppure RICERCA SANITARIA) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] backfill_toofull after adding new OSDs
Let's try to restrict discussion to the original thread "backfill_toofull while OSDs are not full" and get a tracker opened up for this issue. On Sat, Feb 2, 2019 at 11:52 AM Fyodor Ustinov wrote: > > Hi! > > Right now, after adding OSD: > > # ceph health detail > HEALTH_ERR 74197563/199392333 objects misplaced (37.212%); Degraded data > redundancy (low space): 1 pg backfill_toofull > OBJECT_MISPLACED 74197563/199392333 objects misplaced (37.212%) > PG_DEGRADED_FULL Degraded data redundancy (low space): 1 pg backfill_toofull > pg 6.eb is active+remapped+backfill_wait+backfill_toofull, acting > [21,0,47] > > # ceph pg ls-by-pool iscsi backfill_toofull > PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES LOG STATE > STATE_STAMPVERSION REPORTED UP > ACTING SCRUB_STAMPDEEP_SCRUB_STAMP > 6.eb 6450 1290 0 1645654016 3067 > active+remapped+backfill_wait+backfill_toofull 2019-02-02 00:20:32.975300 > 7208'6567 9790:16214 [5,1,21]p5 [21,0,47]p21 2019-01-18 04:13:54.280495 > 2019-01-18 04:13:54.280495 > > All OSD have less 40% USE. > > ID CLASS WEIGHT REWEIGHT SIZEUSE AVAIL %USE VAR PGS > 0 hdd 9.56149 1.0 9.6 TiB 3.2 TiB 6.3 TiB 33.64 1.31 313 > 1 hdd 9.56149 1.0 9.6 TiB 3.3 TiB 6.3 TiB 34.13 1.33 295 > 5 hdd 9.56149 1.0 9.6 TiB 756 GiB 8.8 TiB 7.72 0.30 103 > 47 hdd 9.32390 1.0 9.3 TiB 3.1 TiB 6.2 TiB 33.75 1.31 306 > > (all other OSD also have less 40%) > > ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable) > > Maybe the developers will pay attention to the letter and say something? > > - Original Message - > From: "Fyodor Ustinov" > To: "Caspar Smit" > Cc: "Jan Kasprzak" , "ceph-users" > Sent: Thursday, 31 January, 2019 16:50:24 > Subject: Re: [ceph-users] backfill_toofull after adding new OSDs > > Hi! > > I saw the same several times when I added a new osd to the cluster. One-two > pg in "backfill_toofull" state. > > In all versions of mimic. > > - Original Message - > From: "Caspar Smit" > To: "Jan Kasprzak" > Cc: "ceph-users" > Sent: Thursday, 31 January, 2019 15:43:07 > Subject: Re: [ceph-users] backfill_toofull after adding new OSDs > > Hi Jan, > > You might be hitting the same issue as Wido here: > > [ https://www.spinics.net/lists/ceph-users/msg50603.html | > https://www.spinics.net/lists/ceph-users/msg50603.html ] > > Kind regards, > Caspar > > Op do 31 jan. 2019 om 14:36 schreef Jan Kasprzak < [ mailto:k...@fi.muni.cz | > k...@fi.muni.cz ] >: > > > Hello, ceph users, > > I see the following HEALTH_ERR during cluster rebalance: > > Degraded data redundancy (low space): 8 pgs backfill_toofull > > Detailed description: > I have upgraded my cluster to mimic and added 16 new bluestore OSDs > on 4 hosts. The hosts are in a separate region in my crush map, and crush > rules prevented data to be moved on the new OSDs. Now I want to move > all data to the new OSDs (and possibly decomission the old filestore OSDs). > I have created the following rule: > > # ceph osd crush rule create-replicated on-newhosts newhostsroot host > > after this, I am slowly moving the pools one-by-one to this new rule: > > # ceph osd pool set test-hdd-pool crush_rule on-newhosts > > When I do this, I get the above error. This is misleading, because > ceph osd df does not suggest the OSDs are getting full (the most full > OSD is about 41 % full). After rebalancing is done, the HEALTH_ERR > disappears. Why am I getting this error? > > # ceph -s > cluster: > id: ...my UUID... > health: HEALTH_ERR > 1271/3803223 objects misplaced (0.033%) > Degraded data redundancy: 40124/3803223 objects degraded (1.055%), 65 pgs > degraded, 67 pgs undersized > Degraded data redundancy (low space): 8 pgs backfill_toofull > > services: > mon: 3 daemons, quorum mon1,mon2,mon3 > mgr: mon2(active), standbys: mon1, mon3 > osd: 80 osds: 80 up, 80 in; 90 remapped pgs > rgw: 1 daemon active > > data: > pools: 13 pools, 5056 pgs > objects: 1.27 M objects, 4.8 TiB > usage: 15 TiB used, 208 TiB / 224 TiB avail > pgs: 40124/3803223 objects degraded (1.055%) > 1271/3803223 objects misplaced (0.033%) > 4963 active+clean > 41 active+recovery_wait+undersized+degraded+remapped > 21 active+recovery_wait+undersized+degraded > 17 active+remapped+backfill_wait > 5 active+remapped+backfill_wait+backfill_toofull > 3 active+remapped+backfill_toofull > 2 active+recovering+undersized+remapped > 2 active+recovering+undersized+degraded+remapped > 1 active+clean+remapped > 1 active+recovering+undersized+degraded > > io: > client: 6.6 MiB/s rd, 2.7 MiB/s wr, 75 op/s rd, 89 op/s wr > recovery: 2.0 MiB/s, 92 objects/s > > Thanks for any hint, > > -Yenya > > -- > | Jan "Yenya" Kasprzak http://fi.muni.cz/ | fi.muni.cz ] - work | > [ http://yenya.net/ | yenya.net ] - private}> | > | [ http://www.fi.muni.cz/~kas/ | http://www.fi.muni.cz/~kas/ ] GPG: > 4096R/A45477D5 |
[ceph-users] CephFS overwrite/truncate performance hit
I'm seeing some interesting performance issues with file overwriting on CephFS. Creating lots of files is fast: for i in $(seq 1 1000); do echo $i; echo test > a.$i done Deleting lots of files is fast: rm a.* As is creating them again. However, repeatedly creating the same file over and over again is slow: for i in $(seq 1 1000); do echo $i; echo test > a done And it's still slow if the file is created with a new name and then moved over: for i in $(seq 1 1000); do echo $i; echo test > a.$i; mv a.$i a done While appending to a single file is really fast: for i in $(seq 1 1000); do echo $i; echo test >> a done As is repeatedly writing to offset 0: for i in $(seq 1 1000); do echo $i; echo $RANDOM | dd of=a bs=128 conv=notrunc done But truncating the file first slows it back down again: for i in $(seq 1 1000); do echo $i; truncate -s 0 a; echo test >> a done All of these things are reasonably fast on a local FS, of course. I'm using the kernel client (4.18) with Ceph 13.2.4, and the relevant CephFS data and metadata pools are rep-3 on HDDs. It seems to me that any operation that *reduces* a file's size for any given filename, or replaces it with another inode, has a large overhead. I have an application that stores some flag data in a file, using the usual open/write/close/rename dance to atomically overwrite it, and this operation is currently the bottleneck (while doing a bunch of other processing on files on CephFS). I'm considering changing it to use a xattr to store the data instead, which seems like it should be atomic and performs a lot better: for i in $(seq 1 1000); do echo $i; setfattr -n user.foo -v "test$RANDOM" a done Alternatively, is there a more CephFS-friendly atomic overwrite pattern than the usual open/write/close/rename? Can it e.g. guarantee that a write at offset 0 of less than the page size is atomic? I could easily make the writes equal-sized and thus avoid truncations and remove the rename dance, if I can guarantee they're atomic. Is there any documentation on what write operations incur significant overhead on CephFS like this, and why? This particular issue isn't mentioned in http://docs.ceph.com/docs/master/cephfs/app-best-practices/ (which seems like it mostly deals with reads, not writes). -- Hector Martin (hec...@marcansoft.com) Public Key: https://mrcn.st/pub ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] rados block on SSD - performance - how to tune and get insight?
Hi List We are in the process of moving to the next usecase for our ceph cluster (Bulk, cheap, slow, erasurecoded, cephfs) storage was the first - and that works fine. We're currently on luminous / bluestore, if upgrading is deemed to change what we're seeing then please let us know. We have 6 OSD hosts, each with a S4510 of 1TB with 1 SSD in each. Connected through a H700 MegaRaid Perc BBWC, EachDiskRaid0 - and scheduler set to deadline, nomerges = 1, rotational = 0. Each disk "should" give approximately 36K IOPS random write and the double random read. Pool is setup with a 3x replicaiton. We would like a "scaleout" setup of well performing SSD block devices - potentially to host databases and things like that. I ready through this nice document [0], I know the HW are radically different from mine, but I still think I'm in the very low end of what 6 x S4510 should be capable of doing. Since it is IOPS i care about I have lowered block size to 4096 -- 4M blocksize nicely saturates the NIC's in both directions. $ sudo rados bench -p scbench -b 4096 10 write --no-cleanup hints = 1 Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for up to 10 seconds or 0 objects Object prefix: benchmark_data_torsk2_11207 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 0 0 0 0 0 0 - 0 1 16 5857 5841 22.8155 22.8164 0.00238437 0.00273434 2 15 11768 11753 22.9533 23.0938 0.0028559 0.00271944 3 16 17264 17248 22.4564 21.4648 0.0024 0.00278101 4 16 22857 22841 22.3037 21.84770.002716 0.00280023 5 16 28462 28446 22.2213 21.8945 0.002201860.002811 6 16 34216 34200 22.2635 22.4766 0.00234315 0.00280552 7 16 39616 39600 22.0962 21.0938 0.00290661 0.00282718 8 16 45510 45494 22.2118 23.0234 0.0033541 0.00281253 9 16 50995 50979 22.1243 21.4258 0.00267282 0.00282371 10 16 56745 56729 22.1577 22.4609 0.00252583 0.0028193 Total time run: 10.002668 Total writes made: 56745 Write size: 4096 Object size:4096 Bandwidth (MB/sec): 22.1601 Stddev Bandwidth: 0.712297 Max bandwidth (MB/sec): 23.0938 Min bandwidth (MB/sec): 21.0938 Average IOPS: 5672 Stddev IOPS:182 Max IOPS: 5912 Min IOPS: 5400 Average Latency(s): 0.00281953 Stddev Latency(s): 0.00190771 Max latency(s): 0.0834767 Min latency(s): 0.00120945 Min latency is fine -- but Max latency of 83ms ? Average IOPS @ 5672 ? $ sudo rados bench -p scbench 10 rand hints = 1 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 0 0 0 0 0 0 - 0 1 15 23329 23314 91.0537 91.0703 0.000349856 0.000679074 2 16 48555 48539 94.7884 98.5352 0.000499159 0.000652067 3 16 76193 76177 99.1747 107.961 0.000443877 0.000622775 4 15103923103908 101.459 108.324 0.000678589 0.000609182 5 15132720132705 103.663 112.488 0.000741734 0.000595998 6 15161811161796 105.323 113.637 0.000333166 0.000586323 7 15190196190181 106.115 110.879 0.000612227 0.000582014 8 15221155221140 107.966 120.934 0.000471219 0.000571944 9 16251143251127 108.984 117.137 0.000267528 0.000566659 Total time run: 10.000640 Total reads made: 282097 Read size:4096 Object size: 4096 Bandwidth (MB/sec): 110.187 Average IOPS: 28207 Stddev IOPS: 2357 Max IOPS: 30959 Min IOPS: 23314 Average Latency(s): 0.000560402 Max latency(s): 0.109804 Min latency(s): 0.000212671 This is also quite far from expected. I have 12GB of memory on the OSD daemon for caching on each host - close to idle cluster - thus 50GB+ for caching with a working set of < 6GB .. this should - in this case not really be bound by the underlying SSD. But if it were: IOPS/disk * num disks / replication => 95K * 6 / 3 => 190K or 6x off? No measureable service time in iostat when running tests, thus I have come to the conclusion that it has to be either client side, the network path, or the OSD-daemon that deliveres the increasing latency / decreased IOPS. Is there any suggestions on how to get more insigths in that? Has anyone replicated close to the number Micron are reporting on NVMe? Thanks a log. [0] https://www.micron.com/-/media/client/global/documents/products/other-documents/micron_9200_max_ceph_12,-d-,2,-d-,8_luminous_bluestore_reference_architecture.pdf?la=en ___ ceph-users mailing list ceph-user
Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?
This seems right. You are doing a single benchmark from a single client. Your limiting factor will be the network latency. For most networks this is between 0.2 and 0.3ms. if you're trying to test the potential of your cluster, you'll need multiple workers and clients. On Thu, Feb 7, 2019, 2:17 AM Hi List > > We are in the process of moving to the next usecase for our ceph cluster > (Bulk, cheap, slow, erasurecoded, cephfs) storage was the first - and > that works fine. > > We're currently on luminous / bluestore, if upgrading is deemed to > change what we're seeing then please let us know. > > We have 6 OSD hosts, each with a S4510 of 1TB with 1 SSD in each. Connected > through a H700 MegaRaid Perc BBWC, EachDiskRaid0 - and scheduler set to > deadline, nomerges = 1, rotational = 0. > > Each disk "should" give approximately 36K IOPS random write and the double > random read. > > Pool is setup with a 3x replicaiton. We would like a "scaleout" setup of > well performing SSD block devices - potentially to host databases and > things like that. I ready through this nice document [0], I know the > HW are radically different from mine, but I still think I'm in the > very low end of what 6 x S4510 should be capable of doing. > > Since it is IOPS i care about I have lowered block size to 4096 -- 4M > blocksize nicely saturates the NIC's in both directions. > > > $ sudo rados bench -p scbench -b 4096 10 write --no-cleanup > hints = 1 > Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for > up to 10 seconds or 0 objects > Object prefix: benchmark_data_torsk2_11207 > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg > lat(s) > 0 0 0 0 0 0 - > 0 > 1 16 5857 5841 22.8155 22.8164 0.00238437 > 0.00273434 > 2 15 11768 11753 22.9533 23.0938 0.0028559 > 0.00271944 > 3 16 17264 17248 22.4564 21.4648 0.0024 > 0.00278101 > 4 16 22857 22841 22.3037 21.84770.002716 > 0.00280023 > 5 16 28462 28446 22.2213 21.8945 0.00220186 > 0.002811 > 6 16 34216 34200 22.2635 22.4766 0.00234315 > 0.00280552 > 7 16 39616 39600 22.0962 21.0938 0.00290661 > 0.00282718 > 8 16 45510 45494 22.2118 23.0234 0.0033541 > 0.00281253 > 9 16 50995 50979 22.1243 21.4258 0.00267282 > 0.00282371 >10 16 56745 56729 22.1577 22.4609 0.00252583 > 0.0028193 > Total time run: 10.002668 > Total writes made: 56745 > Write size: 4096 > Object size:4096 > Bandwidth (MB/sec): 22.1601 > Stddev Bandwidth: 0.712297 > Max bandwidth (MB/sec): 23.0938 > Min bandwidth (MB/sec): 21.0938 > Average IOPS: 5672 > Stddev IOPS:182 > Max IOPS: 5912 > Min IOPS: 5400 > Average Latency(s): 0.00281953 > Stddev Latency(s): 0.00190771 > Max latency(s): 0.0834767 > Min latency(s): 0.00120945 > > Min latency is fine -- but Max latency of 83ms ? > Average IOPS @ 5672 ? > > $ sudo rados bench -p scbench 10 rand > hints = 1 > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg > lat(s) > 0 0 0 0 0 0 - > 0 > 1 15 23329 23314 91.0537 91.0703 0.000349856 > 0.000679074 > 2 16 48555 48539 94.7884 98.5352 0.000499159 > 0.000652067 > 3 16 76193 76177 99.1747 107.961 0.000443877 > 0.000622775 > 4 15103923103908 101.459 108.324 0.000678589 > 0.000609182 > 5 15132720132705 103.663 112.488 0.000741734 > 0.000595998 > 6 15161811161796 105.323 113.637 0.000333166 > 0.000586323 > 7 15190196190181 106.115 110.879 0.000612227 > 0.000582014 > 8 15221155221140 107.966 120.934 0.000471219 > 0.000571944 > 9 16251143251127 108.984 117.137 0.000267528 > 0.000566659 > Total time run: 10.000640 > Total reads made: 282097 > Read size:4096 > Object size: 4096 > Bandwidth (MB/sec): 110.187 > Average IOPS: 28207 > Stddev IOPS: 2357 > Max IOPS: 30959 > Min IOPS: 23314 > Average Latency(s): 0.000560402 > Max latency(s): 0.109804 > Min latency(s): 0.000212671 > > This is also quite far from expected. I have 12GB of memory on the OSD > daemon for caching on each host - close to idle cluster - thus 50GB+ for > caching with a working set of < 6GB .. this should - in this case > not really be bound by the underlying SSD. But if it were: > > IOPS/disk * num disks / replication => 95K * 6 / 3 => 190K or 6x off? > > No measureable service time in iostat when running tests, thus I have > come to the conclusion that it has to be either client side,
Re: [ceph-users] rados block on SSD - performance - how to tune and get insight?
Hello, On Thu, 7 Feb 2019 08:17:20 +0100 jes...@krogh.cc wrote: > Hi List > > We are in the process of moving to the next usecase for our ceph cluster > (Bulk, cheap, slow, erasurecoded, cephfs) storage was the first - and > that works fine. > > We're currently on luminous / bluestore, if upgrading is deemed to > change what we're seeing then please let us know. > > We have 6 OSD hosts, each with a S4510 of 1TB with 1 SSD in each. Connected > through a H700 MegaRaid Perc BBWC, EachDiskRaid0 - and scheduler set to > deadline, nomerges = 1, rotational = 0. > I'd make sure that the endurance of these SSDs is in line with your expected usage. > Each disk "should" give approximately 36K IOPS random write and the double > random read. > Only locally, latency is your enemy. Tell us more about your network. > Pool is setup with a 3x replicaiton. We would like a "scaleout" setup of > well performing SSD block devices - potentially to host databases and > things like that. I ready through this nice document [0], I know the > HW are radically different from mine, but I still think I'm in the > very low end of what 6 x S4510 should be capable of doing. > > Since it is IOPS i care about I have lowered block size to 4096 -- 4M > blocksize nicely saturates the NIC's in both directions. > > rados bench is not the sharpest tool in the shed for this. As it needs to allocate stuff to begin with, amongst other things. And before you go "fio with RBD engine", that had major issues in my experience, too. Your best and most realistic results will come from doing the testing inside a VM (I presume from your use case) or a mounted RBD block device. And then using fio, of course. > $ sudo rados bench -p scbench -b 4096 10 write --no-cleanup > hints = 1 > Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for > up to 10 seconds or 0 objects > Object prefix: benchmark_data_torsk2_11207 > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) > 0 0 0 0 0 0 - 0 > 1 16 5857 5841 22.8155 22.8164 0.00238437 0.00273434 > 2 15 11768 11753 22.9533 23.0938 0.0028559 0.00271944 > 3 16 17264 17248 22.4564 21.4648 0.0024 0.00278101 > 4 16 22857 22841 22.3037 21.84770.002716 0.00280023 > 5 16 28462 28446 22.2213 21.8945 0.002201860.002811 > 6 16 34216 34200 22.2635 22.4766 0.00234315 0.00280552 > 7 16 39616 39600 22.0962 21.0938 0.00290661 0.00282718 > 8 16 45510 45494 22.2118 23.0234 0.0033541 0.00281253 > 9 16 50995 50979 22.1243 21.4258 0.00267282 0.00282371 >10 16 56745 56729 22.1577 22.4609 0.00252583 0.0028193 > Total time run: 10.002668 > Total writes made: 56745 > Write size: 4096 > Object size:4096 > Bandwidth (MB/sec): 22.1601 > Stddev Bandwidth: 0.712297 > Max bandwidth (MB/sec): 23.0938 > Min bandwidth (MB/sec): 21.0938 > Average IOPS: 5672 > Stddev IOPS:182 > Max IOPS: 5912 > Min IOPS: 5400 > Average Latency(s): 0.00281953 > Stddev Latency(s): 0.00190771 > Max latency(s): 0.0834767 > Min latency(s): 0.00120945 > > Min latency is fine -- but Max latency of 83ms ? Outliers during setup are to be expected and ignored > Average IOPS @ 5672 ? > Plenty of good reasons to come up with that number, yes. > $ sudo rados bench -p scbench 10 rand > hints = 1 > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) > 0 0 0 0 0 0 - 0 > 1 15 23329 23314 91.0537 91.0703 0.000349856 0.000679074 > 2 16 48555 48539 94.7884 98.5352 0.000499159 0.000652067 > 3 16 76193 76177 99.1747 107.961 0.000443877 0.000622775 > 4 15103923103908 101.459 108.324 0.000678589 0.000609182 > 5 15132720132705 103.663 112.488 0.000741734 0.000595998 > 6 15161811161796 105.323 113.637 0.000333166 0.000586323 > 7 15190196190181 106.115 110.879 0.000612227 0.000582014 > 8 15221155221140 107.966 120.934 0.000471219 0.000571944 > 9 16251143251127 108.984 117.137 0.000267528 0.000566659 > Total time run: 10.000640 > Total reads made: 282097 > Read size:4096 > Object size: 4096 > Bandwidth (MB/sec): 110.187 > Average IOPS: 28207 > Stddev IOPS: 2357 > Max IOPS: 30959 > Min IOPS: 23314 > Average Latency(s): 0.000560402 > Max latency(s): 0.109804 > Min latency(s): 0.000212671 > > This is also quite far from expected. I have 12GB of memory on the