Re: [ceph-users] clock skew
Hi John, list, On 1-4-2017 16:18, John Petrini wrote: Just ntp. Just to follow-up on this: we have yet experienced a clock skew since we starting using chrony. Just three days ago, I know, bit still... Perhaps you should try it too, and report if it (seems to) work better for you as well. But again, just three days, could be I cheer too early. MJ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Why is cls_log_add logging so much?
On a busy cluster, I'm seeing a couple of OSDs logging millions of lines like this: 2017-04-04 06:35:18.240136 7f40ff873700 0 cls/log/cls_log.cc:129: storing entry at 1_1491287718.237118_57657708.1 2017-04-04 06:35:18.244453 7f4102078700 0 cls/log/cls_log.cc:129: storing entry at 1_1491287718.241622_57657709.1 2017-04-04 06:35:18.296092 7f40ff873700 0 cls/log/cls_log.cc:129: storing entry at 1_1491287718.296308_57657710.1 1. Can someone explain what these messages mean? It seems strange to me that only a few OSD generate these. 2. Why are they being generated at debug level 0, meaning that they cannot be filtered? This should happen for a non-error message that can be generated at least 50 times per second. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Troubleshooting incomplete PG's
Hello Sage and Brad, Many thanks for the information >incomplete PGs can be extracted from the drive if the bad sector(s) don't >happen to affect those pgs. The ceph-objectstore-tool --op export command >can be used for this (extract it from the affected drive and add it to >some other osd). == #ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --pgid 1.fs1 --op export --file /tmp/test Exporting 1.fs1 Export successful #ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-1 --op import --file /tmp/test Importing pgid 1.fs1 Import successful == I will try this for next issue reoccurance. Need your suggestion to fixing the unfound errors which happened on other environment. v11.2.0 , bluestore, 4+1 === 1 active+degraded, 8191 active+clean; 29494 GB data, 39323 GB used, 1180 TB / 1218 TB avail; 2/66917305 objects degraded (0.000%); 1/13383461 unfound (0.000%) === === pg 1.93f is active+degraded, acting [206,99,11,290,169], 1 unfound === What we tried... #Restart all the associated OSD's for that PG's 206,99,11,290,169 #Here all the OSD's are up and running state #ceph pg repair 1.93f #ceph pg deep-scrub 1.93f At last #ceph pg 1.93f mark_unfound_lost delete { data loss } Need your views on this, to how to clear the unfound issues without data loss. Thanks Jayaram On Mon, Apr 3, 2017 at 6:50 PM, Sage Weil wrote: > On Fri, 31 Mar 2017, nokia ceph wrote: > > Hello Brad, > > Many thanks of the info :) > > > > ENV:-- Kracken - bluestore - EC 4+1 - 5 node cluster : RHEL7 > > > > What is the status of the down+out osd? Only one osd osd.6 down and out > from > > cluster. > > What role did/does it play? Mostimportantly, is it osd.6? Yes, due to > > underlying I/O error issue we removed this device from the cluster. > > Is the device completely destroyed or is it only returning errors > when reading certain data? It is likely that some (or all) of the > incomplete PGs can be extracted from the drive if the bad sector(s) don't > happen to affect those pgs. The ceph-objectstore-tool --op export command > can be used for this (extract it from the affected drive and add it to > some other osd). > > > I put this parameter " osd_find_best_info_ignore_history_les = true" in > > ceph.conf, and find those 22 PG's were changed to "down+remapped" . Now > all > > are reverted to "remapped+incomplete" state. > > This is usually not a great idea unless you're out of options, by the way! > > > #ceph pg stat 2> /dev/null > > v2731828: 4096 pgs: 1 incomplete, 21 remapped+incomplete, 4074 > active+clean; > > 268 TB data, 371 TB used, 267 TB / 638 TB avail > > > > ## ceph -s > > 2017-03-30 19:02:14.350242 7f8b0415f700 -1 WARNING: the following > dangerous > > and experimental features are enabled: bluestore,rocksdb > > 2017-03-30 19:02:14.366545 7f8b0415f700 -1 WARNING: the following > dangerous > > and experimental features are enabled: bluestore,rocksdb > > cluster bd8adcd0-c36d-4367-9efe-f48f5ab5f108 > > health HEALTH_ERR > > 22 pgs are stuck inactive for more than 300 seconds > > 22 pgs incomplete > > 22 pgs stuck inactive > > 22 pgs stuck unclean > > monmap e2: 5 mons at{au-adelaide=10.50.21.24: > 6789/0,au-brisbane=10.50.21.22:6789/0,au-canberra= > > 10.50.21.23:6789/0,au-melbourne=10.50.21.21:6789/0, > au-sydney=10.50.21.20:67 > > 89/0} > > election epoch 180, quorum 0,1,2,3,4 > > au-sydney,au-melbourne,au-brisbane,au-canberra,au-adelaide > > mgr active: au-adelaide > > osdmap e6506: 117 osds: 117 up, 117 in; 21 remapped pgs > > flags sortbitwise,require_jewel_osds,require_kraken_osds > > pgmap v2731828: 4096 pgs, 1 pools, 268 TB data, 197 Mobjects > > 371 TB used, 267 TB / 638 TB avail > > 4074 active+clean > > 21 remapped+incomplete > >1 incomplete > > > > > > ## ceph osd dump 2>/dev/null | grep cdvr > > pool 1 'cdvr_ec' erasure size 5 min_size 4 crush_ruleset 1 object_hash > > rjenkins pg_num 4096 pgp_num 4096 last_change 456 flags > > hashpspool,nodeep-scrub stripe_width 65536 > > > > Inspecting affected PG 1.e4b > > > > # ceph pg dump 2> /dev/null | grep 1.e4b > > 1.e4b 50832 00 0 0 73013340821 > > 1000610006 remapped+incomplete 2017-03-30 14:14:26.297098 3844'161662 > > 6506:325748 [113,66,15,73,103]113 [NONE,NONE,NONE,73,NONE] > > > 73 1643'139486 2017-03-21 04:56:16.683953 0'0 2017-02-21 > > 10:33:50.012922 > > > > When I trigger below command. > > > > #ceph pg force_create_pg 1.e4b > > pg 1.e4b now creating, ok > > > > As it went to creating state, no change after that. Can you explain why > this > > PG showing null values after triggering "force_create_pg",? > > > > ]# ceph pg dump 2> /dev/null | grep 1.e4b > > 1.e4b 0 00 0 0 > 0 > > 00creating 2017-03-30 19:07:00.982178
[ceph-users] Librbd logging
Hello cephers, I have a situation where from time to time the write operation to the seph storage hangs for 3-5 seconds. For testing we have a simple line like: while sleep 1; date >> logfile; done & with this we can see that rarely there are 3 seconds or more differences between the consecutive outputs of date. Initially we have suspected the deep scrub and we have tuned its parameters, so right now I'm confident that the reason is something different than the deep scrubbing. I would like to know if any of you has encountered a similar situation, and what was the solution for it. I am suspecting the network between the compute nodes and the storage, but I need to prove this. I am thinking on enabling client side logging for librbd, but I see there are many subsystems where the logging can be enabled. Can anyone tell me which subsystem should I log, and at which level to be able to see whether the network is causing write issues? We're using ceph 0.94.10. Thank you, Laszlo ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Get/set/list rbd image using python librbd
Ok, thanks Jason for submitting the feature request ticket. But i am afraid i couldn't contribute now, lack of C/C++ :D On Mon, Apr 3, 2017 at 9:09 PM, Jason Dillaman wrote: > We try to keep the C/C++ and Python APIs in-sync, but it looks like > these functions were missed and are not currently available via the > Python API. I created a tracker ticket [1] the issue. If you are > interested, feel free to contribute a pull request for the missing > APIs. > > [1] http://tracker.ceph.com/issues/19451 > > On Sun, Apr 2, 2017 at 8:17 PM, Sayid Munawar > wrote: > > Hi, > > > > Using rbd command line, we can set / get / list image-meta of an rbd > image > > as described in the man page. > > > > # rbd image-meta list mypool/myimage > > > > > > How can we do the same using python librbdpy ? i can't find it in the > > documentation. > > > > with rados.Rados(conffile='my_ceph.conf') as cluster: > > with cluster.open_ioctx('mypool') as ioctx: > > with rbd.Image(ioctx, 'myimage') as image: > > image._some_method_to_set_metadata() ??? > > > > > > Thank you > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > -- > Jason > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Librbd logging
Couple options: 1) you can enable LTTng-UST tracing [1][2] against your VM for an extremely light-weight way to track IO latencies. 2) you can enable "debug rbd = 20" and grep through the logs for matching "AioCompletion.*(set_request_count|finalize)" log entries 3) use the asok file during one of these events to dump the objecter requests [1] http://docs.ceph.com/docs/jewel/rbd/rbd-replay/ [2] http://tracker.ceph.com/issues/14629 On Tue, Apr 4, 2017 at 7:36 AM, Laszlo Budai wrote: > Hello cephers, > > I have a situation where from time to time the write operation to the seph > storage hangs for 3-5 seconds. For testing we have a simple line like: > while sleep 1; date >> logfile; done & > > with this we can see that rarely there are 3 seconds or more differences > between the consecutive outputs of date. > Initially we have suspected the deep scrub and we have tuned its parameters, > so right now I'm confident that the reason is something different than the > deep scrubbing. > > I would like to know if any of you has encountered a similar situation, and > what was the solution for it. > I am suspecting the network between the compute nodes and the storage, but I > need to prove this. I am thinking on enabling client side logging for > librbd, but I see there are many subsystems where the logging can be > enabled. Can anyone tell me which subsystem should I log, and at which level > to be able to see whether the network is causing write issues? > We're using ceph 0.94.10. > > Thank you, > Laszlo > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Jason ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph pg inconsistencies - omap data lost
Hi all, We have a weird issue with a few inconsistent PGs. We are running ceph 11.2 on RHEL7. As an example inconsistent PG we have: # rados -p volumes list-inconsistent-obj 4.19 {"epoch":83986,"inconsistents":[{"object":{"name":"rbd_header.08f7fa43a49c7f","nspace":"","locator":"","snap":"head","version":28785242},"errors":[],"union_shard_errors":["omap_digest_mismatch_oi"],"selected_object_info":"4:9843f136:::rbd_header.08f7fa43a49c7f:head(82935'28785242 client.118028302.0:3057684 dirty|data_digest|omap_digest s 0 uv 28785242 dd od alloc_hint [0 0 0])","shards":[{"osd":10,"errors":["omap_digest_mismatch_oi"],"size":0,"omap_digest":"0x62b5dcb6","data_digest":"0x"},{"osd":20,"errors":["omap_digest_mismatch_oi"],"size":0,"omap_digest":"0x62b5dcb6","data_digest":"0x"},{"osd":29,"errors":["omap_digest_mismatch_oi"],"size":0,"omap_digest":"0x62b5dcb6","data_digest":"0x"}]}]} If I try to repair this PG, I get the following in the OSD logs: 2017-04-04 14:31:37.825833 7f2d7f802700 -1 log_channel(cluster) log [ERR] : 4.19 shard 10: soid 4:9843f136:::rbd_header.08f7fa43a49c7f:head omap_digest 0x62b5dcb6 != omap_digest 0x from auth oi 4:9843f136:::rbd_header.08f7fa43a49c7f:head(82935'28785242 client.118028302.0:3057684 dirty|data_digest|omap_digest s 0 uv 28785242 dd od alloc_hint [0 0 0]) 2017-04-04 14:31:37.825863 7f2d7f802700 -1 log_channel(cluster) log [ERR] : 4.19 shard 20: soid 4:9843f136:::rbd_header.08f7fa43a49c7f:head omap_digest 0x62b5dcb6 != omap_digest 0x from auth oi 4:9843f136:::rbd_header.08f7fa43a49c7f:head(82935'28785242 client.118028302.0:3057684 dirty|data_digest|omap_digest s 0 uv 28785242 dd od alloc_hint [0 0 0]) 2017-04-04 14:31:37.825870 7f2d7f802700 -1 log_channel(cluster) log [ERR] : 4.19 shard 29: soid 4:9843f136:::rbd_header.08f7fa43a49c7f:head omap_digest 0x62b5dcb6 != omap_digest 0x from auth oi 4:9843f136:::rbd_header.08f7fa43a49c7f:head(82935'28785242 client.118028302.0:3057684 dirty|data_digest|omap_digest s 0 uv 28785242 dd od alloc_hint [0 0 0]) 2017-04-04 14:31:37.825877 7f2d7f802700 -1 log_channel(cluster) log [ERR] : 4.19 soid 4:9843f136:::rbd_header.08f7fa43a49c7f:head: failed to pick suitable auth object 2017-04-04 14:32:37.926980 7f2d7cffd700 -1 log_channel(cluster) log [ERR] : 4.19 deep-scrub 3 errors If I list the omapvalues, they are null # rados -p volumes listomapvals rbd_header.08f7fa43a49c7f |wc -l 0 If I list the extended attributes on the filesystem of each OSD that hosts this file, they are indeed empty (all 3 OSDs are the same, but just listing one for brevity) getfattr /var/lib/ceph/osd/ceph-29/current/4.19_head/DIR_9/DIR_1/DIR_2/rbd\\uheader.08f7fa43a49c7f__head_6C8FC219__4 getfattr: Removing leading '/' from absolute path names # file: var/lib/ceph/osd/ceph-29/current/4.19_head/DIR_9/DIR_1/DIR_2/rbd\134uheader.08f7fa43a49c7f__head_6C8FC219__4 user.ceph._ user.ceph._@1 user.ceph._lock.rbd_lock user.ceph.snapset user.cephos.spill_out Is there anything I can do to recover from this situation? -- Kind regards, Ben Morrice __ Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670 EPFL / BBP Biotech Campus Chemin des Mines 9 1202 Geneva Switzerland ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Client's read affinity
Jason, I haven't heard much about this feature. Will the localization have effect if the crush location configuration is set in the [osd] section, or does it need to apply globally for clients as well? On Fri, Mar 31, 2017 at 6:38 AM, Jason Dillaman wrote: > Assuming you are asking about RBD-back VMs, it is not possible to > localize the all reads to the VM image. You can, however, enable > localization of the parent image since that is a read-only data set. > To enable that feature, set "rbd localize parent reads = true" and > populate the "crush location = host=X rack=Y etc=Z" in your ceph.conf. > > On Fri, Mar 31, 2017 at 9:00 AM, Alejandro Comisario > wrote: > > any experiences ? > > > > On Wed, Mar 29, 2017 at 2:02 PM, Alejandro Comisario > > wrote: > >> Guys hi. > >> I have a Jewel Cluster divided into two racks which is configured on > >> the crush map. > >> I have clients (openstack compute nodes) that are closer from one rack > >> than to another. > >> > >> I would love to (if is possible) to specify in some way the clients to > >> read first from the nodes on a specific rack then try the other one if > >> is not possible. > >> > >> Is that doable ? can somebody explain me how to do it ? > >> best. > >> > >> -- > >> Alejandrito > > > > > > > > -- > > Alejandro Comisario > > CTO | NUBELIU > > E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857 > > _ > > www.nubeliu.com > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > -- > Jason > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Brian Andrus | Cloud Systems Engineer | DreamHost brian.and...@dreamhost.com | www.dreamhost.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] FreeBSD port net/ceph-devel released
On Sat, Apr 1, 2017 at 4:58 PM, Willem Jan Withagen wrote: > On 1-4-2017 21:59, Wido den Hollander wrote: >> >>> Op 31 maart 2017 om 19:15 schreef Willem Jan Withagen : >>> >>> >>> On 31-3-2017 17:32, Wido den Hollander wrote: Hi Willem Jan, > Op 30 maart 2017 om 13:56 schreef Willem Jan Withagen > : > > > Hi, > > I'm pleased to announce that my efforts to port to FreeBSD have > resulted in a ceph-devel port commit in the ports tree. > > https://www.freshports.org/net/ceph-devel/ > Awesome work! I don't touch FreeBSD that much, but I can imagine that people want this. Out of curiosity, does this run on ZFS under FreeBSD? Or what Filesystem would you use behind FileStore with this? Or does BlueStore work? >>> >>> Since I'm a huge ZFS fan, that is what I run it on. >> >> Cool! The ZIL, ARC and L2ARC can actually make that very fast. Interesting! > > Right, ZIL is magic, and more or equal to the journal now used with OSDs > for exactly the same reason. Sad thing is that a write is now 3* > journaled: 1* by Ceph, and 2* by ZFS. Which means that the used > bandwidth to the SSDs is double of what it could be. > > Had some discussion about this, but disabling the Ceph journal is not > just setting an option. Although I would like to test performance of an > OSD with just the ZFS journal. But I expect that the OSD journal is > rather firmly integrated. Disabling the OSD journal will never be viable. The journal is also necessary for transactions and batch updates which cannot be done atomically in FileStore. This is great work Willem. I'm especially looking forward to seeing BlueStore performance on a ZVol. -- Patrick Donnelly ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] FreeBSD port net/ceph-devel released
On Tue, Apr 4, 2017 at 12:28 PM, Patrick Donnelly wrote: > On Sat, Apr 1, 2017 at 4:58 PM, Willem Jan Withagen wrote: >> On 1-4-2017 21:59, Wido den Hollander wrote: >>> Op 31 maart 2017 om 19:15 schreef Willem Jan Withagen : On 31-3-2017 17:32, Wido den Hollander wrote: > Hi Willem Jan, > >> Op 30 maart 2017 om 13:56 schreef Willem Jan Withagen >> : >> >> >> Hi, >> >> I'm pleased to announce that my efforts to port to FreeBSD have >> resulted in a ceph-devel port commit in the ports tree. >> >> https://www.freshports.org/net/ceph-devel/ >> > > Awesome work! I don't touch FreeBSD that much, but I can imagine that > people want this. > > Out of curiosity, does this run on ZFS under FreeBSD? Or what > Filesystem would you use behind FileStore with this? Or does > BlueStore work? Since I'm a huge ZFS fan, that is what I run it on. >>> >>> Cool! The ZIL, ARC and L2ARC can actually make that very fast. Interesting! >> >> Right, ZIL is magic, and more or equal to the journal now used with OSDs >> for exactly the same reason. Sad thing is that a write is now 3* >> journaled: 1* by Ceph, and 2* by ZFS. Which means that the used >> bandwidth to the SSDs is double of what it could be. >> >> Had some discussion about this, but disabling the Ceph journal is not >> just setting an option. Although I would like to test performance of an >> OSD with just the ZFS journal. But I expect that the OSD journal is >> rather firmly integrated. > > Disabling the OSD journal will never be viable. The journal is also > necessary for transactions and batch updates which cannot be done > atomically in FileStore. > > This is great work Willem. I'm especially looking forward to seeing > BlueStore performance on a ZVol. > > -- > Patrick Donnelly > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] FreeBSD port net/ceph-devel released
[ Sorry for the empty email there. :o ] On Tue, Apr 4, 2017 at 12:28 PM, Patrick Donnelly wrote: > On Sat, Apr 1, 2017 at 4:58 PM, Willem Jan Withagen wrote: >> On 1-4-2017 21:59, Wido den Hollander wrote: >>> Op 31 maart 2017 om 19:15 schreef Willem Jan Withagen : On 31-3-2017 17:32, Wido den Hollander wrote: > Hi Willem Jan, > >> Op 30 maart 2017 om 13:56 schreef Willem Jan Withagen >> : >> >> >> Hi, >> >> I'm pleased to announce that my efforts to port to FreeBSD have >> resulted in a ceph-devel port commit in the ports tree. >> >> https://www.freshports.org/net/ceph-devel/ >> > > Awesome work! I don't touch FreeBSD that much, but I can imagine that > people want this. > > Out of curiosity, does this run on ZFS under FreeBSD? Or what > Filesystem would you use behind FileStore with this? Or does > BlueStore work? Since I'm a huge ZFS fan, that is what I run it on. >>> >>> Cool! The ZIL, ARC and L2ARC can actually make that very fast. Interesting! >> >> Right, ZIL is magic, and more or equal to the journal now used with OSDs >> for exactly the same reason. Sad thing is that a write is now 3* >> journaled: 1* by Ceph, and 2* by ZFS. Which means that the used >> bandwidth to the SSDs is double of what it could be. >> >> Had some discussion about this, but disabling the Ceph journal is not >> just setting an option. Although I would like to test performance of an >> OSD with just the ZFS journal. But I expect that the OSD journal is >> rather firmly integrated. > > Disabling the OSD journal will never be viable. The journal is also > necessary for transactions and batch updates which cannot be done > atomically in FileStore. To expand on Patrick's statement: You shouldn't get confused by the presence of options to disable journaling. They exist but only work on btrfs-backed FileStores and are *not* performant. You could do the same on zfs, but in order to provide the guarantees of the RADOS protocol, when in that mode the OSD just holds replies on all operations until it knows they've been persisted to disk and snapshotted, then sends back a commit. You can probably imagine the horrible IO patterns and bursty application throughput that result. o_0 -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] FreeBSD port net/ceph-devel released
On 4-4-2017 21:05, Gregory Farnum wrote: > [ Sorry for the empty email there. :o ] > > On Tue, Apr 4, 2017 at 12:28 PM, Patrick Donnelly wrote: >> On Sat, Apr 1, 2017 at 4:58 PM, Willem Jan Withagen wrote: >>> On 1-4-2017 21:59, Wido den Hollander wrote: > Op 31 maart 2017 om 19:15 schreef Willem Jan Withagen : > > > On 31-3-2017 17:32, Wido den Hollander wrote: >> Hi Willem Jan, >> >>> Op 30 maart 2017 om 13:56 schreef Willem Jan Withagen >>> : >>> >>> >>> Hi, >>> >>> I'm pleased to announce that my efforts to port to FreeBSD have >>> resulted in a ceph-devel port commit in the ports tree. >>> >>> https://www.freshports.org/net/ceph-devel/ >>> >> >> Awesome work! I don't touch FreeBSD that much, but I can imagine that >> people want this. >> >> Out of curiosity, does this run on ZFS under FreeBSD? Or what >> Filesystem would you use behind FileStore with this? Or does >> BlueStore work? > > Since I'm a huge ZFS fan, that is what I run it on. Cool! The ZIL, ARC and L2ARC can actually make that very fast. Interesting! >>> >>> Right, ZIL is magic, and more or equal to the journal now used with OSDs >>> for exactly the same reason. Sad thing is that a write is now 3* >>> journaled: 1* by Ceph, and 2* by ZFS. Which means that the used >>> bandwidth to the SSDs is double of what it could be. >>> >>> Had some discussion about this, but disabling the Ceph journal is not >>> just setting an option. Although I would like to test performance of an >>> OSD with just the ZFS journal. But I expect that the OSD journal is >>> rather firmly integrated. >> >> Disabling the OSD journal will never be viable. The journal is also >> necessary for transactions and batch updates which cannot be done >> atomically in FileStore. > > To expand on Patrick's statement: You shouldn't get confused by the > presence of options to disable journaling. They exist but only work on > btrfs-backed FileStores and are *not* performant. You could do the > same on zfs, but in order to provide the guarantees of the RADOS > protocol, when in that mode the OSD just holds replies on all > operations until it knows they've been persisted to disk and > snapshotted, then sends back a commit. You can probably imagine the > horrible IO patterns and bursty application throughput that result. When I talked about this with Sage in CERN, I got the same answer. So this is at least consistent. ;-) And I have to admit that I do not understand the intricate details of this part of Ceph. So at the moment I'm looking at it from a more global view What, i guess, needs to be done, is to get ride of at least one of the SSD writes. Which is possible by mounting the journal disk as a separate VDEV (2 SSDs in mirror) and get the max speed out of this. Problem with this all is that the number of SSDs sort of blows up, and very likely there is a lot of waste because the journals need not be very large. And yes the other way would be to do BlueStore on ZVOL, where the underlying VDEVs are carefully crafted. But first we need to get AIO working. And I have not (yet) looked at that at all... First objective was to get a port of any sorts, which I did last week. Second is to take Luminous and make a "stable" port which is less of a moving target. Only then AIO is on the radar --WjW ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Client's read affinity
AFAIK, the OSDs should discover their location in the CRUSH map automatically -- therefore, this "crush location" config override would be used for librbd client configuration ("i.e. [client] section") to describe their location in the CRUSH map relative to racks, hosts, etc. On Tue, Apr 4, 2017 at 3:12 PM, Brian Andrus wrote: > Jason, I haven't heard much about this feature. > > Will the localization have effect if the crush location configuration is set > in the [osd] section, or does it need to apply globally for clients as well? > > On Fri, Mar 31, 2017 at 6:38 AM, Jason Dillaman wrote: >> >> Assuming you are asking about RBD-back VMs, it is not possible to >> localize the all reads to the VM image. You can, however, enable >> localization of the parent image since that is a read-only data set. >> To enable that feature, set "rbd localize parent reads = true" and >> populate the "crush location = host=X rack=Y etc=Z" in your ceph.conf. >> >> On Fri, Mar 31, 2017 at 9:00 AM, Alejandro Comisario >> wrote: >> > any experiences ? >> > >> > On Wed, Mar 29, 2017 at 2:02 PM, Alejandro Comisario >> > wrote: >> >> Guys hi. >> >> I have a Jewel Cluster divided into two racks which is configured on >> >> the crush map. >> >> I have clients (openstack compute nodes) that are closer from one rack >> >> than to another. >> >> >> >> I would love to (if is possible) to specify in some way the clients to >> >> read first from the nodes on a specific rack then try the other one if >> >> is not possible. >> >> >> >> Is that doable ? can somebody explain me how to do it ? >> >> best. >> >> >> >> -- >> >> Alejandrito >> > >> > >> > >> > -- >> > Alejandro Comisario >> > CTO | NUBELIU >> > E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857 >> > _ >> > www.nubeliu.com >> > ___ >> > ceph-users mailing list >> > ceph-users@lists.ceph.com >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >> -- >> Jason >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > -- > Brian Andrus | Cloud Systems Engineer | DreamHost > brian.and...@dreamhost.com | www.dreamhost.com -- Jason ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Apply for an official mirror at CN
Wido, ping? On Sat, Apr 1, 2017 at 8:40 PM, SJ Zhu wrote: > On Sat, Apr 1, 2017 at 8:10 PM, Wido den Hollander wrote: >> Great! Very good to hear. We can CNAME cn.ceph.com to that location? > > > Yes, please CNAME to mirrors.ustc.edu.cn, and I will set vhost in our > nginx for the > ceph directory. > > Thanks > > -- > Regards, > Shengjing Zhu -- Regards, Shengjing Zhu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com