Re: [ceph-users] v0.67.9 Dumpling released
Hi Sage, all, On 21 May 2014, at 22:02, Sage Weil wrote: > * osd: allow snap trim throttling with simple delay (#6278, Sage Weil) Do you have some advice about how to use the snap trim throttle? I saw osd_snap_trim_sleep, which is still 0 by default. But I didn't manage to follow the original ticket, since it started out as a question about deep scrub contending with client IOs, but then at some point you renamed the ticket to throttling snap trim. What exactly does snap trim do in the context of RBD client? And can you suggest a good starting point for osd_snap_trim_sleep = … ? Cheers, Dan -- Dan van der Ster || Data & Storage Services || CERN IT Department -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Storage
Hi All, I have a ceph storage cluster with four nodes. I have created block storage using cinder in openstack and ceph as its storage backend. So, I see a volume is created in ceph in one of the pools. But how to get information like on which OSD, PG, the volume is created in ? Thanks Kumar This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. __ www.accenture.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Storage
Hi, Am 04.06.2014 14:51, schrieb yalla.gnan.ku...@accenture.com: > Hi All, > > > > I have a ceph storage cluster with four nodes. I have created block storage > using cinder in openstack and ceph as its storage backend. > > So, I see a volume is created in ceph in one of the pools. But how to get > information like on which OSD, PG, the volume is created in ? > Check rbd ls, rbd info / to get block_name_prefix. rados ls -p to see the objects used. Normally, ceph stripes rbd images across different objects on different osds, so the volume is not created in only one osd or one pg. -- Mit freundlichen Grüßen, Florian Wiessner Smart Weblications GmbH Martinsberger Str. 1 D-95119 Naila fon.: +49 9282 9638 200 fax.: +49 9282 9638 205 24/7: +49 900 144 000 00 - 0,99 EUR/Min* http://www.smart-weblications.de -- Sitz der Gesellschaft: Naila Geschäftsführer: Florian Wiessner HRB-Nr.: HRB 3840 Amtsgericht Hof *aus dem dt. Festnetz, ggf. abweichende Preise aus dem Mobilfunknetz ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD server alternatives to choose
Le 04/06/2014 03:23, Christian Balzer a écrit : > On Tue, 03 Jun 2014 18:52:00 +0200 Cedric Lemarchand wrote: >> Le 03/06/2014 12:14, Christian Balzer a écrit : >>> A simple way to make 1) and 2) cheaper is to use AMD CPUs, they will do >>> just fine at half the price with these loads. >>> If you're that tight on budget, 64GB RAM will do fine, too. >> I am interested about this specific thought, could you elaborate how did >> you determine if such hardware (CPU and RAM) will handle well cases >> where the cluster goes in rebalancing mode when a node or some OSD goes >> down ? > Well, firstly we both read: > https://objects.dreamhost.com/inktankweb/Inktank_Hardware_Configuration_Guide.pdf I was not aware of this doc, it enlightens lots of questions I was having about CPU/RAM consideration. Thanks for your very exhaustive explanations ;-) Cheers > > And looking at those values a single Opteron 4386 would be more > than sufficient for both 1) and 2). > I'm saying and suggesting a single CPU here to keep things all in one NUMA > node. > AFAIK (I haven't used anything Intel for years) some Intel boards require > both CPUs in place to use all available interfaces (PCIe buses), so the > above advice is only for AMD. > As for RAM, it would be totally overspec'ed with 64GB, but a huge > pagecache is an immense help for reads and RAM is fairly cheap these days, > so the more you can afford, the better. > > Secondly experience. > The above document is pretty much on spot when comes to CPU suggestions in > combination with OSDs backed by a single HDD (SSD journal or not). > I think it is overly optimistic when it comes to purely SSD based storage > nodes or something like my HW RAID backed OSD. > Remember, when using the 4k fio I could get Ceph to use about 2 cores > per OSD and then stall on whatever locking contention or other things that > are going on inside it before actually exhausting all available CPU > resources. > OSDs (journal and backing storage) as well as the network were nowhere > near getting exhausted. > > Compared to that fio run a cluster rebalancing is a breeze, at least when > it comes to CPU resources needed. > It comes in a much more CEPH friendly IO block size and thus exhausts > either network or disk bandwidth first. > >> Because, as Robert stated (and I totally agree with that!), designing a >> cluster is about the expected performances in optimal conditions, and >> expected recovery time and nodes loads in non optimal conditions >> (typically rebalancing), and I found this last point hard to consider >> and anticipate. >> > This is why one builds test clusters and then builds production HW > clusters with the expectation that it will be twice as bad as anticipated > from what you saw on the test cluster. ^o^ > >> As a quick exercise (without taking in consideration FS size overhead >> ect ...), based on config "1.NG" from Christian (ratio SSD/HDD of 1:3, >> thus 9x4TB HDD/nodes, 24 nodes) and replication ratio of 2 : > I would never use a replication of 2 unless I were VERY confident in my > backing storage devices (either high end and well monitored SSDs or RAIDs). > >> - each nodes : ~36TB RAW /~18TB NET >> - the whole cluster, 864TB RAW / ~432TB NET >> >> If a node goes down, ~36TB have to be re balanced between the 23 >> existing, so ~1,6TB have to be read and write on each nodes. I think >> this is the expected workload of the cluster in rebalancing mode. >> >> So 2 questions : >> >> * did my maths are good until now ? > Math is hard, lets go shopping. ^o^ > But yes, given your parameters that looks correct. >> * where will be the main bottleneck with such configuration and workload >> (CPU/IO/RAM/NET) ? how calculate it ? >> > See above. > In the configurations suggested by Benjamin disk IO will be the > bottleneck, as the network bandwidth is higher than write capacity of the > SSDs and HDDs. CPU and RAM will not be an issue. > > The other thing to consider are the backfilling and/or recovery settings > in CEPH, these will of course influence how much of an impact a node > failure (and potential recovery of it) will have. > Depending on those settings and the cluster load (as in client side) at > the time of failure the most optimistic number for full recovery of > redundancy I can come up with is about an hour, in reality it is probably > going to be substantially longer. > And during that time any further disk failure (with over 200 in the > cluster a pretty decent probability) can result in irrecoverable data loss. > > Christian >> -- >> Cédric >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Cédric ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Storage
Hello, How can check ceph client session in clients side, for example, when mount iscsi or nfs, you can check it (nfs just mount, iscsi iscsiadm -m session), but how can do that with ceph? And is there more detailed documentation about openstack and ceph than http://ceph.com/docs/master/rbd/rbd-openstack/? On 2014.06.04. 16:29, Smart Weblications GmbH - Florian Wiessner wrote: > Hi, > > Am 04.06.2014 14:51, schrieb yalla.gnan.ku...@accenture.com: >> Hi All, >> >> >> >> I have a ceph storage cluster with four nodes. I have created block storage >> using cinder in openstack and ceph as its storage backend. >> >> So, I see a volume is created in ceph in one of the pools. But how to get >> information like on which OSD, PG, the volume is created in ? >> > > Check rbd ls, rbd info / to get block_name_prefix. > > rados ls -p to see the objects used. > > Normally, ceph stripes rbd images across different objects on different osds, > so > the volume is not created in only one osd or one pg. > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.67.9 Dumpling released
On Wed, 4 Jun 2014, Dan Van Der Ster wrote: > Hi Sage, all, > > On 21 May 2014, at 22:02, Sage Weil wrote: > > > * osd: allow snap trim throttling with simple delay (#6278, Sage Weil) > > Do you have some advice about how to use the snap trim throttle? I saw > osd_snap_trim_sleep, which is still 0 by default. But I didn't manage to > follow the original ticket, since it started out as a question about > deep scrub contending with client IOs, but then at some point you > renamed the ticket to throttling snap trim. What exactly does snap trim > do in the context of RBD client? And can you suggest a good starting > point for osd_snap_trim_sleep = ? ? This is a coarse hack to make the snap trimming slow down and let client IO run by simply sleeping between work. I would start with something smallish (.01 = 10ms) after deleting some snapshots and see what effect it has on request latency. Unfortunately it's not a very intuitive knob to adjust, but it is an interim solution until we figure out how to better prioritize this (and other) background work. In short, if you do see a performance degradation after removing snaps, adjust this up or down and see how it changes that. If you don't see a degradation, then you're lucky and don't need to do anything. :) You can adjust this on running OSDs with something like 'ceph daemon osd.NN config set osd_snap_trim_sleep .01' or with 'ceph tell osd.* injectargs -- --osd-snap-trim-sleep .01'. sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Experiences with Ceph at the June'14 issue of USENIX ; login:
Hello Ian, Thanks for your interest. On Mon, Jun 02, 2014 at 06:37:48PM -0400, Ian Colle wrote: > Thanks, Filippos! Very interesting reading. > > Are you comfortable enough yet to remove the RAID-1 from your architecture and > get all that space back? Actually, we are not ready to do that yet. There are three major things to consider. First, to be able to get rid of the RAID-1 setup, we need to increase the replication level to at least 3x. So the space gain is not that great to begin with. Second, this operation can take about a month for our scale according to our calculations and previous experience. During this period of increased I/O we might get peaks of performance degradation. Plus, we currently do not have the necessary hardware available to increase the replication level before we get rid of the RAID setup. Third, we have a few disk failures per month. The RAID-1 setup has allowed us to seamlessly replace them without any hiccup or even a clue to the end user that something went wrong. Surely we can rely on RADOS to avoid any data loss, but if we currently rely on RADOS for recovery there might be some (minor) performance degradation, especially for the VM I/O traffic. Kind Regards, -- Filippos ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RGW: Multi Part upload and resulting objects
Hi, During a multi part upload you can't upload parts smaller than 5M, and radosgw also slices object in slices of 4M. Having those two being different is a bit unfortunate because if you slice your files in the minimum chunk size you end up with a main file of 4M and a shadowfile of 1M for each part ... Would it make sense to allow either multipart upload of 4M, or to rise the slice size to something more than 4M (4M or 8M if you want power of 2) ? Cheers, Sylvain ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.67.9 Dumpling released
On 04 Jun 2014, at 16:06, Sage Weil wrote: > On Wed, 4 Jun 2014, Dan Van Der Ster wrote: >> Hi Sage, all, >> >> On 21 May 2014, at 22:02, Sage Weil wrote: >> >>> * osd: allow snap trim throttling with simple delay (#6278, Sage Weil) >> >> Do you have some advice about how to use the snap trim throttle? I saw >> osd_snap_trim_sleep, which is still 0 by default. But I didn't manage to >> follow the original ticket, since it started out as a question about >> deep scrub contending with client IOs, but then at some point you >> renamed the ticket to throttling snap trim. What exactly does snap trim >> do in the context of RBD client? And can you suggest a good starting >> point for osd_snap_trim_sleep = ? ? > > This is a coarse hack to make the snap trimming slow down and let client > IO run by simply sleeping between work. I would start with something > smallish (.01 = 10ms) after deleting some snapshots and see what effect it > has on request latency. Unfortunately it's not a very intuitive knob to > adjust, but it is an interim solution until we figure out how to better > prioritize this (and other) background work. > Thanks Sage. Is this delay applied per object being removed or at some higher granularity? And BTW, I was also curious why you’ve only added a throttle to the snap trim ops. Are object/rbd/pg/pool deletions somehow less disruptive to client IOs? Cheers, Dan > In short, if you do see a performance degradation after removing snaps, > adjust this up or down and see how it changes that. If you don't see a > degradation, then you're lucky and don't need to do anything. :) > > You can adjust this on running OSDs with something like 'ceph daemon > osd.NN config set osd_snap_trim_sleep .01' or with 'ceph tell osd.* > injectargs -- --osd-snap-trim-sleep .01'. > > sage > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.67.9 Dumpling released
On Wed, 4 Jun 2014, Andrey Korolyov wrote: > On 06/04/2014 06:06 PM, Sage Weil wrote: > > On Wed, 4 Jun 2014, Dan Van Der Ster wrote: > >> Hi Sage, all, > >> > >> On 21 May 2014, at 22:02, Sage Weil wrote: > >> > >>> * osd: allow snap trim throttling with simple delay (#6278, Sage Weil) > >> > >> Do you have some advice about how to use the snap trim throttle? I saw > >> osd_snap_trim_sleep, which is still 0 by default. But I didn't manage to > >> follow the original ticket, since it started out as a question about > >> deep scrub contending with client IOs, but then at some point you > >> renamed the ticket to throttling snap trim. What exactly does snap trim > >> do in the context of RBD client? And can you suggest a good starting > >> point for osd_snap_trim_sleep = ? ? > > > > This is a coarse hack to make the snap trimming slow down and let client > > IO run by simply sleeping between work. I would start with something > > smallish (.01 = 10ms) after deleting some snapshots and see what effect it > > has on request latency. Unfortunately it's not a very intuitive knob to > > adjust, but it is an interim solution until we figure out how to better > > prioritize this (and other) background work. > > > > In short, if you do see a performance degradation after removing snaps, > > adjust this up or down and see how it changes that. If you don't see a > > degradation, then you're lucky and don't need to do anything. :) > > > > You can adjust this on running OSDs with something like 'ceph daemon > > osd.NN config set osd_snap_trim_sleep .01' or with 'ceph tell osd.* > > injectargs -- --osd-snap-trim-sleep .01'. > > > > sage > > > > Hi, > > we had the same mechanism for almost a half of year and it working nice > except cases when multiple background snap deletions are hitting their > ends - latencies may spike not regarding very large sleep gap for snap > operations. Do you have any thoughts on reducing this particular impact? This isn't ringing any bells. If this is somethign you can reproduce with osd logging enabled we should be able to tell what is causing the spike, though... sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.67.9 Dumpling released
On Wed, 4 Jun 2014, Dan Van Der Ster wrote: > On 04 Jun 2014, at 16:06, Sage Weil wrote: > > > On Wed, 4 Jun 2014, Dan Van Der Ster wrote: > >> Hi Sage, all, > >> > >> On 21 May 2014, at 22:02, Sage Weil wrote: > >> > >>> * osd: allow snap trim throttling with simple delay (#6278, Sage Weil) > >> > >> Do you have some advice about how to use the snap trim throttle? I saw > >> osd_snap_trim_sleep, which is still 0 by default. But I didn't manage to > >> follow the original ticket, since it started out as a question about > >> deep scrub contending with client IOs, but then at some point you > >> renamed the ticket to throttling snap trim. What exactly does snap trim > >> do in the context of RBD client? And can you suggest a good starting > >> point for osd_snap_trim_sleep = ? ? > > > > This is a coarse hack to make the snap trimming slow down and let client > > IO run by simply sleeping between work. I would start with something > > smallish (.01 = 10ms) after deleting some snapshots and see what effect it > > has on request latency. Unfortunately it's not a very intuitive knob to > > adjust, but it is an interim solution until we figure out how to better > > prioritize this (and other) background work. > > > > Thanks Sage. Is this delay applied per object being removed or at some > higher granularity? Per object. > And BTW, I was also curious why you?ve only added a throttle to the snap > trim ops. Are object/rbd/pg/pool deletions somehow less disruptive to > client IOs? Other deletions are client IOs. Snap deletions are one of the few operations that are driven by the OSD and thus need their own throttling. FWIW, I think the plan going forward is to create ops for these internally so that the go through the same queues and prioritization as client requests. sage > > Cheers, Dan > > > In short, if you do see a performance degradation after removing snaps, > > adjust this up or down and see how it changes that. If you don't see a > > degradation, then you're lucky and don't need to do anything. :) > > > > You can adjust this on running OSDs with something like 'ceph daemon > > osd.NN config set osd_snap_trim_sleep .01' or with 'ceph tell osd.* > > injectargs -- --osd-snap-trim-sleep .01'. > > > > sage > > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.67.9 Dumpling released
On Wed, 4 Jun 2014, Andrey Korolyov wrote: > On 06/04/2014 07:22 PM, Sage Weil wrote: > > On Wed, 4 Jun 2014, Andrey Korolyov wrote: > >> On 06/04/2014 06:06 PM, Sage Weil wrote: > >>> On Wed, 4 Jun 2014, Dan Van Der Ster wrote: > Hi Sage, all, > > On 21 May 2014, at 22:02, Sage Weil wrote: > > > * osd: allow snap trim throttling with simple delay (#6278, Sage Weil) > > Do you have some advice about how to use the snap trim throttle? I saw > osd_snap_trim_sleep, which is still 0 by default. But I didn't manage to > follow the original ticket, since it started out as a question about > deep scrub contending with client IOs, but then at some point you > renamed the ticket to throttling snap trim. What exactly does snap trim > do in the context of RBD client? And can you suggest a good starting > point for osd_snap_trim_sleep = ? ? > >>> > >>> This is a coarse hack to make the snap trimming slow down and let client > >>> IO run by simply sleeping between work. I would start with something > >>> smallish (.01 = 10ms) after deleting some snapshots and see what effect > >>> it > >>> has on request latency. Unfortunately it's not a very intuitive knob to > >>> adjust, but it is an interim solution until we figure out how to better > >>> prioritize this (and other) background work. > >>> > >>> In short, if you do see a performance degradation after removing snaps, > >>> adjust this up or down and see how it changes that. If you don't see a > >>> degradation, then you're lucky and don't need to do anything. :) > >>> > >>> You can adjust this on running OSDs with something like 'ceph daemon > >>> osd.NN config set osd_snap_trim_sleep .01' or with 'ceph tell osd.* > >>> injectargs -- --osd-snap-trim-sleep .01'. > >>> > >>> sage > >>> > >> > >> Hi, > >> > >> we had the same mechanism for almost a half of year and it working nice > >> except cases when multiple background snap deletions are hitting their > >> ends - latencies may spike not regarding very large sleep gap for snap > >> operations. Do you have any thoughts on reducing this particular impact? > > > > This isn't ringing any bells. If this is somethign you can reproduce with > > osd logging enabled we should be able to tell what is causing the spike, > > though... > > > > sage > > > > Ok, would 10 be enough there? On 20, all timings most likely to be > distorted by logging operations even for tmpfs. Yeah, debug osd = 20 and debug ms = 1 should be sufficient. sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.67.9 Dumpling released
On Wed, 4 Jun 2014, Dan Van Der Ster wrote: > On 04 Jun 2014, at 16:06, Sage Weil wrote: > > > You can adjust this on running OSDs with something like 'ceph daemon > > osd.NN config set osd_snap_trim_sleep .01' or with 'ceph tell osd.* > > injectargs -- --osd-snap-trim-sleep .01'. > > Thanks, trying that now. > > I noticed that using = 0.01 in ceph.conf it gets parsed as 0, whereas > .01 is parsed correctly. Known bug? Nope! Do you mind filing a ticket at tracker.ceph.com? Thanks! sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.67.9 Dumpling released
On 06/04/2014 06:06 PM, Sage Weil wrote: > On Wed, 4 Jun 2014, Dan Van Der Ster wrote: >> Hi Sage, all, >> >> On 21 May 2014, at 22:02, Sage Weil wrote: >> >>> * osd: allow snap trim throttling with simple delay (#6278, Sage Weil) >> >> Do you have some advice about how to use the snap trim throttle? I saw >> osd_snap_trim_sleep, which is still 0 by default. But I didn't manage to >> follow the original ticket, since it started out as a question about >> deep scrub contending with client IOs, but then at some point you >> renamed the ticket to throttling snap trim. What exactly does snap trim >> do in the context of RBD client? And can you suggest a good starting >> point for osd_snap_trim_sleep = ? ? > > This is a coarse hack to make the snap trimming slow down and let client > IO run by simply sleeping between work. I would start with something > smallish (.01 = 10ms) after deleting some snapshots and see what effect it > has on request latency. Unfortunately it's not a very intuitive knob to > adjust, but it is an interim solution until we figure out how to better > prioritize this (and other) background work. > > In short, if you do see a performance degradation after removing snaps, > adjust this up or down and see how it changes that. If you don't see a > degradation, then you're lucky and don't need to do anything. :) > > You can adjust this on running OSDs with something like 'ceph daemon > osd.NN config set osd_snap_trim_sleep .01' or with 'ceph tell osd.* > injectargs -- --osd-snap-trim-sleep .01'. > > sage > Hi, we had the same mechanism for almost a half of year and it working nice except cases when multiple background snap deletions are hitting their ends - latencies may spike not regarding very large sleep gap for snap operations. Do you have any thoughts on reducing this particular impact? > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RGW: Multi Part upload and resulting objects
On Wed, Jun 4, 2014 at 7:58 AM, Sylvain Munaut wrote: > Hi, > > > During a multi part upload you can't upload parts smaller than 5M, and > radosgw also slices object in slices of 4M. Having those two being > different is a bit unfortunate because if you slice your files in the > minimum chunk size you end up with a main file of 4M and a shadowfile > of 1M for each part ... > > > Would it make sense to allow either multipart upload of 4M, or to rise > the slice size to something more than 4M (4M or 8M if you want power > of 2) ? Huh. We took the 5MB limit from S3, but it definitely is unfortunate in combination with our 4MB chunking. You can change the default slice size using a config option, though. I believe you want to change rgw_obj_stripe_size (default: 4 << 20). There might be some other considerations around the initial 512KB "head" objects, though...Yehuda? -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.67.9 Dumpling released
On 06/04/2014 07:22 PM, Sage Weil wrote: > On Wed, 4 Jun 2014, Andrey Korolyov wrote: >> On 06/04/2014 06:06 PM, Sage Weil wrote: >>> On Wed, 4 Jun 2014, Dan Van Der Ster wrote: Hi Sage, all, On 21 May 2014, at 22:02, Sage Weil wrote: > * osd: allow snap trim throttling with simple delay (#6278, Sage Weil) Do you have some advice about how to use the snap trim throttle? I saw osd_snap_trim_sleep, which is still 0 by default. But I didn't manage to follow the original ticket, since it started out as a question about deep scrub contending with client IOs, but then at some point you renamed the ticket to throttling snap trim. What exactly does snap trim do in the context of RBD client? And can you suggest a good starting point for osd_snap_trim_sleep = ? ? >>> >>> This is a coarse hack to make the snap trimming slow down and let client >>> IO run by simply sleeping between work. I would start with something >>> smallish (.01 = 10ms) after deleting some snapshots and see what effect it >>> has on request latency. Unfortunately it's not a very intuitive knob to >>> adjust, but it is an interim solution until we figure out how to better >>> prioritize this (and other) background work. >>> >>> In short, if you do see a performance degradation after removing snaps, >>> adjust this up or down and see how it changes that. If you don't see a >>> degradation, then you're lucky and don't need to do anything. :) >>> >>> You can adjust this on running OSDs with something like 'ceph daemon >>> osd.NN config set osd_snap_trim_sleep .01' or with 'ceph tell osd.* >>> injectargs -- --osd-snap-trim-sleep .01'. >>> >>> sage >>> >> >> Hi, >> >> we had the same mechanism for almost a half of year and it working nice >> except cases when multiple background snap deletions are hitting their >> ends - latencies may spike not regarding very large sleep gap for snap >> operations. Do you have any thoughts on reducing this particular impact? > > This isn't ringing any bells. If this is somethign you can reproduce with > osd logging enabled we should be able to tell what is causing the spike, > though... > > sage > Ok, would 10 be enough there? On 20, all timings most likely to be distorted by logging operations even for tmpfs. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RGW: Multi Part upload and resulting objects
On Wed, Jun 4, 2014 at 8:49 AM, Gregory Farnum wrote: > On Wed, Jun 4, 2014 at 7:58 AM, Sylvain Munaut > wrote: >> Hi, >> >> >> During a multi part upload you can't upload parts smaller than 5M, and >> radosgw also slices object in slices of 4M. Having those two being >> different is a bit unfortunate because if you slice your files in the >> minimum chunk size you end up with a main file of 4M and a shadowfile >> of 1M for each part ... >> >> >> Would it make sense to allow either multipart upload of 4M, or to rise >> the slice size to something more than 4M (4M or 8M if you want power >> of 2) ? > > Huh. We took the 5MB limit from S3, but it definitely is unfortunate > in combination with our 4MB chunking. You can change the default slice > size using a config option, though. I believe you want to change > rgw_obj_stripe_size (default: 4 << 20). There might be some other > considerations around the initial 512KB "head" objects, > though...Yehuda? The head object size is unrelated to the stripe size and changing the stripe size wouldn't affect it. For large uploads the head size is negligible, so I don't see it as any concern. Yehuda ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.67.9 Dumpling released
On 04 Jun 2014, at 16:06, Sage Weil wrote: > You can adjust this on running OSDs with something like 'ceph daemon > osd.NN config set osd_snap_trim_sleep .01' or with 'ceph tell osd.* > injectargs -- --osd-snap-trim-sleep .01'. Thanks, trying that now. I noticed that using = 0.01 in ceph.conf it gets parsed as 0, whereas .01 is parsed correctly. Known bug? Cheers, Dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rados benchmark is fast, but dd result on guest vm still slow?
Hello, On Wed, 4 Jun 2014 22:36:00 +0800 Indra Pramana wrote: > Hi Christian, > > Good day to you, and thank you for your reply. > > Just now I managed to identify 3 more OSDs which were slow and needed to > be trimmed. Here is a longer (1 minute) result of rados bench after the > trimming: > This is the second time I see you mentioning needing to trim OSDs. Does that mean your actual storage is on SDDs? If only your journals are on SSDs (and nothing else) a trim (how do you trim them?) should have no effect at all. > http://pastebin.com/YFTbLyHA > You should run atop on all of your storage nodes and watch all OSD disks when your cluster stalls. If you have too many nodes/OSDs to watch them all at the same time, use the logging functionality of atop (probably with a lower interval than the standard 10 seconds) and review things after a bench run. I have a hard time believing that your entire cluster just stopped processing things for 10 seconds there, but I bet an OSD or node stalled. > > Total time run: 69.441936 > Total writes made: 3773 > Write size: 4096000 > Bandwidth (MB/sec): 212.239 > > Stddev Bandwidth: 247.672 > Max bandwidth (MB/sec): 921.875 > Min bandwidth (MB/sec): 0 > Average Latency:0.58602 > Stddev Latency: 2.39341 > Max latency:32.1121 > Min latency:0.04847 > > > When I run this for 60 seconds, I noted some slow requests message when I > monitor using ceph -w, near the end of the 60-second period. > > I have verified that all OSDs have I/O speed of > 220 MB/s after I > trimmed the remaining slow ones just now. I noted that some SSDs are > having 250 MB/s of I/O speed when I take it out of cluster, but then > drop to 150 MB/s -ish after I put back into the cluster. > Also having to trim SSDs to regain performance suggests that you probably aren't using Intel DC ones. Some (most really)) SSDs are known to have massive delays (latencies) when having to do a garbage collection or other internal functions. > Could it be due to the latency? You mentioned that average latency of 0.5 > is pretty horrible. How can I find what contributes to the latency and > how to fix the problem? Really at loss now. :( > The latency is a combination of all delays, in your case I'm sure it is storage related. Christian > Looking forward to your reply, thank you. > > Cheers. > > > > > On Mon, Jun 2, 2014 at 4:56 PM, Christian Balzer wrote: > > > > > Hello, > > > > On Mon, 2 Jun 2014 16:15:22 +0800 Indra Pramana wrote: > > > > > Dear all, > > > > > > I have managed to identify some slow OSDs and journals and have since > > > replaced them. RADOS benchmark of the whole cluster is now fast, much > > > improved from last time, showing the cluster can go up to 700+ MB/s. > > > > > > = > > > Maintaining 16 concurrent writes of 4194304 bytes for up to 10 > > > seconds or 0 objects > > > Object prefix: benchmark_data_hv-kvm-01_6931 > > >sec Cur ops started finished avg MB/s cur MB/s last lat > > > avg lat 0 0 0 0 0 0 > > > - 0 1 16 214 198 791.387 792 > > > 0.260687 0.074689 2 16 275 259 517.721 244 > > > 0.079697 0.0861397 3 16 317 301 401.174 > > > 168 0.209022 0.115348 4 16 317 301 > > > 300.902 0 - 0.115348 5 16 356 340 > > > 271.92478 0.040032 0.172452 6 16 389 373 > > > 248.604 132 0.038983 0.221213 7 16 411 395 > > > 225.66288 0.048462 0.211686 8 16 441 425 > > > 212.454 120 0.048722 0.237671 9 16 474 458 > > > 203.513 132 0.041285 0.226825 10 16 504 > > > 488 195.161 120 0.041899 0.224044 11 16 > > > 505 489 177.784 4 0.622238 0.224858 12 > > > 16 505 489162.97 0 - 0.224858 Total > > > time run: 12.142654 Total writes made: 505 > > > Write size: 4194304 > > > Bandwidth (MB/sec): 166.356 > > > > > > Stddev Bandwidth: 208.41 > > > Max bandwidth (MB/sec): 792 > > > Min bandwidth (MB/sec): 0 > > > Average Latency:0.384178 > > > Stddev Latency: 1.10504 > > > Max latency:9.64224 > > > Min latency:0.031679 > > > = > > > > > This might be better than the last result, but it still shows the same > > massive variance in latency and a pretty horrible average latency. > > > > Also you want to run this test for a lot longer, looking at the > > bandwidth progression it seems to drop over time. > > I'd expect the sustained bandwidth over a minute or so be below > > 100MB/s. > > > > > > > However, dd test result on guest VM is still slow. > > > > > > = > > > root@test1# dd bs=1M count=256 if=/dev/zero of=test conv=fdatasync > > > oflag=direct > > > 256+0 records in > > > 256+0 recor
Re: [ceph-users] rados benchmark is fast, but dd result on guest vm still slow?
Hello, On Wed, 4 Jun 2014 23:46:33 +0800 Indra Pramana wrote: > Hi Christian, > > In addition to my previous email, I realised that if I use dd with 4M > block size, I can get higher speed. > > root@Ubuntu-12043-64bit:/data# dd bs=4M count=128 if=/dev/zero of=test4 > conv=fdatasync oflag=direct > 128+0 records in > 128+0 records out > 536870912 bytes (537 MB) copied, 5.68378 s, 94.5 MB/s > > compared to: > > root@Ubuntu-12043-64bit:/data# dd bs=1M count=512 if=/dev/zero of=test8 > conv=fdatasync oflag=direct > 512+0 records in > 512+0 records out > 536870912 bytes (537 MB) copied, 8.91133 s, 60.2 MB/s > That's what I told you. An even bigger impact than saw here. > But still, the difference is still very big. With 4M block size, I can > get 400 MB/s average I/O speed (max 1,000 MB/s) using rados bench, but > only 90 MB/s average using dd on guest VM. I am wondering if there are > any "throttling" settings which prevent the guest VM to get the full I/O > speed the Ceph cluster provides. > Not really, no. However despite the identical block size now, you are still using 2 different tools and thus comparing apples to oranges. rados bench by default starts 16 threads, doesn't have to deal with any inefficiencies of the VM layers and neither with a filesystem. The dd on the other hand is in the VM, writes to a filesystem and most of all is single threaded. If I run a dd I get about half the speed of rados bench, running 2 in parallel on different VMs gets things to 80%, etc. > With regards to the VM user space, kernel space that you mentioned, can > you elaborate more on what do you mean by that? We are using CloudStack > and KVM hypervisor, using libvirt to connect to Ceph RBD. > So probably userspace RBD, I don't really know Cloudstack though. What I was suggesting is mapping and then mounting a (new) RBD image to a host (kernelspace), formatting it with the same FS type as your VM and then run the dd on it. Not a perfect match due to kernel versus user space, but a lot closer than bench versus dd. Christian > Looking forward to your reply, thank you. > > Cheers. > > > > On Wed, Jun 4, 2014 at 10:36 PM, Indra Pramana wrote: > > > Hi Christian, > > > > Good day to you, and thank you for your reply. > > > > Just now I managed to identify 3 more OSDs which were slow and needed > > to be trimmed. Here is a longer (1 minute) result of rados bench after > > the trimming: > > > > http://pastebin.com/YFTbLyHA > > > > > > Total time run: 69.441936 > > Total writes made: 3773 > > Write size: 4096000 > > Bandwidth (MB/sec): 212.239 > > > > Stddev Bandwidth: 247.672 > > Max bandwidth (MB/sec): 921.875 > > Min bandwidth (MB/sec): 0 > > Average Latency:0.58602 > > Stddev Latency: 2.39341 > > Max latency:32.1121 > > Min latency:0.04847 > > > > > > When I run this for 60 seconds, I noted some slow requests message > > when I monitor using ceph -w, near the end of the 60-second period. > > > > I have verified that all OSDs have I/O speed of > 220 MB/s after I > > trimmed the remaining slow ones just now. I noted that some SSDs are > > having 250 MB/s of I/O speed when I take it out of cluster, but then > > drop to 150 MB/s -ish after I put back into the cluster. > > > > Could it be due to the latency? You mentioned that average latency of > > 0.5 is pretty horrible. How can I find what contributes to the latency > > and how to fix the problem? Really at loss now. :( > > > > Looking forward to your reply, thank you. > > > > Cheers. > > > > > > > > > > On Mon, Jun 2, 2014 at 4:56 PM, Christian Balzer wrote: > > > >> > >> Hello, > >> > >> On Mon, 2 Jun 2014 16:15:22 +0800 Indra Pramana wrote: > >> > >> > Dear all, > >> > > >> > I have managed to identify some slow OSDs and journals and have > >> > since replaced them. RADOS benchmark of the whole cluster is now > >> > fast, much improved from last time, showing the cluster can go up > >> > to 700+ MB/s. > >> > > >> > = > >> > Maintaining 16 concurrent writes of 4194304 bytes for up to 10 > >> > seconds or 0 objects > >> > Object prefix: benchmark_data_hv-kvm-01_6931 > >> >sec Cur ops started finished avg MB/s cur MB/s last lat > >> > avg lat 0 0 0 0 0 0 - > >> 0 > >> > 1 16 214 198 791.387 792 0.260687 > >> > 0.074689 2 16 275 259 517.721 244 0.079697 > >> > 0.0861397 3 16 317 301 401.174 168 > >> > 0.209022 0.115348 4 16 317 301 300.902 > >> > 0 - 0.115348 5 16 356 340 271.924 > >> > 78 0.040032 0.172452 6 16 389 373 248.604 > >> > 132 0.038983 0.221213 7 16 411 395 > >> > 225.66288 0.048462 0.211686 8 16 441 > >> > 425 212.454 120 0.048722 0.237671 9 16 > >> > 474 458 203.513 132 0.041285 0.226
[ceph-users] osd down/out problem
Hi, some of the osds in my env continues to try to connect to monitors/ceph nodes, but get connection refused and down/out. It even worse when I try to initialize 100+ osds (800G HDD for each osd), most of the osds would run into the same problem to connect to monitor. I checked the monitor status, it looks good, there are no monitors down, I also disabled iptalbes and selinux, set " max open files = 131072" in ceph.conf. Could you let me know what else I should do to fix the problem? BTW, for now I have 3 monitors in ceph cluster, and all of them are in good status. Osd log - -4633> 2014-06-03 10:37:55.359873 7fa894c2c7a0 10 monclient(hunting): -4633> auth_supported 2 method cephx -4632> 2014-06-03 10:37:55.360055 7fa894c2c7a0 2 auth: KeyRing::load: loaded key file /etc/ceph/keyring.osd.0 -4631> 2014-06-03 10:37:55.360607 7fa894c2c7a0 5 asok(0x2660230) register_command objecter_requests hook 0x2610190 -4630> 2014-06-03 10:37:55.360620 7fa87f4fa700 5 osd.0 0 heartbeat: osd_stat(33016 kB used, 837 GB avail, 837 GB total, peers []/[] op hist []) -4629> 2014-06-03 10:37:55.360679 7fa894c2c7a0 10 monclient(hunting): renew_subs -4628> 2014-06-03 10:37:55.360694 7fa894c2c7a0 10 monclient(hunting): _reopen_session rank -1 name -4627> 2014-06-03 10:37:55.360779 7fa894c2c7a0 10 monclient(hunting): picked mon.0 con 0x269dc20 addr 192.168.50.11:6789/0 -4626> 2014-06-03 10:37:55.360804 7fa894c2c7a0 10 monclient(hunting): _send_mon_message to mon.0 at 192.168.50.11:6789/0 -4625> 2014-06-03 10:37:55.360814 7fa894c2c7a0 1 -- 192.168.50.11:6800/7283 --> 192.168.50.11:6789/0 -- auth(proto 0 26 bytes epoch 0) v1 -- ?+0 0x2668900 con 0x269dc20 -4624> 2014-06- 03 10:37:55.360835 7fa894c2c7a0 10 monclient(hunting): renew_subs -4623> 2014-06-03 10:37:55.360904 7fa87d4f6700 2 -- 192.168.50.11:6800/7283 >> 192.168.50.11:6789/0 pipe(0x27b8000 sd=25 :0 s=1 pgs=0 cs=0 l=1 c=0x269dc20).connect error 192.168.50.11:6789/0, (111) Connection refused -4622> 2014-06-03 10:37:55.360980 7fa87d4f6700 2 -- 192.168.50.11:6800/7283 >> 192.168.50.11:6789/0 pipe(0x27b8000 sd=25 :0 s=1 pgs=0 cs=0 l=1 c=0x269dc20).fault (111) Connection refused -4621> 2014-06-03 10:37:55.361007 7fa87d4f6700 0 -- 192.168.50.11:6800/7283 >> 192.168.50.11:6789/0 pipe(0x27b8000 sd=25 :0 s=1 pgs=0 cs=0 l=1 c=0x269dc20).fault -4620> 2014-06-03 10:37:55.361072 7fa87d4f6700 2 -- 192.168.50.11:6800/7283 >> 192.168.50.11:6789/0 pipe(0x27b8000 sd=25 :0 s=1 pgs=0 cs=0 l=1 c=0x269dc20).connect error 192.168.50.11:6789/0, (111) Connection refused -4619> 2014-06-03 10:37:55.361101 7fa87d4f6700 2 -- 192.168.50.11:6800/7283 >> 192.168.50.11:6789/0 pipe(0x27b8000 sd=25 :0 s=1 pg s=0 cs=0 l=1 c=0x269dc20).fault (111) Connection refused -4618> 2014-06-03 10:37:55.561290 7fa87d4f6700 2 -- 192.168.50.11:6800/7283 >> 192.168.50.11:6789/0 pipe(0x27b8000 sd=25 :0 s=1 pgs=0 cs=0 l=1 c=0x269dc20).connect error 192.168.50.11:6789/0, (111) Connection refused -4617> 2014-06-03 10:37:55.561384 7fa87d4f6700 2 -- 192.168.50.11:6800/7283 >> 192.168.50.11:6789/0 pipe(0x27b8000 sd=25 :0 s=1 pgs=0 cs=0 l=1 c=0x269dc20).fault (111) Connection refused -4616> 2014-06-03 10:37:55.961583 7fa87d4f6700 2 -- 192.168.50.11:6800/7283 >> 192.168.50.11:6789/0 pipe(0x27b8000 sd=25 :0 s=1 pgs=0 cs=0 l=1 c=0x269dc20).connect error 192.168.50.11:6789/0, (111) Connection refused -4615> 2014-06-03 10:37:55.961641 7fa87d4f6700 2 -- 192.168.50.11:6800/7283 >> 192.168.50.11:6789/0 pipe(0x27b8000 sd=25 :0 s=1 pgs=0 cs=0 l=1 c=0x269dc20).fault (111) Connection refused -4614> 2014-06-03 10:37:56.761838 7fa87d4f6700 2 -- 192.168.50.11:6800/7283 >> 192.168.50.11:6789/0 pipe(0x27b8000 sd=25 :0 s=1 pgs=0 cs=0 l=1 c=0x269dc20).connect error 192.168.50.11:6789/0, (111) Connection refused -4613> 2014-06-03 10:37:56.761904 7fa87d4f6700 2 -- 192.168.50.11:6800/7283 >> 192.168.50.11:6789/0 pipe(0x27b8000 sd=25 :0 s=1 pgs=0 cs=0 l=1 c=0x269dc20).fault (111) Connection refused .. -3482> 2014-06-03 10:40:37.377272 7fa882d01700 10 monclient(hunting): -3482> tick -3481> 2014-06-03 10:40:37.377286 7fa882d01700 1 monclient(hunting): continuing hunt -3480> 2014-06-03 10:40:37.377288 7fa882d01700 10 monclient(hunting): _reopen_session rank -1 name -3479> 2014-06-03 10:40:37.377294 7fa882d01700 1 -- 192.168.50.11:6800/7283 mark_down 0x269dc20 -- 0x27b8780 -3478> 2014-06-03 10:40:37.377376 7fa882d01700 10 monclient(hunting): picked mon.2 con 0x269f380 addr 192.168.50.13:6789/0 -3477> 2014-06-03 10:40:37.377401 7fa882d01700 10 monclient(hunting): _send_mon_message to mon.2 at 192.168.50.13:6789/0 -3476> 2014-06-03 10:40:37.377405 7fa882d01700 1 -- 192.168.50.11:6800/7283 --> 192.168.50.13:6789/0 -- auth(proto 0 26 bytes epoch 0) v1 -- ?+0 0x266a880 con 0x269f380 -3475> 2014-06-03 10:40:37.377415 7fa882d01700 10 monclient(hunting): renew_subs -3474> 2014-06-03 10:40:37.377387 7fa87c3f3700 2 -- 192.168.50.11:6800/7
[ceph-users] Storage
Hi All, I have a ceph storage cluster with four nodes. I have created block storage using cinder in openstack and ceph as its storage backend. So, I see a volume is created in ceph in one of the pools. But how to get information like on which OSD, PG, the volume is created in ? Thanks Kumar This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. __ www.accenture.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Experiences with Ceph at the June'14 issue of USENIX ; login:
Hello Filippos, On Wed, 4 Jun 2014 17:22:35 +0300 Filippos Giannakos wrote: > Hello Ian, > > Thanks for your interest. > > On Mon, Jun 02, 2014 at 06:37:48PM -0400, Ian Colle wrote: > > Thanks, Filippos! Very interesting reading. > > > > Are you comfortable enough yet to remove the RAID-1 from your > > architecture and get all that space back? > > Actually, we are not ready to do that yet. There are three major things > to consider. > > First, to be able to get rid of the RAID-1 setup, we need to increase the > replication level to at least 3x. So the space gain is not that great to > begin with. > > Second, this operation can take about a month for our scale according to > our calculations and previous experience. During this period of > increased I/O we might get peaks of performance degradation. Plus, we > currently do not have the necessary hardware available to increase the > replication level before we get rid of the RAID setup. > > Third, we have a few disk failures per month. The RAID-1 setup has > allowed us to seamlessly replace them without any hiccup or even a clue > to the end user that something went wrong. Surely we can rely on RADOS > to avoid any data loss, but if we currently rely on RADOS for recovery > there might be some (minor) performance degradation, especially for the > VM I/O traffic. > That. And in addition you probably never had to do all that song and dance of removing a failed OSD and bringing up a replacement. ^o^ One of the reasons I choose RAIDs as OSDs, especially since the Ceph cluster in question is not local. Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com