Re: [ceph-users] Problem with radosgw and some file name characters
Yehuda, sorry for the delay in reply, I was away for a week or so. The problem happens regardless of the client. I've tried a few. You are right, I've got a load balancer and a reverse proxy behind the radosgw service. My setup is as follows: Internet <---> Load Balancer <--> [Apache_Proxy_Server_1 | Apache_Proxy_Server_2] <--> [Radosgw_Server_1 | Radosgw_Server_2] So, the client hits the Load Balancer first, which redirects traffice either to proxy server 1 or 2. The proxy servers connect either to radosgw server 1 or server 2. I've noticed that if I go directly (Internet <--> Radosgw_Server_1) I do not have any issues with special characters. Any idea what I am missing? Perhaps something needs changing on the proxy server? Cheers Andrei - Original Message ----- From: "Yehuda Sadeh" To: "Andrei Mikhailovsky" Cc: ceph-users@lists.ceph.com Sent: Wednesday, 21 May, 2014 4:24:51 PM Subject: Re: [ceph-users] Problem with radosgw and some file name characters On Tue, May 20, 2014 at 4:13 AM, Andrei Mikhailovsky wrote: > Anyone have any idea how to fix the problem with getting 403 when trying to > upload files with none standard characters? I am sure I am not the only one > with these requirements. It might be the specific client that you're using and the way it signs the requests. Can you try a different S3 client, see whether it works or not? Are you by any chance going through some kind of a load balancer that rewrites the urls? Yehuda > > From: "Andrei Mikhailovsky" > To: "Yehuda Sadeh" > Cc: ceph-users@lists.ceph.com > Sent: Monday, 19 May, 2014 12:38:29 PM > > Subject: Re: [ceph-users] Problem with radosgw and some file name characters > > Yehuda, > > Never mind my last post, i've found the issue with the rule that you've > suggested. my fastcgi script is called differently, so that's why i was > getting the 404. > > I've tried your rewrite rule and I am still having the same issues. The same > characters are failing with the rule you've suggested. > > > Any idea how to fix the issue? > > Cheers > > Andrei > > From: "Andrei Mikhailovsky" > To: "Yehuda Sadeh" > Cc: ceph-users@lists.ceph.com > Sent: Monday, 19 May, 2014 9:30:03 AM > Subject: Re: [ceph-users] Problem with radosgw and some file name characters > > Yehuda, > > I've tried the rewrite rule that you've suggested, but it is not working for > me. I get 404 when trying to access the service. > > RewriteRule ^/(.*) /s3gw.3.fcgi?%{QUERY_STRING} > [E=HTTP_AUTHORIZATION:%{HTTP:Authorization},L] > > Any idea what is wrong with this rule? > > Cheers > > Andrei > > > > > From: "Yehuda Sadeh" > To: "Andrei Mikhailovsky" > Cc: ceph-users@lists.ceph.com > Sent: Friday, 16 May, 2014 5:44:52 PM > Subject: Re: [ceph-users] Problem with radosgw and some file name characters > > Was talking about this. There is a different and simpler rule that we > use nowadays, for some reason it's not well documented: > > RewriteRule ^/(.*) /s3gw.3.fcgi?%{QUERY_STRING} > [E=HTTP_AUTHORIZATION:%{HTTP:Authorization},L] > > I still need to see a more verbose log to make a better educated guess. > > Yehuda > > On Thu, May 15, 2014 at 3:01 PM, Andrei Mikhailovsky > wrote: >> >> Yehuda, >> >> what do you mean by the rewrite rule? is this for Apache? I've used the >> ceph >> documentation to create it. My rule is: >> >> >> RewriteRule ^/([a-zA-Z0-9-_.]*)([/]?.*) >> /s3gw.fcgi?page=$1¶ms=$2&%{QUERY_STRING} >> [E=HTTP_AUTHORIZATION:%{HTTP:Authorization},L] >> >> Or are you talking about something else? >> >> Cheers >> >> Andrei >> >> From: "Yehuda Sadeh" >> To: "Andrei Mikhailovsky" >> Cc: ceph-users@lists.ceph.com >> Sent: Thursday, 15 May, 2014 4:05:06 PM >> Subject: Re: [ceph-users] Problem with radosgw and some file name >> characters >> >> >> Your rewrite rule might be off a bit. Can you provide log with 'debug rgw >> = >> 20'? >> >> Yehuda >> >> On Thu, May 15, 2014 at 8:02 AM, Andrei Mikhailovsky >> wrote: >>> Hello guys, >>> >>> >>> I am trying to figure out what is the problem here. >>> >>> >>> Currently running U
[ceph-users] release date for 0.80.2
Hi guys, Was wondering if 0.80.2 is coming any time soon? I am planning na upgrade from Emperor and was wondering if I should wait for 0.80.2 to come out if the release date is pretty soon. Otherwise, I will go for the 0.80.1. Cheers Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] release date for 0.80.2
Just thought to save some time ))) - Original Message - From: "Wido den Hollander" To: ceph-users@lists.ceph.com Sent: Thursday, 3 July, 2014 12:11:07 PM Subject: Re: [ceph-users] release date for 0.80.2 On 07/03/2014 10:27 AM, Andrei Mikhailovsky wrote: > Hi guys, > > Was wondering if 0.80.2 is coming any time soon? I am planning na > upgrade from Emperor and was wondering if I should wait for 0.80.2 to > come out if the release date is pretty soon. Otherwise, I will go for > the 0.80.1. > Why bother? Upgrading from 0.80.1 to .2 is not that much work. Or is there a specific bug in 0.80.1 which you don't want to run into? > Cheers > Andrei > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Wido den Hollander Ceph consultant and trainer 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] nginx (tengine) and radosgw
Hi David, Do you mind sharing the howto/documentation with examples of configs, etc.? I am tempted to give it a go and replace the Apache reverse proxy that I am currently using. cheers Andrei - Original Message - From: "David Moreau Simard" To: ceph-users@lists.ceph.com Sent: Sunday, 22 June, 2014 2:37:00 AM Subject: Re: [ceph-users] nginx (tengine) and radosgw Hi, I just wanted to chime in and say that I didn’t notice any problems swapping nginx out in favor of tengine. tengine is used as a load balancer that also handles SSL termination. I found that disabling body buffering saves a lot on upload times as well. I took the time to do a post about it and linked this thread: http://dmsimard.com/2014/06/21/a-use-case-of-tengine-a-drop-in-replacement-and-fork-of-nginx/ - David On May 29, 2014, at 12:20 PM, Michael Lukzak < mis...@vp.pl > wrote: Re[2]: [ceph-users] nginx (tengine) and radosgw Hi, Ups, so I don't read carefully a doc... I will try this solution. Thanks! Michael From the docs, you need this setting in ceph.conf (if you're using nginx/tengine): rgw print continue = false This will fix the 100-continue issues. On 5/29/2014 5:56 AM, Michael Lukzak wrote: Re[2]: [ceph-users] nginx (tengine) and radosgw Hi, I'm also use tengine, works fine with SSL (I have a Wildcard). But I have other issue with HTTP 100-Continue. Clients like boto or Cyberduck hangs if they can't make HTTP 100-Continue. IP_REMOVED - - [29/May/2014:11:27:53 +] "PUT /temp/1b6f6a11d7aa188f06f8255fdf0345b4 HTTP/1.1" 100 0 "-" "Boto/2.27.0 Python/2.7.6 Linux/3.13.0-24-generic" Do You have also problem with that? I used for testing oryginal nginx and also have a problem with 100-Continue. Only Apache 2.x works fine. BR, Michael I haven't tried SSL yet. We currently don't have a wildcard certificate for this, so it hasn't been a concern (and our current use case, all the files are public anyway). On 5/20/2014 4:26 PM, Andrei Mikhailovsky wrote: That looks very interesting indeed. I've tried to use nginx, but from what I recall it had some ssl related issues. Have you tried to make the ssl work so that nginx acts as an ssl proxy in front of the radosgw? Cheers Andrei From: "Brian Rak" To: ceph-users@lists.ceph.com Sent: Tuesday, 20 May, 2014 9:11:58 PM Subject: [ceph-users] nginx (tengine) and radosgw I've just finished converting from nginx/radosgw to tengine/radosgw, and it's fixed all the weird issues I was seeing (uploads failing, random clock skew errors, timeouts). The problem with nginx and radosgw is that nginx insists on buffering all the uploads to disk. This causes a significant performance hit, and prevents larger uploads from working. Supposedly, there is going to be an option in nginx to disable this, but it hasn't been released yet (nor do I see anything on the nginx devel list about it). tengine ( http://tengine.taobao.org/ ) is an nginx fork that implements unbuffered uploads to fastcgi. It's basically a drop in replacement for nginx. My configuration looks like this: server { listen 80; server_name *.rados.test rados.test; client_max_body_size 10g; # This is the important option that tengine has, but nginx does not fastcgi_request_buffering off; location / { fastcgi_pass_header Authorization; fastcgi_pass_request_headers on; if ($request_method = PUT ) { rewrite ^ /PUT$request_uri; } include fastcgi_params; fastcgi_pass unix:/path/to/ceph.radosgw.fastcgi.sock; } location /PUT/ { internal; fastcgi_pass_header Authorization; fastcgi_pass_request_headers on; include fastcgi_params; fastcgi_param CONTENT_LENGTH $content_length; fastcgi_pass unix:/path/to/ceph.radosgw.fastcgi.sock; } } if anyone else is looking to run radosgw without having to run apache, I would recommend you look into tengine :) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time
Hi Andrija, I've got at least two more stories of similar nature. One is my friend running a ceph cluster and one is from me. Both of our clusters are pretty small. My cluster has only two osd servers with 8 osds each, 3 mons. I have an ssd journal per 4 osds. My friend has a cluster of 3 mons and 3 osd servers with 4 osds each and an ssd per 4 osds as well. Both clusters are connected with 40gbit/s IP over Infiniband links. We had the same issue while upgrading to firefly. However, we did not add any new disks, just ran the "ceph osd crush tunables optimal" command after following an upgrade. Both of our clusters were "down" as far as the virtual machines are concerned. All vms have crashed because of the lack of IO. It was a bit problematic, taking into account that ceph is typically so great at staying alive during failures and upgrades. So, there seems to be a problem with the upgrade. I wish devs would have added a big note in red letters that if you run this command it will likely affect your cluster performance and most likely all your vms will die. So, please shutdown your vms if you do not want to have data loss. I've changed the default values to reduce the load during recovery and also to tune a few things performance wise. My settings were: osd recovery max chunk = 8388608 osd recovery op priority = 2 osd max backfills = 1 osd recovery max active = 1 osd recovery threads = 1 osd disk threads = 2 filestore max sync interval = 10 filestore op threads = 20 filestore_flusher = false However, this didn't help much and i've noticed that shortly after running the tunnables command my guest vms iowait has quickly jumped to 50% and a to 99% a minute after. This has happened on all vms at once. During the recovery phase I ran the "rbd -p ls -l" command several times and it took between 20-40 minutes to complete. It typically takes less than 2 seconds when the cluster is not in recovery mode. My mate's cluster had the same tunables apart from the last three. He had exactly the same behaviour. One other thing that i've noticed is that somewhere in the docs I've read that running the tunnable optimal command should move not more than 10% of your data. However, in both of our cases our status was just over 30% degraded and it took a good part of 9 hours to complete the data reshuffling. Any comments from the ceph team or other ceph gurus on: 1. What have we done wrong in our upgrade process 2. What options should we have used to keep our vms alive Cheers Andrei - Original Message - From: "Andrija Panic" To: ceph-users@lists.ceph.com Sent: Sunday, 13 July, 2014 9:54:17 PM Subject: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time Hi, after seting ceph upgrade (0.72.2 to 0.80.3) I have issued "ceph osd crush tunables optimal" and after only few minutes I have added 2 more OSDs to the CEPH cluster... So these 2 changes were more or a less done at the same time - rebalancing because of tunables optimal, and rebalancing because of adding new OSD... Result - all VMs living on CEPH storage have gone mad, no disk access efectively, blocked so to speak. Since this rebalancing took 5h-6h, I had bunch of VMs down for that long... Did I do wrong by causing "2 rebalancing" to happen at the same time ? Is this behaviour normal, to cause great load on all VMs because they are unable to access CEPH storage efectively ? Thanks for any input... -- Andrija Panić ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Firefly Upgrade
Quenten, It has been noted before and I've seen a thread on the mailing list about it. In a long term, I've not noticed a great increase in ram. By that i mean that initially, right after doing the upgrade from emperor to firefly and restarting the odd servers I did notice about 20-25% more ram usage. However, it has been about a week since i've done the upgrade and according to my graphs, the memory usage has decreased and is now at about the same level as it was before the upgrade. It does sound strange, but that's the case with me. Here is my current usage on one of my odd servers: ps aux |grep ceph-osd root 23081 5.9 1.3 1889508 334908 ? Ssl Jul12 132:28 /usr/bin/ceph-osd --cluster=ceph -i 0 -f root 23083 7.7 1.3 2024624 344664 ? Ssl Jul12 171:37 /usr/bin/ceph-osd --cluster=ceph -i 6 -f root 23152 4.4 1.4 1857568 348068 ? Ssl Jul12 99:04 /usr/bin/ceph-osd --cluster=ceph -i 3 -f root 23222 4.4 1.0 1807564 254108 ? Ssl Jul12 98:01 /usr/bin/ceph-osd --cluster=ceph -i 8 -f root 23295 4.5 1.1 1774380 272012 ? Ssl Jul12 100:10 /usr/bin/ceph-osd --cluster=ceph -i 4 -f root 23369 3.8 1.0 1689284 257152 ? Ssl Jul12 84:09 /usr/bin/ceph-osd --cluster=ceph -i 2 -f root 23434 7.0 1.2 1963112 299424 ? Ssl Jul12 156:09 /usr/bin/ceph-osd --cluster=ceph -i 7 -f root 23513 6.2 1.1 1885832 283804 ? Ssl Jul12 137:45 /usr/bin/ceph-osd --cluster=ceph -i 1 -f root 23545 6.0 1.0 1819448 258408 ? Ssl Jul12 134:23 /usr/bin/ceph-osd --cluster=ceph -i 5 -f Cheers Andrei - Original Message - From: "Quenten Grasso" To: "ceph-users" Sent: Monday, 14 July, 2014 11:37:07 AM Subject: [ceph-users] Firefly Upgrade Hi All, Just a quick question for the list, has anyone seen a significant increase in ram usage since firefly? I upgraded from 0.72.2 to 80.3 now all of my Ceph servers are using about double the ram they used to. Only other significant change to our setup was a upgrade to Kernel 3.13.0-30-generic #55~precise1-Ubuntu SMP Any ideas? Regards, Quenten ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph with Multipath ISCSI
Hello guys, I was wondering if there has been any progress on getting multipath iscsi play nicely with ceph? I've followed the how to and created a single path iscsi over ceph rbd with XenServer. However, it would be nice to have a built in failover using iscsi multipath to another ceph mon or osd server. Cheers Andrei -- Andrei Mikhailovsky Director Arhont Information Security Web: http://www.arhont.com http://www.wi-foo.com Tel: +44 (0)870 4431337 Fax: +44 (0)208 429 3111 PGP: Key ID - 0x2B3438DE PGP: Server - keyserver.pgp.com DISCLAIMER The information contained in this email is intended only for the use of the person(s) to whom it is addressed and may be confidential or contain legally privileged information. If you are not the intended recipient you are hereby notified that any perusal, use, distribution, copying or disclosure is strictly prohibited. If you have received this email in error please immediately advise us by return email at and...@arhont.com and delete and purge the email and any attachments without making a copy. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Working ISCSI target guide
Drew, I would not use iscsi with ivm. instead, I would use built in rbd support. However, you would use something like nfs/iscsi if you were to connect other hypervisors to ceph backend. Having failover capabilities is important here )) Andrei -- Andrei Mikhailovsky Director Arhont Information Security Web: http://www.arhont.com http://www.wi-foo.com Tel: +44 (0)870 4431337 Fax: +44 (0)208 429 3111 PGP: Key ID - 0x2B3438DE PGP: Server - keyserver.pgp.com DISCLAIMER The information contained in this email is intended only for the use of the person(s) to whom it is addressed and may be confidential or contain legally privileged information. If you are not the intended recipient you are hereby notified that any perusal, use, distribution, copying or disclosure is strictly prohibited. If you have received this email in error please immediately advise us by return email at and...@arhont.com and delete and purge the email and any attachments without making a copy. - Original Message - From: "Drew Weaver" To: "ceph-users@lists.ceph.com" Sent: Tuesday, 15 July, 2014 2:18:53 PM Subject: Re: [ceph-users] Working ISCSI target guide One other question, if you are going to be using Ceph as a storage system for KVM virtual machines does it even matter if you use ISCSI or not? Meaning that if you are just going to use LVM and have several hypervisors sharing that same VG then using ISCSI isn’t really a requirement unless you are using a Hypervisor like ESXi which only works with ISCSI/NFS correct? Thanks, -Drew From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Drew Weaver Sent: Tuesday, July 15, 2014 9:03 AM To: 'ceph-users@lists.ceph.com' Subject: [ceph-users] Working ISCSI target guide Does anyone have a guide or re-producible method of getting multipath ISCSI working infront of ceph? Even if it just means having two front-end ISCSI targets each with access to the same underlying Ceph volume? This seems like a super popular topic. Thanks, -Drew ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time
Quenten, We've got two monitors sitting on the osd servers and one on a different server. Andrei -- Andrei Mikhailovsky Director Arhont Information Security Web: http://www.arhont.com http://www.wi-foo.com Tel: +44 (0)870 4431337 Fax: +44 (0)208 429 3111 PGP: Key ID - 0x2B3438DE PGP: Server - keyserver.pgp.com DISCLAIMER The information contained in this email is intended only for the use of the person(s) to whom it is addressed and may be confidential or contain legally privileged information. If you are not the intended recipient you are hereby notified that any perusal, use, distribution, copying or disclosure is strictly prohibited. If you have received this email in error please immediately advise us by return email at and...@arhont.com and delete and purge the email and any attachments without making a copy. - Original Message - From: "Quenten Grasso" To: "Andrija Panic" , "Sage Weil" Cc: ceph-users@lists.ceph.com Sent: Wednesday, 16 July, 2014 1:20:19 PM Subject: Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time Hi Sage, Andrija & List I have seen the tuneables issue on our cluster when I upgraded to firefly. I ended up going back to legacy settings after about an hour as my cluster is of 55 3TB OSD’s over 5 nodes and it decided it needed to move around 32% of our data, which after an hour all of our vm’s were frozen and I had to revert the change back to legacy settings and wait about the same time again until our cluster had recovered and reboot our vms. (wasn’t really expecting that one from the patch notes) Also our CPU usage went through the roof as well on our nodes, do you per chance have your metadata servers co-located on your osd nodes as we do? I’ve been thinking about trying to move these to dedicated nodes as it may resolve our issues. Regards, Quenten From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Andrija Panic Sent: Tuesday, 15 July 2014 8:38 PM To: Sage Weil Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time Hi Sage, since this problem is tunables-related, do we need to expect same behavior or not when we do regular data rebalancing caused by adding new/removing OSD? I guess not, but would like your confirmation. I'm already on optimal tunables, but I'm afraid to test this by i.e. shuting down 1 OSD. Thanks, Andrija On 14 July 2014 18:18, Sage Weil < sw...@redhat.com > wrote: I've added some additional notes/warnings to the upgrade and release notes: https://github.com/ceph/ceph/commit/fc597e5e3473d7db6548405ce347ca7732832451 If there is somewhere else where you think a warning flag would be useful, let me know! Generally speaking, we want to be able to cope with huge data rebalances without interrupting service. It's an ongoing process of improving the recovery vs client prioritization, though, and removing sources of overhead related to rebalancing... and it's clearly not perfect yet. :/ sage On Sun, 13 Jul 2014, Andrija Panic wrote: > Hi, > after seting ceph upgrade (0.72.2 to 0.80.3) I have issued "ceph osd crush > tunables optimal" and after only few minutes I have added 2 more OSDs to the > CEPH cluster... > > So these 2 changes were more or a less done at the same time - rebalancing > because of tunables optimal, and rebalancing because of adding new OSD... > > Result - all VMs living on CEPH storage have gone mad, no disk access > efectively, blocked so to speak. > > Since this rebalancing took 5h-6h, I had bunch of VMs down for that long... > > Did I do wrong by causing "2 rebalancing" to happen at the same time ? > Is this behaviour normal, to cause great load on all VMs because they are > unable to access CEPH storage efectively ? > > Thanks for any input... > -- > > Andrija Pani? > > -- Andrija Panić -- http://admintweets.com -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time
Sage, would it help if you add a cache pool to your cluster? Let's say if you add a few TBs of ssds acting as a cache pool to your cluster, would this help with retaining IO to the guest vms during data recovery or reshuffling? Over the past year and a half that we've been using ceph we had a positive experience for the majority of time. The only downtime we had for our vms was when ceph is doing recovery. It seems that regardless of the tuning options we've used, our vms are still unable to get IO, they get to 98-99% iowait and freeze. This has happened on dumpling, emperor and now firefly releases. Because of this I've set noout flag on my cluster and have to keep an eye on the osds for manual intervention, which is far from ideal case (((. Andrei -- Andrei Mikhailovsky Director Arhont Information Security Web: http://www.arhont.com http://www.wi-foo.com Tel: +44 (0)870 4431337 Fax: +44 (0)208 429 3111 PGP: Key ID - 0x2B3438DE PGP: Server - keyserver.pgp.com DISCLAIMER The information contained in this email is intended only for the use of the person(s) to whom it is addressed and may be confidential or contain legally privileged information. If you are not the intended recipient you are hereby notified that any perusal, use, distribution, copying or disclosure is strictly prohibited. If you have received this email in error please immediately advise us by return email at and...@arhont.com and delete and purge the email and any attachments without making a copy. - Original Message - From: "Sage Weil" To: "Gregory Farnum" Cc: ceph-users@lists.ceph.com Sent: Thursday, 17 July, 2014 1:06:52 AM Subject: Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time On Wed, 16 Jul 2014, Gregory Farnum wrote: > On Wed, Jul 16, 2014 at 4:45 PM, Craig Lewis > wrote: > > One of the things I've learned is that many small changes to the cluster > > are > > better than one large change. Adding 20% more OSDs? Don't add them all at > > once, trickle them in over time. Increasing pg_num & pgp_num from 128 to > > 1024? Go in steps, not one leap. > > > > I try to avoid operations that will touch more than 20% of the disks > > simultaneously. When I had journals on HDD, I tried to avoid going over 10% > > of the disks. > > > > > > Is there a way to execute `ceph osd crush tunables optimal` in a way that > > takes smaller steps? > > Unfortunately not; the crush tunables are changes to the core > placement algorithms at work. Well, there is one way, but it is only somewhat effective. If you decompile the crush maps for bobtail vs firefly the actual difference is tunable chooseleaf_vary_r 1 and this is written such that a value of 1 is the optimal 'new' way, 0 is the legacy old way, but values > 1 are less-painful steps between the two (though mostly closer to the firefly value of 1). So, you could set tunable chooseleaf_vary_r 4 wait for it to settle, and then do tunable chooseleaf_vary_r 3 ...and so forth down to 1. I did some limited testing of the data movement involved and noted it here: https://github.com/ceph/ceph/commit/37f840b499da1d39f74bfb057cf2b92ef4e84dc6 In my test case, going from 0 to 4 was about 1/10th as bad as going straight from 0 to 1, but the final step from 2 to 1 is still about 1/2 as bad. I'm not sure if that means it's not worth the trouble of not just jumping straight to the firefly tunables, or whether it means legacy users should just set (and leave) this at 2 or 3 or 4 and get almost all the benefit without the rebalance pain. sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time
Comments inline - Original Message - From: "Sage Weil" To: "Quenten Grasso" Cc: ceph-users@lists.ceph.com Sent: Thursday, 17 July, 2014 4:44:45 PM Subject: Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time On Thu, 17 Jul 2014, Quenten Grasso wrote: > Hi Sage & List > > I understand this is probably a hard question to answer. > > I mentioned previously our cluster is co-located MON?s on OSD servers, which > are R515?s w/ 1 x AMD 6 Core processor & 11 3TB OSD?s w/ dual 10GBE. > > When our cluster is doing these busy operations and IO has stopped as in my > case, I mentioned earlier running/setting tuneable to optimal or heavy > recovery > > operations is there a way to ensure our IO doesn?t get completely > blocked/stopped/frozen in our vms? > > Could it be as simple as putting all 3 of our mon servers on baremetal > w/ssd?s? (I recall reading somewhere that a mon disk was doing several > thousand IOPS during a recovery operation) > > I assume putting just one on baremetal won?t help because our mon?s will only > ever be as fast as our slowest mon server? I don't think this is related to where the mons are (most likely). The big question for me is whether IO is getting completely blocked, or just slowed enough that the VMs are all timing out. AM: I was looking at the cluster status while the rebalancing was taking place and I was seeing very little client IO reported by ceph -s output. The numbers were around 20-100 whereas our typical IO for the cluster is around 1000. Having said that, this was not enough as _all_ of our vms become unresponsive and didn't recover after rebalancing finished. What slow request messages did you see during the rebalance? AM: As I was experimenting with different options while trying to gain some client IO back i've noticed that when I am limiting the options to 1 per osd ( osd max backfills = 1, osd recovery max active = 1, osd recovery threads = 1), I did not have any slow or blocked requests at all. Increasing these values did produce some blocked requests occasionally, but they were being quickly cleared. What were the op latencies? AM: In general, the latencies were around 5-10 higher compared to the normal cluster ops. The second column of the "ceph osd perf" was around 50s where as it is typically between 3-10. It did occasionally jump to some crazy numbers like 2000-3000 on several osds, but only for 5-10 seconds. It's possible there is a bug here, but it's also possible the cluster is just operating close enough to capacity that the additional rebalancing work pushes it into a place where it can't keep up and the IO latencies are too high. AM: My cluster in particular is under-utilised for the majority of time. I do not typically see osds more than 20-30% utilised and our ssd journals are usually less than 10% utilised. Or that we just have more work to do prioritizing requests.. but it's hard to say without more info. sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] feature set mismatch after upgrade from Emperor to Firefly
Hello guys, I have noticed the following message/error after upgrading to firefly. Does anyone know what needs doing to correct it? Thanks Andrei [ 25.911055] libceph: mon1 192.168.168.201:6789 feature set mismatch, my 40002 < server's 20002040002, missing 2000200 [ 25.911698] libceph: mon1 192.168.168.201:6789 socket error on read [ 35.913049] libceph: mon2 192.168.168.13:6789 feature set mismatch, my 40002 < server's 20002040002, missing 2000200 [ 35.913694] libceph: mon2 192.168.168.13:6789 socket error on read [ 45.909466] libceph: mon0 192.168.168.200:6789 feature set mismatch, my 40002 < server's 20002040002, missing 2000200 [ 45.910104] libceph: mon0 192.168.168.200:6789 socket error on read ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] feature set mismatch after upgrade from Emperor to Firefly
Thanks guys, I am trying 3.15 kernel to see how it works. Andrei -- Andrei Mikhailovsky Director Arhont Information Security Web: http://www.arhont.com http://www.wi-foo.com Tel: +44 (0)870 4431337 Fax: +44 (0)208 429 3111 PGP: Key ID - 0x2B3438DE PGP: Server - keyserver.pgp.com DISCLAIMER The information contained in this email is intended only for the use of the person(s) to whom it is addressed and may be confidential or contain legally privileged information. If you are not the intended recipient you are hereby notified that any perusal, use, distribution, copying or disclosure is strictly prohibited. If you have received this email in error please immediately advise us by return email at and...@arhont.com and delete and purge the email and any attachments without making a copy. - Original Message - From: "Ilya Dryomov" To: "Irek Fasikhov" Cc: "Andrei Mikhailovsky" , ceph-users@lists.ceph.com Sent: Sunday, 20 July, 2014 7:55:49 PM Subject: Re: [ceph-users] feature set mismatch after upgrade from Emperor to Firefly On Sun, Jul 20, 2014 at 10:29 PM, Irek Fasikhov wrote: > Привет, Андрей. > > ceph osd getcrushmap -o /tmp/crush > crushtool -i /tmp/crush --set-chooseleaf_vary_r 0 -o /tmp/crush.new > ceph osd setcrushmap -i /tmp/crush.new > > Or > > update kernel 3.15. > > > 2014-07-20 20:19 GMT+04:00 Andrei Mikhailovsky : >> >> Hello guys, >> >> >> I have noticed the following message/error after upgrading to firefly. >> Does anyone know what needs doing to correct it? >> >> >> Thanks >> >> Andrei >> >> >> >> [ 25.911055] libceph: mon1 192.168.168.201:6789 feature set mismatch, my >> 40002 < server's 20002040002, missing 2000200 >> >> [ 25.911698] libceph: mon1 192.168.168.201:6789 socket error on read >> >> [ 35.913049] libceph: mon2 192.168.168.13:6789 feature set mismatch, my >> 40002 < server's 20002040002, missing 2000200 >> >> [ 35.913694] libceph: mon2 192.168.168.13:6789 socket error on read >> >> [ 45.909466] libceph: mon0 192.168.168.200:6789 feature set mismatch, my >> 40002 < server's 20002040002, missing 2000200 >> >> [ 45.910104] libceph: mon0 192.168.168.200:6789 socket error on read Your kernel is missing TUNABLES2 and TUNABLES3 feature bits. For the latter, do what Irek said. To deal with the former the easiest thing is to upgrade to 3.9 or later, but if that's not acceptable to you, try ceph osd crush tunables legacy Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph and Infiniband
Ricardo, Thought to share my testing results. I've been using IPoIB with ceph for quite some time now. I've got QDR osd/mon/client servers to serve rbd images to kvm hypervisor. I've done some performance testing using both rados and guest vm benchmarks while running the last three stable versions of ceph. My conclusion was that ceph itself needs to mature and/or be optimised in order to utilise the capabilities of the infiniband link. In my experience, I was not able to reach the limits of the network speeds reported to me by the network performance monitoring tools. I was struggling to push data throughput beyond 1.5GB/s while using between 2 and 64 concurrent tests. This was the case when the benchmark data was using the same data over and over again and the data was cached on the osd servers and was coming directly from server's ram without any access to the osds themselves. My ipoib network performance tests were showing on average 2.5-3GB/s with peaks reaching 3.3GB/s over ipoib. It would be nice to see how ceph is performing over rdma ))). Having said this, perhaps my test gear is somewhat limited or my ceph optimisation was not done correctly. I had 2 osd servers with 8 osds each and three clients running guest vms and rados benchmarks. None of the benchmarks were able to fully utilise the server resources. my osd servers were running on about 50% utilisation during the tests. So, I had to conclude that unless you are running a large cluster with some specific data sets that utilise multithreading you will probably not need to have an infiniband link. A single thread performance for the cold data will be limited to about 1/2 of the speed of a single osd device. So, if your osds are running 150MB/s do not expect to have a single thread faster than 70-80MB/s. On the other hand, if you utilise high performance gear, like cache cards capable of achieving speeds of over gigabytes per second, perhaps infiniband link might be of use. Not sure if the ceph-osd process is capable of "spitting" out this amount of data though. You might be having a CPU bottleneck. Andrei - Original Message - From: "Sage Weil" To: "Riccardo Murri" Cc: ceph-users@lists.ceph.com Sent: Tuesday, 22 July, 2014 9:42:56 PM Subject: Re: [ceph-users] Ceph and Infiniband On Tue, 22 Jul 2014, Riccardo Murri wrote: > Hello, > > a few questions on Ceph's current support for Infiniband > > (A) Can Ceph use Infiniband's native protocol stack, or must it use > IP-over-IB? Google finds a couple of entries in the Ceph wiki related > to native IB support (see [1], [2]), but none of them seems finished > and there is no timeline. > > [1]: > https://wiki.ceph.com/Planning/Blueprints/Emperor/msgr%3A_implement_infiniband_support_via_rsockets > > [2]: http://wiki.ceph.com/Planning/Blueprints/Giant/Accelio_RDMA_Messenger This is work in progress. We hope to get basic support into the tree in the next couple of months. > (B) Can we connect to the same Ceph cluster from Infiniband *and* > Ethernet? Some clients do only have Ethernet and will not be > upgraded, some others would have QDR Infiniband -- we would like both > sets to access the same storage cluster. This is further out. Very early refactoring to make this work in wip-addr. > (C) I found this old thread about Ceph's performance on 10GbE and > Infiniband: are the issues reported there still current? > > http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/6816 No idea! :) sage > > > Thanks for any hint! > > Riccardo > > -- > Riccardo Murri > http://www.s3it.uzh.ch/about/team/ > > S3IT: Services and Support for Science IT > University of Zurich > Winterthurerstrasse 190, CH-8057 Z?rich (Switzerland) > Tel: +41 44 635 4222 > Fax: +41 44 635 6888 > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Using Crucial MX100 for journals or cache pool
Hello guys, Was wondering if anyone has tried using the Crucial MX100 ssds either for osd journals or cache pool? It seems like a good cost effective alternative to the more expensive drives and read/write performance is very good as well. Thanks -- Andrei Mikhailovsky Director Arhont Information Security Web: http://www.arhont.com http://www.wi-foo.com Tel: +44 (0)870 4431337 Fax: +44 (0)208 429 3111 PGP: Key ID - 0x2B3438DE PGP: Server - keyserver.pgp.com DISCLAIMER The information contained in this email is intended only for the use of the person(s) to whom it is addressed and may be confidential or contain legally privileged information. If you are not the intended recipient you are hereby notified that any perusal, use, distribution, copying or disclosure is strictly prohibited. If you have received this email in error please immediately advise us by return email at and...@arhont.com and delete and purge the email and any attachments without making a copy. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Using Crucial MX100 for journals or cache pool
Thanks for your comments. Andrei -- Andrei Mikhailovsky Director Arhont Information Security Web: http://www.arhont.com http://www.wi-foo.com Tel: +44 (0)870 4431337 Fax: +44 (0)208 429 3111 PGP: Key ID - 0x2B3438DE PGP: Server - keyserver.pgp.com DISCLAIMER The information contained in this email is intended only for the use of the person(s) to whom it is addressed and may be confidential or contain legally privileged information. If you are not the intended recipient you are hereby notified that any perusal, use, distribution, copying or disclosure is strictly prohibited. If you have received this email in error please immediately advise us by return email at and...@arhont.com and delete and purge the email and any attachments without making a copy. - Original Message - From: "Christian Balzer" To: "Andrei Mikhailovsky" Cc: ceph-users@lists.ceph.com Sent: Friday, 1 August, 2014 10:41:09 AM Subject: Re: [ceph-users] Using Crucial MX100 for journals or cache pool On Fri, 1 Aug 2014 09:38:34 +0100 (BST) Andrei Mikhailovsky wrote: > Hello guys, > > Was wondering if anyone has tried using the Crucial MX100 ssds either > for osd journals or cache pool? It seems like a good cost effective > alternative to the more expensive drives and read/write performance is > very good as well. > If you're going purely by price, a 128GB MX100 doesn't have much of an advantage over a 120GB Intel DC S3500. While the endurance is given as 72TB for all MX100 models, it increases with size for the Intel ones, 275TB for the 480GB model. So a while a 512GB MX100 is cheaper, it compares very poorly to a 480GB DC S3500 when it comes to TBW/$. And of course when it comes to the _consistent_ performance of the DC S3700 SSDs, nothing more needs to be said than the articles David referred to. That's what makes them so well suited for journal operations. If you're looking for a low cost cache pool, keep in mind that the warranty of the Crucial SSDs is just 3 years. If you stick to that time frame, that's about 65GB writes per day. This might be enough, put it is probably a lot harder to predict write loads for a cache pool unlike with a journal. >From my understanding doing something like a backup of your actual pool would write everything to the cache pool in that process. Christian -- Christian Balzer Network/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] cache pools on hypervisor servers
Hello guys, I was hoping to get some answers on how would ceph behaive when I install SSDs on the hypervisor level and use them as cache pool. Let's say I've got 10 kvm hypervisors and I install one 512GB ssd on each server. I then create a cache pool for my storage cluster using these ssds. My questions are: 1. How would the network IO flow when I am performing read and writes on the virtual machines? Would writes get stored on the hypervisor's ssd disk right away or would the rights be directed to the osd servers first and then redirected back to the cache pool on the hypervisor's ssd? Similarly, would reads go to the osd servers and then redirected to the cache pool on the hypervisors? 2. Would majority of network traffic shift to the cache pool level and stay at the hypervisors level rather than hypervisor / osd server level? Many thanks Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cache pools on hypervisor servers
Anyone have an idea on how it works? Thansk - Original Message - From: "Andrei Mikhailovsky" To: ceph-users@lists.ceph.com Sent: Monday, 4 August, 2014 10:10:03 AM Subject: [ceph-users] cache pools on hypervisor servers Hello guys, I was hoping to get some answers on how would ceph behaive when I install SSDs on the hypervisor level and use them as cache pool. Let's say I've got 10 kvm hypervisors and I install one 512GB ssd on each server. I then create a cache pool for my storage cluster using these ssds. My questions are: 1. How would the network IO flow when I am performing read and writes on the virtual machines? Would writes get stored on the hypervisor's ssd disk right away or would the rights be directed to the osd servers first and then redirected back to the cache pool on the hypervisor's ssd? Similarly, would reads go to the osd servers and then redirected to the cache pool on the hypervisors? 2. Would majority of network traffic shift to the cache pool level and stay at the hypervisors level rather than hypervisor / osd server level? Many thanks Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cache pools on hypervisor servers
Robert, thanks for your reply, please see my comments inline - Original Message - > From: "Robert van Leeuwen" > To: "Andrei Mikhailovsky" , ceph-users@lists.ceph.com > Sent: Wednesday, 13 August, 2014 6:57:57 AM > Subject: RE: cache pools on hypervisor servers > > I was hoping to get some answers on how would ceph behaive when I install > > SSDs on the hypervisor level and use them as cache pool. > > Let's say I've got 10 kvm hypervisors and I install one 512GB ssd on each > > server. > >I then create a cache pool for my storage cluster using these ssds. My > >questions are: > > > >1. How would the network IO flow when I am performing read and writes on the > >virtual machines? Would writes get stored on the hypervisor's ssd disk > >right away or would the rights be directed to the osd servers >first and > >then redirected back to the cache pool on the hypervisor's ssd? Similarly, > >would reads go to the osd servers and then redirected to the cache pool on > >the hypervisors? > You would need to make an OSD of your hypervisors. > Data would be "striped" across all hypervisors in the cache pool. > So you would shift traffic from: > hypervisors -> dedicated ceph OSD pool > to > hypervisors -> hypervisors running a OSD with SSD > Note that the local OSD also has to to do OSD replication traffic so you are > increasing the network load on the hypervisors by quite a bit. Personally I am not worried too much about the hypervisor - hypervisor traffic as I am using a dedicated infiniband network for storage. It is not used for the guest to guest or the internet traffic or anything else. I would like to decrease or at least smooth out the traffic peaks between the hypervisors and the SAS/SATA osd storage servers. I guess the ssd cache pool would enable me to do that as the eviction rate should be more structured compared to the random io writes that guest vms generate. > > Would majority of network traffic shift to the cache pool level and stay at > > the hypervisors level rather than hypervisor / osd server level? > I guess it depends on your access patterns and how much data needs to be > migrated back and forth to the regular storage. > I'm very interested in the effect of caching pools in combination with > running VMs on them so I'd be happy to hear what you find ;) I will give it a try and share back the results when we get the ssd kit. > As a side note: Running OSDs on hypervisors would not be my preferred choice > since hypervisor load might impact Ceph performance. Do you think it is not a good idea even if you have a lot of cores on the hypervisors? Like 24 or 32 per host server? According to my monitoring, our osd servers are not that stressed and generally have over 50% of free cpu power. Having said this, ssd osds will generate more io and throughput compared to the sas/sata osds, so the cpu load might be higher. Not really sure here. > I guess you can end up with pretty weird/unwanted results when your > hypervisors get above a certain load threshold. > I would certainly test a lot with high loads before putting it in > production... Definitely! > Cheers, > Robert van Leeuwen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cache pools on hypervisor servers
Hi guys, Could someone from the ceph team please comment on running osd cache pool on the hypervisors? Is this a good idea, or will it create a lot of performance issues? Anyone in the ceph community that has done this? Any results to share? Many thanks Andrei - Original Message - > From: "Robert van Leeuwen" > To: "Andrei Mikhailovsky" > Cc: ceph-users@lists.ceph.com > Sent: Thursday, 14 August, 2014 9:31:24 AM > Subject: RE: cache pools on hypervisor servers > > Personally I am not worried too much about the hypervisor - hypervisor > > traffic as I am using a dedicated infiniband network for storage. > > It is not used for the guest to guest or the internet traffic or anything > > else. I would like to decrease or at least smooth out the traffic peaks > > between the hypervisors and the SAS/SATA osd storage servers. > > I guess the ssd cache pool would enable me to do that as the eviction rate > > should be more structured compared to the random io writes that guest vms > > generate. > Sounds reasonable > >>I'm very interested in the effect of caching pools in combination with > >>running VMs on them so I'd be happy to hear what you find ;) > > I will give it a try and share back the results when we get the ssd kit. > Excellent, looking forward to it. > >> As a side note: Running OSDs on hypervisors would not be my preferred > >> choice since hypervisor load might impact Ceph performance. > > Do you think it is not a good idea even if you have a lot of cores on the > > hypervisors? > > Like 24 or 32 per host server? > > According to my monitoring, our osd servers are not that stressed and > > generally have over 50% of free cpu power. > The number of cores do not really matter if they are all busy ;) > I honestly do not know how Ceph behaves when it is CPU starved but I guess it > might not be pretty. > Since your whole environment will be crumbling down if your storage becomes > unavailable it is not a risk I would take lightly. > Cheers, > Robert van Leeuwen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cache pools on hypervisor servers
Thanks a lot for your input. I will proceed with putting the cache pool on the storage layer instead. Andrei - Original Message - > From: "Sage Weil" > To: "Andrei Mikhailovsky" > Cc: "Robert van Leeuwen" , > ceph-users@lists.ceph.com > Sent: Thursday, 14 August, 2014 6:33:25 PM > Subject: Re: [ceph-users] cache pools on hypervisor servers > > On Thu, 14 Aug 2014, Andrei Mikhailovsky wrote: > > Hi guys, > > > > Could someone from the ceph team please comment on running osd cache pool > > on > > the hypervisors? Is this a good idea, or will it create a lot of > > performance > > issues? > > It doesn't sound like an especially good idea. In general you want the > cache pool to be significantly faster than the base pool (think PCI > attached flash). And there won't be any particular affinity to the host > where the VM consuming the sotrage happens to be, so I don't think there > is a reason to put the flash in the hypervisor nodes unless there simply > isn't anywhere else to put them. > > Probably what you're after is a client-side write-thru cache? There is > some ongoing work to build this into qemu and possibly librbd, but nothing > is ready yet that I know of. > > sage > > > > > > Anyone in the ceph community that has done this? Any results to share? > > > > Many thanks > > > > Andrei > > > > > > From: "Robert van Leeuwen" > > To: "Andrei Mikhailovsky" > > Cc: ceph-users@lists.ceph.com > > Sent: Thursday, 14 August, 2014 9:31:24 AM > > Subject: RE: cache pools on hypervisor servers > > > > > Personally I am not worried too much about the hypervisor - > > hypervisor traffic as I am using a dedicated infiniband network for > > storage. > > > It is not used for the guest to guest or the internet traffic or > > anything else. I would like to decrease or at least smooth out the > > traffic peaks between the hypervisors and the SAS/SATA osd storage > > servers. > > > I guess the ssd cache pool would enable me to do that as the > > eviction rate should be more structured compared to the random io > > writes that guest vms generate. Sounds reasonable > > > > >>I'm very interested in the effect of caching pools in combination > > with running VMs on them so I'd be happy to hear what you find ;) > > > I will give it a try and share back the results when we get the ssd > > kit. > > Excellent, looking forward to it. > > > > > > >> As a side note: Running OSDs on hypervisors would not be my > > preferred choice since hypervisor load might impact Ceph performance. > > > Do you think it is not a good idea even if you have a lot of cores > > on the hypervisors? > > > Like 24 or 32 per host server? > > > According to my monitoring, our osd servers are not that stressed > > and generally have over 50% of free cpu power. > > > > The number of cores do not really matter if they are all busy ;) > > I honestly do not know how Ceph behaves when it is CPU starved but I > > guess it might not be pretty. > > Since your whole environment will be crumbling down if your storage > > becomes unavailable it is not a risk I would take lightly. > > > > Cheers, > > Robert van Leeuwen > > > > > > > > > > > > > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Serious performance problems with small file writes
Hugo, I would look at setting up a cache pool made of 4-6 ssds to start with. So, if you have 6 osd servers, stick at least 1 ssd disk in each server for the cache pool. It should greatly reduce the osd's stress of writing a large number of small files. Your cluster should become more responsive and the end user's experience should also improve. I am planning on doing so in a near future, but according to my friend's experience, introducing a cache pool has greatly increased the overall performance of the cluster and has removed the performance issues that he was having during scrubbing/deep-scrubbing/recovery activities. The size of your working data set should determine the size of the cache pool, but in general it will create a nice speedy buffer between your clients and those terribly slow spindles. Andrei - Original Message - From: "Hugo Mills" To: "Dan Van Der Ster" Cc: "Ceph Users List" Sent: Wednesday, 20 August, 2014 4:54:28 PM Subject: Re: [ceph-users] Serious performance problems with small file writes Hi, Dan, Some questions below I can't answer immediately, but I'll spend tomorrow morning irritating people by triggering these events (I think I have a reproducer -- unpacking a 1.2 GiB tarball with 25 small files in it) and giving you more details. For the ones I can answer right now: On Wed, Aug 20, 2014 at 02:51:12PM +, Dan Van Der Ster wrote: > Do you get slow requests during the slowness incidents? Slow requests, yes. ceph -w reports them coming in groups, e.g.: 2014-08-20 15:51:23.911711 mon.1 [INF] pgmap v2287926: 704 pgs: 704 active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 8449 kB/s rd, 3506 kB/s wr, 527 op/s 2014-08-20 15:51:22.381063 osd.5 [WRN] 6 slow requests, 6 included below; oldest blocked for > 10.133901 secs 2014-08-20 15:51:22.381066 osd.5 [WRN] slow request 10.133901 seconds old, received at 2014-08-20 15:51:12.247127: osd_op(mds.0.101:5528578 10005889b29. [create 0~0,setxattr parent (394)] 0.786a9365 ondisk+write e217298) v4 currently waiting for subops from 6 2014-08-20 15:51:22.381068 osd.5 [WRN] slow request 10.116337 seconds old, received at 2014-08-20 15:51:12.264691: osd_op(mds.0.101:5529006 1000599e576. [create 0~0,setxattr parent (392)] 0.5ccbd6a9 ondisk+write e217298) v4 currently waiting for subops from 7 2014-08-20 15:51:22.381070 osd.5 [WRN] slow request 10.116277 seconds old, received at 2014-08-20 15:51:12.264751: osd_op(mds.0.101:5529009 1000588932d. [create 0~0,setxattr parent (394)] 0.de5eca4e ondisk+write e217298) v4 currently waiting for subops from 6 2014-08-20 15:51:22.381071 osd.5 [WRN] slow request 10.115296 seconds old, received at 2014-08-20 15:51:12.265732: osd_op(mds.0.101:5529042 1000588933e. [create 0~0,setxattr parent (395)] 0.5e4d56be ondisk+write e217298) v4 currently waiting for subops from 7 2014-08-20 15:51:22.381073 osd.5 [WRN] slow request 10.115184 seconds old, received at 2014-08-20 15:51:12.265844: osd_op(mds.0.101:5529047 1000599e58a. [create 0~0,setxattr parent (395)] 0.6a487965 ondisk+write e217298) v4 currently waiting for subops from 6 2014-08-20 15:51:24.381370 osd.5 [WRN] 2 slow requests, 2 included below; oldest blocked for > 10.73 secs 2014-08-20 15:51:24.381373 osd.5 [WRN] slow request 10.73 seconds old, received at 2014-08-20 15:51:14.381267: osd_op(mds.0.101:5529327 100058893ca. [create 0~0,setxattr parent (395)] 0.750c7574 ondisk+write e217298) v4 currently commit sent 2014-08-20 15:51:24.381375 osd.5 [WRN] slow request 10.28 seconds old, received at 2014-08-20 15:51:14.381312: osd_op(mds.0.101:5529329 100058893cb. [create 0~0,setxattr parent (395)] 0.c75853fa ondisk+write e217298) v4 currently commit sent 2014-08-20 15:51:24.913554 mon.1 [INF] pgmap v2287927: 704 pgs: 704 active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 13218 B/s rd, 3532 kB/s wr, 377 op/s 2014-08-20 15:51:25.381582 osd.5 [WRN] 3 slow requests, 3 included below; oldest blocked for > 10.709989 secs 2014-08-20 15:51:25.381586 osd.5 [WRN] slow request 10.709989 seconds old, received at 2014-08-20 15:51:14.671549: osd_op(mds.0.101:5529457 10005889403. [create 0~0,setxattr parent (407)] 0.e15ab1fa ondisk+write e217298) v4 currently no flag points reached 2014-08-20 15:51:25.381587 osd.5 [WRN] slow request 10.709767 seconds old, received at 2014-08-20 15:51:14.671771: osd_op(mds.0.101:5529462 10005889406. [create 0~0,setxattr parent (406)] 0.70f8a5d3 ondisk+write e217298) v4 currently no flag points reached 2014-08-20 15:51:25.381589 osd.5 [WRN] slow request 10.182354 seconds old, received at 2014-08-20 15:51:15.199184: osd_op(mds.0.101:5529464 10005889407. [create 0~0,setxattr parent (391)] 0.30535d02 ondisk+write e217298) v4 currently no flag points reached 2014-08-20 15:51:25.920298 mon.1 [INF] pgmap v2287928: 704 pgs: 704 active+clean; 1
[ceph-users] pool with cache pool and rbd export
Hello guys, I am planning to perform regular rbd pool off-site backup with rbd export and export-diff. I've got a small ceph firefly cluster with an active writeback cache pool made of couple of osds. I've got the following question which I hope the ceph community could answer: Will this rbd export or import operations affect the active hot data in the cache pool, thus evicting from the cache pool the real hot data used by the clients. Or does the process of rbd export/import effect only the osds and does not touch the cache pool? Many thanks Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pool with cache pool and rbd export
So it looks like using rbd export / import will negatively effect the client performance, which is unfortunate. Is this really the case? Any plans on changing this behavior in future versions of ceph? Cheers Andrei - Original Message - From: "Robert LeBlanc" To: "Andrei Mikhailovsky" Cc: ceph-users@lists.ceph.com Sent: Friday, 22 August, 2014 8:21:08 PM Subject: Re: [ceph-users] pool with cache pool and rbd export My understanding is that all reads are copied to the cache pool. This would indicate that cache will be evicted. I don't know to what extent this will affect the hot cache because we have not used a cache pool yet. I'm currently looking into bcache fronting the disks to provide caching there. Robert LeBlanc On Fri, Aug 22, 2014 at 12:41 PM, Andrei Mikhailovsky < and...@arhont.com > wrote: Hello guys, I am planning to perform regular rbd pool off-site backup with rbd export and export-diff. I've got a small ceph firefly cluster with an active writeback cache pool made of couple of osds. I've got the following question which I hope the ceph community could answer: Will this rbd export or import operations affect the active hot data in the cache pool, thus evicting from the cache pool the real hot data used by the clients. Or does the process of rbd export/import effect only the osds and does not touch the cache pool? Many thanks Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pool with cache pool and rbd export
Does that also mean that scrubbing and deep-scrubbing also squishes data out of the cache pool? Could someone from the ceph community confirm this? Thanks - Original Message - From: "Robert LeBlanc" To: "Andrei Mikhailovsky" Cc: ceph-users@lists.ceph.com Sent: Friday, 22 August, 2014 8:21:08 PM Subject: Re: [ceph-users] pool with cache pool and rbd export My understanding is that all reads are copied to the cache pool. This would indicate that cache will be evicted. I don't know to what extent this will affect the hot cache because we have not used a cache pool yet. I'm currently looking into bcache fronting the disks to provide caching there. Robert LeBlanc On Fri, Aug 22, 2014 at 12:41 PM, Andrei Mikhailovsky < and...@arhont.com > wrote: Hello guys, I am planning to perform regular rbd pool off-site backup with rbd export and export-diff. I've got a small ceph firefly cluster with an active writeback cache pool made of couple of osds. I've got the following question which I hope the ceph community could answer: Will this rbd export or import operations affect the active hot data in the cache pool, thus evicting from the cache pool the real hot data used by the clients. Or does the process of rbd export/import effect only the osds and does not touch the cache pool? Many thanks Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pool with cache pool and rbd export
Sage, I guess this will be a problem with the cache pool when you do the export for the first time. However, after the first export is done, the diff data will be read and copied across and looking at the cache pool, I would say the diff data will be there anyway as the changes are going to be considered as hot data by the pool as it has been recently changed. So, I do not expect the delta exports squeeze too much data from the cache pool. That is if I got the understanding of how cache pools work. Andrei - Original Message - From: "Sage Weil" To: "Andrei Mikhailovsky" Cc: "Robert LeBlanc" , ceph-users@lists.ceph.com Sent: Friday, 22 August, 2014 10:34:24 PM Subject: Re: [ceph-users] pool with cache pool and rbd export On Fri, 22 Aug 2014, Andrei Mikhailovsky wrote: > So it looks like using rbd export / import will negatively effect the > client performance, which is unfortunate. Is this really the case? Any > plans on changing this behavior in future versions of ceph? There will always be some impact from import/export as you are incurring IO load on the system. But yes, in the cache case, this isn't very nice. In master we've added the ability to avoid promoting objects on individual reads unless we've seen some previous activity. This isn't backported to firefly yet (although that is likely). Even so, someone needs to do a bit of testing to verify that the rbd export pattern incurs only a single read on the objects and will avoid a promotion in the general case. sage > > Cheers > > Andrei > > > - Original Message - > From: "Robert LeBlanc" > To: "Andrei Mikhailovsky" > Cc: ceph-users@lists.ceph.com > Sent: Friday, 22 August, 2014 8:21:08 PM > Subject: Re: [ceph-users] pool with cache pool and rbd export > > > My understanding is that all reads are copied to the cache pool. This would > indicate that cache will be evicted. I don't know to what extent this will > affect the hot cache because we have not used a cache pool yet. I'm currently > looking into bcache fronting the disks to provide caching there. > > > Robert LeBlanc > > > > On Fri, Aug 22, 2014 at 12:41 PM, Andrei Mikhailovsky < and...@arhont.com > > wrote: > > > Hello guys, > > I am planning to perform regular rbd pool off-site backup with rbd export and > export-diff. I've got a small ceph firefly cluster with an active writeback > cache pool made of couple of osds. I've got the following question which I > hope the ceph community could answer: > > Will this rbd export or import operations affect the active hot data in the > cache pool, thus evicting from the cache pool the real hot data used by the > clients. Or does the process of rbd export/import effect only the osds and > does not touch the cache pool? > > Many thanks > > Andrei > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pool with cache pool and rbd export
Sage, would it not be more effective to separate the data as internal and external in a sense. So all maintenance related activities will be classed as internal (like scrubbing, deep-scrubbing, import and export data, etc) and will not effect the cache and all other activities (really client io) will go through cache? Andrei - Original Message - From: "Sage Weil" To: "Andrei Mikhailovsky" Cc: "Robert LeBlanc" , ceph-users@lists.ceph.com Sent: Friday, 22 August, 2014 10:34:24 PM Subject: Re: [ceph-users] pool with cache pool and rbd export On Fri, 22 Aug 2014, Andrei Mikhailovsky wrote: > So it looks like using rbd export / import will negatively effect the > client performance, which is unfortunate. Is this really the case? Any > plans on changing this behavior in future versions of ceph? There will always be some impact from import/export as you are incurring IO load on the system. But yes, in the cache case, this isn't very nice. In master we've added the ability to avoid promoting objects on individual reads unless we've seen some previous activity. This isn't backported to firefly yet (although that is likely). Even so, someone needs to do a bit of testing to verify that the rbd export pattern incurs only a single read on the objects and will avoid a promotion in the general case. sage > > Cheers > > Andrei > > > - Original Message - > From: "Robert LeBlanc" > To: "Andrei Mikhailovsky" > Cc: ceph-users@lists.ceph.com > Sent: Friday, 22 August, 2014 8:21:08 PM > Subject: Re: [ceph-users] pool with cache pool and rbd export > > > My understanding is that all reads are copied to the cache pool. This would > indicate that cache will be evicted. I don't know to what extent this will > affect the hot cache because we have not used a cache pool yet. I'm currently > looking into bcache fronting the disks to provide caching there. > > > Robert LeBlanc > > > > On Fri, Aug 22, 2014 at 12:41 PM, Andrei Mikhailovsky < and...@arhont.com > > wrote: > > > Hello guys, > > I am planning to perform regular rbd pool off-site backup with rbd export and > export-diff. I've got a small ceph firefly cluster with an active writeback > cache pool made of couple of osds. I've got the following question which I > hope the ceph community could answer: > > Will this rbd export or import operations affect the active hot data in the > cache pool, thus evicting from the cache pool the real hot data used by the > clients. Or does the process of rbd export/import effect only the osds and > does not touch the cache pool? > > Many thanks > > Andrei > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph rbd image checksums
Hello guys, I am planning to do rbd images off-site backup with rbd export-diff and I was wondering if ceph has checksumming functionality so that I can compare source and destination files for consistency? If so, how do I retrieve the checksum values from the ceph cluster? Thanks Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] rbd clones and export / export-diff functions
Hello guys, Is it possible to export rbd image while preserving the clones structure? So, if I've got a single clone rbd image and 10 vm images that were cloned from the original one, would the rbd export preserve this structure on the destination pool, or would it waste space and create 10 independent vm rbd images without using the clone? Thanks Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph rbd image checksums
Does rbd export and export-diff and like wise import and import-diff grantee the consistency of data? So, that if the image is "damaged" during the transfer, would this be flagged by the other side? Or would it simply leave the broken image on the destination cluster? Cheers - Original Message - From: "Wido den Hollander" To: ceph-users@lists.ceph.com Sent: Monday, 25 August, 2014 10:31:14 AM Subject: Re: [ceph-users] ceph rbd image checksums On 08/24/2014 08:27 PM, Andrei Mikhailovsky wrote: > Hello guys, > > I am planning to do rbd images off-site backup with rbd export-diff and I was > wondering if ceph has checksumming functionality so that I can compare source > and destination files for consistency? If so, how do I retrieve the checksum > values from the ceph cluster? > That would be rather difficult. There is no checksum, but to have a valid checksum you would need to verify the whole RBD image. That means reading the whole image and calculating the checksum based on that. > Thanks > > Andrei > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph monitor load, low performance
From the top of my head, it is recommended to use 3 mons in production. Also, for the 22 osds your number of PGs look a bug low, you should look at that. The performance of the cluster is poor - this is too vague. What is your current performance, what benchmarks have you tried, what is your data workload and most importantly, how is your cluster setup. what disks, ssds, network, ram, etc. Please provide more information so that people could help you. Andrei - Original Message - From: "Mateusz Skała" To: ceph-users@lists.ceph.com Sent: Monday, 25 August, 2014 2:39:16 PM Subject: [ceph-users] Ceph monitor load, low performance Hello, we have deployed ceph cluster with 4 monitors and 22 osd's. We are using only rbd's. All VM's on KVM have specified monitors in the same order. One of monitors (the first on the list in vm disk specification - ceph35) has more load than others and the performance of cluster is poor. How can we fix this problem. Here is 'ceph -s' output: cluster a9d17295-UUID-1cad7724e97f health HEALTH_OK monmap e4: 4 mons at {ceph15=IP.15:6789/0,ceph25=IP.25:6789/0,ceph30=IP.30:6789/0,ceph35=IP.35:6789/0}, election epoch 5750, quorum 0,1,2,3 ceph15,ceph25,ceph30,ceph35 osdmap e7376: 22 osds: 22 up, 22 in pgmap v3387277: 3072 pgs, 3 pools, 2306 GB data, 587 kobjects 6997 GB used, 12270 GB / 19267 GB avail 3071 active+clean 1 active+clean+scrubbing client io 14849 B/s rd, 2887 kB/s wr, 1044 op/s Thanks for help, -- Best Regards Mateusz ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Two osds are spaming dmesg every 900 seconds
Hello I am seeing this message every 900 seconds on the osd servers. My dmesg output is all filled with: [256627.683702] libceph: osd3 192.168.168.200:6821 socket closed (con state OPEN) [256627.687663] libceph: osd6 192.168.168.200:6841 socket closed (con state OPEN) Looking at the ceph-osd logs I see the following at the same time: 2014-08-25 19:48:14.869145 7f0752125700 0 -- 192.168.168.200:6821/4097 >> 192.168.168.200:0/2493848861 pipe(0x13b43c80 sd=92 :6821 s=0 pgs=0 cs=0 l=0 c=0x16a606e0).accept peer addr is really 192.168.168.200:0/2493848861 (socket is 192.168.168.200:54457/0) This happens only on two osds and the rest of osds seem fine. Does anyone know why am I seeing this and how to correct it? Thanks Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] rbd export poor performance
Hi, I am running a few tests for exporting volumes with rbd export and noticing very poor performance. It takes almost 3 hours to export 100GB volume. Servers are pretty idle during the export. The performance of the cluster itself is way faster. How can I increase the speed of rbd export? Thanks Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Two osds are spaming dmesg every 900 seconds
Thanks! i thought it's something serious. Andrei - Original Message - From: "Gregory Farnum" To: "Andrei Mikhailovsky" Cc: "ceph-users" Sent: Tuesday, 26 August, 2014 9:00:06 PM Subject: Re: [ceph-users] Two osds are spaming dmesg every 900 seconds This is being output by one of the kernel clients, and it's just saying that the connections to those two OSDs have died from inactivity. Either the other OSD connections are used a lot more, or aren't used at all. In any case, it's not a problem; just a noisy notification. There's not much you can do about it; sorry. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Mon, Aug 25, 2014 at 12:01 PM, Andrei Mikhailovsky wrote: > Hello > > I am seeing this message every 900 seconds on the osd servers. My dmesg > output is all filled with: > > [256627.683702] libceph: osd3 192.168.168.200:6821 socket closed (con state > OPEN) > [256627.687663] libceph: osd6 192.168.168.200:6841 socket closed (con state > OPEN) > > > Looking at the ceph-osd logs I see the following at the same time: > > 2014-08-25 19:48:14.869145 7f0752125700 0 -- 192.168.168.200:6821/4097 >> > 192.168.168.200:0/2493848861 pipe(0x13b43c80 sd=92 :6821 s=0 pgs=0 cs=0 l=0 > c=0x16a606e0).accept peer addr is really 192.168.168.200:0/2493848861 (socket > is 192.168.168.200:54457/0) > > > This happens only on two osds and the rest of osds seem fine. Does anyone > know why am I seeing this and how to correct it? > > Thanks > > Andrei > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Cache pool - step by step guide
Hello guys, I was wondering if someone could point me in the right direction of a step by step guide on setting up a cache pool. I've seen the http://ceph.com/docs/firefly/dev/cache-pool/. However, it has no mentioning of the first steps that one need to take. For instance, I've got my ssd disks plugged into the osd servers. What do I do next? How do i create the initial pool made just of these ssds and promote it to the cache pool status. How do i choose the right cache pool sizing, number of PGs, etc. Thanks Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cache pool - step by step guide
Vlad, thanks for the information. I will review it shortly. I do have SSDs and SAS (not sata) disks in the same box. But I guess there shouldn't be much difference between SAS and SATA. At the moment I am running firefly. I've seen some comments that the master branch has a great deal of improvements introduce to accommodate high IO of the SSDs. Does that apply to the improvements of the cache tier as well? Cheers Andrei - Original Message - From: "Vladislav Gorbunov" To: "Andrei Mikhailovsky" Cc: ceph-users@lists.ceph.com Sent: Thursday, 4 September, 2014 1:52:05 AM Subject: Re: [ceph-users] Cache pool - step by step guide You mix sata and ssd disks within the same server? Read this: http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/ When you have different pools for sata and ssd configure cache-pool: ceph osd tier add satapool ssdpool ceph osd tier cache-mode ssdpool writeback ceph osd pool set ssdpool hit_set_type bloom ceph osd pool set ssdpool hit_set_count 1 In this example 80-85% of the cache pool is equal to 280GB ceph osd pool set ssdpool target_max_bytes $((280*1024*1024*1024)) ceph osd tier set-overlay satapool ssdpool ceph osd pool set ssdpool hit_set_period 300 ceph osd pool set ssdpool cache_min_flush_age 300 # 10 minutes ceph osd pool set ssdpool cache_min_evict_age 1800 # 30 minutes ceph osd pool set ssdpool cache_target_dirty_ratio .4 ceph osd pool set ssdpool cache_target_full_ratio .8 Remember, that the current stable ceph 0.80.5 cache pool osds crashing when data is evicting to underlying storage pool. See http://tracker.ceph.com/issues/8982 2014-09-04 10:55 GMT+12:00 Andrei Mikhailovsky < and...@arhont.com > : Hello guys, I was wondering if someone could point me in the right direction of a step by step guide on setting up a cache pool. I've seen the http://ceph.com/docs/firefly/dev/cache-pool/ . However, it has no mentioning of the first steps that one need to take. For instance, I've got my ssd disks plugged into the osd servers. What do I do next? How do i create the initial pool made just of these ssds and promote it to the cache pool status. How do i choose the right cache pool sizing, number of PGs, etc. Thanks Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Cache pool and using btrfs for ssd osds
Hello guys, I was wondering if there is a benefit of using journal-less btrfs file system on the cache pool osds? Would it speed up the writes to the cache tier? Is btrfs and ceph getting close to production level? Cheers Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph and TRIM on SSD disks
Hello guys, was wondering if it is a good idea to enable TRIM (mount option discard) on the ssd disks which are used for either cache pool or osd journals? For performance, is it better to enable it or run fstrim with cron every once in a while? Thanks Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Best practices on Filesystem recovery on RBD block volume?
Keith, You should consider doing regular rbd volume snapshots and keep them for N amount of hours/days/months depending on your needs. Cheers Andrei - Original Message - From: "Keith Phua" To: ceph-users@lists.ceph.com Cc: y...@nus.edu.sg, cheechi...@nus.edu.sg, eng...@nus.edu.sg Sent: Wednesday, 10 September, 2014 3:22:53 AM Subject: [ceph-users] Best practices on Filesystem recovery on RBD block volume? Dear ceph-users, Recently we had an encounter of a XFS filesystem corruption on a NAS box. After repairing the filesystem, we discover the files were gone. This trigger some questions with regards to filesystem on RBD block which I hope the community can enlighten me. 1. If a local filesystem on a rbd block is corrupted, is it fair to say that regardless of how many replicated copies we specified for the pool, unless the filesystem is properly repaired and recovered, we may not get our data back? 2. If the above statement is true, does it mean that severe filesystem corruption on a RBD block constitute a single point of failure, since filesystems corruption can happened when the RBD client is not properly shutdown or due to a kernel bug? 3. Other than existing best practices for a filesystem recovery, does ceph have any other best practices for filesystem on RBD which we can adopt for data recovery? Thanks in advance. Regards, Keith ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Best practices on Filesystem recovery on RBD block volume?
Keith, I think the hypervisor / infrastructure orchestration layer should be able to handle proper snapshotting with io freezing. For instance, we use CloudStack and you can set up automatic snapshots and snapshot retention policies. Cheers Andrei - Original Message - From: "Ilya Dryomov" To: "Keith Phua" Cc: "Andrei Mikhailovsky" , ceph-users@lists.ceph.com, y...@nus.edu.sg, cheechi...@nus.edu.sg, eng...@nus.edu.sg Sent: Wednesday, 10 September, 2014 11:51:04 AM Subject: Re: [ceph-users] Best practices on Filesystem recovery on RBD block volume? On Wed, Sep 10, 2014 at 2:45 PM, Keith Phua wrote: > Hi Andrei, > > Thanks for the suggestion. > > But a rbd volume snapshots may only work if the filesystem is in a > consistent state, which mean no IO during snapshotting. With cronjob > snapshotting, usually we have no control over client doing any IOs. Having xfs_freeze -f /mnt xfs_freeze -u /mnt Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Cache Pool writing too much on ssds, poor performance?
Hello guys, I am experimeting with cache pool and running some tests to see how adding the cache pool improves the overall performance of our small cluster. While doing testing I've noticed that it seems that the cache pool is writing too much on the cache pool ssds. Not sure what the issue here, perhaps someone could help me understand what is going on. My test cluster is: 2 x OSD servers (Each server has: 24GB ram, 12 cores, 8 hdd osds, 2 ssds journals, 2 ssds for cache pool, 40gbit/s infiniband network capable of 25gbit/s over ipoib). Cache pool is set to 500GB with replica of 2. 4 x host servers (128GB ram, 24 core, 40gbit/s infiniband network capable of 12gbit/s over ipoib) So, my test is: Simple tests using the following command: "dd if=/dev/vda of=/dev/null bs=4M count=2000 iflag=direct". I am concurrently starting this command on 10 virtual machines which are running on 4 host servers. The aim is to monitor the use of cache pool when reading the same data over and over again. Running the above command for the first time does what I was expecting. The osds are doing a lot of reads, the cache pool does a lot of writes (around 250-300MB/s per ssd disk) and no reads. The dd results for the guest vms are poor. The results of the "ceph -w" shows consistent performance across the time. Running the above for the second and consequent times produces IO patterns which I was not expecting at all. The hdd osds are not doing much (this part I expected), the cache pool still does a lot of writes and very little reads! The dd results have improved just a little, but not much. The results of the "ceph -w" shows performance breaks over time. For instance, I have a peak of throughput in the first couple of seconds (data is probably coming from the osd server's ram at high rate). After the peak throughput has finished, the ceph reads are done in the following way: 2-3 seconds of activity followed by 2 seconds if inactivity) and it keeps doing that throughout the length of the test. So, to put the numbers in perspective, when running tests over and over again I would get around 2000 - 3000MB/s for the first two seconds, followed by 0MB/s for the next two seconds, followed by around 150-250MB/s over 2-3 seconds, followed by 0MB/s for 2 seconds, followed 150-250MB/s over 2-3 seconds, followed by 0MB/s over 2 secods, and the pattern repeats until the test is done. I kept running the dd command for about 15-20 times and observed the same behariour. The cache pool does mainly writes (around 200MB/s per ssd) when guest vms are reading the same data over and over again. There is very little read IO (around 20-40MB/s). Why am I not getting high read IO? I have expected the 80GB of data that is being read from the vms over and over again to be firmly recognised as the hot data and kept in the cache pool and read from it when guest vms request the data. Instead, I mainly get writes on the cache pool ssds and I am not really sure where these writes are coming from as my hdd osds are being pretty idle. >From the overall tests so far, introducing the cache pool has drastically >slowed down my cluster (by as much as 50-60%). Thanks for any help Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cache Pool writing too much on ssds, poor performance?
Hi, I have created the cache tier using the following commands: 95 ceph osd pool create cache-pool-ssd 2048 2048 ; ceph osd pool set cache-pool-ssd crush_ruleset 4 124 ceph osd pool set cache-pool-ssd size 2 126 ceph osd pool set cache-pool-ssd min_size 1 130 ceph osd tier add Primary-ubuntu-1 cache-pool-ssd 131 ceph osd tier cache-mode cache-pool-ssd writeback 132 ceph osd tier set-overlay Primary-ubuntu-1 cache-pool-ssd 135 ceph osd pool set cache-pool-ssd hit_set_type bloom 136 ceph osd pool set cache-pool-ssd hit_set_count 1 137 ceph osd pool set cache-pool-ssd hit_set_period 3600 138 ceph osd pool set cache-pool-ssd target_max_bytes 5000 143 ceph osd pool set cache-pool-ssd cache_target_full_ratio 0.8 SInce the initial install i've increased the target_max_bytes to 800GB. The rest of the settings are left as default. Did I miss something that might explain the behaviour that i am experiencing? Cheers Andrei - Original Message - From: "Xiaoxi Chen" To: "Andrei Mikhailovsky" , "ceph-users" Sent: Thursday, 11 September, 2014 2:00:31 AM Subject: RE: Cache Pool writing too much on ssds, poor performance? Could you show your cache tiering configuration? Especially this three parameters. ceph osd pool set hot-storage cache_target_dirty_ratio 0.4 ceph osd pool set hot-storage cache_target_full_ratio 0.8 ceph osd pool set {cachepool} target_max_bytes {#bytes} From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Andrei Mikhailovsky Sent: Wednesday, September 10, 2014 8:51 PM To: ceph-users Subject: [ceph-users] Cache Pool writing too much on ssds, poor performance? Hello guys, I am experimeting with cache pool and running some tests to see how adding the cache pool improves the overall performance of our small cluster. While doing testing I've noticed that it seems that the cache pool is writing too much on the cache pool ssds. Not sure what the issue here, perhaps someone could help me understand what is going on. My test cluster is: 2 x OSD servers (Each server has: 24GB ram, 12 cores, 8 hdd osds, 2 ssds journals, 2 ssds for cache pool, 40gbit/s infiniband network capable of 25gbit/s over ipoib). Cache pool is set to 500GB with replica of 2. 4 x host servers (128GB ram, 24 core, 40gbit/s infiniband network capable of 12gbit/s over ipoib) So, my test is: Simple tests using the following command: "dd if=/dev/vda of=/dev/null bs=4M count=2000 iflag=direct". I am concurrently starting this command on 10 virtual machines which are running on 4 host servers. The aim is to monitor the use of cache pool when reading the same data over and over again. Running the above command for the first time does what I was expecting. The osds are doing a lot of reads, the cache pool does a lot of writes (around 250-300MB/s per ssd disk) and no reads. The dd results for the guest vms are poor. The results of the "ceph -w" shows consistent performance across the time. Running the above for the second and consequent times produces IO patterns which I was not expecting at all. The hdd osds are not doing much (this part I expected), the cache pool still does a lot of writes and very little reads! The dd results have improved just a little, but not much. The results of the "ceph -w" shows performance breaks over time. For instance, I have a peak of throughput in the first couple of seconds (data is probably coming from the osd server's ram at high rate). After the peak throughput has finished, the ceph reads are done in the following way: 2-3 seconds of activity followed by 2 seconds if inactivity) and it keeps doing that throughout the length of the test. So, to put the numbers in perspective, when running tests over and over again I would get around 2000 - 3000MB/s for the first two seconds, followed by 0MB/s for the next two seconds, followed by around 150-250MB/s over 2-3 seconds, followed by 0MB/s for 2 seconds, followed 150-250MB/s over 2-3 seconds, followed by 0MB/s over 2 secods, and the pattern repeats until the test is done. I kept running the dd command for about 15-20 times and observed the same behariour. The cache pool does mainly writes (around 200MB/s per ssd) when guest vms are reading the same data over and over again. There is very little read IO (around 20-40MB/s). Why am I not getting high read IO? I have expected the 80GB of data that is being read from the vms over and over again to be firmly recognised as the hot data and kept in the cache pool and read from it when guest vms request the data. Instead, I mainly get writes on the cache pool ssds and I am not really sure where these writes are coming from as my hdd osds are being pretty idle. >From the overall tests so far, introducing the cache pool has drastically >slowed down my cluster (by as much as 50-60%). Thanks for any help
Re: [ceph-users] Rebalancing slow I/O.
Irek, have you change the ceph.conf file to change the recovery p riority? Options like these might help with prioritising repair/rebuild io with the client IO: osd_recovery_max_chunk = 8388608 osd_recovery_op_priority = 2 osd_max_backfills = 1 osd_recovery_max_active = 1 osd_recovery_threads = 1 Andrei - Original Message - From: "Irek Fasikhov" To: ceph-users@lists.ceph.com Sent: Thursday, 11 September, 2014 1:07:06 PM Subject: [ceph-users] Rebalancing slow I/O. Hi,All. DELL R720X8,96 OSDs, Network 2x10Gbit LACP. When one of the nodes crashes, I get very slow I / O operations on virtual machines. A cluster map by default. [ceph@ceph08 ~]$ ceph osd tree # id weight type name up/down reweight -1 262.1 root defaults -2 32.76 host ceph01 0 2.73 osd.0 up 1 ... 11 2.73 osd.11 up 1 -3 32.76 host ceph02 13 2.73 osd.13 up 1 .. 12 2.73 osd.12 up 1 -4 32.76 host ceph03 24 2.73 osd.24 up 1 35 2.73 osd.35 up 1 -5 32.76 host ceph04 37 2.73 osd.37 up 1 . 47 2.73 osd.47 up 1 -6 32.76 host ceph05 48 2.73 osd.48 up 1 ... 59 2.73 osd.59 up 1 -7 32.76 host ceph06 60 2.73 osd.60 down 0 ... 71 2.73 osd.71 down 0 -8 32.76 host ceph07 72 2.73 osd.72 up 1 83 2.73 osd.83 up 1 -9 32.76 host ceph08 84 2.73 osd.84 up 1 95 2.73 osd.95 up 1 If I change the cluster map on the following: root---| | |-rack1 | | | host ceph01 | host ceph02 | host ceph03 | host ceph04 | |---rack2 | host ceph05 host ceph06 host ceph07 host ceph08 What will povidenie cluster failover one node? And how much will it affect the performance? Thank you -- С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cache Pool writing too much on ssds, poor performance?
Mark, Thanks for a very detailed email. Really apreciate your help on this. I now have a bit more understanding on how it works and understand why I am getting so much write on the cache ssds. I am however, trouble to understand why the cache pool is not keeping the data and flushing it? I've got the pool size of about 7x as large as the current benchmark set (80gb data set vs 500GB pool size), so the benchmark data should fit nicely many times over. I understand if there is a small percentage of the data is a cache miss, but from what it looks like it is missing a considerable amount. Is there a way to check the stats of the cache pool, including hit/miss information and other data? Yes, I am using firefly 0.80.5. Thanks Andrei - Original Message - From: "Mark Nelson" To: ceph-users@lists.ceph.com Sent: Thursday, 11 September, 2014 3:02:40 PM Subject: Re: [ceph-users] Cache Pool writing too much on ssds, poor performance? Something that is very important to keep in mind with the way that the cache tier implementation currently works in Ceph is that cache misses are very expensive. It's really important that your workload have a really big hot/cold data skew otherwise it's not going to work well at all. In your case, you are doing sequential reads which is terrible for this because for each pass you are never re-reading the same blocks, and by the time you get to the end of the test and restart it, the first blocks (apparently) have already been flushed. If you increased the size of the cache tier, you might be able to fit the whole thing in cache which should help dramatically, but that's not easy to guarantee outside of benchmarks. I'm guessing you are using firefly right? To improve this behaviour, Sage implemented a new policy in the recent development releases not to promote reads right away. Instead, we wait to promote until there are several reads of the same object within a certain time period. That should dramatically help in this case. You really don't want big sequential reads being promoted into cache since cache promotion is expensive and the disks are really good at doing that kind of thing anyway. On the flip side, 4MB read misses are bad, but the situation is even worse with 4K misses. Imagine for instance that you are going to do a 4K read from a default 4MB RBD block and that object is not in cache. In the implementation we have in firefly, the whole 4MB object will be promoted to the cache which will in most cases require a transfer of that object over the network to the primary OSD for the associated PG in the cache pool. Now depending on the replication policy, that primary cache pool OSD will fan out and write (by default) 2 extra copies of the data to the other OSDs in the PG, so 3 total. Now assuming your cache tier is on SSDs with co-located journals, each one of those writes will actually be 2 writes, one to the journal, and one to the data store. To recap: *Any* read miss regardless if it's 4KB or 4MB means at least 1 4MB object promotion, times 3 replicas (ie 12MB over the network) times 2 for the journal writes. So 24MB of data written to the cache tier disks, no matter if it's 4KB or 4MB. Imagine you have 200 IOPS worth of 4KB read cache misses. That's roughly 4.8GB/s of writes into the cache tier. If you are seldomly re-reading the same blocks, performance will be absolutely terrible. On the other hand, if you have lots of small random reads from the same set of 4MB objects, the cache tier really can help. How much it helps vs just doing the reads from page cache is debatable though. There's some band between page cache and disk where the cache tier fits in, but getting everything right is going to be tricky. The optimal situation imho is that the cache tier only promote objects that have a lot of small random reads hitting them, and be very conservative about promotions, especially for new writes. I don't know whether or not cache promotion might pollute page cache in strange ways, but that's something we also may need to be careful about. For more info, see the following thread: http://www.spinics.net/lists/ceph-devel/msg20189.html Mark On 09/10/2014 07:51 AM, Andrei Mikhailovsky wrote: > Hello guys, > > I am experimeting with cache pool and running some tests to see how > adding the cache pool improves the overall performance of our small cluster. > > While doing testing I've noticed that it seems that the cache pool is > writing too much on the cache pool ssds. Not sure what the issue here, > perhaps someone could help me understand what is going on. > > My test cluster is: > 2 x OSD servers (Each server has: 24GB ram, 12 cores, 8 hdd osds, 2 ssds > journals, 2 ssds for cache pool, 40gbit/s infiniband network capable of > 25gbit/s over ipoib).
[ceph-users] writing to rbd mapped device produces hang tasks
Hello guys, I've been trying to map an rbd disk to run some testing and I've noticed that while I can successfully read from the rbd image mapped to /dev/rbdX, I am failing to reliably write to it. Sometimes write tests work perfectly well, especially if I am using large block sizes. But often writes hang for a considerable amount of time and I have kernel hang task messages (shown below) in my dmesg. the hang tasks show particularly frequently when using 4K block size. However, with large block sizes writes also sometimes hang, but for sure less frequent I am using simple dd tests like: dd if=/dev/zero of= bs=4K count=250K. I am running firefly on Ubuntu 12.04 on all osd/mon servers. The rbd image is mapped on one of the osd servers. All osd servers are running kernel version 3.15.10-031510-generic . Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.439974] INFO: task jbd2/rbd0-8:3505 blocked for more than 120 seconds. Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.441586] Not tainted 3.15.10-031510-generic #201408132333 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.443022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444862] jbd2/rbd0-8 D 0003 0 3505 2 0x Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444870] 8803a10a7c48 0002 88007963b288 8803a10a7fd8 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444874] 00014540 00014540 880469f63260 880866969930 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444876] 8803a10a7c58 8803a10a7d88 88034d142100 880848519824 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444879] Call Trace: Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444893] [] schedule+0x29/0x70 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444901] [] jbd2_journal_commit_transaction+0x240/0x1510 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444908] [] ? sched_clock_cpu+0x85/0xc0 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444920] [] ? arch_vtime_task_switch+0x8a/0x90 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444923] [] ? vtime_common_task_switch+0x3d/0x50 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444928] [] ? __wake_up_sync+0x20/0x20 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444933] [] ? try_to_del_timer_sync+0x4f/0x70 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444938] [] kjournald2+0xb8/0x240 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444941] [] ? __wake_up_sync+0x20/0x20 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444943] [] ? commit_timeout+0x10/0x10 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444949] [] kthread+0xc9/0xe0 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444952] [] ? flush_kthread_worker+0xb0/0xb0 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444965] [] ret_from_fork+0x7c/0xb0 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444969] [] ? flush_kthread_worker+0xb0/0xb0 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.445141] INFO: task dd:21180 blocked for more than 120 seconds. Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.446595] Not tainted 3.15.10-031510-generic #201408132333 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.448070] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449910] dd D 0002 0 21180 19562 0x0002 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449913] 880485a1b5d8 0002 880485a1b5e8 880485a1bfd8 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449916] 00014540 00014540 880341833260 88011086cb90 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449919] 880485a1b5a8 88046fc94e40 88011086cb90 81204da0 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449921] Call Trace: Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449927] [] ? __wait_on_buffer+0x30/0x30 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449930] [] schedule+0x29/0x70 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449934] [] io_schedule+0x8f/0xd0 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449936] [] sleep_on_buffer+0xe/0x20 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449940] [] __wait_on_bit+0x62/0x90 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449945] [] ? bio_alloc_bioset+0xa0/0x1d0 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449947] [] ? __wait_on_buffer+0x30/0x30 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449951] [] out_of_line_wait_on_bit+0x7c/0x90 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449954] [] ? wake_atomic_t_function+0x40/0x40 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449957] [] __wait_on_buffer+0x2e/0x30 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449962] [] ext4_wait_block_bitmap+0xd8/0xe0 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449969] [] ext4
Re: [ceph-users] writing to rbd mapped device produces hang tasks
Hi guys, Following up on my previous message. I've tried to repeat the same experiment with 3.16.2 kernel and I still have the same behaviour. The dd process got stuck after running dd for 3 times in a row. The iostat shows that the rbd0 device is fully utilised, but without any activity on the device itself: Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 135.00 0.00 0.00 0.00 0.00 100.00 I've also tried enabling and disabling rbd cache, which didn't make a difference. Could someone help me with debugging the issue and getting to the root cause? Thanks Andrei - Original Message - From: "Andrei Mikhailovsky" To: ceph-users@lists.ceph.com Sent: Sunday, 14 September, 2014 12:04:15 AM Subject: [ceph-users] writing to rbd mapped device produces hang tasks Hello guys, I've been trying to map an rbd disk to run some testing and I've noticed that while I can successfully read from the rbd image mapped to /dev/rbdX, I am failing to reliably write to it. Sometimes write tests work perfectly well, especially if I am using large block sizes. But often writes hang for a considerable amount of time and I have kernel hang task messages (shown below) in my dmesg. the hang tasks show particularly frequently when using 4K block size. However, with large block sizes writes also sometimes hang, but for sure less frequent I am using simple dd tests like: dd if=/dev/zero of= bs=4K count=250K. I am running firefly on Ubuntu 12.04 on all osd/mon servers. The rbd image is mapped on one of the osd servers. All osd servers are running kernel version 3.15.10-031510-generic. Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.439974] INFO: task jbd2/rbd0-8:3505 blocked for more than 120 seconds. Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.441586] Not tainted 3.15.10-031510-generic #201408132333 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.443022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444862] jbd2/rbd0-8 D 0003 0 3505 2 0x Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444870] 8803a10a7c48 0002 88007963b288 8803a10a7fd8 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444874] 00014540 00014540 880469f63260 880866969930 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444876] 8803a10a7c58 8803a10a7d88 88034d142100 880848519824 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444879] Call Trace: Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444893] [] schedule+0x29/0x70 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444901] [] jbd2_journal_commit_transaction+0x240/0x1510 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444908] [] ? sched_clock_cpu+0x85/0xc0 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444920] [] ? arch_vtime_task_switch+0x8a/0x90 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444923] [] ? vtime_common_task_switch+0x3d/0x50 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444928] [] ? __wake_up_sync+0x20/0x20 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444933] [] ? try_to_del_timer_sync+0x4f/0x70 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444938] [] kjournald2+0xb8/0x240 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444941] [] ? __wake_up_sync+0x20/0x20 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444943] [] ? commit_timeout+0x10/0x10 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444949] [] kthread+0xc9/0xe0 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444952] [] ? flush_kthread_worker+0xb0/0xb0 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444965] [] ret_from_fork+0x7c/0xb0 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444969] [] ? flush_kthread_worker+0xb0/0xb0 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.445141] INFO: task dd:21180 blocked for more than 120 seconds. Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.446595] Not tainted 3.15.10-031510-generic #201408132333 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.448070] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449910] dd D 0002 0 21180 19562 0x0002 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449913] 880485a1b5d8 0002 880485a1b5e8 880485a1bfd8 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449916] 00014540 00014540 880341833260 88011086cb90 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449919] 880485a1b5a8 88046fc94e40 88011086cb90 81204da0 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449921] Call Trace: Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449927] [] ? __wait_on_buffer+0x30/0x30 Sep 13 21:24:
Re: [ceph-users] writing to rbd mapped device produces hang tasks
11:29:56 arh-ibstorage1-ib kernel: [ 1200.472476] [] writeback_sb_inodes+0x22e/0x340 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472479] [] __writeback_inodes_wb+0x9e/0xd0 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472483] [] wb_writeback+0x28b/0x330 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472487] [] ? get_nr_dirty_inodes+0x52/0x80 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472490] [] wb_check_old_data_flush+0x9f/0xb0 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472493] [] wb_do_writeback+0x134/0x1c0 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472496] [] ? set_worker_desc+0x6f/0x80 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472500] [] bdi_writeback_workfn+0x78/0x1f0 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472503] [] process_one_work+0x17f/0x4c0 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472507] [] worker_thread+0x11b/0x3f0 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472510] [] ? create_and_start_worker+0x80/0x80 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472513] [] kthread+0xc9/0xe0 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472516] [] ? flush_kthread_worker+0xb0/0xb0 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472520] [] ret_from_fork+0x7c/0xb0 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472523] [] ? flush_kthread_worker+0xb0/0xb0 Cheers - Original Message - From: "Andrei Mikhailovsky" To: ceph-users@lists.ceph.com Sent: Sunday, 14 September, 2014 11:34:07 AM Subject: Re: [ceph-users] writing to rbd mapped device produces hang tasks Hi guys, Following up on my previous message. I've tried to repeat the same experiment with 3.16.2 kernel and I still have the same behaviour. The dd process got stuck after running dd for 3 times in a row. The iostat shows that the rbd0 device is fully utilised, but without any activity on the device itself: Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util rbd0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 135.00 0.00 0.00 0.00 0.00 100.00 I've also tried enabling and disabling rbd cache, which didn't make a difference. Could someone help me with debugging the issue and getting to the root cause? Thanks Andrei - Original Message - From: "Andrei Mikhailovsky" To: ceph-users@lists.ceph.com Sent: Sunday, 14 September, 2014 12:04:15 AM Subject: [ceph-users] writing to rbd mapped device produces hang tasks Hello guys, I've been trying to map an rbd disk to run some testing and I've noticed that while I can successfully read from the rbd image mapped to /dev/rbdX, I am failing to reliably write to it. Sometimes write tests work perfectly well, especially if I am using large block sizes. But often writes hang for a considerable amount of time and I have kernel hang task messages (shown below) in my dmesg. the hang tasks show particularly frequently when using 4K block size. However, with large block sizes writes also sometimes hang, but for sure less frequent I am using simple dd tests like: dd if=/dev/zero of= bs=4K count=250K. I am running firefly on Ubuntu 12.04 on all osd/mon servers. The rbd image is mapped on one of the osd servers. All osd servers are running kernel version 3.15.10-031510-generic. Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.439974] INFO: task jbd2/rbd0-8:3505 blocked for more than 120 seconds. Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.441586] Not tainted 3.15.10-031510-generic #201408132333 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.443022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444862] jbd2/rbd0-8 D 0003 0 3505 2 0x Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444870] 8803a10a7c48 0002 88007963b288 8803a10a7fd8 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444874] 00014540 00014540 880469f63260 880866969930 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444876] 8803a10a7c58 8803a10a7d88 88034d142100 880848519824 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444879] Call Trace: Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444893] [] schedule+0x29/0x70 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444901] [] jbd2_journal_commit_transaction+0x240/0x1510 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444908] [] ? sched_clock_cpu+0x85/0xc0 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444920] [] ? arch_vtime_task_switch+0x8a/0x90 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444923] [] ? vtime_common_task_switch+0x3d/0x50 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444928] [] ? __wake_up_sync+0x20/0x20 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444933] [] ? try_to_del_timer_sync+0x4f/0x70 Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.4449
Re: [ceph-users] writing to rbd mapped device produces hang tasks
To answer my own question, I think I am getting 8818 bug - http://tracker.ceph.com/issues/8818 . The solution seems to be to upgrade to the latest 3.17 kernel brunch. Cheers - Original Message - From: "Andrei Mikhailovsky" To: ceph-users@lists.ceph.com Sent: Sunday, 14 September, 2014 11:38:07 AM Subject: Re: [ceph-users] writing to rbd mapped device produces hang tasks Oh, forgot to show the hang tasks message, which looks different on the 3.16.2 kernel: Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.467439] INFO: task kworker/u25:2:668 blocked for more than 120 seconds. Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.469032] Not tainted 3.16.2-031602-generic #201409052035 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.470474] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472288] kworker/u25:2 D 0004 0 668 2 0x Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472299] Workqueue: writeback bdi_writeback_workfn (flush-251:0) Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472301] 88046713afe8 0046 8804560ccc60 88046713bfd8 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472304] 00014400 00014400 880469fb8000 880464753cc0 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472307] 88011022f768 88011022f7c8 88011022f7cc 880464753cc0 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472309] Call Trace: Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472313] [] schedule+0x29/0x70 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472316] [] schedule_preempt_disabled+0xe/0x10 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472319] [] __mutex_lock_slowpath+0xd5/0x1c0 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472322] [] mutex_lock+0x23/0x37 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472330] [] ceph_osdc_start_request+0x42/0x80 [libceph] Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472335] [] rbd_obj_request_submit+0x33/0x70 [rbd] Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472340] [] rbd_img_obj_request_submit+0xaa/0x100 [rbd] Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472344] [] rbd_img_request_submit+0x56/0x80 [rbd] Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472349] [] rbd_request_fn+0x2ac/0x350 [rbd] Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472356] [] __blk_run_queue+0x37/0x50 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472359] [] queue_unplugged+0x3d/0xc0 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472364] [] blk_flush_plug_list+0x1d2/0x210 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472369] [] ? __wait_on_buffer+0x30/0x30 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472371] [] io_schedule+0x78/0xd0 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472374] [] sleep_on_buffer+0xe/0x20 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472377] [] __wait_on_bit+0x62/0x90 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472381] [] ? bio_alloc_bioset+0xa0/0x1d0 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472384] [] ? __wait_on_buffer+0x30/0x30 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472387] [] out_of_line_wait_on_bit+0x7c/0x90 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472393] [] ? wake_atomic_t_function+0x40/0x40 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472395] [] __wait_on_buffer+0x2e/0x30 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472401] [] ext4_wait_block_bitmap+0xd8/0xe0 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472406] [] ext4_mb_init_cache+0x1de/0x750 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472411] [] ? pagecache_get_page+0xaa/0x1d0 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472414] [] ext4_mb_init_group+0xbe/0x110 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472417] [] ext4_mb_load_buddy+0x380/0x3a0 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472420] [] ext4_mb_find_by_goal+0xa3/0x310 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472423] [] ext4_mb_regular_allocator+0x5e/0x450 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472428] [] ? ext4_ext_find_extent+0x220/0x2a0 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472431] [] ext4_mb_new_blocks+0x40a/0x540 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472435] [] ? ext4_ext_find_extent+0x120/0x2a0 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472438] [] ? ext4_ext_check_overlap.isra.27+0xbc/0xd0 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472441] [] ext4_ext_map_blocks+0x973/0xad0 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472446] [] ? __ext4_es_shrink+0x210/0x2d0 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472450] [] ext4_map_blocks+0x15f/0x550 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472455] [] ? __pagevec_release+0x2c/0x40 Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472458] [] mpage_map_one_exten
[ceph-users] Bcache / Enhanceio with osds
Hello guys, Was wondering if anyone uses or done some testing with using bcache or enhanceio caching in front of ceph osds? I've got a small cluster of 2 osd servers, 16 osds in total and 4 ssds for journals. I've recently purchased four additional ssds to be used for ceph cache pool, but i've found performance of guest vms to be slower with the cache pool for many benchmarks. The write performance has slightly improved, but the read performance has suffered a lot (as much as 60% in some tests). Therefore, I am planning to scrap the cache pool (at least until it matures) and use either bcache or enhanceio instead. Thanks Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Cache pool stats
Hi Does anyone know how to check the basic cache pool stats for the information like how well the cache layer is working for a recent or historic time frame? Things like cache hit ratio would be very helpful as well as. Thanks Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bcache / Enhanceio with osds
- Original Message - > From: "Mark Nelson" > To: ceph-users@lists.ceph.com > Sent: Monday, 15 September, 2014 1:13:01 AM > Subject: Re: [ceph-users] Bcache / Enhanceio with osds > On 09/14/2014 05:11 PM, Andrei Mikhailovsky wrote: > > Hello guys, > > > > Was wondering if anyone uses or done some testing with using bcache > > or > > enhanceio caching in front of ceph osds? > > > > I've got a small cluster of 2 osd servers, 16 osds in total and 4 > > ssds > > for journals. I've recently purchased four additional ssds to be > > used > > for ceph cache pool, but i've found performance of guest vms to be > > slower with the cache pool for many benchmarks. The write > > performance > > has slightly improved, but the read performance has suffered a lot > > (as > > much as 60% in some tests). > > > > Therefore, I am planning to scrap the cache pool (at least until it > > matures) and use either bcache or enhanceio instead. > We're actually looking at dm-cache a bit right now. (and talking some > of > the developers about the challenges they are facing to help improve > our > own cache tiering) No meaningful benchmarks of dm-cache yet though. > Bcache, enhanceio, and flashcache all look interesting too. Regarding > the cache pool: we've got a couple of ideas that should help improve > performance, especially for reads. Mark, do you mind sharing these ideas with the rest of cephers? Can these ideas be implemented on the existing firefly install? > There are definitely advantages to > keeping cache local to the node though. I think some form of local > node > caching could be pretty useful going forward. What do you mean by the local to the node? Do you mean the use of cache disks on the hypervisor level? Or do you mean using cache ssd disks on the osd servers rather than creating a separate cache tier hardware? Thanks > > > > Thanks > > > > Andrei > > > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] XenServer and Ceph - any updates?
Hello guys, I was wondering if there has been any updates on getting XenServer ready for ceph? I've seen a howto that was written well over a year ago (I think) for a PoC integration of XenServer and Ceph. However, I've not seen any developments lately.It would be cool to see other hypervisors adapting Ceph )) Cheers Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bcache / Enhanceio with osds
I've done a bit of testing with Enhanceio on my cluster and I can see a definate improvement in read performance for cached data. The performance increase is around 3-4 times the cluster speed prior to using enhanceio based on large block size IO (1M and 4M). I've done a concurrent test of running a single "dd if=/dev/vda of=/dev/null bs=1M/4M iflag=direct" instance over 20 vms which were running on 4 host servers. Prior to enchanceio i was getting around 30-35MB/s per guest vm regardless of how many times i run the test. With enhanceio (from the second run) I was hitting over 130MB/s per vm. I've not seen any lag in performance of other vms while using enchanceio, unlike a considerable lag without the enchanceio. The ssd disk utilisation was not hitting much over 60%. The small block size (4K) performance hasn't changed with enhanceio, which made me think that the performance of osds themselves is limited when using small block sizes. I wasn't getting much over 2-3MB/s per guest vm. On a contrary, when I tried to use the firefly cache pool on the same hardware, my cluster has performed significantly slower with the cache pool. The whole cluster seemed under a lot more load and the performance has dropped to around 12-15MB/s and other guest vms were very very slow. The ssd disks were utilised 100% all the time during the test with majority of write IO. I admit that these tests shouldn't be considered as a definate and fully performance tests of ceph cluster as this is a live cluster with disk io actiivity outside outside of the test vms. The average load is not much (300-500 IO/s), mainly reads. However, it still indicates that there is a room for improvement in the ceph's cache pool implementation. Looking at my results, I think ceph is missing a lot of hits on the read cache, which causes osds to write a lot of data. With enchanceio I was getting well over 50% read hit ratio and the main activity on the ssds was read io unlike ceph. Outside of the tests, i've left enchanceio running on the osd servers. It has been a few days now and the hit ratio on the osds is around 8-11%, which seems a bit low. I was wondering if I should change the default block size of enchance io to 2K instead of the default 4K. Taking into account's ceph object size of 4M I am not sure if this will help the hit ratio. Does anyone have an idea? Andrei - Original Message - > From: "Mark Nelson" > To: "Robert LeBlanc" , "Mark Nelson" > > Cc: ceph-users@lists.ceph.com > Sent: Monday, 22 September, 2014 10:49:42 PM > Subject: Re: [ceph-users] Bcache / Enhanceio with osds > Likely it won't since the OSD is already coalescing journal writes. > FWIW, I ran through a bunch of tests using seekwatcher and blktrace > at > 4k, 128k, and 4m IO sizes on a 4 OSD cluster (3x replication) to get > a > feel for what the IO patterns are like for the dm-cache developers. I > included both the raw blktrace data and seekwatcher graphs here: > http://nhm.ceph.com/firefly_blktrace/ > there are some interesting patterns but they aren't too easy to spot > (I > don't know why the Chris decided to use blue and green by default!) > Mark > On 09/22/2014 04:32 PM, Robert LeBlanc wrote: > > We are still in the middle of testing things, but so far we have > > had > > more improvement with SSD journals than the OSD cached with bcache > > (five > > OSDs fronted by one SSD). We still have yet to test if adding a > > bcache > > layer in addition to the SSD journals provides any additional > > improvements. > > > > Robert LeBlanc > > > > On Sun, Sep 14, 2014 at 6:13 PM, Mark Nelson > > > <mailto:mark.nel...@inktank.com>> wrote: > > > > On 09/14/2014 05:11 PM, Andrei Mikhailovsky wrote: > > > > Hello guys, > > > > Was wondering if anyone uses or done some testing with using > > bcache or > > enhanceio caching in front of ceph osds? > > > > I've got a small cluster of 2 osd servers, 16 osds in total and > > 4 ssds > > for journals. I've recently purchased four additional ssds to be > > used > > for ceph cache pool, but i've found performance of guest vms to be > > slower with the cache pool for many benchmarks. The write > > performance > > has slightly improved, but the read performance has suffered a > > lot (as > > much as 60% in some tests). > > > > Therefore, I am planning to scrap the cache pool (at least until it > > matures) and use either bcache or enhanceio instead. > > > > > > We're actually looking at dm-cache a bit right now. (and talking > > some of the developers about the challenges
Re: [ceph-users] ceph backups
Luis, you may want to take a look at rbd export/import and export-diff import-diff functionality. this could be used to copy data to another cluster or offsite. S3 has regions, which you could use for async replication. Not sure how the cephfs work for backups. Andrei - Original Message - > From: "Luis Periquito" > To: ceph-users@lists.ceph.com > Sent: Tuesday, 23 September, 2014 11:28:39 AM > Subject: [ceph-users] ceph backups > Hi fellow cephers, > I'm being asked questions around our backup of ceph, mainly due to > data deletion. > We are currently using ceph to store RBD, S3 and eventually cephFS; > and we would like to be able to devise a plan to backup the > information as to avoid issues with data being deleted from the > cluster. > I know RBD has the snapshots, but how can they be automated? Can we > rely on them to perform data recovery? > And for S3/CephFS? Are there any backup methods? Other than copying > all the information into another location? > thanks, > Luis > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] features of the next stable release
Hi cephers, I've got three questions: 1. Does anyone have an estimation on the release dates of the next stable ceph branch? 2. Will the new stable release have improvements in the following areas: a) working with ssd disks; b) cache tier 3. Will the new stable release introduce support for native RDMA / Infiniband networking without the need of using IP over Infiniband? Thanks Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] features of the next stable release
- Original Message - > I'm not sure what you mean about improvements for SSD disks, but the > OSD should be generally a bit faster. There are several cache tier > improvements included that should improve performance on most > workloads that others can speak about in more detail than I. What i mean by the SSD disk improvement is that currently, the cluster made of all ssd disks is pretty slow. You will not get the IO throughput of the SSDs. My tests show the limit seems to be around 3K IOPs even though the ssds can easily do 50K+ IOps . This makes it impossible to have a decent database workload on the ceph cluster. > -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph Performance with SSD journal
Not sure what exact brands of samsung you have, but i've got the 840 Pro and it sucks big time. its is slow and unreliable and halts to a stand still over a period of time due to the trimming issue. Even after i've left unreserved like 50% of the disk. Unlike the Intel disks (even the consumer brand like 520 and 530 are just way better. I will stay away from any samsung drives in the future. Andrei - Original Message - > From: "Sumit Gaur" > To: "Irek Fasikhov" > Cc: ceph-users@lists.ceph.com > Sent: Friday, 13 February, 2015 1:09:38 PM > Subject: Re: [ceph-users] ceph Performance with SSD journal > Hi Irek, > I am using v0.80.5 Firefly > -sumit > On Fri, Feb 13, 2015 at 1:30 PM, Irek Fasikhov < malm...@gmail.com > > wrote: > > Hi. > > > What version? > > > 2015-02-13 6:04 GMT+03:00 Sumit Gaur < sumitkg...@gmail.com > : > > > > Hi Chir, > > > > > > Please fidn my answer below in blue > > > > > > On Thu, Feb 12, 2015 at 12:42 PM, Chris Hoy Poy < ch...@gopc.net > > > > > > > wrote: > > > > > > > Hi Sumit, > > > > > > > > > > A couple questions: > > > > > > > > > > What brand/model SSD? > > > > > > > > > samsung 480G SSD(PM853T) having random write 90K IOPS (4K, > > > 368MBps) > > > > > > > What brand/model HDD? > > > > > > > > > 64GB memory, 300GB SAS HDD (seagate) , 10Gb nic > > > > > > > Also how they are connected to controller/motherboard? Are they > > > > sharing a bus (ie SATA expander)? > > > > > > > > > no , They are connected with local Bus not the SATA expander. > > > > > > > RAM? > > > > > > > > > 64GB > > > > > > > Also look at the output of "iostat -x" or similiar, are the > > > > SSDs > > > > hitting 100% utilisation? > > > > > > > > > No, SSD was hitting 2000 iops only. > > > > > > > I suspect that the 5:1 ratio of HDDs to SDDs is not ideal, you > > > > now > > > > have 5x the write IO trying to fit into a single SSD. > > > > > > > > > I have not seen any documented reference to calculate the ratio. > > > Could you suggest one. Here I want to mention that results for > > > 1024K > > > write improve a lot. Problem is with 1024K read and 4k write . > > > > > > SSD journal 810 IOPS and 810MBps > > > > > > HDD journal 620 IOPS and 620 MBps > > > > > > > I'll take a punt on it being a SATA connected SSD (most > > > > common), > > > > 5x > > > > ~130 megabytes/second gets very close to most SATA bus limits. > > > > If > > > > its a shared BUS, you possibly hit that limit even earlier > > > > (since > > > > all that data is now being written twice out over the bus). > > > > > > > > > > cheers; > > > > > > > > > > \Chris > > > > > > > > > > From: "Sumit Gaur" < sumitkg...@gmail.com > > > > > > > > > > > To: ceph-users@lists.ceph.com > > > > > > > > > > Sent: Thursday, 12 February, 2015 9:23:35 AM > > > > > > > > > > Subject: [ceph-users] ceph Performance with SSD journal > > > > > > > > > > Hi Ceph -Experts, > > > > > > > > > > Have a small ceph architecture related question > > > > > > > > > > As blogs and documents suggest that ceph perform much better if > > > > we > > > > use journal on SSD . > > > > > > > > > > I have made the ceph cluster with 30 HDD + 6 SSD for 6 OSD > > > > nodes. > > > > 5 > > > > HDD + 1 SSD on each node and each SSD have 5 partition for > > > > journaling 5 OSDs on the node. > > > > > > > > > > Now I ran similar test as I ran for all HDD setup. > > > > > > > > > > What I saw below two reading goes in wrong direction as > > > > expected > > > > > > > > > > 1) 4K write IOPS are less for SSD setup, though not major > > > > difference > > > > but less. > > > > > > > > > > 2) 1024K Read IOPS are less for SSD setup than HDD setup. > > > > > > > > > > On the other hand 4K read and 1024K write both have much better > > > > numbers for SSD setup. > > > > > > > > > > Let me know if I am missing some obvious concept. > > > > > > > > > > Thanks > > > > > > > > > > sumit > > > > > > > > > > ___ > > > > > > > > > > ceph-users mailing list > > > > > > > > > > ceph-users@lists.ceph.com > > > > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > ___ > > > > > > ceph-users mailing list > > > > > > ceph-users@lists.ceph.com > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > -- > > > С уважением, Фасихов Ирек Нургаязович > > > Моб.: +79229045757 > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Introducing "Learning Ceph" : The First ever Book on Ceph
Yeah, guys, thanks! I've got it a few days ago and done a few chapters already. Well done! Andrei - Original Message - > From: "Wido den Hollander" > To: ceph-users@lists.ceph.com > Sent: Friday, 13 February, 2015 5:38:47 PM > Subject: Re: [ceph-users] Introducing "Learning Ceph" : The First > ever Book on Ceph > On 05-02-15 23:53, Karan Singh wrote: > > Hello Community Members > > > > I am happy to introduce the first book on Ceph with the title > > “*Learning > > Ceph*”. > > > > Me and many folks from the publishing house together with technical > > reviewers spent several months to get this book compiled and > > published. > > > > Finally the book is up for sale on , i hope you would like it and > > surely > > will learn a lot from it. > > > Great! Just ordered myself a copy! > > Amazon : > > http://www.amazon.com/Learning-Ceph-Karan-Singh/dp/1783985623/ref=sr_1_1?s=books&ie=UTF8&qid=1423174441&sr=1-1&keywords=ceph > > Packtpub : > > https://www.packtpub.com/application-development/learning-ceph > > > > You can grab the sample copy from here : > > https://www.dropbox.com/s/ek76r01r9prs6pb/Learning_Ceph_Packt.pdf?dl=0 > > > > *Finally , I would like to express my sincere thanks to * > > > > *Sage Weil* - For developing Ceph and everything around it as well > > as > > writing foreword for “Learning Ceph”. > > *Patrick McGarry *- For his usual off the track support that too > > always. > > > > Last but not the least , to our great community members , who are > > also > > reviewers of the book *Don Talton , Julien Recurt , Sebastien Han > > *and > > *Zihong Chen *, Thank you guys for your efforts. > > > > > > > > Karan Singh > > Systems Specialist , Storage Platforms > > CSC - IT Center for Science, > > Keilaranta 14, P. O. Box 405, FIN-02101 Espoo, Finland > > mobile: +358 503 812758 > > tel. +358 9 4572001 > > fax +358 9 4572302 > > http://www.csc.fi/ > > > > > > > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > -- > Wido den Hollander > 42on B.V. > Phone: +31 (0)20 700 9902 > Skype: contact42on > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Dumpling/Firefly/Hammer SSD/Memstore performance comparison
Mark, many thanks for your effort and ceph performance tests. This puts things in perspective. Looking at the results, I was a bit concerned that the IOPs performance in niether releases come even marginally close to the capabilities of the underlying ssd device. Even the fastest PCI ssds have only managed to achieve about the 1/6th IOPs of the raw device. I guess there is a great deal more optimisations to be done in the upcoming LTS releases to make the IOPs rate close to the raw device performance. I have done some testing in the past and noticed that despite the server having a lot of unused resources (about 40-50% server idle and about 60-70% ssd idle) the ceph would not perform well when used with ssds. I was testing with Firefly + auth and my IOPs rate was around the 3K mark. Something is holding ceph back from performing well with ssds ((( Andrei - Original Message - > From: "Mark Nelson" > To: "ceph-devel" > Cc: ceph-users@lists.ceph.com > Sent: Tuesday, 17 February, 2015 5:37:01 PM > Subject: [ceph-users] Ceph Dumpling/Firefly/Hammer SSD/Memstore > performance comparison > Hi All, > I wrote up a short document describing some tests I ran recently to > look > at how SSD backed OSD performance has changed across our LTS > releases. > This is just looking at RADOS performance and not RBD or RGW. It also > doesn't offer any real explanations regarding the results. It's just > a > first high level step toward understanding some of the behaviors > folks > on the mailing list have reported over the last couple of releases. I > hope you find it useful. > Mark > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Extreme slowness in SSD cluster with 3 nodes and 9 OSD with 3.16-3 kernel
Martin, I have been using Samsung 840 Pro for journals about 2 years now and have just replaced all my samsung drives with Intel. We have found a lot of performance issues with 840 Pro (we are using 128mb). In particular, a very strange behaviour with using 4 partitions (with 50% underprovisioning left as empty unpartitioned space on the drive) where the drive would grind to almost a halt after a few weeks of use. I was getting 100% utilisation on the drives doing just 3-4MB/s writes. This was not the case when I've installed the new drives. Manual Trimming helps for a few weeks until the same happens again. This has been happening with all 840 Pro ssds that we have and contacting Samsung Support has proven to be utterly useless. They do not want to speak with you until you install windows and run their monkey utility ((. Also, i've noticed the latencies of the Samsung 840 Pro ssd drives to be about 15-20 slower compared with a consumer grade Intel drives, like Intel 520. According to ceph osd pef, I would consistently get higher figures on the osds with Samsung journal drive compared with the Intel drive on the same server. Something like 2-3ms for Intel vs 40-50ms for Samsungs. At some point we had enough with Samsungs and scrapped them. Andrei - Original Message - > From: "Martin B Nielsen" > To: "Philippe Schwarz" > Cc: ceph-users@lists.ceph.com > Sent: Saturday, 28 February, 2015 11:51:57 AM > Subject: Re: [ceph-users] Extreme slowness in SSD cluster with 3 > nodes and 9 OSD with 3.16-3 kernel > Hi, > I cannot recognize that picture; we've been using samsumg 840 pro in > production for almost 2 years now - and have had 1 fail. > We run a 8node mixed ssd/platter cluster with 4x samsung 840 pro > (500gb) in each so that is 32x ssd. > They've written ~25TB data in avg each. > Using the dd you had inside an existing semi-busy mysql-guest I get: > 10240 bytes (102 MB) copied, 5.58218 s, 18.3 MB/s > Which is still not a lot, but I think it is more a limitation of our > setup/load. > We are using dumpling. > All that aside, I would prob. go with something tried and tested if I > was to redo it today - we haven't had any issues, but it is still > nice to use something you know should have a baseline performance > and can compare to that. > Cheers, > Martin > On Sat, Feb 28, 2015 at 12:32 PM, Philippe Schwarz < > p...@schwarz-fr.net > wrote: > > -BEGIN PGP SIGNED MESSAGE- > > > Hash: SHA1 > > > Le 28/02/2015 12:19, mad Engineer a écrit : > > > > Hello All, > > > > > > > > I am trying ceph-firefly 0.80.8 > > > > (69eaad7f8308f21573c604f121956e64679a52a7) with 9 OSD ,all > > > Samsung > > > > SSD 850 EVO on 3 servers with 24 G RAM,16 cores @2.27 Ghz Ubuntu > > > > 14.04 LTS with 3.16-3 kernel.All are connected to 10G ports with > > > > maximum MTU.There are no extra disks for journaling and also > > > there > > > > are no separate network for replication and data transfer.All 3 > > > > nodes are also hosting monitoring process.Operating system runs > > > on > > > > SATA disk. > > > > > > > > When doing a sequential benchmark using "dd" on RBD, mounted on > > > > client as ext4 its taking 110s to write 100Mb data at an average > > > > speed of 926Kbps. > > > > > > > > time dd if=/dev/zero of=hello bs=4k count=25000 oflag=direct > > > > 25000+0 records in 25000+0 records out 10240 bytes (102 MB) > > > > copied, 110.582 s, 926 kB/s > > > > > > > > real 1m50.585s user 0m0.106s sys 0m2.233s > > > > > > > > While doing this directly on ssd mount point shows: > > > > > > > > time dd if=/dev/zero of=hello bs=4k count=25000 oflag=direct > > > > 25000+0 records in 25000+0 records out 10240 bytes (102 MB) > > > > copied, 1.38567 s, 73.9 MB/s > > > > > > > > OSDs are in XFS with these extra arguments : > > > > > > > > rw,noatime,inode64,logbsize=256k,delaylog,allocsize=4M > > > > > > > > ceph.conf > > > > > > > > [global] fsid = 7d889081-7826-439c-9fe5-d4e57480d9be > > > > mon_initial_members = ceph1, ceph2, ceph3 mon_host = > > > > 10.99.10.118,10.99.10.119,10.99.10.120 auth_cluster_required = > > > > cephx auth_service_required = cephx auth_client_required = cephx > > > > filestore_xattr_use_omap = true osd_pool_default_size = 2 > > > > osd_pool_default_min_size = 2 osd_pool_default_pg_num = 450 > > > > osd_pool_default_pgp_num = 450 max_open_files = 131072 > > > > > > > > [osd] osd_mkfs_type = xfs osd_op_threads = 8 osd_disk_threads = 4 > > > > osd_mount_options_xfs = > > > > "rw,noatime,inode64,logbsize=256k,delaylog,allocsize=4M" > > > > > > > > > > > > on our traditional storage with Full SAS disk, same "dd" > > > completes > > > > in 16s with an average write speed of 6Mbps. > > > > > > > > Rados bench: > > > > > > > > rados bench -p rbd 10 write Maintaining 16 concurrent writes of > > > > 4194304 bytes for up to 10 seconds or 0 objects Object prefix: > > > > benchmark_data_ceph1_2977 sec Cur
Re: [ceph-users] SSD selection
I would not use a single ssd for 5 osds. I would recommend the 3-4 osds max per ssd or you will get the bottleneck on the ssd side. I've had a reasonable experience with Intel 520 ssds (which are not produced anymore). I've found Samsung 840 Pro to be horrible! Otherwise, it seems that everyone here recommends the DC3500 or DC3700 and it has the best wear per $ ratio out of all the drives. Andrei - Original Message - > From: "Tony Harris" > To: "Christian Balzer" > Cc: ceph-users@lists.ceph.com > Sent: Sunday, 1 March, 2015 4:19:30 PM > Subject: Re: [ceph-users] SSD selection > Well, although I have 7 now per node, you make a good point and I'm > in a position where I can either increase to 8 and split 4/4 and > have 2 ssds, or reduce to 5 and use a single osd per node (the > system is not in production yet). > Do all the DC lines have caps in them or just the DC s line? > -Tony > On Sat, Feb 28, 2015 at 11:21 PM, Christian Balzer < ch...@gol.com > > wrote: > > On Sat, 28 Feb 2015 20:42:35 -0600 Tony Harris wrote: > > > > Hi all, > > > > > > > > I have a small cluster together and it's running fairly well (3 > > > nodes, 21 > > > > osds). I'm looking to improve the write performance a bit though, > > > which > > > > I was hoping that using SSDs for journals would do. But, I was > > > wondering > > > > what people had as recommendations for SSDs to act as journal > > > drives. > > > > If I read the docs on ceph.com correctly, I'll need 2 ssds per > > > node > > > > (with 7 drives in each node, I think the recommendation was 1ssd > > > per 4-5 > > > > drives?) so I'm looking for drives that will work well without > > > breaking > > > > the bank for where I work (I'll probably have to purchase them > > > myself > > > > and donate, so my budget is somewhat small). Any suggestions? I'd > > > > prefer one that can finish its write in a power outage case, the > > > only > > > > one I know of off hand is the intel dcs3700 I think, but at $300 > > > it's > > > > WAY above my affordability range. > > > Firstly, an uneven number of OSDs (HDDs) per node will bite you in > > the > > > proverbial behind down the road when combined with journal SSDs, as > > one of > > > those SSDs will wear our faster than the other. > > > Secondly, how many SSDs you need is basically a trade-off between > > price, > > > performance, endurance and limiting failure impact. > > > I have cluster where I used 4 100GB DC S3700s with 8 HDD OSDs, > > optimizing > > > the write paths and IOPS and failure domain, but not the sequential > > speed > > > or cost. > > > Depending on what your write load is and the expected lifetime of > > this > > > cluster, you might be able to get away with DC S3500s or even > > better > > the > > > new DC S3610s. > > > Keep in mind that buying a cheap, low endurance SSD now might cost > > you > > > more down the road if you have to replace it after a year (TBW/$). > > > All the cheap alternatives to DC level SSDs tend to wear out too > > fast, > > > have no powercaps and tend to have unpredictable (caused by garbage > > > collection) and steadily decreasing performance. > > > Christian > > > -- > > > Christian Balzer Network/Systems Engineer > > > ch...@gol.com Global OnLine Japan/Fusion Communications > > > http://www.gol.com/ > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] SSD selection
I am not sure about the enterprise grade and underprovisioning, but for the Intel 520s i've got 240gbs (the speeds of 240 is a bit better than 120s). and i've left 50% underprovisioned. I've got 10GB for journals and I am using 4 osds per ssd. Andrei - Original Message - > From: "Tony Harris" > To: "Andrei Mikhailovsky" > Cc: ceph-users@lists.ceph.com, "Christian Balzer" > Sent: Sunday, 1 March, 2015 8:49:56 PM > Subject: Re: [ceph-users] SSD selection > Ok, any size suggestion? Can I get a 120 and be ok? I see I can get > DCS3500 120GB for within $120/drive so it's possible to get 6 of > them... > -Tony > On Sun, Mar 1, 2015 at 12:46 PM, Andrei Mikhailovsky < > and...@arhont.com > wrote: > > I would not use a single ssd for 5 osds. I would recommend the 3-4 > > osds max per ssd or you will get the bottleneck on the ssd side. > > > I've had a reasonable experience with Intel 520 ssds (which are not > > produced anymore). I've found Samsung 840 Pro to be horrible! > > > Otherwise, it seems that everyone here recommends the DC3500 or > > DC3700 and it has the best wear per $ ratio out of all the drives. > > > Andrei > > > > From: "Tony Harris" < neth...@gmail.com > > > > > > > To: "Christian Balzer" < ch...@gol.com > > > > > > > Cc: ceph-users@lists.ceph.com > > > > > > Sent: Sunday, 1 March, 2015 4:19:30 PM > > > > > > Subject: Re: [ceph-users] SSD selection > > > > > > Well, although I have 7 now per node, you make a good point and > > > I'm > > > in a position where I can either increase to 8 and split 4/4 and > > > have 2 ssds, or reduce to 5 and use a single osd per node (the > > > system is not in production yet). > > > > > > Do all the DC lines have caps in them or just the DC s line? > > > > > > -Tony > > > > > > On Sat, Feb 28, 2015 at 11:21 PM, Christian Balzer < > > > ch...@gol.com > > > > > > > wrote: > > > > > > > On Sat, 28 Feb 2015 20:42:35 -0600 Tony Harris wrote: > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > > > > > > > > > I have a small cluster together and it's running fairly well > > > > > (3 > > > > > nodes, 21 > > > > > > > > > > > osds). I'm looking to improve the write performance a bit > > > > > though, > > > > > which > > > > > > > > > > > I was hoping that using SSDs for journals would do. But, I > > > > > was > > > > > wondering > > > > > > > > > > > what people had as recommendations for SSDs to act as journal > > > > > drives. > > > > > > > > > > > If I read the docs on ceph.com correctly, I'll need 2 ssds > > > > > per > > > > > node > > > > > > > > > > > (with 7 drives in each node, I think the recommendation was > > > > > 1ssd > > > > > per 4-5 > > > > > > > > > > > drives?) so I'm looking for drives that will work well > > > > > without > > > > > breaking > > > > > > > > > > > the bank for where I work (I'll probably have to purchase > > > > > them > > > > > myself > > > > > > > > > > > and donate, so my budget is somewhat small). Any suggestions? > > > > > I'd > > > > > > > > > > > prefer one that can finish its write in a power outage case, > > > > > the > > > > > only > > > > > > > > > > > one I know of off hand is the intel dcs3700 I think, but at > > > > > $300 > > > > > it's > > > > > > > > > > > WAY above my affordability range. > > > > > > > > > > Firstly, an uneven number of OSDs (HDDs) per node will bite you > > > > in > > > > the > > > > > > > > > > proverbial behind down the road when combined with journal > > > > SSDs, > > > > as > > > > one of > > > > > > > > > > those SSDs will wear our f
Re: [ceph-users] OSD + Flashcache + udev + Partition uuid
In a long term use I also had some issues with flashcache and enhanceio. I've noticed frequent slow requests. Andrei - Original Message - > From: "Robert LeBlanc" > To: "Nick Fisk" > Cc: ceph-users@lists.ceph.com > Sent: Friday, 20 March, 2015 8:14:16 PM > Subject: Re: [ceph-users] OSD + Flashcache + udev + Partition uuid > We tested bcache and abandoned it for two reasons. > 1. Didn't give us any better performance than journals on SSD. > 2. We had lots of corruption of the OSDs and were rebuilding them > frequently. > Since removing them, the OSDs have been much more stable. > On Fri, Mar 20, 2015 at 4:03 AM, Nick Fisk < n...@fisk.me.uk > wrote: > > > -Original Message- > > > > From: ceph-users [mailto: ceph-users-boun...@lists.ceph.com ] On > > > Behalf Of > > > > Burkhard Linke > > > > Sent: 20 March 2015 09:09 > > > > To: ceph-users@lists.ceph.com > > > > Subject: Re: [ceph-users] OSD + Flashcache + udev + Partition > > > uuid > > > > > > > > Hi, > > > > > > > > On 03/19/2015 10:41 PM, Nick Fisk wrote: > > > > > I'm looking at trialling OSD's with a small flashcache device > > > > over > > > > > them to hopefully reduce the impact of metadata updates when > > > > doing > > > > small block io. > > > > > Inspiration from here:- > > > > > > > > > > http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/12083 > > > > > > > > > > One thing I suspect will happen, is that when the OSD node > > > > starts > > > > up > > > > > udev could possibly mount the base OSD partition instead of > > > > > flashcached device, as the base disk will have the ceph > > > > partition > > > > uuid > > > > > type. This could result in quite nasty corruption. > > > > I ran into this problem with an enhanceio based cache for one of > > > our > > > > database servers. > > > > > > > > I think you can prevent this problem by using bcache, which is > > > also > > > integrated > > > > into the official kernel tree. It does not act as a drop in > > > replacement, > > > but > > > > creates a new device that is only available if the cache is > > > initialized > > > correctly. A > > > > GPT partion table on the bcache device should be enough to allow > > > the > > > > standard udev rules to kick in. > > > > > > > > I haven't used bcache in this scenario yet, and I cannot comment > > > on > > > its > > > speed > > > > and reliability compared to other solutions. But from the > > > operational > > > point of > > > > view it is "safer" than enhanceio/flashcache. > > > I did look at bcache, but there are a lot of worrying messages on > > the > > > mailing list about hangs and panics that has discouraged me > > slightly > > from > > > it. I do think it is probably the best solution, but I'm not > > convinced about > > > the stability. > > > > > > > > Best regards, > > > > Burkhard > > > > ___ > > > > ceph-users mailing list > > > > ceph-users@lists.ceph.com > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ > > > ceph-users mailing list > > > ceph-users@lists.ceph.com > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Preliminary RDMA vs TCP numbers
Hi, Am I the only person noticing disappointing results from the preliminary RDMA testing, or am I reading the numbers wrong? Yes, it's true that on a very small cluster you do see a great improvement in rdma, but in real life rdma is used in large infrastructure projects, not on a few servers with a handful of osds. In fact, from what i've seen from the slides, the rdma implementation scales horribly to the point that it becomes slower the more osds you through at it. >From my limited knowledge, i have expected a much higher performance gains >with rdma, taking into account that you should have much lower latency and >overhead and lower cpu utilisation when using this transport in comparison >with tcp. Are we likely to see a great deal of improvement with ceph and rdma in a near future? Is there a roadmap for having a stable and reliable rdma protocol support? Thanks Andrei - Original Message - > From: "Andrey Korolyov" > To: "Somnath Roy" > Cc: ceph-users@lists.ceph.com, "ceph-devel" > > Sent: Wednesday, 8 April, 2015 9:28:12 AM > Subject: Re: [ceph-users] Preliminary RDMA vs TCP numbers > On Wed, Apr 8, 2015 at 11:17 AM, Somnath Roy > wrote: > > > > Hi, > > Please find the preliminary performance numbers of TCP Vs RDMA > > (XIO) implementation (on top of SSDs) in the following link. > > > > http://www.slideshare.net/somnathroy7568/ceph-on-rdma > > > > The attachment didn't go through it seems, so, I had to use > > slideshare. > > > > Mark, > > If we have time, I can present it in tomorrow's performance > > meeting. > > > > Thanks & Regards > > Somnath > > > Those numbers are really impressive (for small numbers at least)! > What > are TCP settings you using?For example, difference can be lowered on > scale due to less intensive per-connection acceleration on CUBIC on a > larger number of nodes, though I do not believe that it was a main > reason for an observed TCP catchup on a relatively flat workload such > as fio generates. > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Preliminary RDMA vs TCP numbers
Somnath, Sounds very promising! I can't wait to try it on my cluster as I am currently using IPOIB instread of the native rdma. Cheers Andrei - Original Message - > From: "Somnath Roy" > To: "Andrei Mikhailovsky" , "Andrey Korolyov" > > Cc: ceph-users@lists.ceph.com, "ceph-devel" > > Sent: Wednesday, 8 April, 2015 5:23:23 PM > Subject: RE: [ceph-users] Preliminary RDMA vs TCP numbers > Andrei, > Yes, I see it has lot of potential and I believe fixing the > performance bottlenecks inside XIO messenger it should go further. > We are working on it and will keep community posted.. > Thanks & Regards > Somnath > From: Andrei Mikhailovsky [mailto:and...@arhont.com] > Sent: Wednesday, April 08, 2015 2:22 AM > To: Andrey Korolyov > Cc: ceph-users@lists.ceph.com; ceph-devel; Somnath Roy > Subject: Re: [ceph-users] Preliminary RDMA vs TCP numbers > Hi, > Am I the only person noticing disappointing results from the > preliminary RDMA testing, or am I reading the numbers wrong? > Yes, it's true that on a very small cluster you do see a great > improvement in rdma, but in real life rdma is used in large > infrastructure projects, not on a few servers with a handful of > osds. In fact, from what i've seen from the slides, the rdma > implementation scales horribly to the point that it becomes slower > the more osds you through at it. > From my limited knowledge, i have expected a much higher performance > gains with rdma, taking into account that you should have much lower > latency and overhead and lower cpu utilisation when using this > transport in comparison with tcp. > Are we likely to see a great deal of improvement with ceph and rdma > in a near future? Is there a roadmap for having a stable and > reliable rdma protocol support? > Thanks > Andrei > - Original Message - > > From: "Andrey Korolyov" < and...@xdel.ru > > > > To: "Somnath Roy" < somnath@sandisk.com > > > > Cc: ceph-users@lists.ceph.com , "ceph-devel" < > > ceph-de...@vger.kernel.org > > > > Sent: Wednesday, 8 April, 2015 9:28:12 AM > > > Subject: Re: [ceph-users] Preliminary RDMA vs TCP numbers > > > On Wed, Apr 8, 2015 at 11:17 AM, Somnath Roy < > > somnath@sandisk.com > wrote: > > > > > > > > Hi, > > > > Please find the preliminary performance numbers of TCP Vs RDMA > > > (XIO) implementation (on top of SSDs) in the following link. > > > > > > > > http://www.slideshare.net/somnathroy7568/ceph-on-rdma > > > > > > > > The attachment didn't go through it seems, so, I had to use > > > slideshare. > > > > > > > > Mark, > > > > If we have time, I can present it in tomorrow's performance > > > meeting. > > > > > > > > Thanks & Regards > > > > Somnath > > > > > > > Those numbers are really impressive (for small numbers at least)! > > What > > > are TCP settings you using?For example, difference can be lowered > > on > > > scale due to less intensive per-connection acceleration on CUBIC on > > a > > > larger number of nodes, though I do not believe that it was a main > > > reason for an observed TCP catchup on a relatively flat workload > > such > > > as fio generates. > > > ___ > > > ceph-users mailing list > > > ceph-users@lists.ceph.com > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > PLEASE NOTE: The information contained in this electronic mail > message is intended only for the use of the designated recipient(s) > named above. If the reader of this message is not the intended > recipient, you are hereby notified that you have received this > message in error and that any review, dissemination, distribution, > or copying of this message is strictly prohibited. If you have > received this communication in error, please notify the sender by > telephone or e-mail (as shown above) immediately and destroy any and > all copies of this message in your possession (whether hard copies > or electronically stored copies). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Preliminary RDMA vs TCP numbers
Mike, yeah, I wouldn't switch to rdma until it is fully supported in a stable release ))) Andrei - Original Message - > From: "Andrei Mikhailovsky" > To: "Somnath Roy" > Cc: ceph-users@lists.ceph.com, "ceph-devel" > > Sent: Wednesday, 8 April, 2015 7:16:40 PM > Subject: Re: [ceph-users] Preliminary RDMA vs TCP numbers > Somnath, > Sounds very promising! I can't wait to try it on my cluster as I am > currently using IPOIB instread of the native rdma. > Cheers > Andrei > - Original Message - > > From: "Somnath Roy" > > > To: "Andrei Mikhailovsky" , "Andrey Korolyov" > > > > > Cc: ceph-users@lists.ceph.com, "ceph-devel" > > > > > Sent: Wednesday, 8 April, 2015 5:23:23 PM > > > Subject: RE: [ceph-users] Preliminary RDMA vs TCP numbers > > > Andrei, > > > Yes, I see it has lot of potential and I believe fixing the > > performance bottlenecks inside XIO messenger it should go further. > > > We are working on it and will keep community posted.. > > > Thanks & Regards > > > Somnath > > > From: Andrei Mikhailovsky [mailto:and...@arhont.com] > > > Sent: Wednesday, April 08, 2015 2:22 AM > > > To: Andrey Korolyov > > > Cc: ceph-users@lists.ceph.com; ceph-devel; Somnath Roy > > > Subject: Re: [ceph-users] Preliminary RDMA vs TCP numbers > > > Hi, > > > Am I the only person noticing disappointing results from the > > preliminary RDMA testing, or am I reading the numbers wrong? > > > Yes, it's true that on a very small cluster you do see a great > > improvement in rdma, but in real life rdma is used in large > > infrastructure projects, not on a few servers with a handful of > > osds. In fact, from what i've seen from the slides, the rdma > > implementation scales horribly to the point that it becomes slower > > the more osds you through at it. > > > From my limited knowledge, i have expected a much higher > > performance > > gains with rdma, taking into account that you should have much > > lower > > latency and overhead and lower cpu utilisation when using this > > transport in comparison with tcp. > > > Are we likely to see a great deal of improvement with ceph and rdma > > in a near future? Is there a roadmap for having a stable and > > reliable rdma protocol support? > > > Thanks > > > Andrei > > > - Original Message - > > > > From: "Andrey Korolyov" < and...@xdel.ru > > > > > > > To: "Somnath Roy" < somnath@sandisk.com > > > > > > > Cc: ceph-users@lists.ceph.com , "ceph-devel" < > > > ceph-de...@vger.kernel.org > > > > > > > Sent: Wednesday, 8 April, 2015 9:28:12 AM > > > > > > Subject: Re: [ceph-users] Preliminary RDMA vs TCP numbers > > > > > > On Wed, Apr 8, 2015 at 11:17 AM, Somnath Roy < > > > somnath@sandisk.com > wrote: > > > > > > > > > > > > > > Hi, > > > > > > > Please find the preliminary performance numbers of TCP Vs RDMA > > > > (XIO) implementation (on top of SSDs) in the following link. > > > > > > > > > > > > > > http://www.slideshare.net/somnathroy7568/ceph-on-rdma > > > > > > > > > > > > > > The attachment didn't go through it seems, so, I had to use > > > > slideshare. > > > > > > > > > > > > > > Mark, > > > > > > > If we have time, I can present it in tomorrow's performance > > > > meeting. > > > > > > > > > > > > > > Thanks & Regards > > > > > > > Somnath > > > > > > > > > > > > > Those numbers are really impressive (for small numbers at least)! > > > What > > > > > > are TCP settings you using?For example, difference can be lowered > > > on > > > > > > scale due to less intensive per-connection acceleration on CUBIC > > > on > > > a > > > > > > larger number of nodes, though I do not believe that it was a > > > main > > > > > > reason for an observed TCP catchup on a relatively flat workload > > > such > > > > > > as fio generates. > > > > > >
Re: [ceph-users] deep scrubbing causes osd down
Hi JC, I am running ceph 0.87.1 on Ubuntu 12.04 LTS server with latest patches. I am however running kernel version 3.19.3 and not the stock distro one. I am running cfq on all spindles and noop on all ssds (used for journals). I've not done any scrub specific options, but will try and see if it makes a difference. Thanks for your feedback Andrei - Original Message - > From: "LOPEZ Jean-Charles" > To: "Andrei Mikhailovsky" > Cc: "LOPEZ Jean-Charles" , > ceph-users@lists.ceph.com > Sent: Saturday, 11 April, 2015 7:54:18 PM > Subject: Re: [ceph-users] deep scrubbing causes osd down > Hi Andrei, > 1) what ceph version are you running? > 2) what distro and version are you running? > 3) have you checked the disk elevator for the OSD devices to be set > to cfq? > 4) Have have you considered exploring the following parameters to > further tune > - osd_scrub_chunk_min lower the default value of 5. e.g. = 1 > - osd_scrub_chunk_max lower the default value of 25. e.g. = 5 > - osd_deep_scrub_stride If you have lowered parameters above, you can > play with this one to fit best your physical disk behaviour. - > osd_scrub_sleep introduce a half second sleep between 2 scrubs; e.g. > = 0.5 to start with a half second delay > Cheers > JC > > On 10 Apr 2015, at 12:01, Andrei Mikhailovsky < and...@arhont.com > > > wrote: > > > Hi guys, > > > I was wondering if anyone noticed that the deep scrubbing process > > causes some osd to go down? > > > I have been keeping an eye on a few remaining stability issues in > > my > > test cluster. One of the unsolved issues is the occasional > > reporting > > of osd(s) going down and coming back up after about 20-30 seconds. > > This happens to various osds throughout the cluster. I have a small > > cluster of just 2 osd servers with 9 osds each. > > > The common trend that i see week after week is that whenever there > > is > > a long deep scrubbing activity on the cluster it triggers one or > > more osds to go down for a short period of time. After the osd is > > marked down, it goes back up after about 20 seconds. Obviously > > there > > is a repair process that kicks in which causes more load on the > > cluster. While looking at the logs, i've not seen the osds being > > marked down when the cluster is not deep scrubbing. It _always_ > > happens when there is a deep scrub activity. I am seeing the > > reports > > of osds going down about 3-4 times a week. > > > The latest happened just recently with the following log entries: > > > 2015-04-10 19:32:48.330430 mon.0 192.168.168.13:6789/0 3441533 : > > cluster [INF] pgmap v50849466: 8508 pgs: 8506 active+clean, 2 > > active+clean+scrubbing+deep; 13213 GB data, 26896 GB used, 23310 GB > > / 50206 GB avail; 1005 B/s rd, 1005 > > > B/s wr, 0 op/s > > > 2015-04-10 19:32:52.950633 mon.0 192.168.168.13:6789/0 3441542 : > > cluster [INF] osd.6 192.168.168.200:6816/3738 failed (5 reports > > from > > 5 peers after 60.747890 >= grace 46.701350) > > > 2015-04-10 19:32:53.121904 mon.0 192.168.168.13:6789/0 3441544 : > > cluster [INF] osdmap e74309: 18 osds: 17 up, 18 in > > > 2015-04-10 19:32:53.231730 mon.0 192.168.168.13:6789/0 3441545 : > > cluster [INF] pgmap v50849467: 8508 pgs: 599 stale+active+clean, > > 7907 active+clean, 1 stale+active+clean+scrubbing+deep, 1 > > active+clean+scrubbing+deep; 13213 GB data, 26896 GB used, 23310 GB > > / 50206 GB avail; 375 B/s rd, 0 op/s > > > osd.6 logs around the same time are: > > > 2015-04-10 19:16:29.110617 7fad6d5ec700 0 log_channel(default) log > > [INF] : 5.3d7 deep-scrub ok > > > 2015-04-10 19:27:47.561389 7fad6bde9700 0 log_channel(default) log > > [INF] : 5.276 deep-scrub ok > > > 2015-04-10 19:31:11.611321 7fad6d5ec700 0 log_channel(default) log > > [INF] : 5.287 deep-scrub ok > > > 2015-04-10 19:31:53.339881 7fad7ce0b700 1 heartbeat_map is_healthy > > 'OSD::osd_op_tp thread 0x7fad735f8700' had timed out after 15 > > > 2015-04-10 19:31:53.339887 7fad7ce0b700 1 heartbeat_map is_healthy > > 'OSD::osd_op_tp thread 0x7fad745fa700' had timed out after 15 > > > 2015-04-10 19:31:53.339890 7fad7ce0b700 1 heartbeat_map is_healthy > > 'OSD::osd_op_tp thread 0x7fad705f2700' had timed out after 15 > > > 2015-04-10 19:31:53.340050 7fad7e60e700 1 heartbeat_map is_healthy > > 'OSD::osd_op_tp thread 0x7fad735f8700' had timed out after 15 > > > 2015-04-10 19:31:53.340053 7fad7e60e700 1 heartbeat_map is_health
Re: [ceph-users] deep scrubbing causes osd down
JC, I've implemented the following changes to the ceph.conf and restarted mons and osds. osd_scrub_chunk_min = 1 osd_scrub_chunk_max =5 Things have become considerably worse after the changes. Shortly after doing that, majority of osd processes started taking up over 100% cpu and the cluster has considerably slowed down. All my vms are reporting high IO wait (between 30-80%), even vms which are pretty idle and don't do much. i have tried restarting all osds, but shortly after the restart the cpu usage goes up. The osds are showing the following logs: 2015-04-12 08:39:28.853860 7f96f81dd700 0 log_channel(default) log [WRN] : slow request 60.277590 seconds old, received at 2015-04-12 08:38:28.576168: osd_op(client.69637439.0:290325926 rbd_data.265f967a5f7514.4a00 [set-alloc-hint object_size 4194304 write_size 4194304,write 1249280~4096] 5.cb2620e0 snapc ac=[ac] ack+ondisk+write+known_if_redirected e74834) currently waiting for missing object 2015-04-12 08:39:28.853863 7f96f81dd700 0 log_channel(default) log [WRN] : slow request 60.246943 seconds old, received at 2015-04-12 08:38:28.606815: osd_op(client.69637439.0:290325927 rbd_data.265f967a5f7514.4a00 [set-alloc-hint object_size 4194304 write_size 4194304,write 1310720~4096] 5.cb2620e0 snapc ac=[ac] ack+ondisk+write+known_if_redirected e74834) currently waiting for missing object 2015-04-12 08:39:36.855180 7f96f81dd700 0 log_channel(default) log [WRN] : 7 slow requests, 1 included below; oldest blocked for > 68.278951 secs 2015-04-12 08:39:36.855191 7f96f81dd700 0 log_channel(default) log [WRN] : slow request 30.268450 seconds old, received at 2015-04-12 08:39:06.586669: osd_op(client.64965167.0:1607510 rbd_data.1f264b2ae8944a.0228 [set-alloc-hint object_size 4194304 write_size 4194304,write 3584000~69632] 5.30418007 ack+ondisk+write+known_if_redirected e74834) currently waiting for subops from 9 2015-04-12 08:40:43.570004 7f96dd693700 0 cls/rgw/cls_rgw.cc:1458: gc_iterate_entries end_key=1_01428824443.569998000 [In total i've got around 40,000 slow request entries accumulated overnight ((( ] On top of that, I have reports of osds going down and back up as frequently as every 10-20 minutes. This effects all osds and not a particular set of osds. I will restart the osd servers to see if it makes a difference, otherwise, I will need to revert back to the default settings as the cluster as it currently is is not functional. Andrei - Original Message - > From: "LOPEZ Jean-Charles" > To: "Andrei Mikhailovsky" > Cc: "LOPEZ Jean-Charles" , > ceph-users@lists.ceph.com > Sent: Saturday, 11 April, 2015 7:54:18 PM > Subject: Re: [ceph-users] deep scrubbing causes osd down > Hi Andrei, > 1) what ceph version are you running? > 2) what distro and version are you running? > 3) have you checked the disk elevator for the OSD devices to be set > to cfq? > 4) Have have you considered exploring the following parameters to > further tune > - osd_scrub_chunk_min lower the default value of 5. e.g. = 1 > - osd_scrub_chunk_max lower the default value of 25. e.g. = 5 > - osd_deep_scrub_stride If you have lowered parameters above, you can > play with this one to fit best your physical disk behaviour. - > osd_scrub_sleep introduce a half second sleep between 2 scrubs; e.g. > = 0.5 to start with a half second delay > Cheers > JC > > On 10 Apr 2015, at 12:01, Andrei Mikhailovsky < and...@arhont.com > > > wrote: > > > Hi guys, > > > I was wondering if anyone noticed that the deep scrubbing process > > causes some osd to go down? > > > I have been keeping an eye on a few remaining stability issues in > > my > > test cluster. One of the unsolved issues is the occasional > > reporting > > of osd(s) going down and coming back up after about 20-30 seconds. > > This happens to various osds throughout the cluster. I have a small > > cluster of just 2 osd servers with 9 osds each. > > > The common trend that i see week after week is that whenever there > > is > > a long deep scrubbing activity on the cluster it triggers one or > > more osds to go down for a short period of time. After the osd is > > marked down, it goes back up after about 20 seconds. Obviously > > there > > is a repair process that kicks in which causes more load on the > > cluster. While looking at the logs, i've not seen the osds being > > marked down when the cluster is not deep scrubbing. It _always_ > > happens when there is a deep scrub activity. I am seeing the > > reports > > of osds going down about 3-4 times a week. > > > The latest happened just recently with the following log entries: > > > 2015-04-
Re: [ceph-users] deep scrubbing causes osd down
JC, the restart of the osd servers seems to have stabilised the cluster. It has been a few hours since the restart and I haven't not seen a single osd disconnect. Is there a way to limit the total number of scrub and/or deep-scrub processes running at the same time? For instance, I do not want to have more than 1 or 2 scrub/deep-scrubs running at the same time on my cluster. How do I implement this? Thanks Andrei - Original Message - > From: "Andrei Mikhailovsky" > To: "LOPEZ Jean-Charles" > Cc: ceph-users@lists.ceph.com > Sent: Sunday, 12 April, 2015 9:02:05 AM > Subject: Re: [ceph-users] deep scrubbing causes osd down > JC, > I've implemented the following changes to the ceph.conf and restarted > mons and osds. > osd_scrub_chunk_min = 1 > osd_scrub_chunk_max =5 > Things have become considerably worse after the changes. Shortly > after doing that, majority of osd processes started taking up over > 100% cpu and the cluster has considerably slowed down. All my vms > are reporting high IO wait (between 30-80%), even vms which are > pretty idle and don't do much. > i have tried restarting all osds, but shortly after the restart the > cpu usage goes up. The osds are showing the following logs: > 2015-04-12 08:39:28.853860 7f96f81dd700 0 log_channel(default) log > [WRN] : slow request 60.277590 seconds old, received at 2015-04-12 > 08:38:28.576168: osd_op(client.69637439.0:290325926 > rbd_data.265f967a5f7514.4a00 [set-alloc-hint object_size > 4194304 write_size 4194304,write 1249280~4096] 5.cb2620e0 snapc > ac=[ac] ack+ondisk+write+known_if_redirected e74834) currently > waiting for missing object > 2015-04-12 08:39:28.853863 7f96f81dd700 0 log_channel(default) log > [WRN] : slow request 60.246943 seconds old, received at 2015-04-12 > 08:38:28.606815: osd_op(client.69637439.0:290325927 > rbd_data.265f967a5f7514.4a00 [set-alloc-hint object_size > 4194304 write_size 4194304,write 1310720~4096] 5.cb2620e0 snapc > ac=[ac] ack+ondisk+write+known_if_redirected e74834) currently > waiting for missing object > 2015-04-12 08:39:36.855180 7f96f81dd700 0 log_channel(default) log > [WRN] : 7 slow requests, 1 included below; oldest blocked for > > 68.278951 secs > 2015-04-12 08:39:36.855191 7f96f81dd700 0 log_channel(default) log > [WRN] : slow request 30.268450 seconds old, received at 2015-04-12 > 08:39:06.586669: osd_op(client.64965167.0:1607510 > rbd_data.1f264b2ae8944a.0228 [set-alloc-hint object_size > 4194304 write_size 4194304,write 3584000~69632] 5.30418007 > ack+ondisk+write+known_if_redirected e74834) currently waiting for > subops from 9 > 2015-04-12 08:40:43.570004 7f96dd693700 0 > cls/rgw/cls_rgw.cc:1458: gc_iterate_entries > end_key=1_01428824443.569998000 > [In total i've got around 40,000 slow request entries accumulated > overnight ((( ] > On top of that, I have reports of osds going down and back up as > frequently as every 10-20 minutes. This effects all osds and not a > particular set of osds. > I will restart the osd servers to see if it makes a difference, > otherwise, I will need to revert back to the default settings as the > cluster as it currently is is not functional. > Andrei > - Original Message - > > From: "LOPEZ Jean-Charles" > > > To: "Andrei Mikhailovsky" > > > Cc: "LOPEZ Jean-Charles" , > > ceph-users@lists.ceph.com > > > Sent: Saturday, 11 April, 2015 7:54:18 PM > > > Subject: Re: [ceph-users] deep scrubbing causes osd down > > > Hi Andrei, > > > 1) what ceph version are you running? > > > 2) what distro and version are you running? > > > 3) have you checked the disk elevator for the OSD devices to be set > > to cfq? > > > 4) Have have you considered exploring the following parameters to > > further tune > > > - osd_scrub_chunk_min lower the default value of 5. e.g. = 1 > > > - osd_scrub_chunk_max lower the default value of 25. e.g. = 5 > > > - osd_deep_scrub_stride If you have lowered parameters above, you > > can > > play with this one to fit best your physical disk behaviour. - > > osd_scrub_sleep introduce a half second sleep between 2 scrubs; > > e.g. > > = 0.5 to start with a half second delay > > > Cheers > > > JC > > > > On 10 Apr 2015, at 12:01, Andrei Mikhailovsky < and...@arhont.com > > > > > > > wrote: > > > > > > Hi guys, > > > > > > I was wondering if anyone noticed that the deep scrubbing process > > > causes some osd to go down? > > > > > > I have been kee
Re: [ceph-users] deep scrubbing causes osd down
JC, Thanks I think the max scrub option that you refer to is a value per osd and not per cluster. So, the default is not to run more than 1 scrub per osd. So, if you have 100 osds by default it will not run more than 100 scurb processes at the same time. However, I want to limit the number on a cluster basis rather than on an osd basis. Andrei - Original Message - > From: "Jean-Charles Lopez" > To: "Andrei Mikhailovsky" > Cc: ceph-users@lists.ceph.com > Sent: Sunday, 12 April, 2015 5:17:10 PM > Subject: Re: [ceph-users] deep scrubbing causes osd down > Hi andrei > There is one parameter, osd_max_scrub I think, that controls the > number of scrubs per OSD. But the default is 1 if I'm correct. > Can you check on one of your OSDs with the admin socket? > Then it remains the option of scheduling the deep scrubs via a cron > job after setting nodeep-scrub to prevent automatic deep scrubbing. > Dan Van Der Ster had a post on this ML regarding this. > JC > While moving. Excuse unintended typos. > On Apr 12, 2015, at 05:21, Andrei Mikhailovsky < and...@arhont.com > > wrote: > > JC, > > > the restart of the osd servers seems to have stabilised the > > cluster. > > It has been a few hours since the restart and I haven't not seen a > > single osd disconnect. > > > Is there a way to limit the total number of scrub and/or deep-scrub > > processes running at the same time? For instance, I do not want to > > have more than 1 or 2 scrub/deep-scrubs running at the same time on > > my cluster. How do I implement this? > > > Thanks > > > Andrei > > > - Original Message - > > > > From: "Andrei Mikhailovsky" < and...@arhont.com > > > > > > > To: "LOPEZ Jean-Charles" < jelo...@redhat.com > > > > > > > Cc: ceph-users@lists.ceph.com > > > > > > Sent: Sunday, 12 April, 2015 9:02:05 AM > > > > > > Subject: Re: [ceph-users] deep scrubbing causes osd down > > > > > > JC, > > > > > > I've implemented the following changes to the ceph.conf and > > > restarted > > > mons and osds. > > > > > > osd_scrub_chunk_min = 1 > > > > > > osd_scrub_chunk_max =5 > > > > > > Things have become considerably worse after the changes. Shortly > > > after doing that, majority of osd processes started taking up > > > over > > > 100% cpu and the cluster has considerably slowed down. All my vms > > > are reporting high IO wait (between 30-80%), even vms which are > > > pretty idle and don't do much. > > > > > > i have tried restarting all osds, but shortly after the restart > > > the > > > cpu usage goes up. The osds are showing the following logs: > > > > > > 2015-04-12 08:39:28.853860 7f96f81dd700 0 log_channel(default) > > > log > > > [WRN] : slow request 60.277590 seconds old, received at > > > 2015-04-12 > > > 08:38:28.576168: osd_op(client.69637439.0:290325926 > > > rbd_data.265f967a5f7514.4a00 [set-alloc-hint > > > object_size > > > 4194304 write_size 4194304,write 1249280~4096] 5.cb2620e0 snapc > > > ac=[ac] ack+ondisk+write+known_if_redirected e74834) currently > > > waiting for missing object > > > > > > 2015-04-12 08:39:28.853863 7f96f81dd700 0 log_channel(default) > > > log > > > [WRN] : slow request 60.246943 seconds old, received at > > > 2015-04-12 > > > 08:38:28.606815: osd_op(client.69637439.0:290325927 > > > rbd_data.265f967a5f7514.4a00 [set-alloc-hint > > > object_size > > > 4194304 write_size 4194304,write 1310720~4096] 5.cb2620e0 snapc > > > ac=[ac] ack+ondisk+write+known_if_redirected e74834) currently > > > waiting for missing object > > > > > > 2015-04-12 08:39:36.855180 7f96f81dd700 0 log_channel(default) > > > log > > > [WRN] : 7 slow requests, 1 included below; oldest blocked for > > > > 68.278951 secs > > > > > > 2015-04-12 08:39:36.855191 7f96f81dd700 0 log_channel(default) > > > log > > > [WRN] : slow request 30.268450 seconds old, received at > > > 2015-04-12 > > > 08:39:06.586669: osd_op(client.64965167.0:1607510 > > > rbd_data.1f264b2ae8944a.0228 [set-alloc-hint > > > object_size > > > 4194304 write_size 4194304,write 3584000~69632] 5.30418007 > > > ack+ondisk+write+known_if_redirected e74834) curre
Re: [ceph-users] Possible improvements for a slow write speed (excluding independent SSD journals)
Hi I have been testing the Samsung 840 Pro (128gb) for quite sometime and I can also confirm that this drive is unsuitable for osd journal. The performance and latency that I get from these drives (according to ceph osd perf) are between 10 - 15 times slower compared to the Intel 520. The Intel 530 drives are also pretty awful. They are meant to be a replacement of the 520 drives, but the performance is pretty bad. I have found Intel 520 to be a reasonable drive for performance per price, for a cluster without a great deal of writes. However they do not make those anymore. Otherwise, it seems that the Intel 3600 and 3700 series is a good performer and has a much longer life expectancy. Andrei - Original Message - > From: "Eneko Lacunza" > To: "J-P Methot" , "Christian Balzer" > , ceph-users@lists.ceph.com > Sent: Tuesday, 21 April, 2015 8:18:20 AM > Subject: Re: [ceph-users] Possible improvements for a slow write > speed (excluding independent SSD journals) > Hi, > I'm just writing to you to stress out what others have already said, > because it is very important that you take it very seriously. > On 20/04/15 19:17, J-P Methot wrote: > > On 4/20/2015 11:01 AM, Christian Balzer wrote: > >> > >>> This is similar to another thread running right now, but since > >>> our > >>> current setup is completely different from the one described in > >>> the > >>> other thread, I thought it may be better to start a new one. > >>> > >>> We are running Ceph Firefly 0.80.8 (soon to be upgraded to > >>> 0.80.9). We > >>> have 6 OSD hosts with 16 OSD each (so a total of 96 OSDs). Each > >>> OSD > >>> is a > >>> Samsung SSD 840 EVO on which I can reach write speeds of roughly > >>> 400 > >>> MB/sec, plugged in jbod on a controller that can theoretically > >>> transfer > >>> at 6gb/sec. All of that is linked to openstack compute nodes on > >>> two > >>> bonded 10gbps links (so a max transfer rate of 20 gbps). > >>> > >> I sure as hell hope you're not planning to write all that much to > >> this > >> cluster. > >> But then again you're worried about write speed, so I guess you > >> do. > >> Those _consumer_ SSDs will be dropping like flies, there are a > >> number of > >> threads about them here. > >> > >> They also might be of the kind that don't play well with O_DSYNC, > >> I > >> can't > >> recall for sure right now, check the archives. > >> Consumer SSDs universally tend to slow down quite a bit when not > >> TRIM'ed > >> and/or subjected to prolonged writes, like those generated by a > >> benchmark. > > I see, yes it looks like these SSDs are not the best for the job. > > We > > will not change them for now, but if they start failing, we will > > replace them with better ones. > I tried to put a Samsung 840 Pro 256GB in a ceph setup. It is > supposed > to be quite better than the EVO right? It was total crap. No "not the > best for the job". TOTAL CRAP. :) > It can't give any useful write performance for a Ceph OSD. Spec sheet > numbers don't matter for this, they don't work for ceph OSD, period. > And > yes, the drive is fine and works like a charm in workstation > workloads. > I suggest you at least get some intel S3700/S3610 and use them for > the > journal of those samsung drives, I think that could help performance > a lot. > Cheers > Eneko > -- > Zuzendari Teknikoa / Director Técnico > Binovo IT Human Project, S.L. > Telf. 943575997 > 943493611 > Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun > (Gipuzkoa) > www.binovo.es > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Possible improvements for a slow write speed (excluding independent SSD journals)
Anthony, I doubt the manufacturer reported 315MB/s for 4K block size. Most likely they've used 1M or 4M as the block size to achieve the 300MB/s+ speeds Andrei - Original Message - > From: "Alexandre DERUMIER" > To: "Anthony Levesque" > Cc: "ceph-users" > Sent: Saturday, 25 April, 2015 5:32:30 PM > Subject: Re: [ceph-users] Possible improvements for a slow write > speed (excluding independent SSD journals) > I'm able to reach around 2-25000iops with 4k block with s3500 > (with o_dsync) (so yes, around 80-100MB/S). > I'l bench new s3610 soon to compare. > - Mail original - > De: "Anthony Levesque" > À: "Christian Balzer" > Cc: "ceph-users" > Envoyé: Vendredi 24 Avril 2015 22:00:44 > Objet: Re: [ceph-users] Possible improvements for a slow write speed > (excluding independent SSD journals) > Hi Christian, > We tested some DC S3500 300GB using dd if=randfile of=/dev/sda bs=4k > count=10 oflag=direct,dsync > we got 96 MB/s which is far from the 315 MB/s from the website. > Can I ask you or anyone on the mailing list how you are testing the > write speed for journals? > Thanks > --- > Anthony Lévesque > GloboTech Communications > Phone: 1-514-907-0050 x 208 > Toll Free: 1-(888)-GTCOMM1 x 208 > Phone Urgency: 1-(514) 907-0047 > 1-(866)-500-1555 > Fax: 1-(514)-907-0750 > aleves...@gtcomm.net > http://www.gtcomm.net > On Apr 23, 2015, at 9:05 PM, Christian Balzer < ch...@gol.com > > wrote: > Hello, > On Thu, 23 Apr 2015 18:40:38 -0400 Anthony Levesque wrote: > BQ_BEGIN > To update you on the current test in our lab: > 1.We tested the Samsung OSD in Recovery mode and the speed was able > to > maxout 2x 10GbE port(transferring data at 2200+ MB/s during > recovery). > So for normal write operation without O_DSYNC writes Samsung drives > seem > ok. > 2.We then tested a couple of different model of SSD we had in stock > with > the following command: > dd if=randfile of=/dev/sda bs=4k count=10 oflag=direct,dsync > This was from a blog written by Sebastien Han and I think should be > able > to show how the drives would perform in O_DSYNC writes. For people > interested in some result of what we tested here they are: > Intel DC S3500 120GB = 114 MB/s > Samsung Pro 128GB = 2.4 MB/s > WD Black 1TB (HDD) = 409 KB/s > Intel 330 120GB = 105 MB/s > Intel 520 120GB = 9.4 MB/s > Intel 335 80GB = 9.4 MB/s > Samsung EVO 1TB = 2.5 MB/s > Intel 320 120GB = 78 MB/s > OCZ Revo Drive 240GB = 60.8 MB/s > 4x Samsung EVO 1TB LSI RAID0 HW + BBU = 28.4 MB/s > No real surprises here, but a nice summary nonetheless. > You _really_ want to avoid consumer SSDs for journals and have a good > idea > on how much data you'll write per day and how long you expect your > SSDs to > last (the TBW/$ ratio). > BQ_BEGIN > Please let us know if the command we ran was not optimal to test > O_DSYNC > writes > We order larger drive from Intel DC series to see if we could get > more > than 200 MB/s per SSD. We will keep you posted on tests if that > interested you guys. We dint test multiple parallel test yet (to > simulate multiple journal on one SSD). > BQ_END > You can totally trust the numbers on Intel's site: > http://ark.intel.com/products/family/83425/Data-Center-SSDs > The S3500s are by far the slowest and have the lowest endurance. > Again, depending on your expected write level the S3610 or S3700 > models > are going to be a better fit regarding price/performance. > Especially when you consider that loosing a journal SSD will result > in > several dead OSDs. > BQ_BEGIN > 3.We remove the Journal from all Samsung OSD and put 2x Intel 330 > 120GB > on all 6 Node to test. The overall speed we were getting from the > rados > bench went from 1000 MB/s(approx.) to 450 MB/s which might only be > because the intel cannot do too much in term of journaling (was > tested > at around 100 MB/s). It will be interesting to test with bigger Intel > DC S3500 drives(and more journals) per node to see if I can back up > to > 1000MB/s or even surpass it. > We also wanted to test if the CPU could be a huge bottle neck so we > swap > the Dual E5-2620v2 from node #6 and replace them with Dual > E5-2609v2(Which are much smaller in core and speed) and the 450 MB/s > we > got from he rados bench went even lower to 180 MB/s. > BQ_END > You really don't have to swap CPUs around, monitor things with atop > or > other tools to see where your bottlenecks are. > BQ_BEGIN > So Im wondering if the 1000MB/s we got when the Journal was shared on > the OSD SSD was not limited by the CPUs (even though the samsung are > not > good for journals on the long run) and not just by the fact Samsung > SSD > are bad in O_DSYNC writes(or maybe both). It is probable that 16 SSD > OSD per node in a full SSD cluster is too much and the major > bottleneck > will be from the CPU. > BQ_END > That's what I kept saying. ^.^ > BQ_BEGIN > 4.Im wondering if we find good SSD for the journal and keep the > samsung > for normal writes and read(
Re: [ceph-users] How to estimate whether putting a journal on SSD will help with performance?
Piotr, You may also investigate if the cache tier made of a couple of ssds could help you. Not sure how the data is used in your company, but if you have a bunch of hot data that moves around from one vm to another it might greatly speed up the rsync. On the other hand, if a lot of rsync data is cold, it might have an adverse effect on performance. As a test, you could try to create a small pool with a couple of ssds in a cache tier on top of your spinning osds. You don't need to purchase tons of ssds in advance. As a test case, I would suggest 2-4 ssds in a cache tier should be okay for the PoC. Andrei - Original Message - From: "Nick Fisk" To: "Piotr Wachowicz" Cc: ceph-users@lists.ceph.com Sent: Friday, 1 May, 2015 10:42:12 AM Subject: Re: [ceph-users] How to estimate whether putting a journal on SSD will help with performance? Yeah, that’s your problem, doing a single thread rsync when you have quite poor write latency will not be quick. SSD journals should give you a fair performance boost, otherwise you need to coalesce the writes at the client so that Ceph is given bigger IOs at higher queue depths. RBD Cache can help here as well as potentially FS tuning to buffer more aggressively. If writeback RBD cache is enabled, data will be buffered by RBD until a sync is called by the client, so data loss can occur during this period if the app is not issuing fsyncs properly. Once a sync is called data is flushed to the journals and then later to the actual OSD store. From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Piotr Wachowicz Sent: 01 May 2015 10:14 To: Nick Fisk Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] How to estimate whether putting a journal on SSD will help with performance? Thanks for your answer, Nick. Typically it's a single rsync session at a time (sometimes two, but rarely more concurrently). So it's a single ~5GB typical linux filesystem from one random VM to another random VM. Apart from using RBD Cache, is there any other way to improve the overall performance of such a use case in a Ceph cluster? In theory I guess we could always tarball it, and rsync the tarball, thus effectively using sequential IO rather than random. But that's simply not feasible for us at the moment. Any other ways? Sidequestion: does using RBDCache impact the way data is stored on the client? (e.g. a write call returning after data has been written to Journal (fast) vs written all the way to the OSD data store(slow)). I'm guessing it's always the first one, regardless of whether client uses RBDCache or not, right? My logic here is that otherwise that would imply that clients can impact the way OSDs behave, which could be dangerous in some situations. Kind Regards, Piotr On Fri, May 1, 2015 at 10:59 AM, Nick Fisk < n...@fisk.me.uk > wrote: How many Rsync’s are doing at a time? If it is only a couple, you will not be able to take advantage of the full number of OSD’s, as each block of data is only located on 1 OSD (not including replicas). When you look at disk statistics you are seeing an average over time, so it will look like the OSD’s are not very busy, when in fact each one is busy for a very brief period. SSD journals will help your write latency, probably going down from around 15-30ms to under 5ms From: ceph-users [mailto: ceph-users-boun...@lists.ceph.com ] On Behalf Of Piotr Wachowicz Sent: 01 May 2015 09:31 To: ceph-users@lists.ceph.com Subject: [ceph-users] How to estimate whether putting a journal on SSD will help with performance? Is there any way to confirm (beforehand) that using SSDs for journals will help? We're seeing very disappointing Ceph performance. We have 10GigE interconnect (as a shared public/internal network). We're wondering whether it makes sense to buy SSDs and put journals on them. But we're looking for a way to verify that this will actually help BEFORE we splash cash on SSDs. The problem is that the way we have things configured now, with journals on spinning HDDs (shared with OSDs as the backend storage), apart from slow read/write performance to Ceph I already mention, we're also seeing fairly low disk utilization on OSDs. This low disk utilization suggests that journals are not really used to their max, which begs for the questions whether buying SSDs for journals will help. This kind of suggests that the bottleneck is NOT the disk. But,m yeah, we cannot really confirm that. Our typical data access use case is a lot of small random read/writes. We're doing a lot of rsyncing (entire regular linux filesystems) from one VM to another. We're using Ceph for OpenStack storage (kvm). Enabling RBD cache didn't really help all that much. So, is there any way to confirm beforehand that using SSDs for journals will help in our case? Kind Regards
Re: [ceph-users] rbd performance issue - can't find bottleneck
Hi guys, I also use a combination of intel 520 and 530 for my journals and have noticed that the latency and the speed of 520s is better than 530s. Could someone please confirm that doing the following at start up will stop the dsync on the relevant drives? # echo temporary write through > /sys/class/scsi_disk/1\:0\:0\:0/cache_type Do I need to patch my kernel for this or is this already implementable in vanilla? I am running 3.19.x branch from ubuntu testing repo. Would the above change the performance of 530s to be more like 520s? Cheers Andrei - Original Message - > From: "Alexandre DERUMIER" > To: "Jacek Jarosiewicz" > Cc: "ceph-users" > Sent: Thursday, 18 June, 2015 11:54:42 AM > Subject: Re: [ceph-users] rbd performance issue - can't find bottleneck > > Hi, > > for read benchmark > > with fio, what is the iodepth ? > > my fio 4k randr results with > > iodepth=1 : bw=6795.1KB/s, iops=1698 > iodepth=2 : bw=14608KB/s, iops=3652 > iodepth=4 : bw=32686KB/s, iops=8171 > iodepth=8 : bw=76175KB/s, iops=19043 > iodepth=16 :bw=173651KB/s, iops=43412 > iodepth=32 :bw=336719KB/s, iops=84179 > > (This should be similar with rados bench -t (threads) option). > > This is normal because of network latencies + ceph latencies. > Doing more parallism increase iops. > > (doing a bench with "dd" = iodepth=1) > > Theses result are with 1 client/rbd volume. > > > now with more fio client (numjobs=X) > > I can reach up to 300kiops with 8-10 clients. > > > This should be the same with lauching multiple rados bench in parallel > > (BTW, it could be great to have an option in rados bench to do it) > > > - Mail original - > De: "Jacek Jarosiewicz" > À: "Mark Nelson" , "ceph-users" > > Envoyé: Jeudi 18 Juin 2015 11:49:11 > Objet: Re: [ceph-users] rbd performance issue - can't find bottleneck > > On 06/17/2015 04:19 PM, Mark Nelson wrote: > >> SSD's are INTEL SSDSC2BW240A4 > > > > Ah, if I'm not mistaken that's the Intel 530 right? You'll want to see > > this thread by Stefan Priebe: > > > > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg05667.html > > > > In fact it was the difference in Intel 520 and Intel 530 performance > > that triggered many of the different investigations that have taken > > place by various folks into SSD flushing behavior on ATA_CMD_FLUSH. The > > gist of it is that the 520 is very fast but probably not safe. The 530 > > is safe but not fast. The DC S3700 (and similar drives with super > > capacitors) are thought to be both fast and safe (though some drives > > like the crucual M500 and later misrepresented their power loss > > protection so you have to be very careful!) > > > > Yes, these are Intel 530. > I did the tests described in the thread You pasted and unfortunately > that's my case... I think. > > The dd run locally on a mounted ssd partition looks like this: > > [root@cf02 journal]# dd if=/dev/zero of=test bs=350k count=1 > oflag=direct,dsync > 1+0 records in > 1+0 records out > 358400 bytes (3.6 GB) copied, 211.698 s, 16.9 MB/s > > and when I skip the flag dsync it goes fast: > > [root@cf02 journal]# dd if=/dev/zero of=test bs=350k count=1 > oflag=direct > 1+0 records in > 1+0 records out > 358400 bytes (3.6 GB) copied, 9.05432 s, 396 MB/s > > (I used the same 350k block size as mentioned in the e-mail from the > thread above) > > I tried disabling the dsync like this: > > [root@cf02 ~]# echo temporary write through > > /sys/class/scsi_disk/1\:0\:0\:0/cache_type > > [root@cf02 ~]# cat /sys/class/scsi_disk/1\:0\:0\:0/cache_type > write through > > ..and then locally I see the speedup: > > [root@cf02 journal]# dd if=/dev/zero of=test bs=350k count=1 > oflag=direct,dsync > 1+0 records in > 1+0 records out > 358400 bytes (3.6 GB) copied, 10.4624 s, 343 MB/s > > > ..but when I test it from a client I still get slow results: > > root@cf03:/ceph/tmp# dd if=/dev/zero of=test bs=100M count=100 oflag=direct > 100+0 records in > 100+0 records out > 1048576 bytes (10 GB) copied, 122.482 s, 85.6 MB/s > > and fio gives the same 2-3k iops. > > after the change to SSD cache_type I tried remounting the test image, > recreating it and so on - nothing helped. > > I ran rbd bench-write on it, and it's not good either: > > root@cf03:~# rbd bench-write t2 > bench-write io_size 4096 io_threads 16 bytes 1073741824 pattern seq > SEC OPS OPS/SEC BYTES/SEC > 1 4221 4220.64 32195919.35 > 2 9628 4813.95 36286083.00 > 3 15288 4790.90 35714620.49 > 4 19610 4902.47 36626193.93 > 5 24844 4968.37 37296562.14 > 6 30488 5081.31 38112444.88 > 7 36152 5164.54 38601615.10 > 8 41479 5184.80 38860207.38 > 9 46971 5218.70 39181437.52 > 10 52219 5221.77 39322641.34 > 11 5 5151.36 38761566.30 > 12 62073 5172.71 38855021.35 > 13 65962 5073.95 38182880.49 > 14 71541 5110.02 38431536.17 > 15 77039 5135.85 38615125.42 > 16 82133 5133.31 38692578.98 > 17 87657 5156.24 38849948.84 > 18 92943 5141.03 38
Re: [ceph-users] rbd performance issue - can't find bottleneck
Mark, Thanks, I do understand that there is a risk of data loss by doing this. Having said this, ceph is designed to be fault tollerant and self repairing should something happen to individual journals, osds and server nodes. Isn't this a still good measure to compromise between data integrity and speed? So, by faking dsync and not actually doing this, you have a window of opportunity to data loss should a failure happen between the last flash and the moment of failure. Thus, if the ssd disk failure happens, regardless if dsync is used or not, would ceph still consider the osds behind the journal to be unavailable/lost and migrate the data around anyway and perform the necessary checks to make sure the data integrity is not compromised? If this is true, I would still consider using the dsync bypass in favour of the extra speed benefit. Unless I am missing a bigger picture and miscalculated something. Could someone please elaborate on this a bit further to understand the realy world threat of not using the dsync bypass? Cheers Andrei - Original Message - > From: "Mark Nelson" > To: ceph-users@lists.ceph.com > Sent: Friday, 19 June, 2015 3:59:55 PM > Subject: Re: [ceph-users] rbd performance issue - can't find bottleneck > > > > On 06/19/2015 09:54 AM, Andrei Mikhailovsky wrote: > > Hi guys, > > > > I also use a combination of intel 520 and 530 for my journals and have > > noticed that the latency and the speed of 520s is better than 530s. > > > > Could someone please confirm that doing the following at start up will stop > > the dsync on the relevant drives? > > > > # echo temporary write through > /sys/class/scsi_disk/1\:0\:0\:0/cache_type > > > > Do I need to patch my kernel for this or is this already implementable in > > vanilla? I am running 3.19.x branch from ubuntu testing repo. > > > > Would the above change the performance of 530s to be more like 520s? > > I need to comment that it's *really* not a good idea to do this if you > care about data integrity. There's a reason why the 530 is slower than > the 520. If you need speed and you care about your data, you should > really consider jumping up to the DC S3700. > > There's a possibility that the 730 *may* be ok as it supposedly has > power loss protection, but it's still not using HET MLC so the flash > cells will wear out faster. It's also a consumer grade drive, so no one > will give you support for this kind of use case if you have problems. > > Mark > > > > > Cheers > > > > Andrei > > > > > > > > - Original Message - > >> From: "Alexandre DERUMIER" > >> To: "Jacek Jarosiewicz" > >> Cc: "ceph-users" > >> Sent: Thursday, 18 June, 2015 11:54:42 AM > >> Subject: Re: [ceph-users] rbd performance issue - can't find bottleneck > >> > >> Hi, > >> > >> for read benchmark > >> > >> with fio, what is the iodepth ? > >> > >> my fio 4k randr results with > >> > >> iodepth=1 : bw=6795.1KB/s, iops=1698 > >> iodepth=2 : bw=14608KB/s, iops=3652 > >> iodepth=4 : bw=32686KB/s, iops=8171 > >> iodepth=8 : bw=76175KB/s, iops=19043 > >> iodepth=16 :bw=173651KB/s, iops=43412 > >> iodepth=32 :bw=336719KB/s, iops=84179 > >> > >> (This should be similar with rados bench -t (threads) option). > >> > >> This is normal because of network latencies + ceph latencies. > >> Doing more parallism increase iops. > >> > >> (doing a bench with "dd" = iodepth=1) > >> > >> Theses result are with 1 client/rbd volume. > >> > >> > >> now with more fio client (numjobs=X) > >> > >> I can reach up to 300kiops with 8-10 clients. > >> > >> > >> This should be the same with lauching multiple rados bench in parallel > >> > >> (BTW, it could be great to have an option in rados bench to do it) > >> > >> > >> - Mail original - > >> De: "Jacek Jarosiewicz" > >> À: "Mark Nelson" , "ceph-users" > >> > >> Envoyé: Jeudi 18 Juin 2015 11:49:11 > >> Objet: Re: [ceph-users] rbd performance issue - can't find bottleneck > >> > >> On 06/17/2015 04:19 PM, Mark Nelson wrote: > >>>> SSD's are INTEL SSDSC2BW240A4 > >>> > >>> Ah, if I'm not mistaken that's the Intel 530 right? You'll want to see > >>> this thread by Stefan Priebe: &g
Re: [ceph-users] rbd performance issue - can't find bottleneck
Mark, thanks for putting it down this way. It does make sense. Does it mean that having the Intel 520s, which bypass the dsync is theat to the data stored on the journals? I do have a few of these installed, alongside with 530s. I did not plan to replace them just yet. Would it make more sense to get a small battery protected raid card in front of the 520s and 530s to protect against these types of scenarios? Cheers - Original Message - > From: "Mark Nelson" > To: "Andrei Mikhailovsky" > Cc: ceph-users@lists.ceph.com > Sent: Friday, 19 June, 2015 5:08:31 PM > Subject: Re: [ceph-users] rbd performance issue - can't find bottleneck > > On 06/19/2015 10:29 AM, Andrei Mikhailovsky wrote: > > Mark, > > > > Thanks, I do understand that there is a risk of data loss by doing this. > > Having said this, ceph is designed to be fault tollerant and self > > repairing should something happen to individual journals, osds and server > > nodes. Isn't this a still good measure to compromise between data > > integrity and speed? So, by faking dsync and not actually doing this, you > > have a window of opportunity to data loss should a failure happen between > > the last flash and the moment of failure. > > > > Thus, if the ssd disk failure happens, regardless if dsync is used or not, > > would ceph still consider the osds behind the journal to be > > unavailable/lost and migrate the data around anyway and perform the > > necessary checks to make sure the data integrity is not compromised? If > > this is true, I would still consider using the dsync bypass in favour of > > the extra speed benefit. Unless I am missing a bigger picture and > > miscalculated something. > > > > Could someone please elaborate on this a bit further to understand the > > realy world threat of not using the dsync bypass? > > Hi Andrei, > > Basically the entire point of the Ceph journal is to guarantee that data > hits a persistent medium before the write gets acknowledged. Imagine a > scenario where you lose power just as the write happens. > > Scenario A: You have proper O_DSYNC writes. In this case, assuming the > SSD is behaving properly, you can be fairly confident that the write to > the local journal succeeded (or not). > > Scenario B: You bypass O_DSYNC. The journal write "completes" quickly, > but it's not actually written out to flash, just to the drive cache. If > the SSD has power loss protection it can theoretically write that data > out to the flash before it losses power. For this reason, drives with > PLP can often perform O_DSYNC writes very quickly even without this hack > (ie it can ignore ATA_CMD_FLUSH). > > For a drive like the 530 without PLP, there's no guarantee that the data > in cache will hit the flash. Ceph will *think* it did though, and the > risk is worse because the write "completes" so fast. Now you have a > scenario where ceph thinks something exists but it really doesn't (or > exists in a corrupted state). This leads to all sorts of problems. If > another OSD goes down and you have two copies of the data that disagree > with each other, what do you do? What if not all of the replica writes > succeeded but you have a copy of the data on the primary? Can you trust > it? Everything starts breaking down. > > Mark > > > > > Cheers > > > > Andrei > > > > > > - Original Message - > >> From: "Mark Nelson" > >> To: ceph-users@lists.ceph.com > >> Sent: Friday, 19 June, 2015 3:59:55 PM > >> Subject: Re: [ceph-users] rbd performance issue - can't find bottleneck > >> > >> > >> > >> On 06/19/2015 09:54 AM, Andrei Mikhailovsky wrote: > >>> Hi guys, > >>> > >>> I also use a combination of intel 520 and 530 for my journals and have > >>> noticed that the latency and the speed of 520s is better than 530s. > >>> > >>> Could someone please confirm that doing the following at start up will > >>> stop > >>> the dsync on the relevant drives? > >>> > >>> # echo temporary write through > > >>> /sys/class/scsi_disk/1\:0\:0\:0/cache_type > >>> > >>> Do I need to patch my kernel for this or is this already implementable in > >>> vanilla? I am running 3.19.x branch from ubuntu testing repo. > >>> > >>> Would the above change the performance of 530s to be more like 520s? > >> > >> I need to comment that it's *really* not a good idea to do this if you >
[ceph-users] latest Hammer for Ubuntu precise
Hi, I seem to be missing the latest Hammer release 0.94.2 in the repo for Ubuntu precise. I can see the packages for trusty, but precise still shows 0.94.1. Is there a miss or did you stop supporting precise? Or perhaps something is odd happened with my precise servers? Cheers Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] latest Hammer for Ubuntu precise
Thanks Mate, I was under the same impression. Could someone at Inktank please help us with this problem? Is this intentional or has it simply been an error? Thanks Andrei -- Andrei Mikhailovsky Director Arhont Information Security Web: http://www.arhont.com http://www.wi-foo.com Tel: +44 (0)870 4431337 Fax: +44 (0)208 429 3111 PGP: Key ID - 0x2B3438DE PGP: Server - keyserver.pgp.com DISCLAIMER The information contained in this email is intended only for the use of the person(s) to whom it is addressed and may be confidential or contain legally privileged information. If you are not the intended recipient you are hereby notified that any perusal, use, distribution, copying or disclosure is strictly prohibited. If you have received this email in error please immediately advise us by return email at and...@arhont.com and delete and purge the email and any attachments without making a copy. - Original Message - From: "Gabri Mate" To: "Andrei Mikhailovsky" Cc: ceph-users@lists.ceph.com Sent: Monday, 22 June, 2015 6:28:11 PM Subject: Re: [ceph-users] latest Hammer for Ubuntu precise As far as I see the packages are there but the Packages file wasn't updated (correctly?) that's why we, Precise users do not see the updates. I am still wondering whether this is intentional or not. Probably not. :) Hopefully it will be sorted out soon. Mate On 00:14 Mon 22 Jun , Andrei Mikhailovsky wrote: > Hi, > > I seem to be missing the latest Hammer release 0.94.2 in the repo for Ubuntu > precise. I can see the packages for trusty, but precise still shows 0.94.1. > Is there a miss or did you stop supporting precise? Or perhaps something is > odd happened with my precise servers? > > Cheers > > Andrei > > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph and EnhanceIO cache
Hi Nick, I've played with Flashcache and EnhanceIO, but I've decided not to use it for production in the end. The reason was that using both has increased the amount of slow requests that I had on the cluster and I have also noticed somewhat higher level of iowait on the vms. At that time, I didn't have much time to investigate the slow requests issue and I wasn't sure what exactly is causing them. All I can say is that after disabling the caching the slow requests have stopped. Perhaps others could share if they had any issues. THanks - Original Message - > From: "Nick Fisk" > To: "Dominik Zalewski" > Cc: ceph-users@lists.ceph.com > Sent: Friday, 26 June, 2015 11:12:25 AM > Subject: Re: [ceph-users] Ceph and EnhanceIO cache > I think flashcache bombs out, I must admit I have tested that yet, but as I > would only be running it in writecache mode, there is no requirement I can > think of for it to keep on running gracefully. > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Dominik Zalewski > Sent: 26 June 2015 10:54 > To: Nick Fisk > Cc: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] Ceph and EnhanceIO cache > Thanks for your reply. > Do you know by any chance how flashcache handles SSD going offline? > Here is an snip from enhanceio wiki page: > Failure of an SSD device in read-only and write-through modes is > handled gracefully by allowing I/O to continue to/from the > source volume. An application may notice a drop in performance but it > will not receive any I/O errors. > Failure of an SSD device in write-back mode obviously results in the > loss of dirty blocks in the cache. To guard against this data loss, two > SSD devices can be mirrored via RAID 1. > EnhanceIO identifies device failures based on error codes. Depending on > whether the failure is likely to be intermittent or permanent, it takes > the best suited action. > Looking at mailing list and github commits, both flashcache and enhanceio had > not much going on since last year. > Dominik > On Fri, Jun 26, 2015 at 10:28 AM, Nick Fisk < n...@fisk.me.uk > wrote: > > > -Original Message- > > > > From: ceph-users [mailto: ceph-users-boun...@lists.ceph.com ] On Behalf > > > Of > > > > Dominik Zalewski > > > > Sent: 26 June 2015 09:59 > > > > To: ceph-users@lists.ceph.com > > > > Subject: [ceph-users] Ceph and EnhanceIO cache > > > > > > > > Hi, > > > > > > > > I came across this blog post mentioning using EnhanceIO (fork of > > > flashcache) > > > > as cache for OSDs. > > > > > > > > http://xo4t.mj.am/link/xo4t/jgu895v/1/DnECCniu-HWfTojpLN1IqA/aHR0cDovL3d3dy5zZWJhc3RpZW4taGFuLmZyL2Jsb2cvMjAxNC8xMC8wNi9jZXBoLWFuZC1lbmhhbmNlaW8v > > > > > > > > http://xo4t.mj.am/link/xo4t/jgu895v/2/FTxs29ShRIqNOekTAo2wKw/aHR0cHM6Ly9naXRodWIuY29tL3N0ZWMtaW5jL0VuaGFuY2VJTw > > > > > > > > I'm planning to test it with 5x 1TB HGST Travelstar 7k1000 2.5inch OSDs > > > and > > > > using 256GB Transcend SSD as enhanceio cache. > > > > > > > > I'm wondering if anyone is using EnhanceIO or others like bcache, > > > dm-cache > > > > with Ceph in production and what is your experience/results. > > > Not using in production, but have been testing all of the above both > > caching > > the OSD and RBD's. > > > If your RBD's are being used in scenarios where small sync writes are > > important (iscsi,database's) then caching the RBD's is almost essential. My > > findings:- > > > FlashCache - Probably the best of the bunch. Has sequential override and > > hopefully facebook will continue to maintain it > > > EnhanceIO - Nice that you can hot add the cache, but is likely to no longer > > be maintained, so risky for production > > > DMCache - Well maintained, but biggest problem is that it only caches > > writes > > for blocks that are already in cache > > > Bcache - Didn't really spend much time looking at this. Looks as if > > development activity is dying down and potential stability issues > > > DM-WriteBoost - Great performance, really suits RBD requirements. > > Unfortunately the ram buffering part seems to not play safe with iSCSI use. > > > With something like flashcache on a RBD I can max out the SSD with small > > sequential write IO's and it then passes them to the RBD in nice large > > IO's. > > Bursty random writes also benefit. > > > Caching the OSD's, or more specifically a small section where the levelDB > > lives can greatly improve small block write performance. Flashcache is best > > for this as you can limit the sequential cutoff to the levelDB transaction > > size. DMcache is potentially an option as well. You can probably halve OSD > > latency by doing this. > > > > > > > > Thanks > > > > > > > > Dominik > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@
Re: [ceph-users] qemu-1.4.0 and onwards, linux kernel 3.2.x, ceph-RBD, heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive qemu-process, [Qemu-devel] [Bug 1207686]
I can confirm that I am having similar issues with ubuntu vm guests using fio with bs=4k direct=1 numjobs=4 iodepth=16. Occasionally i see hang tasks, occasionally guest vm stops responding without leaving anything in the logs and sometimes i see kernel panic on the console. I typically leave the runtime of the fio test for 60 minutes and it tends to stop responding after about 10-30 mins. I am on ubuntu 12.04 with 3.5 kernel backport and using ceph 0.61.7 with qemu 1.5.0 and libvirt 1.0.2 Andrei - Original Message - From: "Oliver Francke" To: "Josh Durgin" Cc: ceph-users@lists.ceph.com, "Mike Dawson" , "Stefan Hajnoczi" , qemu-de...@nongnu.org Sent: Friday, 9 August, 2013 10:22:00 AM Subject: Re: [ceph-users] qemu-1.4.0 and onwards, linux kernel 3.2.x, ceph-RBD, heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive qemu-process, [Qemu-devel] [Bug 1207686] Hi Josh, just opened http://tracker.ceph.com/issues/5919 with all collected information incl. debug-log. Hope it helps, Oliver. On 08/08/2013 07:01 PM, Josh Durgin wrote: > On 08/08/2013 05:40 AM, Oliver Francke wrote: >> Hi Josh, >> >> I have a session logged with: >> >> debug_ms=1:debug_rbd=20:debug_objectcacher=30 >> >> as you requested from Mike, even if I think, we do have another story >> here, anyway. >> >> Host-kernel is: 3.10.0-rc7, qemu-client 1.6.0-rc2, client-kernel is >> 3.2.0-51-amd... >> >> Do you want me to open a ticket for that stuff? I have about 5MB >> compressed logfile waiting for you ;) > > Yes, that'd be great. If you could include the time when you saw the > guest hang that'd be ideal. I'm not sure if this is one or two bugs, > but it seems likely it's a bug in rbd and not qemu. > > Thanks! > Josh > >> Thnx in advance, >> >> Oliver. >> >> On 08/05/2013 09:48 AM, Stefan Hajnoczi wrote: >>> On Sun, Aug 04, 2013 at 03:36:52PM +0200, Oliver Francke wrote: Am 02.08.2013 um 23:47 schrieb Mike Dawson : > We can "un-wedge" the guest by opening a NoVNC session or running a > 'virsh screenshot' command. After that, the guest resumes and runs > as expected. At that point we can examine the guest. Each time we'll > see: >>> If virsh screenshot works then this confirms that QEMU itself is still >>> responding. Its main loop cannot be blocked since it was able to >>> process the screendump command. >>> >>> This supports Josh's theory that a callback is not being invoked. The >>> virtio-blk I/O request would be left in a pending state. >>> >>> Now here is where the behavior varies between configurations: >>> >>> On a Windows guest with 1 vCPU, you may see the symptom that the >>> guest no >>> longer responds to ping. >>> >>> On a Linux guest with multiple vCPUs, you may see the hung task message >>> from the guest kernel because other vCPUs are still making progress. >>> Just the vCPU that issued the I/O request and whose task is in >>> UNINTERRUPTIBLE state would really be stuck. >>> >>> Basically, the symptoms depend not just on how QEMU is behaving but >>> also >>> on the guest kernel and how many vCPUs you have configured. >>> >>> I think this can explain how both problems you are observing, Oliver >>> and >>> Mike, are a result of the same bug. At least I hope they are :). >>> >>> Stefan >> >> > -- Oliver Francke filoo GmbH Moltkestraße 25a 0 Gütersloh HRB4355 AG Gütersloh Geschäftsführer: J.Rehpöhler | C.Kunz Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] SSD pool write performance
Hi i've also tested 4k performance and found similar results with fio and iozone tests as well as simple dd. I've noticed that my io rate doesn't go above 2k-3k in the virtual machines. I've got two servers with ssd journals but spindles for the osd. I've previusly tried to use nfs + zfs on the same hardware with the same drives acting as cache drives. The nfs performance was far better for 4k io. I was hitting around 60k when the storage servers were reading the test file from ram. It looks like some more optimisations have to be done to fix the current bottleneck. Having said this, the read performance from multiple clients would excel NFS by far. In nfs I would not see the total speeds over 450-500 but with ceph i was going over 1GB/s Andrei - Original Message - From: "Sergey Pimkov" To: ceph-users@lists.ceph.com Sent: Thursday, 10 October, 2013 8:47:32 PM Subject: [ceph-users] SSD pool write performance Hello! I'm testing small CEPH pool consists of some SSD drives (without any spinners). Ceph version is 0.67.4. Seems like write performance of this configuration is not so good as possible, when I testing it with small block size (4k). Pool configuration: 2 mons on separated hosts, one host with two OSD. First partition of each disk is used for journal and has 20Gb size, second is formatted as XFS and used for data (mount options: rw,noexec,nodev,noatime,nodiratime,inode64). 20% of space left unformatted. Journal aio and dio turned on. Each disk has about 15k IOPS with 4k blocks, iodepth 1 and 50k IOPS with 4k block, iodepth 16 (tested with fio). Linear throughput of disks is about 420Mb/s. Network throughput is 1Gbit/s. I use rbd pool with size 1 and want this pool to act like RAID0 at this time. Virtual machine (QEMU/KVM) on separated host is configured to use 100Gb RBD as second disk. Fio running in this machine (iodepth 16, buffered=0, direct=1, libaio, 4k randwrite) shows about 2.5-3k IOPS. Multiple quests with the same configuration shows similar summary result. Local kernel RBD on host with OSD also shows about 2-2.5k IOPS. Latency is about 7ms. I also tried to pre-fill RBD without any results. Atop shows about 90% disks utilization during tests. CPU utilization is about 400% (2x Xeon E5504 is installed on ceph node). There is a lot of free memory on host. Blktrace shows that about 4k operations (4k to about 40k bytes) completing every second on every disk. OSD throughput is about 30 MB/s. I expected to see about 2 x 50k/4 = 20-30k IOPS on RBD, so is that too optimistic for CEPH with such load or if I missed up something important? I also tried to use one disk as journal (20GB, last space left unformatted) and configure the next disk as OSD, this configuration have shown almost the same result. Playing with some osd/filestore/journal options with admin socket ended with no result. Please, tell me am I wrong with this setup? Or should I use more disks to get better performance with small concurrent writes? Or is ceph optimized for work with slow spinners and shouldn't be used with SSD disk only? Thank you very much in advance! My ceph configuration: ceph.conf == [global] auth cluster required = none auth service required = none auth client required = none [client] rbd cache = true rbd cache max dirty = 0 [osd] osd journal aio = true osd max backfills = 4 osd recovery max active = 1 filestore max sync interval = 5 [mon.1] host = ceph1 mon addr = 10.10.0.1:6789 [mon.2] host = ceph2 mon addr = 10.10.0.2:6789 [osd.72] host = ceph7 devs = /dev/sdd2 osd journal = /dev/sdd1 [osd.73] host = ceph7 devs = /dev/sde2 osd journal = /dev/sde1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] CloudStack 4.2 - radosgw / S3 storage issues
Hello guys, I am doing a test ACS setup to see how we can use Ceph for both Primary and Secondary storage services. I have now successfully added both Primary (cluster wide) and Secondary storage. However, I've noticed that my SSVM and CPVM are not being created, so digging in the logs revealed the following exceptions: The radosgw logs show the following: 2013-10-29 00:19:38.289487 7f2aa7d9f780 20 enqueued request req=0x2390060 2013-10-29 00:19:38.289518 7f2aa7d9f780 20 RGWWQ: 2013-10-29 00:19:38.289521 7f2aa7d9f780 20 req: 0x2390060 2013-10-29 00:19:38.289529 7f2aa7d9f780 10 allocated request req=0x23452f0 2013-10-29 00:19:38.289572 7f2aa7d9f780 20 enqueued request req=0x23452f0 2013-10-29 00:19:38.289575 7f2aa7d9f780 20 RGWWQ: 2013-10-29 00:19:38.289576 7f2aa7d9f780 20 req: 0x2390060 2013-10-29 00:19:38.289578 7f2aa7d9f780 20 req: 0x23452f0 2013-10-29 00:19:38.289610 7f2aa7d9f780 10 allocated request req=0x23a1630 2013-10-29 00:19:38.289613 7f2a54ff9700 20 dequeued request req=0x2390060 2013-10-29 00:19:38.289627 7f2a54ff9700 20 RGWWQ: 2013-10-29 00:19:38.289629 7f2a54ff9700 20 req: 0x23452f0 2013-10-29 00:19:38.289647 7f2a54ff9700 1 == starting new request req=0x2390060 = 2013-10-29 00:19:38.289650 7f2a36fcd700 20 dequeued request req=0x23452f0 2013-10-29 00:19:38.289675 7f2a36fcd700 20 RGWWQ: empty 2013-10-29 00:19:38.289685 7f2a36fcd700 1 == starting new request req=0x23452f0 = 2013-10-29 00:19:38.289715 7f2a54ff9700 2 req 1291:0.69::POST /template%2Ftmpl%2F1%2F4%2Fcentos55-x86_64%2Feec2209b-9875-3c8d-92be-c001bd8a0faf.qcow2.bz2::initializing 2013-10-29 00:19:38.289723 7f2a54ff9700 10 host=cloudstack-secondary.arh-ibstorage1domain-name.com rgw_dns_name=arh-ibstorage1-ibdomain-name.com 2013-10-29 00:19:38.289755 7f2a36fcd700 2 req 1292:0.69::POST /template%2Ftmpl%2F1%2F3%2Frouting-3%2Fsystemvmtemplate-2013-06-12-master-kvm.qcow2.bz2::initializing 2013-10-29 00:19:38.289761 7f2a36fcd700 10 host=cloudstack-secondary.arh-ibstorage1domain-name.com rgw_dns_name=arh-ibstorage1-ibdomain-name.com 2013-10-29 00:19:38.289761 7f2a54ff9700 10 s->object=tmpl/1/4/centos55-x86_64/eec2209b-9875-3c8d-92be-c001bd8a0faf.qcow2.bz2 s->bucket=template 2013-10-29 00:19:38.289770 7f2a54ff9700 20 FCGI_ROLE=RESPONDER 2013-10-29 00:19:38.289771 7f2a54ff9700 20 SCRIPT_URL=/template/tmpl/1/4/centos55-x86_64/eec2209b-9875-3c8d-92be-c001bd8a0faf.qcow2.bz2 2013-10-29 00:19:38.289773 7f2a54ff9700 20 SCRIPT_URI=http://cloudstack-secondary.arh-ibstorage1domain-name.com/template/tmpl/1/4/centos55-x86_64/eec2209b-9875-3c8d-92be-c001bd8a0faf.qcow2.b z2 2013-10-29 00:19:38.289775 7f2a54ff9700 20 HTTP_AUTHORIZATION=AWS S3-User-Key:v1NjAqxoFbROJOlBPRWyOSw8IZI= 2013-10-29 00:19:38.289776 7f2a54ff9700 20 HTTP_HOST=cloudstack-secondary.arh-ibstorage1domain-name.com 2013-10-29 00:19:38.289776 7f2a54ff9700 20 HTTP_DATE=Tue, 29 Oct 2013 00:19:38 GMT 2013-10-29 00:19:38.289777 7f2a54ff9700 20 HTTP_USER_AGENT=aws-sdk-java/1.3.22 Linux/3.5.0-42-generic OpenJDK_64-Bit_Server_VM/20.0-b12 2013-10-29 00:19:38.289778 7f2a54ff9700 20 CONTENT_TYPE=application/x-bzip2 2013-10-29 00:19:38.289780 7f2a54ff9700 20 HTTP_TRANSFER_ENCODING=chunked 2013-10-29 00:19:38.289782 7f2a54ff9700 20 HTTP_CONNECTION=Keep-Alive 2013-10-29 00:19:38.289784 7f2a54ff9700 20 PATH=/usr/local/bin:/usr/bin:/bin 2013-10-29 00:19:38.289785 7f2a54ff9700 20 SERVER_SIGNATURE= 2013-10-29 00:19:38.289786 7f2a54ff9700 20 SERVER_SOFTWARE=Apache/2.2.22 (Ubuntu) 2013-10-29 00:19:38.289787 7f2a54ff9700 20 SERVER_NAME=cloudstack-secondary.arh-ibstorage1domain-name.com 2013-10-29 00:19:38.289788 7f2a54ff9700 20 SERVER_ADDR=192.168.169.200 2013-10-29 00:19:38.289789 7f2a54ff9700 20 SERVER_PORT=80 2013-10-29 00:19:38.289790 7f2a54ff9700 20 REMOTE_ADDR=192.168.169.1 2013-10-29 00:19:38.289790 7f2a36fcd700 10 s->object=tmpl/1/3/routing-3/systemvmtemplate-2013-06-12-master-kvm.qcow2.bz2 s->bucket=template 2013-10-29 00:19:38.289791 7f2a54ff9700 20 DOCUMENT_ROOT=/var/www 2013-10-29 00:19:38.289794 7f2a54ff9700 20 SCRIPT_FILENAME=/var/www/s3gw.fcgi 2013-10-29 00:19:38.289794 7f2a54ff9700 20 REMOTE_PORT=34613 2013-10-29 00:19:38.289795 7f2a54ff9700 20 GATEWAY_INTERFACE=CGI/1.1 2013-10-29 00:19:38.289796 7f2a54ff9700 20 SERVER_PROTOCOL=HTTP/1.1 2013-10-29 00:19:38.289797 7f2a54ff9700 20 REQUEST_METHOD=POST 2013-10-29 00:19:38.289796 7f2a36fcd700 20 FCGI_ROLE=RESPONDER 2013-10-29 00:19:38.289798 7f2a54ff9700 20 QUERY_STRING=page=template¶ms=/tmpl/1/4/centos55-x86_64/eec2209b-9875-3c8d-92be-c001bd8a0faf.qcow2.bz2&uploads 2013-10-29 00:19:38.289798 7f2a36fcd700 20 SCRIPT_URL=/template/tmpl/1/3/routing-3/systemvmtemplate-2013-06-12-master-kvm.qcow2.bz2 2013-10-29 00:19:38.289799 7f2a54ff9700 20 REQUEST_URI=/template%2Ftmpl%2F1%2F4%2Fcentos55-x86_64%2Feec2209b-9875-3c8d-92be-c001bd8a0faf.qcow2.bz2?uploads 2013-10-29 00:19:38.289800 7f2a36fcd700 20 SCRIPT_URI=http://cloudstack-secondary.arh-ibstorag
[ceph-users] CloudStack 4.2 - radosgw / S3 storage issues
Hello guys, I am doing a test ACS setup to see how we can use Ceph for both Primary and Secondary storage services. I have now successfully added both Primary (cluster wide) and Secondary storage. However, I've noticed that my SSVM and CPVM are not being created, so digging in the logs revealed the following exceptions: The radosgw logs show the following: 2013-10-29 00:19:38.289487 7f2aa7d9f780 20 enqueued request req=0x2390060 2013-10-29 00:19:38.289518 7f2aa7d9f780 20 RGWWQ: 2013-10-29 00:19:38.289521 7f2aa7d9f780 20 req: 0x2390060 2013-10-29 00:19:38.289529 7f2aa7d9f780 10 allocated request req=0x23452f0 2013-10-29 00:19:38.289572 7f2aa7d9f780 20 enqueued request req=0x23452f0 2013-10-29 00:19:38.289575 7f2aa7d9f780 20 RGWWQ: 2013-10-29 00:19:38.289576 7f2aa7d9f780 20 req: 0x2390060 2013-10-29 00:19:38.289578 7f2aa7d9f780 20 req: 0x23452f0 2013-10-29 00:19:38.289610 7f2aa7d9f780 10 allocated request req=0x23a1630 2013-10-29 00:19:38.289613 7f2a54ff9700 20 dequeued request req=0x2390060 2013-10-29 00:19:38.289627 7f2a54ff9700 20 RGWWQ: 2013-10-29 00:19:38.289629 7f2a54ff9700 20 req: 0x23452f0 2013-10-29 00:19:38.289647 7f2a54ff9700 1 == starting new request req=0x2390060 = 2013-10-29 00:19:38.289650 7f2a36fcd700 20 dequeued request req=0x23452f0 2013-10-29 00:19:38.289675 7f2a36fcd700 20 RGWWQ: empty 2013-10-29 00:19:38.289685 7f2a36fcd700 1 == starting new request req=0x23452f0 = 2013-10-29 00:19:38.289715 7f2a54ff9700 2 req 1291:0.69::POST /template%2Ftmpl%2F1%2F4%2Fcentos55-x86_64%2Feec2209b-9875-3c8d-92be-c001bd8a0faf.qcow2.bz2::initializing 2013-10-29 00:19:38.289723 7f2a54ff9700 10 host=cloudstack-secondary.arh-ibstorage1domain-name.com rgw_dns_name=arh-ibstorage1-ibdomain-name.com 2013-10-29 00:19:38.289755 7f2a36fcd700 2 req 1292:0.69::POST /template%2Ftmpl%2F1%2F3%2Frouting-3%2Fsystemvmtemplate-2013-06-12-master-kvm.qcow2.bz2::initializing 2013-10-29 00:19:38.289761 7f2a36fcd700 10 host=cloudstack-secondary.arh-ibstorage1domain-name.com rgw_dns_name=arh-ibstorage1-ibdomain-name.com 2013-10-29 00:19:38.289761 7f2a54ff9700 10 s->object=tmpl/1/4/centos55-x86_64/eec2209b-9875-3c8d-92be-c001bd8a0faf.qcow2.bz2 s->bucket=template 2013-10-29 00:19:38.289770 7f2a54ff9700 20 FCGI_ROLE=RESPONDER 2013-10-29 00:19:38.289771 7f2a54ff9700 20 SCRIPT_URL=/template/tmpl/1/4/centos55-x86_64/eec2209b-9875-3c8d-92be-c001bd8a0faf.qcow2.bz2 2013-10-29 00:19:38.289773 7f2a54ff9700 20 SCRIPT_URI=http://cloudstack-secondary.arh-ibstorage1domain-name.com/template/tmpl/1/4/centos55-x86_64/eec2209b-9875-3c8d-92be-c001bd8a0faf.qcow2.b z2 2013-10-29 00:19:38.289775 7f2a54ff9700 20 HTTP_AUTHORIZATION=AWS S3-User-Key:v1NjAqxoFbROJOlBPRWyOSw8IZI= 2013-10-29 00:19:38.289776 7f2a54ff9700 20 HTTP_HOST=cloudstack-secondary.arh-ibstorage1domain-name.com 2013-10-29 00:19:38.289776 7f2a54ff9700 20 HTTP_DATE=Tue, 29 Oct 2013 00:19:38 GMT 2013-10-29 00:19:38.289777 7f2a54ff9700 20 HTTP_USER_AGENT=aws-sdk-java/1.3.22 Linux/3.5.0-42-generic OpenJDK_64-Bit_Server_VM/20.0-b12 2013-10-29 00:19:38.289778 7f2a54ff9700 20 CONTENT_TYPE=application/x-bzip2 2013-10-29 00:19:38.289780 7f2a54ff9700 20 HTTP_TRANSFER_ENCODING=chunked 2013-10-29 00:19:38.289782 7f2a54ff9700 20 HTTP_CONNECTION=Keep-Alive 2013-10-29 00:19:38.289784 7f2a54ff9700 20 PATH=/usr/local/bin:/usr/bin:/bin 2013-10-29 00:19:38.289785 7f2a54ff9700 20 SERVER_SIGNATURE= 2013-10-29 00:19:38.289786 7f2a54ff9700 20 SERVER_SOFTWARE=Apache/2.2.22 (Ubuntu) 2013-10-29 00:19:38.289787 7f2a54ff9700 20 SERVER_NAME=cloudstack-secondary.arh-ibstorage1domain-name.com 2013-10-29 00:19:38.289788 7f2a54ff9700 20 SERVER_ADDR=192.168.169.200 2013-10-29 00:19:38.289789 7f2a54ff9700 20 SERVER_PORT=80 2013-10-29 00:19:38.289790 7f2a54ff9700 20 REMOTE_ADDR=192.168.169.1 2013-10-29 00:19:38.289790 7f2a36fcd700 10 s->object=tmpl/1/3/routing-3/systemvmtemplate-2013-06-12-master-kvm.qcow2.bz2 s->bucket=template 2013-10-29 00:19:38.289791 7f2a54ff9700 20 DOCUMENT_ROOT=/var/www 2013-10-29 00:19:38.289794 7f2a54ff9700 20 SCRIPT_FILENAME=/var/www/s3gw.fcgi 2013-10-29 00:19:38.289794 7f2a54ff9700 20 REMOTE_PORT=34613 2013-10-29 00:19:38.289795 7f2a54ff9700 20 GATEWAY_INTERFACE=CGI/1.1 2013-10-29 00:19:38.289796 7f2a54ff9700 20 SERVER_PROTOCOL=HTTP/1.1 2013-10-29 00:19:38.289797 7f2a54ff9700 20 REQUEST_METHOD=POST 2013-10-29 00:19:38.289796 7f2a36fcd700 20 FCGI_ROLE=RESPONDER 2013-10-29 00:19:38.289798 7f2a54ff9700 20 QUERY_STRING=page=template¶ms=/tmpl/1/4/centos55-x86_64/eec2209b-9875-3c8d-92be-c001bd8a0faf.qcow2.bz2&uploads 2013-10-29 00:19:38.289798 7f2a36fcd700 20 SCRIPT_URL=/template/tmpl/1/3/routing-3/systemvmtemplate-2013-06-12-master-kvm.qcow2.bz2 2013-10-29 00:19:38.289799 7f2a54ff9700 20 REQUEST_URI=/template%2Ftmpl%2F1%2F4%2Fcentos55-x86_64%2Feec2209b-9875-3c8d-92be-c001bd8a0faf.qcow2.bz2?uploads 2013-10-29 00:19:38.289800 7f2a36fcd700 20 SCRIPT_URI=http://cloudstack-secondary.arh-ibstora
Re: [ceph-users] CloudStack 4.2 - radosgw / S3 storage issues
To answer myself - there was a problem with my api secret key which rados generated. It has escaped the "/", which for some reason CloudStack couldn't understand. Removing the escape (\) character has solved the problem. Andrei - Original Message - From: "Andrei Mikhailovsky" To: ceph-users@lists.ceph.com Sent: Tuesday, 29 October, 2013 11:33:44 AM Subject: [ceph-users] CloudStack 4.2 - radosgw / S3 storage issues Hello guys, I am doing a test ACS setup to see how we can use Ceph for both Primary and Secondary storage services. I have now successfully added both Primary (cluster wide) and Secondary storage. However, I've noticed that my SSVM and CPVM are not being created, so digging in the logs revealed the following exceptions: The radosgw logs show the following: 2013-10-29 00:19:38.289487 7f2aa7d9f780 20 enqueued request req=0x2390060 2013-10-29 00:19:38.289518 7f2aa7d9f780 20 RGWWQ: 2013-10-29 00:19:38.289521 7f2aa7d9f780 20 req: 0x2390060 2013-10-29 00:19:38.289529 7f2aa7d9f780 10 allocated request req=0x23452f0 2013-10-29 00:19:38.289572 7f2aa7d9f780 20 enqueued request req=0x23452f0 2013-10-29 00:19:38.289575 7f2aa7d9f780 20 RGWWQ: 2013-10-29 00:19:38.289576 7f2aa7d9f780 20 req: 0x2390060 2013-10-29 00:19:38.289578 7f2aa7d9f780 20 req: 0x23452f0 2013-10-29 00:19:38.289610 7f2aa7d9f780 10 allocated request req=0x23a1630 2013-10-29 00:19:38.289613 7f2a54ff9700 20 dequeued request req=0x2390060 2013-10-29 00:19:38.289627 7f2a54ff9700 20 RGWWQ: 2013-10-29 00:19:38.289629 7f2a54ff9700 20 req: 0x23452f0 2013-10-29 00:19:38.289647 7f2a54ff9700 1 == starting new request req=0x2390060 = 2013-10-29 00:19:38.289650 7f2a36fcd700 20 dequeued request req=0x23452f0 2013-10-29 00:19:38.289675 7f2a36fcd700 20 RGWWQ: empty 2013-10-29 00:19:38.289685 7f2a36fcd700 1 == starting new request req=0x23452f0 = 2013-10-29 00:19:38.289715 7f2a54ff9700 2 req 1291:0.69::POST /template%2Ftmpl%2F1%2F4%2Fcentos55-x86_64%2Feec2209b-9875-3c8d-92be-c001bd8a0faf.qcow2.bz2::initializing 2013-10-29 00:19:38.289723 7f2a54ff9700 10 host=cloudstack-secondary.arh-ibstorage1domain-name.com rgw_dns_name=arh-ibstorage1-ibdomain-name.com 2013-10-29 00:19:38.289755 7f2a36fcd700 2 req 1292:0.69::POST /template%2Ftmpl%2F1%2F3%2Frouting-3%2Fsystemvmtemplate-2013-06-12-master-kvm.qcow2.bz2::initializing 2013-10-29 00:19:38.289761 7f2a36fcd700 10 host=cloudstack-secondary.arh-ibstorage1domain-name.com rgw_dns_name=arh-ibstorage1-ibdomain-name.com 2013-10-29 00:19:38.289761 7f2a54ff9700 10 s->object=tmpl/1/4/centos55-x86_64/eec2209b-9875-3c8d-92be-c001bd8a0faf.qcow2.bz2 s->bucket=template 2013-10-29 00:19:38.289770 7f2a54ff9700 20 FCGI_ROLE=RESPONDER 2013-10-29 00:19:38.289771 7f2a54ff9700 20 SCRIPT_URL=/template/tmpl/1/4/centos55-x86_64/eec2209b-9875-3c8d-92be-c001bd8a0faf.qcow2.bz2 2013-10-29 00:19:38.289773 7f2a54ff9700 20 SCRIPT_URI=http://cloudstack-secondary.arh-ibstorage1domain-name.com/template/tmpl/1/4/centos55-x86_64/eec2209b-9875-3c8d-92be-c001bd8a0faf.qcow2.b z2 2013-10-29 00:19:38.289775 7f2a54ff9700 20 HTTP_AUTHORIZATION=AWS S3-User-Key:v1NjAqxoFbROJOlBPRWyOSw8IZI= 2013-10-29 00:19:38.289776 7f2a54ff9700 20 HTTP_HOST=cloudstack-secondary.arh-ibstorage1domain-name.com 2013-10-29 00:19:38.289776 7f2a54ff9700 20 HTTP_DATE=Tue, 29 Oct 2013 00:19:38 GMT 2013-10-29 00:19:38.289777 7f2a54ff9700 20 HTTP_USER_AGENT=aws-sdk-java/1.3.22 Linux/3.5.0-42-generic OpenJDK_64-Bit_Server_VM/20.0-b12 2013-10-29 00:19:38.289778 7f2a54ff9700 20 CONTENT_TYPE=application/x-bzip2 2013-10-29 00:19:38.289780 7f2a54ff9700 20 HTTP_TRANSFER_ENCODING=chunked 2013-10-29 00:19:38.289782 7f2a54ff9700 20 HTTP_CONNECTION=Keep-Alive 2013-10-29 00:19:38.289784 7f2a54ff9700 20 PATH=/usr/local/bin:/usr/bin:/bin 2013-10-29 00:19:38.289785 7f2a54ff9700 20 SERVER_SIGNATURE= 2013-10-29 00:19:38.289786 7f2a54ff9700 20 SERVER_SOFTWARE=Apache/2.2.22 (Ubuntu) 2013-10-29 00:19:38.289787 7f2a54ff9700 20 SERVER_NAME=cloudstack-secondary.arh-ibstorage1domain-name.com 2013-10-29 00:19:38.289788 7f2a54ff9700 20 SERVER_ADDR=192.168.169.200 2013-10-29 00:19:38.289789 7f2a54ff9700 20 SERVER_PORT=80 2013-10-29 00:19:38.289790 7f2a54ff9700 20 REMOTE_ADDR=192.168.169.1 2013-10-29 00:19:38.289790 7f2a36fcd700 10 s->object=tmpl/1/3/routing-3/systemvmtemplate-2013-06-12-master-kvm.qcow2.bz2 s->bucket=template 2013-10-29 00:19:38.289791 7f2a54ff9700 20 DOCUMENT_ROOT=/var/www 2013-10-29 00:19:38.289794 7f2a54ff9700 20 SCRIPT_FILENAME=/var/www/s3gw.fcgi 2013-10-29 00:19:38.289794 7f2a54ff9700 20 REMOTE_PORT=34613 2013-10-29 00:19:38.289795 7f2a54ff9700 20 GATEWAY_INTERFACE=CGI/1.1 2013-10-29 00:19:38.289796 7f2a54ff9700 20 SERVER_PROTOCOL=HTTP/1.1 2013-10-29 00:19:38.289797 7f2a54ff9700 20 REQUEST_METHOD=POST 2013-10-29 00:19:38.289796 7f2a36fcd700 20 FCGI_ROLE=RESPONDER 2013-10-29 00:19:38.289798 7f2a54ff9700 20 QUERY_STRING=
Re: [ceph-users] Frequent Crashes on rbd to nfs gateway Server
Ilya, Was wondering if you've had a chance to look into performance issues with rbd and the patched kernel? I've downloaded 3.16.3 and running some dd tests, which were producing hang tasks in the past. I've noticed that i can't get past 20mb/s on the rbd mounted volume. I am sure I was hitting over 60MB/s before. Cheers andrei - Original Message - > From: "Ilya Dryomov" > To: "Micha Krause" > Cc: ceph-users@lists.ceph.com > Sent: Wednesday, 24 September, 2014 3:33:20 PM > Subject: Re: [ceph-users] Frequent Crashes on rbd to nfs gateway > Server > On Wed, Sep 24, 2014 at 4:52 PM, Micha Krause > wrote: > > Hi, > > > >> Like I mentioned in my other reply, I'd be very interested in any > >> > >> similar messages on kernel other than 3.15.*, 3.16.1 and 3.16.2. > >> One > >> hung task stack trace is usually not enough to diagnose this sort > >> of > >> problems. > > > > > > Ok, here is a more complete dmesg output from kernel 3.14: > > > > [ 22.250600] rbd: loaded > > [ 23.159914] libceph: client24407525 fsid > > 46e857ee-855c-4165-8413-8950f8d081be > > [ 23.289691] libceph: mon1 10.210.34.11:6789 session established > > [ 23.890625] rbd2: unknown partition table > > [ 23.890702] rbd: rbd2: added with size 0x100 > > [ 23.937051] rbd0: unknown partition table > > [ 23.937144] rbd: rbd0: added with size 0x100 > > [ 24.052402] rbd1: unknown partition table > > [ 24.052479] rbd: rbd1: added with size 0xa00 > > [ 24.396333] rbd3: unknown partition table > > [ 24.396430] rbd: rbd3: added with size 0x100 > > [ 25.927373] SGI XFS with ACLs, security attributes, realtime, > > large > > block/inode numbers, no debug enabled > > [ 25.960975] XFS (rbd1): Mounting Filesystem > > [ 25.961072] XFS (rbd3): Mounting Filesystem > > [ 25.961637] XFS (rbd2): Mounting Filesystem > > [ 25.961708] XFS (rbd0): Mounting Filesystem > > [ 28.236952] XFS (rbd3): Starting recovery (logdev: internal) > > [ 28.794631] XFS (rbd1): Starting recovery (logdev: internal) > > [ 31.501516] XFS (rbd0): Starting recovery (logdev: internal) > > [ 35.498950] XFS (rbd2): Starting recovery (logdev: internal) > > [ 63.601465] XFS (rbd0): Ending recovery (logdev: internal) > > [ 64.214852] XFS (rbd3): Ending recovery (logdev: internal) > > [ 64.783531] rbd4: unknown partition table > > [ 64.784005] rbd: rbd4: added with size 0x100 > > [ 65.280960] XFS (rbd4): Mounting Filesystem > > [ 68.443439] XFS (rbd2): Ending recovery (logdev: internal) > > [ 69.030358] XFS (rbd4): Starting recovery (logdev: internal) > > [ 69.945523] rbd5: unknown partition table > > [ 69.946021] rbd: rbd5: added with size 0x100 > > [ 70.398567] XFS (rbd5): Mounting Filesystem > > [ 71.187934] XFS (rbd5): Starting recovery (logdev: internal) > > [ 74.144173] rbd6: unknown partition table > > [ 74.144630] rbd: rbd6: added with size 0x100 > > [ 75.402767] XFS (rbd6): Mounting Filesystem > > [ 76.133654] XFS (rbd6): Starting recovery (logdev: internal) > > [ 111.131893] XFS (rbd4): Ending recovery (logdev: internal) > > [ 112.460383] rbd7: unknown partition table > > [ 112.460898] rbd: rbd7: added with size 0x100 > > [ 116.834457] XFS (rbd5): Ending recovery (logdev: internal) > > [ 116.949218] XFS (rbd6): Ending recovery (logdev: internal) > > [ 166.357039] XFS (rbd1): Ending recovery (logdev: internal) > > [ 167.531353] XFS (rbd7): Mounting Filesystem > > [ 168.303166] XFS (rbd7): Starting recovery (logdev: internal) > > [ 172.477811] XFS (rbd7): Ending recovery (logdev: internal) > > [ 2038.723394] INFO: task kthreadd:2 blocked for more than 120 > > seconds. > > [ 2038.723497] Not tainted 3.14-0.bpo.1-amd64 #1 > > [ 2038.723553] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > > disables > > this message. > > [ 2038.723637] kthreadd D 88003fc14340 0 2 0 > > 0x > > [ 2038.723641] 88003fa3a8e0 0046 88003fa43628 > > 81813480 > > [ 2038.723644] 00014340 88003fa43fd8 00014340 > > 88003fa3a8e0 > > [ 2038.723646] 88003fa43638 886a7410 7fff > > 886a7408 > > [ 2038.723648] Call Trace: > > [ 2038.723660] [] ? schedule_timeout+0x1ed/0x250 > > [ 2038.723665] [] ? blk_finish_plug+0xb/0x30 > > [ 2038.723677] [] ? _xfs_buf_ioapply+0x277/0x2e0 > > [xfs] > > [ 2038.723680] [] ? > > wait_for_completion+0xa4/0x110 > > [ 2038.723685] [] ? try_to_wake_up+0x280/0x280 > > [ 2038.723691] [] ? xfs_bwrite+0x23/0x60 [xfs] > > [ 2038.723696] [] ? xfs_buf_iowait+0x96/0xf0 > > [xfs] > > [ 2038.723703] [] ? xfs_bwrite+0x23/0x60 [xfs] > > [ 2038.723711] [] ? xfs_reclaim_inode+0x2f4/0x310 > > [xfs] > > [ 2038.723720] [] ? > > xfs_reclaim_inodes_ag+0x1f7/0x320 > > [xfs] > > [ 2038.723729] [] ? > > xfs_reclaim_inodes_nr+0x2c/0x40 [xfs] > > [ 2038.723736] [] ? super_cache_scan+0x167/0x170 > > [ 2038.723742] [] ? shrink_slab_node+0x126/0x290 > > [ 2038.723746] [] ? vmpressure+0x23/0xa0 > > [ 2038.723750] [] ? shrink_slab+0x82/0x130 > >
Re: [ceph-users] Frequent Crashes on rbd to nfs gateway Server
I also had the hang tasks issues with 3.13.0-35 -generic - Original Message - > From: "German Anders" > To: "Micha Krause" > Cc: ceph-users@lists.ceph.com > Sent: Wednesday, 24 September, 2014 4:35:15 PM > Subject: Re: [ceph-users] Frequent Crashes on rbd to nfs gateway > Server > 3.13.0-35 -generic? really? I found my self in a similar situation > like yours and making a downgrade to that version works fine, also > you could try 3.14.9-031, it work fine for me also. > German Anders > > --- Original message --- > > > Asunto: Re: [ceph-users] Frequent Crashes on rbd to nfs gateway > > Server > > > De: Micha Krause > > > Para: German Anders , Ilya Dryomov > > > > > Cc: > > > Fecha: Wednesday, 24/09/2014 12:33 > > > Hi, > > > > things work fine on kernel 3.13.0-35 > > > > > I can reproduce this on 3.13.10, and I had in once on 3.13.0-35 as > > well. > > > Micha Krause > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Frequent Crashes on rbd to nfs gateway Server
Guys, Have done some testing with 3.16.3-031603-generic downloaded from Ubuntu utopic branch. The hang task problem is gone when using large block size (tested with 1M and 4M) and I could no longer preproduce the hang tasks while doing 100 dd tests in a for loop. However, I can confirm that I am still getting hang tasks while working with a 4K block size. The hang tasks start after about an hour, but they do not cause the server crash. After a while the dd test times out and continues with the loop. This is what I was running: for i in {1..100} ; do time dd if=/dev/zero of=/tmp/mount/1G bs=4K count=25K oflag=direct ; done The following test definately produces the hang tasks like these: [23160.549785] INFO: task dd:2033 blocked for more than 120 seconds. [23160.588364] Tainted: G OE 3.16.3-031603-generic #201409171435 [23160.627998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [23160.706856] dd D 000b 0 2033 23859 0x [23160.706861] 88011cec78c8 0082 88011cec78d8 88011cec7fd8 [23160.706865] 000143c0 000143c0 88048661bcc0 880113441440 [23160.706868] 88011cec7898 88067fd54cc0 880113441440 880113441440 [23160.706871] Call Trace: [23160.706883] [] schedule+0x29/0x70 [23160.706887] [] io_schedule+0x8f/0xd0 [23160.706893] [] dio_await_completion+0x54/0xd0 [23160.706897] [] do_blockdev_direct_IO+0x958/0xcc0 [23160.706903] [] ? wake_up_bit+0x2e/0x40 [23160.706908] [] ? jbd2_journal_dirty_metadata+0xc5/0x260 [23160.706914] [] ? ext4_get_block_write+0x20/0x20 [23160.706919] [] __blockdev_direct_IO+0x4c/0x50 [23160.706922] [] ? ext4_get_block_write+0x20/0x20 [23160.706928] [] ext4_ind_direct_IO+0xce/0x410 [23160.706931] [] ? ext4_get_block_write+0x20/0x20 [23160.706935] [] ext4_ext_direct_IO+0x1bb/0x2a0 [23160.706938] [] ? __ext4_journal_stop+0x78/0xa0 [23160.706942] [] ext4_direct_IO+0xec/0x1e0 [23160.706946] [] ? __mark_inode_dirty+0x53/0x2d0 [23160.706952] [] generic_file_direct_write+0xbb/0x180 [23160.706957] [] ? mnt_clone_write+0x12/0x30 [23160.706960] [] __generic_file_write_iter+0x2a7/0x350 [23160.706963] [] ext4_file_write_iter+0x111/0x3d0 [23160.706969] [] ? iov_iter_init+0x14/0x40 [23160.706976] [] new_sync_write+0x7b/0xb0 [23160.706978] [] vfs_write+0xc7/0x1f0 [23160.706980] [] SyS_write+0x4f/0xb0 [23160.706985] [] system_call_fastpath+0x1a/0x1f [23280.705400] INFO: task dd:2033 blocked for more than 120 seconds. [23280.745358] Tainted: G OE 3.16.3-031603-generic #201409171435 [23280.785069] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [23280.864158] dd D 000b 0 2033 23859 0x [23280.864164] 88011cec78c8 0082 88011cec78d8 88011cec7fd8 [23280.864167] 000143c0 000143c0 88048661bcc0 880113441440 [23280.864170] 88011cec7898 88067fd54cc0 880113441440 880113441440 [23280.864173] Call Trace: [23280.864185] [] schedule+0x29/0x70 [23280.864197] [] io_schedule+0x8f/0xd0 [23280.864203] [] dio_await_completion+0x54/0xd0 [23280.864207] [] do_blockdev_direct_IO+0x958/0xcc0 [23280.864213] [] ? wake_up_bit+0x2e/0x40 [23280.864218] [] ? jbd2_journal_dirty_metadata+0xc5/0x260 [23280.864224] [] ? ext4_get_block_write+0x20/0x20 [23280.864229] [] __blockdev_direct_IO+0x4c/0x50 [23280.864239] [] ? ext4_get_block_write+0x20/0x20 [23280.864244] [] ext4_ind_direct_IO+0xce/0x410 [23280.864247] [] ? ext4_get_block_write+0x20/0x20 [23280.864251] [] ext4_ext_direct_IO+0x1bb/0x2a0 [23280.864254] [] ? __ext4_journal_stop+0x78/0xa0 [23280.864258] [] ext4_direct_IO+0xec/0x1e0 [23280.864263] [] ? __mark_inode_dirty+0x53/0x2d0 [23280.864268] [] generic_file_direct_write+0xbb/0x180 [23280.864273] [] ? mnt_clone_write+0x12/0x30 [23280.864284] [] __generic_file_write_iter+0x2a7/0x350 [23280.864289] [] ext4_file_write_iter+0x111/0x3d0 [23280.864295] [] ? iov_iter_init+0x14/0x40 [23280.864300] [] new_sync_write+0x7b/0xb0 [23280.864302] [] vfs_write+0xc7/0x1f0 [23280.864307] [] SyS_write+0x4f/0xb0 [23280.864314] [] system_call_fastpath+0x1a/0x1f [23400.861043] INFO: task dd:2033 blocked for more than 120 seconds. [23400.901529] Tainted: G OE 3.16.3-031603-generic #201409171435 [23400.942255] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [23401.020985] dd D 000b 0 2033 23859 0x [23401.020991] 88011cec78c8 0082 88011cec78d8 88011cec7fd8 [23401.020995] 000143c0 000143c0 88048661bcc0 880113441440 [23401.020997] 88011cec7898 88067fd54cc0 880113441440 880113441440 [23401.021001] Call Trace: [23401.021014] [] schedule+0x29/0x70 [23401.021025] [] io_schedule+0x8f/0xd0 [23401.021031] [] dio_await_completion+0x54/0xd0 [23401.021035] [] do_blockdev_direct_IO+0x958/0xcc0 [23401.021041] [] ? wake_up_bit+0x2e/0x40 [23401.0210
Re: [ceph-users] Frequent Crashes on rbd to nfs gateway Server
Right, I've stopped the tests because it is just getting ridiculous. Without rbd cache enabled, dd tests run extremely slow: dd if=/dev/zero of=/tmp/mount/1G bs=1M count=1000 oflag=direct 230+0 records in 230+0 records out 241172480 bytes (241 MB) copied, 929.71 s, 259 kB/s Any thoughts why I am getting 250kb/s instead of expected 100MB/s+ with large block size? How do I investigate what's causing this crappy performance? Cheers Andrei - Original Message - > From: "Andrei Mikhailovsky" > To: "Micha Krause" > Cc: ceph-users@lists.ceph.com > Sent: Thursday, 25 September, 2014 10:58:07 AM > Subject: Re: [ceph-users] Frequent Crashes on rbd to nfs gateway > Server > Guys, > Have done some testing with 3.16.3-031603-generic downloaded from > Ubuntu utopic branch. The hang task problem is gone when using large > block size (tested with 1M and 4M) and I could no longer preproduce > the hang tasks while doing 100 dd tests in a for loop. > However, I can confirm that I am still getting hang tasks while > working with a 4K block size. The hang tasks start after about an > hour, but they do not cause the server crash. After a while the dd > test times out and continues with the loop. This is what I was > running: > for i in {1..100} ; do time dd if=/dev/zero of=/tmp/mount/1G bs=4K > count=25K oflag=direct ; done > The following test definately produces the hang tasks like these: > [23160.549785] INFO: task dd:2033 blocked for more than 120 seconds. > [23160.588364] Tainted: G OE 3.16.3-031603-generic #201409171435 > [23160.627998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [23160.706856] dd D 000b 0 2033 23859 0x > [23160.706861] 88011cec78c8 0082 88011cec78d8 > 88011cec7fd8 > [23160.706865] 000143c0 000143c0 88048661bcc0 > 880113441440 > [23160.706868] 88011cec7898 88067fd54cc0 880113441440 > 880113441440 > [23160.706871] Call Trace: > [23160.706883] [] schedule+0x29/0x70 > [23160.706887] [] io_schedule+0x8f/0xd0 > [23160.706893] [] dio_await_completion+0x54/0xd0 > [23160.706897] [] do_blockdev_direct_IO+0x958/0xcc0 > [23160.706903] [] ? wake_up_bit+0x2e/0x40 > [23160.706908] [] ? > jbd2_journal_dirty_metadata+0xc5/0x260 > [23160.706914] [] ? ext4_get_block_write+0x20/0x20 > [23160.706919] [] __blockdev_direct_IO+0x4c/0x50 > [23160.706922] [] ? ext4_get_block_write+0x20/0x20 > [23160.706928] [] ext4_ind_direct_IO+0xce/0x410 > [23160.706931] [] ? ext4_get_block_write+0x20/0x20 > [23160.706935] [] ext4_ext_direct_IO+0x1bb/0x2a0 > [23160.706938] [] ? __ext4_journal_stop+0x78/0xa0 > [23160.706942] [] ext4_direct_IO+0xec/0x1e0 > [23160.706946] [] ? __mark_inode_dirty+0x53/0x2d0 > [23160.706952] [] > generic_file_direct_write+0xbb/0x180 > [23160.706957] [] ? mnt_clone_write+0x12/0x30 > [23160.706960] [] > __generic_file_write_iter+0x2a7/0x350 > [23160.706963] [] ext4_file_write_iter+0x111/0x3d0 > [23160.706969] [] ? iov_iter_init+0x14/0x40 > [23160.706976] [] new_sync_write+0x7b/0xb0 > [23160.706978] [] vfs_write+0xc7/0x1f0 > [23160.706980] [] SyS_write+0x4f/0xb0 > [23160.706985] [] system_call_fastpath+0x1a/0x1f > [23280.705400] INFO: task dd:2033 blocked for more than 120 seconds. > [23280.745358] Tainted: G OE 3.16.3-031603-generic #201409171435 > [23280.785069] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [23280.864158] dd D 000b 0 2033 23859 0x > [23280.864164] 88011cec78c8 0082 88011cec78d8 > 88011cec7fd8 > [23280.864167] 000143c0 000143c0 88048661bcc0 > 880113441440 > [23280.864170] 88011cec7898 88067fd54cc0 880113441440 > 880113441440 > [23280.864173] Call Trace: > [23280.864185] [] schedule+0x29/0x70 > [23280.864197] [] io_schedule+0x8f/0xd0 > [23280.864203] [] dio_await_completion+0x54/0xd0 > [23280.864207] [] do_blockdev_direct_IO+0x958/0xcc0 > [23280.864213] [] ? wake_up_bit+0x2e/0x40 > [23280.864218] [] ? > jbd2_journal_dirty_metadata+0xc5/0x260 > [23280.864224] [] ? ext4_get_block_write+0x20/0x20 > [23280.864229] [] __blockdev_direct_IO+0x4c/0x50 > [23280.864239] [] ? ext4_get_block_write+0x20/0x20 > [23280.864244] [] ext4_ind_direct_IO+0xce/0x410 > [23280.864247] [] ? ext4_get_block_write+0x20/0x20 > [23280.864251] [] ext4_ext_direct_IO+0x1bb/0x2a0 > [23280.864254] [] ? __ext4_journal_stop+0x78/0xa0 > [23280.864258] [] ext4_direct_IO+0xec/0x1e0 > [23280.864263] [] ? __mark_inode_dirty+0x53/0x2d0 > [23280.864268] [] > generic_file_direct_write+0xbb/0x180 > [23280.864273] [] ? mnt_clone_write+0x12/0x30 > [23280.
Re: [ceph-users] Frequent Crashes on rbd to nfs gateway Server
Ilya, I've not used rbd map on older kernels. Just experimenting with rbd map to have an iscsi and nfs gateway service for hypervisors such as xenserver and vmware. I've tried it with the latest ubuntu LTS kernel 3.13 I believe and noticed the issue. Can you not reproduce the hang tasks when doing dd testing? have you tried 4K block sizes and running it for sometime, like I have done? Thanks Andrei - Original Message - > From: "Ilya Dryomov" > To: "Andrei Mikhailovsky" > Cc: "Micha Krause" , ceph-users@lists.ceph.com > Sent: Thursday, 25 September, 2014 12:04:37 PM > Subject: Re: [ceph-users] Frequent Crashes on rbd to nfs gateway > Server > On Thu, Sep 25, 2014 at 1:58 PM, Andrei Mikhailovsky > wrote: > > Guys, > > > > Have done some testing with 3.16.3-031603-generic downloaded from > > Ubuntu > > utopic branch. The hang task problem is gone when using large block > > size > > (tested with 1M and 4M) and I could no longer preproduce the hang > > tasks > > while doing 100 dd tests in a for loop. > > > > > > > > However, I can confirm that I am still getting hang tasks while > > working with > > a 4K block size. The hang tasks start after about an hour, but they > > do not > > cause the server crash. After a while the dd test times out and > > continues > > with the loop. This is what I was running: > > > > for i in {1..100} ; do time dd if=/dev/zero of=/tmp/mount/1G bs=4K > > count=25K > > oflag=direct ; done > > > > The following test definately produces the hang tasks like these: > > > > [23160.549785] INFO: task dd:2033 blocked for more than 120 > > seconds. > > [23160.588364] Tainted: G OE 3.16.3-031603-generic > > #201409171435 > > [23160.627998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > > disables > > this message. > > [23160.706856] dd D 000b 0 2033 23859 > > 0x > > [23160.706861] 88011cec78c8 0082 88011cec78d8 > > 88011cec7fd8 > > [23160.706865] 000143c0 000143c0 88048661bcc0 > > 880113441440 > > [23160.706868] 88011cec7898 88067fd54cc0 880113441440 > > 880113441440 > > [23160.706871] Call Trace: > > [23160.706883] [] schedule+0x29/0x70 > > [23160.706887] [] io_schedule+0x8f/0xd0 > > [23160.706893] [] dio_await_completion+0x54/0xd0 > > [23160.706897] [] > > do_blockdev_direct_IO+0x958/0xcc0 > > [23160.706903] [] ? wake_up_bit+0x2e/0x40 > > [23160.706908] [] ? > > jbd2_journal_dirty_metadata+0xc5/0x260 > > [23160.706914] [] ? > > ext4_get_block_write+0x20/0x20 > > [23160.706919] [] __blockdev_direct_IO+0x4c/0x50 > > [23160.706922] [] ? > > ext4_get_block_write+0x20/0x20 > > [23160.706928] [] ext4_ind_direct_IO+0xce/0x410 > > [23160.706931] [] ? > > ext4_get_block_write+0x20/0x20 > > [23160.706935] [] ext4_ext_direct_IO+0x1bb/0x2a0 > > [23160.706938] [] ? __ext4_journal_stop+0x78/0xa0 > > [23160.706942] [] ext4_direct_IO+0xec/0x1e0 > > [23160.706946] [] ? __mark_inode_dirty+0x53/0x2d0 > > [23160.706952] [] > > generic_file_direct_write+0xbb/0x180 > > [23160.706957] [] ? mnt_clone_write+0x12/0x30 > > [23160.706960] [] > > __generic_file_write_iter+0x2a7/0x350 > > [23160.706963] [] > > ext4_file_write_iter+0x111/0x3d0 > > [23160.706969] [] ? iov_iter_init+0x14/0x40 > > [23160.706976] [] new_sync_write+0x7b/0xb0 > > [23160.706978] [] vfs_write+0xc7/0x1f0 > > [23160.706980] [] SyS_write+0x4f/0xb0 > > [23160.706985] [] system_call_fastpath+0x1a/0x1f > > [23280.705400] INFO: task dd:2033 blocked for more than 120 > > seconds. > > [23280.745358] Tainted: G OE 3.16.3-031603-generic > > #201409171435 > > [23280.785069] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > > disables > > this message. > > [23280.864158] dd D 000b 0 2033 23859 > > 0x > > [23280.864164] 88011cec78c8 0082 88011cec78d8 > > 88011cec7fd8 > > [23280.864167] 000143c0 000143c0 88048661bcc0 > > 880113441440 > > [23280.864170] 88011cec7898 88067fd54cc0 880113441440 > > 880113441440 > > [23280.864173] Call Trace: > > [23280.864185] [] schedule+0x29/0x70 > > [23280.864197] [] io_schedule+0x8f/0xd0 > > [23280.864203] [] dio_await_completion+0x54/0xd0 > > [23280.864207] [] > > do_blockdev_direct_IO+0x958/0xcc0 > > [23280.864213] [] ? wake_up_bit+0x2e/0x40 > > [23280.86421
[ceph-users] OSD log bound mismatch
Hello Cephers, I am having some issues with two osds, which are either flapping or just crashing without recovering back. I've got a log file 100MB or so for these osds which has been generated in a couple of hours if anyone is interested. I am running firefly with the latest updates on Ubuntu 12.04 with the latest LTS kernel. Looking at the osd logs I see a bunch of these entries: 2014-09-26 15:24:08.998918 7f73cb194700 0 log [ERR] : 5.108 log bound mismatch, info (53757'2809698,54690'2817536] actual [53757'2809532,54690'2817536] followed by slow requests like these: 2014-09-26 15:24:16.798355 7f73e247c700 0 log [WRN] : slow request 31.463701 seconds old, received at 2014-09-26 15:23:45.334567: osd_op(client.37190249.0:6372257 rbd_data.3a0cd42ae8944a.280d [set-alloc-hint object_size 4194304 write_size 4194304,write 2203648~4096] 5.27e2bd53 ack+ondisk+write e54691) v4 currently waiting for subops from 8 2014-09-26 15:24:16.798358 7f73e247c700 0 log [WRN] : slow request 31.004246 seconds old, received at 2014-09-26 15:23:45.794022: osd_op(client.38862536.0:2001456 rbd_data.250f7505e5edd7.0f4f [stat,set-alloc-hint object_size 4194304 write_size 4194304,write 3813376~4096] 5.5a3d6aa3 ack+ondisk+write e54691) v4 currently waiting for missing object The cluster seems to suffer and the guest vms are running a bit with a lag. Any idea how to fix these issues? Cheers ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Why performance of benchmarks with small blocks is extremely small?
Timur, As far as I know, the latest master has a number of improvements for ssd disks. If you check the mailing list discussion from a couple of weeks back, you can see that the latest stable firefly is not that well optimised for ssd drives and IO is limited. However changes are being made to address that. I am well surprised that you can get 10K IOps as in my tests I was not getting over 3K IOPs on the ssd disks which are capable of doing 90K IOps. P.S. does anyone know if the ssd optimisation code will be added to the next maintenance release of firefy? Andrei - Original Message - > From: "Timur Nurlygayanov" > To: "Christian Balzer" > Cc: ceph-us...@ceph.com > Sent: Wednesday, 1 October, 2014 1:11:25 PM > Subject: Re: [ceph-users] Why performance of benchmarks with small > blocks is extremely small? > Hello Christian, > Thank you for your detailed answer! > I have other pre-production environment with 4 Ceph servers, 4 SSD > disks per Ceph server (each Ceph OSD node on the separate SSD disk) > Probably I should move journals to other disks or it is not required > in my case? > [root@ceph-node ~]# mount | grep ceph > /dev/sdb4 on /var/lib/ceph/osd/ceph-0 type xfs > (rw,noexec,nodev,noatime,nodiratime,inode64,logbsize=256k,delaylog,user_xattr,data=writeback) > /dev/sde4 on /var/lib/ceph/osd/ceph-5 type xfs > (rw,noexec,nodev,noatime,nodiratime,inode64,logbsize=256k,delaylog,user_xattr,data=writeback) > /dev/sdd4 on /var/lib/ceph/osd/ceph-2 type xfs > (rw,noexec,nodev,noatime,nodiratime,inode64,logbsize=256k,delaylog,user_xattr,data=writeback) > /dev/sdc4 on /var/lib/ceph/osd/ceph-1 type xfs > (rw,noexec,nodev,noatime,nodiratime,inode64,logbsize=256k,delaylog,user_xattr,data=writeback) > [root@ceph-node ~]# find /var/lib/ceph/osd/ | grep journal > /var/lib/ceph/osd/ceph-0/journal > /var/lib/ceph/osd/ceph-5/journal > /var/lib/ceph/osd/ceph-1/journal > /var/lib/ceph/osd/ceph-2/journal > My SSD disks have ~ 40k IOPS per disk, but on the VM I can see only ~ > 10k - 14k IOPS for disks operations. > To check this I execute the following command on VM with root > partition mounted on disk in Ceph storage: > root@test-io:/home/ubuntu# rm -rf /tmp/test && spew -d --write -r -b > 4096 10M /tmp/test > WTR: 56506.22 KiB/s Transfer time: 00:00:00 IOPS: 14126.55 > Is it expected result or I can improve the performance and get at > least 30k-40k IOPS on the VM disks? (I have 2x 10Gb/s networks > interfaces in LACP bonding for storage network, looks like network > can't be the bottleneck). > Thank you! > On Wed, Oct 1, 2014 at 6:50 AM, Christian Balzer < ch...@gol.com > > wrote: > > Hello, > > > [reduced to ceph-users] > > > On Sat, 27 Sep 2014 19:17:22 +0400 Timur Nurlygayanov wrote: > > > > Hello all, > > > > > > > > I installed OpenStack with Glance + Ceph OSD with replication > > > factor 2 > > > > and now I can see the write operations are extremly slow. > > > > For example, I can see only 0.04 MB/s write speed when I run > > > rados > > > bench > > > > with 512b blocks: > > > > > > > > rados bench -p test 60 write --no-cleanup -t 1 -b 512 > > > > > > > There are 2 things wrong with that this test: > > > 1. You're using rados bench, when in fact you should be testing > > from > > > within VMs. For starters a VM could make use of the rbd cache you > > enabled, > > > rados bench won't. > > > 2. Given the parameters of this test you're testing network latency > > more > > > than anything else. If you monitor the Ceph nodes (atop is a good > > tool for > > > that), you will probably see that neither CPU nor disks resources > > are > > > being exhausted. With a single thread rados puts that tiny block of > > 512 > > > bytes on the wire, the primary OSD for the PG has to write this to > > the > > > journal (on your slow, non-SSD disks) and send it to the secondary > > OSD, > > > which has to ACK the write to its journal back to the primary one, > > which > > > in turn then ACKs it to the client (rados bench) and then rados > > bench > > can > > > send the next packet. > > > You get the drift. > > > Using your parameters I can get 0.17MB/s on a pre-production > > cluster > > > that uses 4xQDR Infiniband (IPoIB) connections, on my shitty test > > cluster > > > with 1GB/s links I get similar results to you, unsurprisingly. > > > Ceph excels only with lots of parallelism, so an individual thread > > might > > > be slow (and in your case HAS to be slow, which has nothing to do > > with > > > Ceph per se) but many parallel ones will utilize the resources > > available. > > > Having data blocks that are adequately sized (4MB, the default > > rados > > size) > > > will help for bandwidth and the rbd cache inside a properly > > configured VM > > > should make that happen. > > > Of course in most real life scenarios you will run out of IOPS long > > before > > > you run out of bandwidth. > > > > Maintaining 1 concurrent writes of 512 bytes for up to
Re: [ceph-users] Why performance of benchmarks with small blocks is extremely small?
Greg, are they going to be a part of the next stable release? Cheers - Original Message - > From: "Gregory Farnum" > To: "Andrei Mikhailovsky" > Cc: "Timur Nurlygayanov" , "ceph-users" > > Sent: Wednesday, 1 October, 2014 3:04:51 PM > Subject: Re: [ceph-users] Why performance of benchmarks with small > blocks is extremely small? > On Wed, Oct 1, 2014 at 5:24 AM, Andrei Mikhailovsky > wrote: > > Timur, > > > > As far as I know, the latest master has a number of improvements > > for ssd > > disks. If you check the mailing list discussion from a couple of > > weeks back, > > you can see that the latest stable firefly is not that well > > optimised for > > ssd drives and IO is limited. However changes are being made to > > address > > that. > > > > I am well surprised that you can get 10K IOps as in my tests I was > > not > > getting over 3K IOPs on the ssd disks which are capable of doing > > 90K IOps. > > > > P.S. does anyone know if the ssd optimisation code will be added to > > the next > > maintenance release of firefly? > Not a chance. The changes enabling that improved throughput are very > invasive and sprinkled all over the OSD; they aren't the sort of > thing > that one does backport or that one could put on top of a stable > release for any meaningful definition of "stable". :) > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph, ssds, hdds, journals and caching
Hello Cephers, I am a bit lost on the best ways of using ssd and hdds for ceph cluster which uses rbd + kvm for guest vms. At the moment I've got 2 osd servers which currently have 8 hdd osds (max 16 bays) each and 4 ssd disks. Currently, I am using 2 ssds for osd journals and I've got 2x512GB ssd spare, which are waiting to be utilised. I am running Ubuntu 12.04 with 3.13 kernel from Ubuntu 14.04 and the latest firefly release. I've tried to use ceph cache pool tier and the results were not good. My small cluster slowed down by quite a bit and i've disabled the cache tier altogether. My question is how would one utilise the ssds in the best manner to achieve a good performance boost compared to a pure hdd setup? Should I enable block level caching (likes of bcache or similar) using all my ssds and do not bother using ssd journals? Should I keep the journals on two ssds and utilse the remaining two ssds for bcache? Or is there a better alternative? Cheers Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph, ssds, hdds, journals and caching
From: "Christian Balzer" > To: ceph-users@lists.ceph.com > Sent: Friday, 3 October, 2014 2:06:48 AM > Subject: Re: [ceph-users] ceph, ssds, hdds, journals and caching > On Thu, 2 Oct 2014 21:54:54 +0100 (BST) Andrei Mikhailovsky wrote: > > Hello Cephers, > > > > I am a bit lost on the best ways of using ssd and hdds for ceph > > cluster > > which uses rbd + kvm for guest vms. > > > > At the moment I've got 2 osd servers which currently have 8 hdd > > osds > > (max 16 bays) each and 4 ssd disks. Currently, I am using 2 ssds > > for osd > > journals and I've got 2x512GB ssd spare, which are waiting to be > > utilised. I am running Ubuntu 12.04 with 3.13 kernel from Ubuntu > > 14.04 > > and the latest firefly release. > > > In case you're planning to add more HDDs to those nodes, the obvious > use > case for those SSDs would be additional journals. >From what i've seen so far, the two ssds that i currently use for journaling >are happy serving 8 osds and I do not have much load on them. Having more osds >per server might change that though, you are right. But at the moment I was >hoping to improve the read performance, especially for small block sizes, >hense I was thinking of adding the caching layer. > Also depending on your use case, a kernel newer than 3.13(which also > is > not getting any upstream updates/support) might be a good idea. Yes, indeed. I am considering the latest supported kernels from Ubuntu team > > I've tried to use ceph cache pool tier and the results were not > > good. My > > small cluster slowed down by quite a bit and i've disabled the > > cache > > tier altogether. > > > Yeah, this feature is clearly a case of "wait for the next major > release > or the one after that and try again". Anyone know if the latest 0.80.6 firefly improves the cache behaviour? I've seen a bunch of changes in the cache tiering, however, I am not sure if these are addressing the stability of the tier or its efficiency? > > My question is how would one utilise the ssds in the best manner to > > achieve a good performance boost compared to a pure hdd setup? > > Should I > > enable block level caching (likes of bcache or similar) using all > > my > > ssds and do not bother using ssd journals? Should I keep the > > journals on > > two ssds and utilse the remaining two ssds for bcache? Or is there > > a > > better alternative? > > > This has all been discussed very recently here and the results where > inconclusive at best. In some cases reads were improved, but for > writes it > was potentially worse than normal Ceph journals. > Have you monitored your storage nodes (I keep recommending atop for > this) > during a high load time? If your SSDs are becoming the bottleneck and > not > the actual disks (doubtful, but verify), more journals. I am monitoring my ceph cluster with Zabbix and I do not have a significant load on the servers at all. My biggest concern is the single thread performance of vms. From what I can see, this is the main downside of ceph. On average, I am not getting much over 35-40MB/s per thread in cold data reads. This is compared with a single hdd read performance of 150-160MB/s. Having about 1/4 of the raw device performance is a bit worring, especially compared with what i've read. I should be getting about 1/2 of the raw drive performance for a single thread, but I am not. My hope was with caching tier I can increase it. > Other than that, maybe create a 1TB (usable space) SSD pool for > guests > with special speed requirements... I am planning to do this for the database volumes, however, from what I've read so far, there are performance bottlenecks and the current stable firefly is not optimised for ssds. I've not tried it myself, but it doesn't look like having a dedicated ssd pool will bring a significant increase in performance. Has anyone tried using bcache of dm-cache with ceph? Any tips on how to integrate it? From what I've read so far, they require you to format the existing hdd, which is not feasible if you have an existing live cluster. Cheers > Christian > -- > Christian Balzer Network/Systems Engineer > ch...@gol.com Global OnLine Japan/Fusion Communications > http://www.gol.com/ > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph, ssds, hdds, journals and caching
That is what I am afraid of! - Original Message - > From: "Vladislav Gorbunov" > To: "Andrei Mikhailovsky" > Cc: "Christian Balzer" , ceph-users@lists.ceph.com > Sent: Friday, 3 October, 2014 12:04:37 PM > Subject: Re: [ceph-users] ceph, ssds, hdds, journals and caching > > Has anyone tried using bcache of dm-cache with ceph? > I'm tested lvmcache (based on dm-cache) with ceph 0.80.5 on CentOS 7. > Got unrecoverable error with xfs and total lost osd server. > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph, ssds, hdds, journals and caching
> While I doubt you're hitting any particular bottlenecks on your > storage > servers I don't think Zabbix (very limited experience with it so I > might > be wrong) monitors everything, nor does it so at sufficiently high > freqency to show what is going on during a peak or fio test from a > client. > Thus my suggestion to stare at it live with atop (on all nodes). I will give it a go and see what happens during benchmarks. The Atop is rather informative indeed! There is a zabbix plugin/template for ceph, which gives a good overview of the ceph cluster. It does not provide the level of details that you would get from an admin socket, but rather an overview of the cluster thhroughput and io rates as well as PGs status. > > My biggest concern is the single > > thread performance of vms. From what I can see, this is the main > > downside of ceph. On average, I am not getting much over 35-40MB/s > > per > > thread in cold data reads. This is compared with a single hdd read > > performance of 150-160MB/s. Having about 1/4 of the raw device > > performance is a bit worring, especially compared with what i've > > read. I > > should be getting about 1/2 of the raw drive performance for a > > single > > thread, but I am not. My hope was with caching tier I can increase > > it. > > > Have a look at: > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-April/028552.html > Your numbers look very much like mine before increasing the > read_ahead > buffer. How much in performance did you gain by setting the read_ahead values? The performance figures that I get are using the following udev rules: # set read_ahead values ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="1", ATTR{queue/read_ahead_kb}="2048" ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="1", ATTR{queue/nr_requests}="2048" # set deadline scheduler for non-rotating disks ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="noop" # # set cfq scheduler for rotating disks ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="cfq" Is there anything else that I am missing? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph, ssds, hdds, journals and caching
> Read the above link again, carefully. ^o^ > In in it I state that: > a) despite reading such in old posts, setting read_ahead on the OSD > nodes > has no or even negative effects. Inside the VM, it is very helpful: > b) the read speed increased about 10 times, from 35MB/s to 380MB/s Christian, are you getting 380MB/s from hdd osds or ssd osds? It seems a bit high for a single thread cold data throughput. Cheers > Regards, > Christian > > # set read_ahead values > > ACTION=="add|change", KERNEL=="sd[a-z]", > > ATTR{queue/rotational}=="1", > > ATTR{queue/read_ahead_kb}="2048" ACTION=="add|change", > > KERNEL=="sd[a-z]", ATTR{queue/rotational}=="1", > > ATTR{queue/nr_requests}="2048" # set deadline scheduler for > > non-rotating > > disks ACTION=="add|change", KERNEL=="sd[a-z]", > > ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="noop" # # set > > cfq > > scheduler for rotating disks ACTION=="add|change", > > KERNEL=="sd[a-z]", > > ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="cfq" > > > > Is there anything else that I am missing? > -- > Christian Balzer Network/Systems Engineer > ch...@gol.com Global OnLine Japan/Fusion Communications > http://www.gol.com/ > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] urgent- object unfound
Tuan, I had a similar behaviour when I've connected the cache pool tier. I resolved the issues by restarting all my osds. If your case is the same, try it and see if it works. If not, I guess the guys here and on the ceph irc might be able to help you. Cheers Andrei - Original Message - > From: "Ta Ba Tuan" > To: ceph-users@lists.ceph.com > Sent: Thursday, 16 October, 2014 1:36:01 PM > Subject: [ceph-users] urgent- object unfound > Hi eveyone, I use replicate 3, many unfound object and Ceph very > slow. > pg 6.9d8 is active+recovery_wait+degraded+remapped, acting [22,93], 4 > unfound > pg 6.766 is active+recovery_wait+degraded+remapped, acting [21,36], 1 > unfound > pg 6.73f is active+recovery_wait+degraded+remapped, acting [19,84], 2 > unfound > pg 6.63c is active+recovery_wait+degraded+remapped, acting [10,37], 2 > unfound > pg 6.56c is active+recovery_wait+degraded+remapped, acting [124,93], > 2 > unfound > pg 6.4d3 is active+recovering+degraded+remapped, acting [33,94], 2 > unfound > pg 6.4a5 is active+recovery_wait+degraded+remapped, acting [11,94], 2 > unfound > pg 6.2f9 is active+recovery_wait+degraded+remapped, acting [22,34], 2 > unfound > recovery 535673/52672768 objects degraded (1.017%); 17/17470639 > unfound > (0.000%) > ceph pg map 6.766 > osdmap e94990 pg 6.766 (6.766) -> up [49,36,21] acting [21,36] > I can't resolve it. I need data on those objects. Guide me, please! > Thank you! > -- > Tuan > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] slow requests - what is causing them?
Hello cephers, I've been testing flashcache and enhanceio block device caching for the osds and i've noticed i have started getting the slow requests. The caching type that I use is ready only, so all writes bypass the caching ssds and go directly to osds, just like what it used to be before introducing the caching layer. Prior to introducing caching, i rarely had the slow requests. Judging by the logs, all slow requests are looking like these: 2014-10-16 01:09:15.600807 osd.7 192.168.168.200:6836/32031 100 : [WRN] slow request 30.999641 seconds old, received at 2014-10-16 01:08:44.601040: osd_op(client.36035566.0:16626375 rbd_data.51da686763845e .5a15 [set-alloc-hint object_size 4194304 write_size 4194304,write 2007040~16384] 5.7b16421b snapc c4=[c4] ack+ondisk+write e61892) v4 currently waiting for subops from 9 2014-10-16 01:09:15.600811 osd.7 192.168.168.200:6836/32031 101 : [WRN] slow request 30.999581 seconds old, received at 2014-10-16 01:08:44.601100: osd_op(client.36035566.0:16626376 rbd_data.51da686763845e .5a15 [set-alloc-hint object_size 4194304 write_size 4194304,write 2039808~16384] 5.7b16421b snapc c4=[c4] ack+ondisk+write e61892) v4 currently waiting for subops from 9 2014-10-16 01:09:16.185530 osd.2 192.168.168.200:6811/31891 76 : [WRN] 20 slow requests, 1 included below; oldest blocked for > 57.003961 secs 2014-10-16 01:09:16.185564 osd.2 192.168.168.200:6811/31891 77 : [WRN] slow request 30.098574 seconds old, received at 2014-10-16 01:08:46.086854: osd_op(client.38917806.0:3481697 rbd_data.251d05e3db45a54. 0304 [stat,set-alloc-hint object_size 4194304 write_size 4194304,write 2732032~8192] 5.e4683bbb ack+ondisk+write e61892) v4 currently waiting for subops from 11 2014-10-16 01:09:16.601020 osd.7 192.168.168.200:6836/32031 102 : [WRN] 16 slow requests, 2 included below; oldest blocked for > 43.531516 secs In general, I see between 0 and about 2,000 slow request log entries per day. On one day I saw over 100k entries, but it only happened once. I am struggling to understand what is casing the slow requests? If all the writes go the same path as before caching was introduced, how come I am getting them? How can I investigate this further? Thanks Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] slow requests - what is causing them?
Hello cephers, I've been testing flashcache and enhanceio block device caching for the osds and i've noticed i have started getting the slow requests. The caching type that I use is ready only, so all writes bypass the caching ssds and go directly to osds, just like what it used to be before introducing the caching layer. Prior to introducing caching, i rarely had the slow requests. Judging by the logs, all slow requests are looking like these: 2014-10-16 01:09:15.600807 osd.7 192.168.168.200:6836/32031 100 : [WRN] slow request 30.999641 seconds old, received at 2014-10-16 01:08:44.601040: osd_op(client.36035566.0:16626375 rbd_data.51da686763845e .5a15 [set-alloc-hint object_size 4194304 write_size 4194304,write 2007040~16384] 5.7b16421b snapc c4=[c4] ack+ondisk+write e61892) v4 currently waiting for subops from 9 2014-10-16 01:09:15.600811 osd.7 192.168.168.200:6836/32031 101 : [WRN] slow request 30.999581 seconds old, received at 2014-10-16 01:08:44.601100: osd_op(client.36035566.0:16626376 rbd_data.51da686763845e .5a15 [set-alloc-hint object_size 4194304 write_size 4194304,write 2039808~16384] 5.7b16421b snapc c4=[c4] ack+ondisk+write e61892) v4 currently waiting for subops from 9 2014-10-16 01:09:16.185530 osd.2 192.168.168.200:6811/31891 76 : [WRN] 20 slow requests, 1 included below; oldest blocked for > 57.003961 secs 2014-10-16 01:09:16.185564 osd.2 192.168.168.200:6811/31891 77 : [WRN] slow request 30.098574 seconds old, received at 2014-10-16 01:08:46.086854: osd_op(client.38917806.0:3481697 rbd_data.251d05e3db45a54. 0304 [stat,set-alloc-hint object_size 4194304 write_size 4194304,write 2732032~8192] 5.e4683bbb ack+ondisk+write e61892) v4 currently waiting for subops from 11 2014-10-16 01:09:16.601020 osd.7 192.168.168.200:6836/32031 102 : [WRN] 16 slow requests, 2 included below; oldest blocked for > 43.531516 secs In general, I see between 0 and about 2,000 slow request log entries per day. On one day I saw over 100k entries, but it only happened once. I am struggling to understand what is casing the slow requests? If all the writes go the same path as before caching was introduced, how come I am getting them? How can I investigate this further? Thanks Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com