from:"Andrei Mikhailovsky"

Re: [ceph-users] Problem with radosgw and some file name characters

2014-06-02 Thread Andrei Mikhailovsky

Yehuda, 

sorry for the delay in reply, I was away for a week or so. 

The problem happens regardless of the client. I've tried a few. 

You are right, I've got a load balancer and a reverse proxy behind the radosgw 
service. My setup is as follows: 

Internet <---> Load Balancer <--> [Apache_Proxy_Server_1 | 
Apache_Proxy_Server_2] <--> [Radosgw_Server_1 | Radosgw_Server_2] 

So, the client hits the Load Balancer first, which redirects traffice either to 
proxy server 1 or 2. The proxy servers connect either to radosgw server 1 or 
server 2. 

I've noticed that if I go directly (Internet <--> Radosgw_Server_1) I do not 
have any issues with special characters. 

Any idea what I am missing? Perhaps something needs changing on the proxy 
server? 

Cheers 

Andrei 


- Original Message -----

From: "Yehuda Sadeh"  
To: "Andrei Mikhailovsky"  
Cc: ceph-users@lists.ceph.com 
Sent: Wednesday, 21 May, 2014 4:24:51 PM 
Subject: Re: [ceph-users] Problem with radosgw and some file name characters 

On Tue, May 20, 2014 at 4:13 AM, Andrei Mikhailovsky  wrote: 
> Anyone have any idea how to fix the problem with getting 403 when trying to 
> upload files with none standard characters? I am sure I am not the only one 
> with these requirements. 

It might be the specific client that you're using and the way it signs 
the requests. Can you try a different S3 client, see whether it works 
or not? Are you by any chance going through some kind of a load 
balancer that rewrites the urls? 

Yehuda 

>  
> From: "Andrei Mikhailovsky"  
> To: "Yehuda Sadeh"  
> Cc: ceph-users@lists.ceph.com 
> Sent: Monday, 19 May, 2014 12:38:29 PM 
> 
> Subject: Re: [ceph-users] Problem with radosgw and some file name characters 
> 
> Yehuda, 
> 
> Never mind my last post, i've found the issue with the rule that you've 
> suggested. my fastcgi script is called differently, so that's why i was 
> getting the 404. 
> 
> I've tried your rewrite rule and I am still having the same issues. The same 
> characters are failing with the rule you've suggested. 
> 
> 
> Any idea how to fix the issue? 
> 
> Cheers 
> 
> Andrei 
>  
> From: "Andrei Mikhailovsky"  
> To: "Yehuda Sadeh"  
> Cc: ceph-users@lists.ceph.com 
> Sent: Monday, 19 May, 2014 9:30:03 AM 
> Subject: Re: [ceph-users] Problem with radosgw and some file name characters 
> 
> Yehuda, 
> 
> I've tried the rewrite rule that you've suggested, but it is not working for 
> me. I get 404 when trying to access the service. 
> 
> RewriteRule ^/(.*) /s3gw.3.fcgi?%{QUERY_STRING} 
> [E=HTTP_AUTHORIZATION:%{HTTP:Authorization},L] 
> 
> Any idea what is wrong with this rule? 
> 
> Cheers 
> 
> Andrei 
> 
>  
> 
> 
> From: "Yehuda Sadeh"  
> To: "Andrei Mikhailovsky"  
> Cc: ceph-users@lists.ceph.com 
> Sent: Friday, 16 May, 2014 5:44:52 PM 
> Subject: Re: [ceph-users] Problem with radosgw and some file name characters 
> 
> Was talking about this. There is a different and simpler rule that we 
> use nowadays, for some reason it's not well documented: 
> 
> RewriteRule ^/(.*) /s3gw.3.fcgi?%{QUERY_STRING} 
> [E=HTTP_AUTHORIZATION:%{HTTP:Authorization},L] 
> 
> I still need to see a more verbose log to make a better educated guess. 
> 
> Yehuda 
> 
> On Thu, May 15, 2014 at 3:01 PM, Andrei Mikhailovsky  
> wrote: 
>> 
>> Yehuda, 
>> 
>> what do you mean by the rewrite rule? is this for Apache? I've used the 
>> ceph 
>> documentation to create it. My rule is: 
>> 
>> 
>> RewriteRule ^/([a-zA-Z0-9-_.]*)([/]?.*) 
>> /s3gw.fcgi?page=$1¶ms=$2&%{QUERY_STRING} 
>> [E=HTTP_AUTHORIZATION:%{HTTP:Authorization},L] 
>> 
>> Or are you talking about something else? 
>> 
>> Cheers 
>> 
>> Andrei 
>>  
>> From: "Yehuda Sadeh"  
>> To: "Andrei Mikhailovsky"  
>> Cc: ceph-users@lists.ceph.com 
>> Sent: Thursday, 15 May, 2014 4:05:06 PM 
>> Subject: Re: [ceph-users] Problem with radosgw and some file name 
>> characters 
>> 
>> 
>> Your rewrite rule might be off a bit. Can you provide log with 'debug rgw 
>> = 
>> 20'? 
>> 
>> Yehuda 
>> 
>> On Thu, May 15, 2014 at 8:02 AM, Andrei Mikhailovsky  
>> wrote: 
>>> Hello guys, 
>>> 
>>> 
>>> I am trying to figure out what is the problem here. 
>>> 
>>> 
>>> Currently running U

[ceph-users] release date for 0.80.2

2014-07-03 Thread Andrei Mikhailovsky

Hi guys, 

Was wondering if 0.80.2 is coming any time soon? I am planning na upgrade from 
Emperor and was wondering if I should wait for 0.80.2 to come out if the 
release date is pretty soon. Otherwise, I will go for the 0.80.1. 

Cheers 
Andrei 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] release date for 0.80.2

2014-07-04 Thread Andrei Mikhailovsky

Just thought to save some time ))) 

- Original Message -

From: "Wido den Hollander"  
To: ceph-users@lists.ceph.com 
Sent: Thursday, 3 July, 2014 12:11:07 PM 
Subject: Re: [ceph-users] release date for 0.80.2 

On 07/03/2014 10:27 AM, Andrei Mikhailovsky wrote: 
> Hi guys, 
> 
> Was wondering if 0.80.2 is coming any time soon? I am planning na 
> upgrade from Emperor and was wondering if I should wait for 0.80.2 to 
> come out if the release date is pretty soon. Otherwise, I will go for 
> the 0.80.1. 
> 

Why bother? Upgrading from 0.80.1 to .2 is not that much work. 

Or is there a specific bug in 0.80.1 which you don't want to run into? 

> Cheers 
> Andrei 
> 
> 
> ___ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 


-- 
Wido den Hollander 
Ceph consultant and trainer 
42on B.V. 

Phone: +31 (0)20 700 9902 
Skype: contact42on 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] nginx (tengine) and radosgw

2014-07-04 Thread Andrei Mikhailovsky

Hi David, 

Do you mind sharing the howto/documentation with examples of configs, etc.? 

I am tempted to give it a go and replace the Apache reverse proxy that I am 
currently using. 

cheers 

Andrei 

- Original Message -

From: "David Moreau Simard"  
To: ceph-users@lists.ceph.com 
Sent: Sunday, 22 June, 2014 2:37:00 AM 
Subject: Re: [ceph-users] nginx (tengine) and radosgw 

Hi, 

I just wanted to chime in and say that I didn’t notice any problems swapping 
nginx out in favor of tengine. 
tengine is used as a load balancer that also handles SSL termination. 

I found that disabling body buffering saves a lot on upload times as well. 

I took the time to do a post about it and linked this thread: 
http://dmsimard.com/2014/06/21/a-use-case-of-tengine-a-drop-in-replacement-and-fork-of-nginx/
 

- David 

On May 29, 2014, at 12:20 PM, Michael Lukzak < mis...@vp.pl > wrote: 



Re[2]: [ceph-users] nginx (tengine) and radosgw 
Hi, 

Ups, so I don't read carefully a doc... 
I will try this solution. 

Thanks! 

Michael 



From the docs, you need this setting in ceph.conf (if you're using 
nginx/tengine): 

rgw print continue = false 

This will fix the 100-continue issues. 

On 5/29/2014 5:56 AM, Michael Lukzak wrote: 
Re[2]: [ceph-users] nginx (tengine) and radosgw Hi, 

I'm also use tengine, works fine with SSL (I have a Wildcard). 
But I have other issue with HTTP 100-Continue. 
Clients like boto or Cyberduck hangs if they can't make HTTP 100-Continue. 

IP_REMOVED - - [29/May/2014:11:27:53 +] "PUT 
/temp/1b6f6a11d7aa188f06f8255fdf0345b4 HTTP/1.1" 100 0 "-" "Boto/2.27.0 
Python/2.7.6 Linux/3.13.0-24-generic" 

Do You have also problem with that? 
I used for testing oryginal nginx and also have a problem with 100-Continue. 
Only Apache 2.x works fine. 

BR, 
Michael 




I haven't tried SSL yet. We currently don't have a wildcard certificate 
for this, so it hasn't been a concern (and our current use case, all the files 
are public anyway). 

On 5/20/2014 4:26 PM, Andrei Mikhailovsky wrote: 


That looks very interesting indeed. I've tried to use nginx, but from what I 
recall it had some ssl related issues. Have you tried to make the ssl work so 
that nginx acts as an ssl proxy in front of the radosgw? 

Cheers 

Andrei 


From: "Brian Rak"  
To: ceph-users@lists.ceph.com 
Sent: Tuesday, 20 May, 2014 9:11:58 PM 
Subject: [ceph-users] nginx (tengine) and radosgw 

I've just finished converting from nginx/radosgw to tengine/radosgw, and 
it's fixed all the weird issues I was seeing (uploads failing, random 
clock skew errors, timeouts). 

The problem with nginx and radosgw is that nginx insists on buffering 
all the uploads to disk. This causes a significant performance hit, and 
prevents larger uploads from working. Supposedly, there is going to be 
an option in nginx to disable this, but it hasn't been released yet (nor 
do I see anything on the nginx devel list about it). 

tengine ( http://tengine.taobao.org/ ) is an nginx fork that implements 
unbuffered uploads to fastcgi. It's basically a drop in replacement for 
nginx. 

My configuration looks like this: 

server { 
listen 80; 

server_name *.rados.test rados.test; 

client_max_body_size 10g; 
# This is the important option that tengine has, but nginx does not 
fastcgi_request_buffering off; 

location / { 
fastcgi_pass_header Authorization; 
fastcgi_pass_request_headers on; 

if ($request_method = PUT ) { 
rewrite ^ /PUT$request_uri; 
} 
include fastcgi_params; 

fastcgi_pass unix:/path/to/ceph.radosgw.fastcgi.sock; 
} 

location /PUT/ { 
internal; 
fastcgi_pass_header Authorization; 
fastcgi_pass_request_headers on; 

include fastcgi_params; 
fastcgi_param CONTENT_LENGTH $content_length; 

fastcgi_pass unix:/path/to/ceph.radosgw.fastcgi.sock; 
} 
} 


if anyone else is looking to run radosgw without having to run apache, I 
would recommend you look into tengine :) 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 





___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 




___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time

2014-07-14 Thread Andrei Mikhailovsky

Hi Andrija, 

I've got at least two more stories of similar nature. One is my friend running 
a ceph cluster and one is from me. Both of our clusters are pretty small. My 
cluster has only two osd servers with 8 osds each, 3 mons. I have an ssd 
journal per 4 osds. My friend has a cluster of 3 mons and 3 osd servers with 4 
osds each and an ssd per 4 osds as well. Both clusters are connected with 
40gbit/s IP over Infiniband links. 

We had the same issue while upgrading to firefly. However, we did not add any 
new disks, just ran the "ceph osd crush tunables optimal" command after 
following an upgrade. 

Both of our clusters were "down" as far as the virtual machines are concerned. 
All vms have crashed because of the lack of IO. It was a bit problematic, 
taking into account that ceph is typically so great at staying alive during 
failures and upgrades. So, there seems to be a problem with the upgrade. I wish 
devs would have added a big note in red letters that if you run this command it 
will likely affect your cluster performance and most likely all your vms will 
die. So, please shutdown your vms if you do not want to have data loss. 

I've changed the default values to reduce the load during recovery and also to 
tune a few things performance wise. My settings were: 



osd recovery max chunk = 8388608 

osd recovery op priority = 2 

osd max backfills = 1 

osd recovery max active = 1 

osd recovery threads = 1 

osd disk threads = 2 

filestore max sync interval = 10 

filestore op threads = 20 

filestore_flusher = false 

However, this didn't help much and i've noticed that shortly after running the 
tunnables command my guest vms iowait has quickly jumped to 50% and a to 99% a 
minute after. This has happened on all vms at once. During the recovery phase I 
ran the "rbd -p  ls -l" command several times and it took between 
20-40 minutes to complete. It typically takes less than 2 seconds when the 
cluster is not in recovery mode. 

My mate's cluster had the same tunables apart from the last three. He had 
exactly the same behaviour. 

One other thing that i've noticed is that somewhere in the docs I've read that 
running the tunnable optimal command should move not more than 10% of your 
data. However, in both of our cases our status was just over 30% degraded and 
it took a good part of 9 hours to complete the data reshuffling. 


Any comments from the ceph team or other ceph gurus on: 

1. What have we done wrong in our upgrade process 
2. What options should we have used to keep our vms alive 


Cheers 

Andrei 




- Original Message -

From: "Andrija Panic"  
To: ceph-users@lists.ceph.com 
Sent: Sunday, 13 July, 2014 9:54:17 PM 
Subject: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the 
same time 

Hi, 

after seting ceph upgrade (0.72.2 to 0.80.3) I have issued "ceph osd crush 
tunables optimal" and after only few minutes I have added 2 more OSDs to the 
CEPH cluster... 

So these 2 changes were more or a less done at the same time - rebalancing 
because of tunables optimal, and rebalancing because of adding new OSD... 

Result - all VMs living on CEPH storage have gone mad, no disk access 
efectively, blocked so to speak. 

Since this rebalancing took 5h-6h, I had bunch of VMs down for that long... 

Did I do wrong by causing "2 rebalancing" to happen at the same time ? 
Is this behaviour normal, to cause great load on all VMs because they are 
unable to access CEPH storage efectively ? 

Thanks for any input... 
-- 

Andrija Panić 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Firefly Upgrade

2014-07-14 Thread Andrei Mikhailovsky

Quenten, 

It has been noted before and I've seen a thread on the mailing list about it. 

In a long term, I've not noticed a great increase in ram. By that i mean that 
initially, right after doing the upgrade from emperor to firefly and restarting 
the odd servers I did notice about 20-25% more ram usage. However, it has been 
about a week since i've done the upgrade and according to my graphs, the memory 
usage has decreased and is now at about the same level as it was before the 
upgrade. It does sound strange, but that's the case with me. 

Here is my current usage on one of my odd servers: 



ps aux |grep ceph-osd 

root 23081 5.9 1.3 1889508 334908 ? Ssl Jul12 132:28 /usr/bin/ceph-osd 
--cluster=ceph -i 0 -f 

root 23083 7.7 1.3 2024624 344664 ? Ssl Jul12 171:37 /usr/bin/ceph-osd 
--cluster=ceph -i 6 -f 

root 23152 4.4 1.4 1857568 348068 ? Ssl Jul12 99:04 /usr/bin/ceph-osd 
--cluster=ceph -i 3 -f 

root 23222 4.4 1.0 1807564 254108 ? Ssl Jul12 98:01 /usr/bin/ceph-osd 
--cluster=ceph -i 8 -f 

root 23295 4.5 1.1 1774380 272012 ? Ssl Jul12 100:10 /usr/bin/ceph-osd 
--cluster=ceph -i 4 -f 

root 23369 3.8 1.0 1689284 257152 ? Ssl Jul12 84:09 /usr/bin/ceph-osd 
--cluster=ceph -i 2 -f 

root 23434 7.0 1.2 1963112 299424 ? Ssl Jul12 156:09 /usr/bin/ceph-osd 
--cluster=ceph -i 7 -f 

root 23513 6.2 1.1 1885832 283804 ? Ssl Jul12 137:45 /usr/bin/ceph-osd 
--cluster=ceph -i 1 -f 

root 23545 6.0 1.0 1819448 258408 ? Ssl Jul12 134:23 /usr/bin/ceph-osd 
--cluster=ceph -i 5 -f 



Cheers 

Andrei 


- Original Message -

From: "Quenten Grasso"  
To: "ceph-users"  
Sent: Monday, 14 July, 2014 11:37:07 AM 
Subject: [ceph-users] Firefly Upgrade 



Hi All, 



Just a quick question for the list, has anyone seen a significant increase in 
ram usage since firefly? I upgraded from 0.72.2 to 80.3 now all of my Ceph 
servers are using about double the ram they used to. 



Only other significant change to our setup was a upgrade to Kernel 
3.13.0-30-generic #55~precise1-Ubuntu SMP 



Any ideas? 




Regards, 

Quenten 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Ceph with Multipath ISCSI

2014-07-15 Thread Andrei Mikhailovsky

Hello guys, 

I was wondering if there has been any progress on getting multipath iscsi play 
nicely with ceph? I've followed the how to and created a single path iscsi over 
ceph rbd with XenServer. However, it would be nice to have a built in failover 
using iscsi multipath to another ceph mon or osd server. 

Cheers 

Andrei 

-- 
Andrei Mikhailovsky 
Director 
Arhont Information Security 

Web: http://www.arhont.com 
http://www.wi-foo.com 
Tel: +44 (0)870 4431337 
Fax: +44 (0)208 429 3111 
PGP: Key ID - 0x2B3438DE 
PGP: Server - keyserver.pgp.com 

DISCLAIMER 

The information contained in this email is intended only for the use of the 
person(s) to whom it is addressed and may be confidential or contain legally 
privileged information. If you are not the intended recipient you are hereby 
notified that any perusal, use, distribution, copying or disclosure is strictly 
prohibited. If you have received this email in error please immediately advise 
us by return email at and...@arhont.com and delete and purge the email and any 
attachments without making a copy. 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Working ISCSI target guide

2014-07-15 Thread Andrei Mikhailovsky

Drew, I would not use iscsi with ivm. instead, I would use built in rbd 
support. 

However, you would use something like nfs/iscsi if you were to connect other 
hypervisors to ceph backend. Having failover capabilities is important here )) 

Andrei 

-- 
Andrei Mikhailovsky 
Director 
Arhont Information Security 

Web: http://www.arhont.com 
http://www.wi-foo.com 
Tel: +44 (0)870 4431337 
Fax: +44 (0)208 429 3111 
PGP: Key ID - 0x2B3438DE 
PGP: Server - keyserver.pgp.com 

DISCLAIMER 

The information contained in this email is intended only for the use of the 
person(s) to whom it is addressed and may be confidential or contain legally 
privileged information. If you are not the intended recipient you are hereby 
notified that any perusal, use, distribution, copying or disclosure is strictly 
prohibited. If you have received this email in error please immediately advise 
us by return email at and...@arhont.com and delete and purge the email and any 
attachments without making a copy. 


- Original Message -

From: "Drew Weaver"  
To: "ceph-users@lists.ceph.com"  
Sent: Tuesday, 15 July, 2014 2:18:53 PM 
Subject: Re: [ceph-users] Working ISCSI target guide 



One other question, if you are going to be using Ceph as a storage system for 
KVM virtual machines does it even matter if you use ISCSI or not? 



Meaning that if you are just going to use LVM and have several hypervisors 
sharing that same VG then using ISCSI isn’t really a requirement unless you are 
using a Hypervisor like ESXi which only works with ISCSI/NFS correct? 



Thanks, 

-Drew 






From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Drew 
Weaver 
Sent: Tuesday, July 15, 2014 9:03 AM 
To: 'ceph-users@lists.ceph.com' 
Subject: [ceph-users] Working ISCSI target guide 




Does anyone have a guide or re-producible method of getting multipath ISCSI 
working infront of ceph? Even if it just means having two front-end ISCSI 
targets each with access to the same underlying Ceph volume? 



This seems like a super popular topic. 



Thanks, 

-Drew 



___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time

2014-07-16 Thread Andrei Mikhailovsky

Quenten, 

We've got two monitors sitting on the osd servers and one on a different 
server. 

Andrei 

-- 
Andrei Mikhailovsky 
Director 
Arhont Information Security 

Web: http://www.arhont.com 
http://www.wi-foo.com 
Tel: +44 (0)870 4431337 
Fax: +44 (0)208 429 3111 
PGP: Key ID - 0x2B3438DE 
PGP: Server - keyserver.pgp.com 

DISCLAIMER 

The information contained in this email is intended only for the use of the 
person(s) to whom it is addressed and may be confidential or contain legally 
privileged information. If you are not the intended recipient you are hereby 
notified that any perusal, use, distribution, copying or disclosure is strictly 
prohibited. If you have received this email in error please immediately advise 
us by return email at and...@arhont.com and delete and purge the email and any 
attachments without making a copy. 

- Original Message -

From: "Quenten Grasso"  
To: "Andrija Panic" , "Sage Weil"  
Cc: ceph-users@lists.ceph.com 
Sent: Wednesday, 16 July, 2014 1:20:19 PM 
Subject: Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at 
the same time 

Hi Sage, Andrija & List 

I have seen the tuneables issue on our cluster when I upgraded to firefly. 

I ended up going back to legacy settings after about an hour as my cluster is 
of 55 3TB OSD’s over 5 nodes and it decided it needed to move around 32% of our 
data, which after an hour all of our vm’s were frozen and I had to revert the 
change back to legacy settings and wait about the same time again until our 
cluster had recovered and reboot our vms. (wasn’t really expecting that one 
from the patch notes) 

Also our CPU usage went through the roof as well on our nodes, do you per 
chance have your metadata servers co-located on your osd nodes as we do? I’ve 
been thinking about trying to move these to dedicated nodes as it may resolve 
our issues. 

Regards, 

Quenten 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Andrija Panic 
Sent: Tuesday, 15 July 2014 8:38 PM 
To: Sage Weil 
Cc: ceph-users@lists.ceph.com 
Subject: Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at 
the same time 

Hi Sage, 

since this problem is tunables-related, do we need to expect same behavior or 
not when we do regular data rebalancing caused by adding new/removing OSD? I 
guess not, but would like your confirmation. 

I'm already on optimal tunables, but I'm afraid to test this by i.e. shuting 
down 1 OSD. 

Thanks, 
Andrija 

On 14 July 2014 18:18, Sage Weil < sw...@redhat.com > wrote: 

I've added some additional notes/warnings to the upgrade and release 
notes: 

https://github.com/ceph/ceph/commit/fc597e5e3473d7db6548405ce347ca7732832451 

If there is somewhere else where you think a warning flag would be useful, 
let me know! 

Generally speaking, we want to be able to cope with huge data rebalances 
without interrupting service. It's an ongoing process of improving the 
recovery vs client prioritization, though, and removing sources of 
overhead related to rebalancing... and it's clearly not perfect yet. :/ 

sage 

On Sun, 13 Jul 2014, Andrija Panic wrote: 

> Hi, 
> after seting ceph upgrade (0.72.2 to 0.80.3) I have issued "ceph osd crush 
> tunables optimal" and after only few minutes I have added 2 more OSDs to the 
> CEPH cluster... 
> 
> So these 2 changes were more or a less done at the same time - rebalancing 
> because of tunables optimal, and rebalancing because of adding new OSD... 
> 
> Result - all VMs living on CEPH storage have gone mad, no disk access 
> efectively, blocked so to speak. 
> 
> Since this rebalancing took 5h-6h, I had bunch of VMs down for that long... 
> 
> Did I do wrong by causing "2 rebalancing" to happen at the same time ? 
> Is this behaviour normal, to cause great load on all VMs because they are 
> unable to access CEPH storage efectively ? 
> 
> Thanks for any input... 
> -- 
> 

> Andrija Pani? 
> 
> 

-- 

Andrija Panić 

-- 

http://admintweets.com 

-- 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time

2014-07-17 Thread Andrei Mikhailovsky

Sage, 

would it help if you add a cache pool to your cluster? Let's say if you add a 
few TBs of ssds acting as a cache pool to your cluster, would this help with 
retaining IO to the guest vms during data recovery or reshuffling? 

Over the past year and a half that we've been using ceph we had a positive 
experience for the majority of time. The only downtime we had for our vms was 
when ceph is doing recovery. It seems that regardless of the tuning options 
we've used, our vms are still unable to get IO, they get to 98-99% iowait and 
freeze. This has happened on dumpling, emperor and now firefly releases. 
Because of this I've set noout flag on my cluster and have to keep an eye on 
the osds for manual intervention, which is far from ideal case (((. 

Andrei 

-- 
Andrei Mikhailovsky 
Director 
Arhont Information Security 

Web: http://www.arhont.com 
http://www.wi-foo.com 
Tel: +44 (0)870 4431337 
Fax: +44 (0)208 429 3111 
PGP: Key ID - 0x2B3438DE 
PGP: Server - keyserver.pgp.com 

DISCLAIMER 

The information contained in this email is intended only for the use of the 
person(s) to whom it is addressed and may be confidential or contain legally 
privileged information. If you are not the intended recipient you are hereby 
notified that any perusal, use, distribution, copying or disclosure is strictly 
prohibited. If you have received this email in error please immediately advise 
us by return email at and...@arhont.com and delete and purge the email and any 
attachments without making a copy. 

- Original Message -

From: "Sage Weil"  
To: "Gregory Farnum"  
Cc: ceph-users@lists.ceph.com 
Sent: Thursday, 17 July, 2014 1:06:52 AM 
Subject: Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at 
the same time 

On Wed, 16 Jul 2014, Gregory Farnum wrote: 
> On Wed, Jul 16, 2014 at 4:45 PM, Craig Lewis  
> wrote: 
> > One of the things I've learned is that many small changes to the cluster 
> > are 
> > better than one large change. Adding 20% more OSDs? Don't add them all at 
> > once, trickle them in over time. Increasing pg_num & pgp_num from 128 to 
> > 1024? Go in steps, not one leap. 
> > 
> > I try to avoid operations that will touch more than 20% of the disks 
> > simultaneously. When I had journals on HDD, I tried to avoid going over 10% 
> > of the disks. 
> > 
> > 
> > Is there a way to execute `ceph osd crush tunables optimal` in a way that 
> > takes smaller steps? 
> 
> Unfortunately not; the crush tunables are changes to the core 
> placement algorithms at work. 

Well, there is one way, but it is only somewhat effective. If you 
decompile the crush maps for bobtail vs firefly the actual difference is 

tunable chooseleaf_vary_r 1 

and this is written such that a value of 1 is the optimal 'new' way, 0 is 
the legacy old way, but values > 1 are less-painful steps between the two 
(though mostly closer to the firefly value of 1). So, you could set 

tunable chooseleaf_vary_r 4 

wait for it to settle, and then do 

tunable chooseleaf_vary_r 3 

...and so forth down to 1. I did some limited testing of the data 
movement involved and noted it here: 

https://github.com/ceph/ceph/commit/37f840b499da1d39f74bfb057cf2b92ef4e84dc6 

In my test case, going from 0 to 4 was about 1/10th as bad as going 
straight from 0 to 1, but the final step from 2 to 1 is still about 1/2 as 
bad. I'm not sure if that means it's not worth the trouble of not just 
jumping straight to the firefly tunables, or whether it means legacy users 
should just set (and leave) this at 2 or 3 or 4 and get almost all the 
benefit without the rebalance pain. 

sage 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time

2014-07-17 Thread Andrei Mikhailovsky

Comments inline 

- Original Message -

From: "Sage Weil"  
To: "Quenten Grasso"  
Cc: ceph-users@lists.ceph.com 
Sent: Thursday, 17 July, 2014 4:44:45 PM 
Subject: Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at 
the same time 

On Thu, 17 Jul 2014, Quenten Grasso wrote: 

> Hi Sage & List 
> 
> I understand this is probably a hard question to answer. 
> 
> I mentioned previously our cluster is co-located MON?s on OSD servers, which 
> are R515?s w/ 1 x AMD 6 Core processor & 11 3TB OSD?s w/ dual 10GBE. 
> 
> When our cluster is doing these busy operations and IO has stopped as in my 
> case, I mentioned earlier running/setting tuneable to optimal or heavy 
> recovery 
> 
> operations is there a way to ensure our IO doesn?t get completely 
> blocked/stopped/frozen in our vms? 
> 
> Could it be as simple as putting all 3 of our mon servers on baremetal 
> w/ssd?s? (I recall reading somewhere that a mon disk was doing several 
> thousand IOPS during a recovery operation) 
> 
> I assume putting just one on baremetal won?t help because our mon?s will only 
> ever be as fast as our slowest mon server? 

I don't think this is related to where the mons are (most likely). The 
big question for me is whether IO is getting completely blocked, or just 
slowed enough that the VMs are all timing out. 

AM: I was looking at the cluster status while the rebalancing was taking place 
and I was seeing very little client IO reported by ceph -s output. The numbers 
were around 20-100 whereas our typical IO for the cluster is around 1000. 
Having said that, this was not enough as _all_ of our vms become unresponsive 
and didn't recover after rebalancing finished. 

What slow request messages 
did you see during the rebalance? 

AM: As I was experimenting with different options while trying to gain some 
client IO back i've noticed that when I am limiting the options to 1 per osd ( 
osd max backfills = 1, osd recovery max active = 1, osd recovery threads = 1), 
I did not have any slow or blocked requests at all. Increasing these values did 
produce some blocked requests occasionally, but they were being quickly 
cleared. 

What were the op latencies? 

AM: In general, the latencies were around 5-10 higher compared to the normal 
cluster ops. The second column of the "ceph osd perf" was around 50s where as 
it is typically between 3-10. It did occasionally jump to some crazy numbers 
like 2000-3000 on several osds, but only for 5-10 seconds. 

It's 
possible there is a bug here, but it's also possible the cluster is just 
operating close enough to capacity that the additional rebalancing work 
pushes it into a place where it can't keep up and the IO latencies are 
too high. 

AM: My cluster in particular is under-utilised for the majority of time. I do 
not typically see osds more than 20-30% utilised and our ssd journals are 
usually less than 10% utilised. 

Or that we just have more work to do prioritizing requests.. 
but it's hard to say without more info. 

sage 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] feature set mismatch after upgrade from Emperor to Firefly

2014-07-20 Thread Andrei Mikhailovsky



Hello guys, 




I have noticed the following message/error after upgrading to firefly. Does 
anyone know what needs doing to correct it? 




Thanks 

Andrei 







[ 25.911055] libceph: mon1 192.168.168.201:6789 feature set mismatch, my 40002 
< server's 20002040002, missing 2000200 

[ 25.911698] libceph: mon1 192.168.168.201:6789 socket error on read 

[ 35.913049] libceph: mon2 192.168.168.13:6789 feature set mismatch, my 40002 < 
server's 20002040002, missing 2000200 

[ 35.913694] libceph: mon2 192.168.168.13:6789 socket error on read 

[ 45.909466] libceph: mon0 192.168.168.200:6789 feature set mismatch, my 40002 
< server's 20002040002, missing 2000200 

[ 45.910104] libceph: mon0 192.168.168.200:6789 socket error on read 





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] feature set mismatch after upgrade from Emperor to Firefly

2014-07-20 Thread Andrei Mikhailovsky

Thanks guys, 

I am trying 3.15 kernel to see how it works. 

Andrei 

-- 
Andrei Mikhailovsky 
Director 
Arhont Information Security 

Web: http://www.arhont.com 
http://www.wi-foo.com 
Tel: +44 (0)870 4431337 
Fax: +44 (0)208 429 3111 
PGP: Key ID - 0x2B3438DE 
PGP: Server - keyserver.pgp.com 

DISCLAIMER 

The information contained in this email is intended only for the use of the 
person(s) to whom it is addressed and may be confidential or contain legally 
privileged information. If you are not the intended recipient you are hereby 
notified that any perusal, use, distribution, copying or disclosure is strictly 
prohibited. If you have received this email in error please immediately advise 
us by return email at and...@arhont.com and delete and purge the email and any 
attachments without making a copy. 


- Original Message -

From: "Ilya Dryomov"  
To: "Irek Fasikhov"  
Cc: "Andrei Mikhailovsky" , ceph-users@lists.ceph.com 
Sent: Sunday, 20 July, 2014 7:55:49 PM 
Subject: Re: [ceph-users] feature set mismatch after upgrade from Emperor to 
Firefly 

On Sun, Jul 20, 2014 at 10:29 PM, Irek Fasikhov  wrote: 
> Привет, Андрей. 
> 
> ceph osd getcrushmap -o /tmp/crush 
> crushtool -i /tmp/crush --set-chooseleaf_vary_r 0 -o /tmp/crush.new 
> ceph osd setcrushmap -i /tmp/crush.new 
> 
> Or 
> 
> update kernel 3.15. 
> 
> 
> 2014-07-20 20:19 GMT+04:00 Andrei Mikhailovsky : 
>> 
>> Hello guys, 
>> 
>> 
>> I have noticed the following message/error after upgrading to firefly. 
>> Does anyone know what needs doing to correct it? 
>> 
>> 
>> Thanks 
>> 
>> Andrei 
>> 
>> 
>> 
>> [ 25.911055] libceph: mon1 192.168.168.201:6789 feature set mismatch, my 
>> 40002 < server's 20002040002, missing 2000200 
>> 
>> [ 25.911698] libceph: mon1 192.168.168.201:6789 socket error on read 
>> 
>> [ 35.913049] libceph: mon2 192.168.168.13:6789 feature set mismatch, my 
>> 40002 < server's 20002040002, missing 2000200 
>> 
>> [ 35.913694] libceph: mon2 192.168.168.13:6789 socket error on read 
>> 
>> [ 45.909466] libceph: mon0 192.168.168.200:6789 feature set mismatch, my 
>> 40002 < server's 20002040002, missing 2000200 
>> 
>> [ 45.910104] libceph: mon0 192.168.168.200:6789 socket error on read 

Your kernel is missing TUNABLES2 and TUNABLES3 feature bits. For the latter, 
do what Irek said. To deal with the former the easiest thing is to upgrade to 
3.9 or later, but if that's not acceptable to you, try 

ceph osd crush tunables legacy 

Thanks, 

Ilya 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph and Infiniband

2014-07-23 Thread Andrei Mikhailovsky

Ricardo, 

Thought to share my testing results. 

I've been using IPoIB with ceph for quite some time now. I've got QDR 
osd/mon/client servers to serve rbd images to kvm hypervisor. I've done some 
performance testing using both rados and guest vm benchmarks while running the 
last three stable versions of ceph. 

My conclusion was that ceph itself needs to mature and/or be optimised in order 
to utilise the capabilities of the infiniband link. In my experience, I was not 
able to reach the limits of the network speeds reported to me by the network 
performance monitoring tools. I was struggling to push data throughput beyond 
1.5GB/s while using between 2 and 64 concurrent tests. This was the case when 
the benchmark data was using the same data over and over again and the data was 
cached on the osd servers and was coming directly from server's ram without any 
access to the osds themselves. 

My ipoib network performance tests were showing on average 2.5-3GB/s with peaks 
reaching 3.3GB/s over ipoib. It would be nice to see how ceph is performing 
over rdma ))). 

Having said this, perhaps my test gear is somewhat limited or my ceph 
optimisation was not done correctly. I had 2 osd servers with 8 osds each and 
three clients running guest vms and rados benchmarks. None of the benchmarks 
were able to fully utilise the server resources. my osd servers were running on 
about 50% utilisation during the tests. 

So, I had to conclude that unless you are running a large cluster with some 
specific data sets that utilise multithreading you will probably not need to 
have an infiniband link. A single thread performance for the cold data will be 
limited to about 1/2 of the speed of a single osd device. So, if your osds are 
running 150MB/s do not expect to have a single thread faster than 70-80MB/s. 

On the other hand, if you utilise high performance gear, like cache cards 
capable of achieving speeds of over gigabytes per second, perhaps infiniband 
link might be of use. Not sure if the ceph-osd process is capable of "spitting" 
out this amount of data though. You might be having a CPU bottleneck. 

Andrei 

- Original Message -

From: "Sage Weil"  
To: "Riccardo Murri"  
Cc: ceph-users@lists.ceph.com 
Sent: Tuesday, 22 July, 2014 9:42:56 PM 
Subject: Re: [ceph-users] Ceph and Infiniband 

On Tue, 22 Jul 2014, Riccardo Murri wrote: 
> Hello, 
> 
> a few questions on Ceph's current support for Infiniband 
> 
> (A) Can Ceph use Infiniband's native protocol stack, or must it use 
> IP-over-IB? Google finds a couple of entries in the Ceph wiki related 
> to native IB support (see [1], [2]), but none of them seems finished 
> and there is no timeline. 
> 
> [1]: 
> https://wiki.ceph.com/Planning/Blueprints/Emperor/msgr%3A_implement_infiniband_support_via_rsockets
>  
> [2]: http://wiki.ceph.com/Planning/Blueprints/Giant/Accelio_RDMA_Messenger 

This is work in progress. We hope to get basic support into the tree 
in the next couple of months. 

> (B) Can we connect to the same Ceph cluster from Infiniband *and* 
> Ethernet? Some clients do only have Ethernet and will not be 
> upgraded, some others would have QDR Infiniband -- we would like both 
> sets to access the same storage cluster. 

This is further out. Very early refactoring to make this work in 
wip-addr. 

> (C) I found this old thread about Ceph's performance on 10GbE and 
> Infiniband: are the issues reported there still current? 
> 
> http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/6816 

No idea! :) 

sage 

> 
> 
> Thanks for any hint! 
> 
> Riccardo 
> 
> -- 
> Riccardo Murri 
> http://www.s3it.uzh.ch/about/team/ 
> 
> S3IT: Services and Support for Science IT 
> University of Zurich 
> Winterthurerstrasse 190, CH-8057 Z?rich (Switzerland) 
> Tel: +41 44 635 4222 
> Fax: +41 44 635 6888 
> ___ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Using Crucial MX100 for journals or cache pool

2014-08-01 Thread Andrei Mikhailovsky

Hello guys, 

Was wondering if anyone has tried using the Crucial MX100 ssds either for osd 
journals or cache pool? It seems like a good cost effective alternative to the 
more expensive drives and read/write performance is very good as well. 

Thanks 

-- 
Andrei Mikhailovsky 
Director 
Arhont Information Security 

Web: http://www.arhont.com 
http://www.wi-foo.com 
Tel: +44 (0)870 4431337 
Fax: +44 (0)208 429 3111 
PGP: Key ID - 0x2B3438DE 
PGP: Server - keyserver.pgp.com 

DISCLAIMER 

The information contained in this email is intended only for the use of the 
person(s) to whom it is addressed and may be confidential or contain legally 
privileged information. If you are not the intended recipient you are hereby 
notified that any perusal, use, distribution, copying or disclosure is strictly 
prohibited. If you have received this email in error please immediately advise 
us by return email at and...@arhont.com and delete and purge the email and any 
attachments without making a copy. 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Using Crucial MX100 for journals or cache pool

2014-08-01 Thread Andrei Mikhailovsky

Thanks for your comments. 

Andrei 

-- 
Andrei Mikhailovsky 
Director 
Arhont Information Security 

Web: http://www.arhont.com 
http://www.wi-foo.com 
Tel: +44 (0)870 4431337 
Fax: +44 (0)208 429 3111 
PGP: Key ID - 0x2B3438DE 
PGP: Server - keyserver.pgp.com 

DISCLAIMER 

The information contained in this email is intended only for the use of the 
person(s) to whom it is addressed and may be confidential or contain legally 
privileged information. If you are not the intended recipient you are hereby 
notified that any perusal, use, distribution, copying or disclosure is strictly 
prohibited. If you have received this email in error please immediately advise 
us by return email at and...@arhont.com and delete and purge the email and any 
attachments without making a copy. 

- Original Message -

From: "Christian Balzer"  
To: "Andrei Mikhailovsky"  
Cc: ceph-users@lists.ceph.com 
Sent: Friday, 1 August, 2014 10:41:09 AM 
Subject: Re: [ceph-users] Using Crucial MX100 for journals or cache pool 

On Fri, 1 Aug 2014 09:38:34 +0100 (BST) Andrei Mikhailovsky wrote: 

> Hello guys, 
> 
> Was wondering if anyone has tried using the Crucial MX100 ssds either 
> for osd journals or cache pool? It seems like a good cost effective 
> alternative to the more expensive drives and read/write performance is 
> very good as well. 
> 
If you're going purely by price, a 128GB MX100 doesn't have much of an 
advantage over a 120GB Intel DC S3500. 

While the endurance is given as 72TB for all MX100 models, it increases 
with size for the Intel ones, 275TB for the 480GB model. 
So a while a 512GB MX100 is cheaper, it compares very poorly to a 480GB DC 
S3500 when it comes to TBW/$. 

And of course when it comes to the _consistent_ performance of the DC S3700 
SSDs, nothing more needs to be said than the articles David referred to. 
That's what makes them so well suited for journal operations. 

If you're looking for a low cost cache pool, keep in mind that the 
warranty of the Crucial SSDs is just 3 years. 
If you stick to that time frame, that's about 65GB writes per day. 
This might be enough, put it is probably a lot harder to predict write 
loads for a cache pool unlike with a journal. 
>From my understanding doing something like a backup of your actual pool 
would write everything to the cache pool in that process. 

Christian 
-- 
Christian Balzer Network/Systems Engineer 
ch...@gol.com Global OnLine Japan/Fusion Communications 
http://www.gol.com/ 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] cache pools on hypervisor servers

2014-08-04 Thread Andrei Mikhailovsky

Hello guys, 

I was hoping to get some answers on how would ceph behaive when I install SSDs 
on the hypervisor level and use them as cache pool. Let's say I've got 10 kvm 
hypervisors and I install one 512GB ssd on each server. I then create a cache 
pool for my storage cluster using these ssds. My questions are: 

1. How would the network IO flow when I am performing read and writes on the 
virtual machines? Would writes get stored on the hypervisor's ssd disk right 
away or would the rights be directed to the osd servers first and then 
redirected back to the cache pool on the hypervisor's ssd? Similarly, would 
reads go to the osd servers and then redirected to the cache pool on the 
hypervisors? 

2. Would majority of network traffic shift to the cache pool level and stay at 
the hypervisors level rather than hypervisor / osd server level? 

Many thanks 

Andrei 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cache pools on hypervisor servers

2014-08-12 Thread Andrei Mikhailovsky

Anyone have an idea on how it works? 

Thansk 

- Original Message -

From: "Andrei Mikhailovsky"  
To: ceph-users@lists.ceph.com 
Sent: Monday, 4 August, 2014 10:10:03 AM 
Subject: [ceph-users] cache pools on hypervisor servers 

Hello guys, 

I was hoping to get some answers on how would ceph behaive when I install SSDs 
on the hypervisor level and use them as cache pool. Let's say I've got 10 kvm 
hypervisors and I install one 512GB ssd on each server. I then create a cache 
pool for my storage cluster using these ssds. My questions are: 

1. How would the network IO flow when I am performing read and writes on the 
virtual machines? Would writes get stored on the hypervisor's ssd disk right 
away or would the rights be directed to the osd servers first and then 
redirected back to the cache pool on the hypervisor's ssd? Similarly, would 
reads go to the osd servers and then redirected to the cache pool on the 
hypervisors? 

2. Would majority of network traffic shift to the cache pool level and stay at 
the hypervisors level rather than hypervisor / osd server level? 

Many thanks 

Andrei 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cache pools on hypervisor servers

2014-08-13 Thread Andrei Mikhailovsky

Robert, thanks for your reply, please see my comments inline 

- Original Message -

> From: "Robert van Leeuwen" 
> To: "Andrei Mikhailovsky" , ceph-users@lists.ceph.com
> Sent: Wednesday, 13 August, 2014 6:57:57 AM
> Subject: RE: cache pools on hypervisor servers

> > I was hoping to get some answers on how would ceph behaive when I install
> > SSDs on the hypervisor level and use them as cache pool.
> > Let's say I've got 10 kvm hypervisors and I install one 512GB ssd on each
> > server.
> >I then create a cache pool for my storage cluster using these ssds. My
> >questions are:
> >
> >1. How would the network IO flow when I am performing read and writes on the
> >virtual machines? Would writes get stored on the hypervisor's ssd disk
> >right away or would the rights be directed to the osd servers >first and
> >then redirected back to the cache pool on the hypervisor's ssd? Similarly,
> >would reads go to the osd servers and then redirected to the cache pool on
> >the hypervisors?

> You would need to make an OSD of your hypervisors.
> Data would be "striped" across all hypervisors in the cache pool.
> So you would shift traffic from:
> hypervisors -> dedicated ceph OSD pool
> to
> hypervisors -> hypervisors running a OSD with SSD
> Note that the local OSD also has to to do OSD replication traffic so you are
> increasing the network load on the hypervisors by quite a bit.

Personally I am not worried too much about the hypervisor - hypervisor traffic 
as I am using a dedicated infiniband network for storage. It is not used for 
the guest to guest or the internet traffic or anything else. I would like to 
decrease or at least smooth out the traffic peaks between the hypervisors and 
the SAS/SATA osd storage servers. I guess the ssd cache pool would enable me to 
do that as the eviction rate should be more structured compared to the random 
io writes that guest vms generate. 

> > Would majority of network traffic shift to the cache pool level and stay at
> > the hypervisors level rather than hypervisor / osd server level?

> I guess it depends on your access patterns and how much data needs to be
> migrated back and forth to the regular storage.
> I'm very interested in the effect of caching pools in combination with
> running VMs on them so I'd be happy to hear what you find ;)

I will give it a try and share back the results when we get the ssd kit. 

> As a side note: Running OSDs on hypervisors would not be my preferred choice
> since hypervisor load might impact Ceph performance.

Do you think it is not a good idea even if you have a lot of cores on the 
hypervisors? Like 24 or 32 per host server? According to my monitoring, our osd 
servers are not that stressed and generally have over 50% of free cpu power. 
Having said this, ssd osds will generate more io and throughput compared to the 
sas/sata osds, so the cpu load might be higher. Not really sure here. 

> I guess you can end up with pretty weird/unwanted results when your
> hypervisors get above a certain load threshold.
> I would certainly test a lot with high loads before putting it in
> production...

Definitely! 

> Cheers,
> Robert van Leeuwen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cache pools on hypervisor servers

2014-08-14 Thread Andrei Mikhailovsky

Hi guys, 

Could someone from the ceph team please comment on running osd cache pool on 
the hypervisors? Is this a good idea, or will it create a lot of performance 
issues? 

Anyone in the ceph community that has done this? Any results to share? 

Many thanks 

Andrei 

- Original Message -

> From: "Robert van Leeuwen" 
> To: "Andrei Mikhailovsky" 
> Cc: ceph-users@lists.ceph.com
> Sent: Thursday, 14 August, 2014 9:31:24 AM
> Subject: RE: cache pools on hypervisor servers

> > Personally I am not worried too much about the hypervisor - hypervisor
> > traffic as I am using a dedicated infiniband network for storage.
> > It is not used for the guest to guest or the internet traffic or anything
> > else. I would like to decrease or at least smooth out the traffic peaks
> > between the hypervisors and the SAS/SATA osd storage servers.
> > I guess the ssd cache pool would enable me to do that as the eviction rate
> > should be more structured compared to the random io writes that guest vms
> > generate.
> Sounds reasonable

> >>I'm very interested in the effect of caching pools in combination with
> >>running VMs on them so I'd be happy to hear what you find ;)
> > I will give it a try and share back the results when we get the ssd kit.
> Excellent, looking forward to it.

> >> As a side note: Running OSDs on hypervisors would not be my preferred
> >> choice since hypervisor load might impact Ceph performance.
> > Do you think it is not a good idea even if you have a lot of cores on the
> > hypervisors?
> > Like 24 or 32 per host server?
> > According to my monitoring, our osd servers are not that stressed and
> > generally have over 50% of free cpu power.

> The number of cores do not really matter if they are all busy ;)
> I honestly do not know how Ceph behaves when it is CPU starved but I guess it
> might not be pretty.
> Since your whole environment will be crumbling down if your storage becomes
> unavailable it is not a risk I would take lightly.

> Cheers,
> Robert van Leeuwen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cache pools on hypervisor servers

2014-08-17 Thread Andrei Mikhailovsky

Thanks a lot for your input. I will proceed with putting the cache pool on the 
storage layer instead.

Andrei

- Original Message -
> From: "Sage Weil" 
> To: "Andrei Mikhailovsky" 
> Cc: "Robert van Leeuwen" , 
> ceph-users@lists.ceph.com
> Sent: Thursday, 14 August, 2014 6:33:25 PM
> Subject: Re: [ceph-users] cache pools on hypervisor servers
> 
> On Thu, 14 Aug 2014, Andrei Mikhailovsky wrote:
> > Hi guys,
> > 
> > Could someone from the ceph team please comment on running osd cache pool
> > on
> > the hypervisors? Is this a good idea, or will it create a lot of
> > performance
> > issues?
> 
> It doesn't sound like an especially good idea.  In general you want the
> cache pool to be significantly faster than the base pool (think PCI
> attached flash).  And there won't be any particular affinity to the host
> where the VM consuming the sotrage happens to be, so I don't think there
> is a reason to put the flash in the hypervisor nodes unless there simply
> isn't anywhere else to put them.
> 
> Probably what you're after is a client-side write-thru cache?  There is
> some ongoing work to build this into qemu and possibly librbd, but nothing
> is ready yet that I know of.
> 
> sage
> 
> 
> > 
> > Anyone in the ceph community that has done this? Any results to share?
> > 
> > Many thanks
> > 
> > Andrei
> > 
> > 
> >   From: "Robert van Leeuwen" 
> >   To: "Andrei Mikhailovsky" 
> >   Cc: ceph-users@lists.ceph.com
> >   Sent: Thursday, 14 August, 2014 9:31:24 AM
> >   Subject: RE: cache pools on hypervisor servers
> > 
> > > Personally I am not worried too much about the hypervisor -
> > hypervisor traffic as I am using a dedicated infiniband network for
> > storage.
> > > It is not used for the guest to guest or the internet traffic or
> > anything else. I would like to decrease or at least smooth out the
> > traffic peaks between the hypervisors and the SAS/SATA osd storage
> > servers.
> > > I guess the ssd cache pool would enable me to do that as the
> > eviction rate should be more structured compared to the random io
> > writes that guest vms generate. Sounds reasonable
> > 
> > >>I'm very interested in the effect of caching pools in combination
> > with running VMs on them so I'd be happy to hear what you find ;)
> > > I will give it a try and share back the results when we get the ssd
> > kit.
> > Excellent, looking forward to it.
> > 
> > 
> > >> As a side note: Running OSDs on hypervisors would not be my
> > preferred choice since hypervisor load might impact Ceph performance.
> > > Do you think it is not a good idea even if you have a lot of cores
> > on the hypervisors?
> > > Like 24 or 32 per host server?
> > > According to my monitoring, our osd servers are not that stressed
> > and generally have over 50% of free cpu power.
> > 
> > The number of cores do not really matter if they are all busy ;)
> > I honestly do not know how Ceph behaves when it is CPU starved but I
> > guess it might not be pretty.
> > Since your whole environment will be crumbling down if your storage
> > becomes unavailable it is not a risk I would take lightly.
> > 
> > Cheers,
> > Robert van Leeuwen
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Serious performance problems with small file writes

2014-08-20 Thread Andrei Mikhailovsky

Hugo,

I would look at setting up a cache pool made of 4-6 ssds to start with. So, if 
you have 6 osd servers, stick at least 1 ssd disk in each server for the cache 
pool. It should greatly reduce the osd's stress of writing a large number of 
small files. Your cluster should become more responsive and the end user's 
experience should also improve.

I am planning on doing so in a near future, but according to my friend's 
experience, introducing a cache pool has greatly increased the overall 
performance of the cluster and has removed the performance issues that he was 
having during scrubbing/deep-scrubbing/recovery activities.

The size of your working data set should determine the size of the cache pool, 
but in general it will create a nice speedy buffer between your clients and 
those terribly slow spindles.

Andrei





- Original Message -
From: "Hugo Mills" 
To: "Dan Van Der Ster" 
Cc: "Ceph Users List" 
Sent: Wednesday, 20 August, 2014 4:54:28 PM
Subject: Re: [ceph-users] Serious performance problems with small file writes

   Hi, Dan,

   Some questions below I can't answer immediately, but I'll spend
tomorrow morning irritating people by triggering these events (I think
I have a reproducer -- unpacking a 1.2 GiB tarball with 25 small
files in it) and giving you more details. For the ones I can answer
right now:

On Wed, Aug 20, 2014 at 02:51:12PM +, Dan Van Der Ster wrote:
> Do you get slow requests during the slowness incidents?

   Slow requests, yes. ceph -w reports them coming in groups, e.g.:

2014-08-20 15:51:23.911711 mon.1 [INF] pgmap v2287926: 704 pgs: 704 
active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 8449 
kB/s rd, 3506 kB/s wr, 527 op/s
2014-08-20 15:51:22.381063 osd.5 [WRN] 6 slow requests, 6 included below; 
oldest blocked for > 10.133901 secs
2014-08-20 15:51:22.381066 osd.5 [WRN] slow request 10.133901 seconds old, 
received at 2014-08-20 15:51:12.247127: osd_op(mds.0.101:5528578 
10005889b29. [create 0~0,setxattr parent (394)] 0.786a9365 ondisk+write 
e217298) v4 currently waiting for subops from 6
2014-08-20 15:51:22.381068 osd.5 [WRN] slow request 10.116337 seconds old, 
received at 2014-08-20 15:51:12.264691: osd_op(mds.0.101:5529006 
1000599e576. [create 0~0,setxattr parent (392)] 0.5ccbd6a9 ondisk+write 
e217298) v4 currently waiting for subops from 7
2014-08-20 15:51:22.381070 osd.5 [WRN] slow request 10.116277 seconds old, 
received at 2014-08-20 15:51:12.264751: osd_op(mds.0.101:5529009 
1000588932d. [create 0~0,setxattr parent (394)] 0.de5eca4e ondisk+write 
e217298) v4 currently waiting for subops from 6
2014-08-20 15:51:22.381071 osd.5 [WRN] slow request 10.115296 seconds old, 
received at 2014-08-20 15:51:12.265732: osd_op(mds.0.101:5529042 
1000588933e. [create 0~0,setxattr parent (395)] 0.5e4d56be ondisk+write 
e217298) v4 currently waiting for subops from 7
2014-08-20 15:51:22.381073 osd.5 [WRN] slow request 10.115184 seconds old, 
received at 2014-08-20 15:51:12.265844: osd_op(mds.0.101:5529047 
1000599e58a. [create 0~0,setxattr parent (395)] 0.6a487965 ondisk+write 
e217298) v4 currently waiting for subops from 6
2014-08-20 15:51:24.381370 osd.5 [WRN] 2 slow requests, 2 included below; 
oldest blocked for > 10.73 secs
2014-08-20 15:51:24.381373 osd.5 [WRN] slow request 10.73 seconds old, 
received at 2014-08-20 15:51:14.381267: osd_op(mds.0.101:5529327 
100058893ca. [create 0~0,setxattr parent (395)] 0.750c7574 ondisk+write 
e217298) v4 currently commit sent
2014-08-20 15:51:24.381375 osd.5 [WRN] slow request 10.28 seconds old, 
received at 2014-08-20 15:51:14.381312: osd_op(mds.0.101:5529329 
100058893cb. [create 0~0,setxattr parent (395)] 0.c75853fa ondisk+write 
e217298) v4 currently commit sent
2014-08-20 15:51:24.913554 mon.1 [INF] pgmap v2287927: 704 pgs: 704 
active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 13218 
B/s rd, 3532 kB/s wr, 377 op/s
2014-08-20 15:51:25.381582 osd.5 [WRN] 3 slow requests, 3 included below; 
oldest blocked for > 10.709989 secs
2014-08-20 15:51:25.381586 osd.5 [WRN] slow request 10.709989 seconds old, 
received at 2014-08-20 15:51:14.671549: osd_op(mds.0.101:5529457 
10005889403. [create 0~0,setxattr parent (407)] 0.e15ab1fa ondisk+write 
e217298) v4 currently no flag points reached
2014-08-20 15:51:25.381587 osd.5 [WRN] slow request 10.709767 seconds old, 
received at 2014-08-20 15:51:14.671771: osd_op(mds.0.101:5529462 
10005889406. [create 0~0,setxattr parent (406)] 0.70f8a5d3 ondisk+write 
e217298) v4 currently no flag points reached
2014-08-20 15:51:25.381589 osd.5 [WRN] slow request 10.182354 seconds old, 
received at 2014-08-20 15:51:15.199184: osd_op(mds.0.101:5529464 
10005889407. [create 0~0,setxattr parent (391)] 0.30535d02 ondisk+write 
e217298) v4 currently no flag points reached
2014-08-20 15:51:25.920298 mon.1 [INF] pgmap v2287928: 704 pgs: 704 
active+clean; 1

[ceph-users] pool with cache pool and rbd export

2014-08-22 Thread Andrei Mikhailovsky

Hello guys,

I am planning to perform regular rbd pool off-site backup with rbd export and 
export-diff. I've got a small ceph firefly cluster with an active writeback 
cache pool made of couple of osds. I've got the following question which I hope 
the ceph community could answer:

Will this rbd export or import operations affect the active hot data in the 
cache pool, thus evicting from the cache pool the real hot data used by the 
clients. Or does the process of rbd export/import effect only the osds and does 
not touch the cache pool?

Many thanks

Andrei

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] pool with cache pool and rbd export

2014-08-22 Thread Andrei Mikhailovsky

So it looks like using rbd export / import will negatively effect the client 
performance, which is unfortunate. Is this really the case? Any plans on 
changing this behavior in future versions of ceph?

Cheers

Andrei


- Original Message -
From: "Robert LeBlanc" 
To: "Andrei Mikhailovsky" 
Cc: ceph-users@lists.ceph.com
Sent: Friday, 22 August, 2014 8:21:08 PM
Subject: Re: [ceph-users] pool with cache pool and rbd export


My understanding is that all reads are copied to the cache pool. This would 
indicate that cache will be evicted. I don't know to what extent this will 
affect the hot cache because we have not used a cache pool yet. I'm currently 
looking into bcache fronting the disks to provide caching there. 


Robert LeBlanc 



On Fri, Aug 22, 2014 at 12:41 PM, Andrei Mikhailovsky < and...@arhont.com > 
wrote: 


Hello guys, 

I am planning to perform regular rbd pool off-site backup with rbd export and 
export-diff. I've got a small ceph firefly cluster with an active writeback 
cache pool made of couple of osds. I've got the following question which I hope 
the ceph community could answer: 

Will this rbd export or import operations affect the active hot data in the 
cache pool, thus evicting from the cache pool the real hot data used by the 
clients. Or does the process of rbd export/import effect only the osds and does 
not touch the cache pool? 

Many thanks 

Andrei 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] pool with cache pool and rbd export

2014-08-22 Thread Andrei Mikhailovsky

Does that also mean that scrubbing and deep-scrubbing also squishes data out of 
the cache pool? Could someone from the ceph community confirm this?

Thanks

- Original Message -
From: "Robert LeBlanc" 
To: "Andrei Mikhailovsky" 
Cc: ceph-users@lists.ceph.com
Sent: Friday, 22 August, 2014 8:21:08 PM
Subject: Re: [ceph-users] pool with cache pool and rbd export

My understanding is that all reads are copied to the cache pool. This would 
indicate that cache will be evicted. I don't know to what extent this will 
affect the hot cache because we have not used a cache pool yet. I'm currently 
looking into bcache fronting the disks to provide caching there. 

Robert LeBlanc 

On Fri, Aug 22, 2014 at 12:41 PM, Andrei Mikhailovsky < and...@arhont.com > 
wrote: 

Hello guys, 

I am planning to perform regular rbd pool off-site backup with rbd export and 
export-diff. I've got a small ceph firefly cluster with an active writeback 
cache pool made of couple of osds. I've got the following question which I hope 
the ceph community could answer: 

Will this rbd export or import operations affect the active hot data in the 
cache pool, thus evicting from the cache pool the real hot data used by the 
clients. Or does the process of rbd export/import effect only the osds and does 
not touch the cache pool? 

Many thanks 

Andrei 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] pool with cache pool and rbd export

2014-08-22 Thread Andrei Mikhailovsky

Sage, I guess this will be a problem with the cache pool when you do the export 
for the first time. However, after the first export is done, the diff data will 
be read and copied across and looking at the cache pool, I would say the diff 
data will be there anyway as the changes are going to be considered as hot data 
by the pool as it has been recently changed. So, I do not expect the delta 
exports squeeze too much data from the cache pool. That is if I got the 
understanding of how cache pools work.

Andrei

- Original Message -
From: "Sage Weil" 
To: "Andrei Mikhailovsky" 
Cc: "Robert LeBlanc" , ceph-users@lists.ceph.com
Sent: Friday, 22 August, 2014 10:34:24 PM
Subject: Re: [ceph-users] pool with cache pool and rbd export

On Fri, 22 Aug 2014, Andrei Mikhailovsky wrote:
> So it looks like using rbd export / import will negatively effect the 
> client performance, which is unfortunate. Is this really the case? Any 
> plans on changing this behavior in future versions of ceph?

There will always be some impact from import/export as you are incurring 
IO load on the system.

But yes, in the cache case, this isn't very nice.  In master we've added 
the ability to avoid promoting objects on individual reads unless we've 
seen some previous activity.  This isn't backported to firefly yet 
(although that is likely).  Even so, someone needs to do a bit of testing 
to verify that the rbd export pattern incurs only a single read on the 
objects and will avoid a promotion in the general case.

sage

> 
> Cheers
> 
> Andrei
> 
> 
> - Original Message -
> From: "Robert LeBlanc" 
> To: "Andrei Mikhailovsky" 
> Cc: ceph-users@lists.ceph.com
> Sent: Friday, 22 August, 2014 8:21:08 PM
> Subject: Re: [ceph-users] pool with cache pool and rbd export
> 
> 
> My understanding is that all reads are copied to the cache pool. This would 
> indicate that cache will be evicted. I don't know to what extent this will 
> affect the hot cache because we have not used a cache pool yet. I'm currently 
> looking into bcache fronting the disks to provide caching there. 
> 
> 
> Robert LeBlanc 
> 
> 
> 
> On Fri, Aug 22, 2014 at 12:41 PM, Andrei Mikhailovsky < and...@arhont.com > 
> wrote: 
> 
> 
> Hello guys, 
> 
> I am planning to perform regular rbd pool off-site backup with rbd export and 
> export-diff. I've got a small ceph firefly cluster with an active writeback 
> cache pool made of couple of osds. I've got the following question which I 
> hope the ceph community could answer: 
> 
> Will this rbd export or import operations affect the active hot data in the 
> cache pool, thus evicting from the cache pool the real hot data used by the 
> clients. Or does the process of rbd export/import effect only the osds and 
> does not touch the cache pool? 
> 
> Many thanks 
> 
> Andrei 
> 
> ___ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] pool with cache pool and rbd export

2014-08-22 Thread Andrei Mikhailovsky

Sage, would it not be more effective to separate the data as internal and 
external in a sense. So all maintenance related activities will be classed as 
internal (like scrubbing, deep-scrubbing, import and export data, etc) and will 
not effect the cache and all other activities (really client io) will go 
through cache?

Andrei

- Original Message -
From: "Sage Weil" 
To: "Andrei Mikhailovsky" 
Cc: "Robert LeBlanc" , ceph-users@lists.ceph.com
Sent: Friday, 22 August, 2014 10:34:24 PM
Subject: Re: [ceph-users] pool with cache pool and rbd export

On Fri, 22 Aug 2014, Andrei Mikhailovsky wrote:
> So it looks like using rbd export / import will negatively effect the 
> client performance, which is unfortunate. Is this really the case? Any 
> plans on changing this behavior in future versions of ceph?

There will always be some impact from import/export as you are incurring 
IO load on the system.

But yes, in the cache case, this isn't very nice.  In master we've added 
the ability to avoid promoting objects on individual reads unless we've 
seen some previous activity.  This isn't backported to firefly yet 
(although that is likely).  Even so, someone needs to do a bit of testing 
to verify that the rbd export pattern incurs only a single read on the 
objects and will avoid a promotion in the general case.

sage

> 
> Cheers
> 
> Andrei
> 
> 
> - Original Message -
> From: "Robert LeBlanc" 
> To: "Andrei Mikhailovsky" 
> Cc: ceph-users@lists.ceph.com
> Sent: Friday, 22 August, 2014 8:21:08 PM
> Subject: Re: [ceph-users] pool with cache pool and rbd export
> 
> 
> My understanding is that all reads are copied to the cache pool. This would 
> indicate that cache will be evicted. I don't know to what extent this will 
> affect the hot cache because we have not used a cache pool yet. I'm currently 
> looking into bcache fronting the disks to provide caching there. 
> 
> 
> Robert LeBlanc 
> 
> 
> 
> On Fri, Aug 22, 2014 at 12:41 PM, Andrei Mikhailovsky < and...@arhont.com > 
> wrote: 
> 
> 
> Hello guys, 
> 
> I am planning to perform regular rbd pool off-site backup with rbd export and 
> export-diff. I've got a small ceph firefly cluster with an active writeback 
> cache pool made of couple of osds. I've got the following question which I 
> hope the ceph community could answer: 
> 
> Will this rbd export or import operations affect the active hot data in the 
> cache pool, thus evicting from the cache pool the real hot data used by the 
> clients. Or does the process of rbd export/import effect only the osds and 
> does not touch the cache pool? 
> 
> Many thanks 
> 
> Andrei 
> 
> ___ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] ceph rbd image checksums

2014-08-24 Thread Andrei Mikhailovsky

Hello guys,

I am planning to do rbd images off-site backup with rbd export-diff and I was 
wondering if ceph has checksumming functionality so that I can compare source 
and destination files for consistency? If so, how do I retrieve the checksum 
values from the ceph cluster?

Thanks

Andrei
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] rbd clones and export / export-diff functions

2014-08-25 Thread Andrei Mikhailovsky

Hello guys,

Is it possible to export rbd image while preserving the clones structure? So, 
if I've got a single clone rbd image and 10 vm images that were cloned from the 
original one, would the rbd export preserve this structure on the destination 
pool, or would it waste space and create 10 independent vm rbd images without 
using the clone?

Thanks

Andrei
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph rbd image checksums

2014-08-25 Thread Andrei Mikhailovsky

Does rbd export and export-diff and like wise import and import-diff grantee 
the consistency of data? So, that if the image is "damaged" during the 
transfer, would this be flagged by the other side? Or would it simply leave the 
broken image on the destination cluster?

Cheers

- Original Message -
From: "Wido den Hollander" 
To: ceph-users@lists.ceph.com
Sent: Monday, 25 August, 2014 10:31:14 AM
Subject: Re: [ceph-users] ceph rbd image checksums

On 08/24/2014 08:27 PM, Andrei Mikhailovsky wrote:
> Hello guys,
>
> I am planning to do rbd images off-site backup with rbd export-diff and I was 
> wondering if ceph has checksumming functionality so that I can compare source 
> and destination files for consistency? If so, how do I retrieve the checksum 
> values from the ceph cluster?
>

That would be rather difficult. There is no checksum, but to have a 
valid checksum you would need to verify the whole RBD image.

That means reading the whole image and calculating the checksum based on 
that.

> Thanks
>
> Andrei
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph monitor load, low performance

2014-08-25 Thread Andrei Mikhailovsky

From the top of my head, it is recommended to use 3 mons in production. Also, 
for the 22 osds your number of PGs look a bug low, you should look at that.

The performance of the cluster is poor - this is too vague. What is your 
current performance, what benchmarks have you tried, what is your data workload 
and most importantly, how is your cluster setup. what disks, ssds, network, 
ram, etc.

Please provide more information so that people could help you.

Andrei


- Original Message -
From: "Mateusz Skała" 
To: ceph-users@lists.ceph.com
Sent: Monday, 25 August, 2014 2:39:16 PM
Subject: [ceph-users] Ceph monitor load, low performance

Hello,
we have deployed ceph cluster with 4 monitors and 22 osd's. We are using 
only rbd's. All VM's on KVM have specified monitors in the same order. 
One of monitors (the first on the list in vm disk specification - 
ceph35) has more load than others and the performance of cluster is 
poor. How can we fix this problem.  Here is 'ceph -s' output:

cluster a9d17295-UUID-1cad7724e97f
  health HEALTH_OK
  monmap e4: 4 mons at 
{ceph15=IP.15:6789/0,ceph25=IP.25:6789/0,ceph30=IP.30:6789/0,ceph35=IP.35:6789/0},
 
election epoch 5750, quorum 0,1,2,3 ceph15,ceph25,ceph30,ceph35
  osdmap e7376: 22 osds: 22 up, 22 in
   pgmap v3387277: 3072 pgs, 3 pools, 2306 GB data, 587 kobjects
 6997 GB used, 12270 GB / 19267 GB avail
 3071 active+clean
1 active+clean+scrubbing
   client io 14849 B/s rd, 2887 kB/s wr, 1044 op/s

Thanks for help,
--
Best Regards
Mateusz
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Two osds are spaming dmesg every 900 seconds

2014-08-25 Thread Andrei Mikhailovsky

Hello

I am seeing this message every 900 seconds on the osd servers. My dmesg output 
is all filled with:

[256627.683702] libceph: osd3 192.168.168.200:6821 socket closed (con state 
OPEN)
[256627.687663] libceph: osd6 192.168.168.200:6841 socket closed (con state 
OPEN)


Looking at the ceph-osd logs I see the following at the same time:

2014-08-25 19:48:14.869145 7f0752125700  0 -- 192.168.168.200:6821/4097 >> 
192.168.168.200:0/2493848861 pipe(0x13b43c80 sd=92 :6821 s=0 pgs=0 cs=0 l=0 
c=0x16a606e0).accept peer addr is really 192.168.168.200:0/2493848861 (socket 
is 192.168.168.200:54457/0)


This happens only on two osds and the rest of osds seem fine. Does anyone know 
why am I seeing this and how to correct it?

Thanks

Andrei
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] rbd export poor performance

2014-08-25 Thread Andrei Mikhailovsky

Hi,

I am running a few tests for exporting volumes with rbd export and noticing 
very poor performance. It takes almost 3 hours to export 100GB volume. Servers 
are pretty idle during the export.

The performance of the cluster itself is way faster. How can I increase the 
speed of rbd export?

Thanks

Andrei
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Two osds are spaming dmesg every 900 seconds

2014-08-27 Thread Andrei Mikhailovsky

Thanks!

i thought it's something serious.

Andrei

- Original Message -
From: "Gregory Farnum" 
To: "Andrei Mikhailovsky" 
Cc: "ceph-users" 
Sent: Tuesday, 26 August, 2014 9:00:06 PM
Subject: Re: [ceph-users] Two osds are spaming dmesg every 900 seconds

This is being output by one of the kernel clients, and it's just
saying that the connections to those two OSDs have died from
inactivity. Either the other OSD connections are used a lot more, or
aren't used at all.

In any case, it's not a problem; just a noisy notification. There's
not much you can do about it; sorry.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Mon, Aug 25, 2014 at 12:01 PM, Andrei Mikhailovsky  wrote:
> Hello
>
> I am seeing this message every 900 seconds on the osd servers. My dmesg 
> output is all filled with:
>
> [256627.683702] libceph: osd3 192.168.168.200:6821 socket closed (con state 
> OPEN)
> [256627.687663] libceph: osd6 192.168.168.200:6841 socket closed (con state 
> OPEN)
>
>
> Looking at the ceph-osd logs I see the following at the same time:
>
> 2014-08-25 19:48:14.869145 7f0752125700  0 -- 192.168.168.200:6821/4097 >> 
> 192.168.168.200:0/2493848861 pipe(0x13b43c80 sd=92 :6821 s=0 pgs=0 cs=0 l=0 
> c=0x16a606e0).accept peer addr is really 192.168.168.200:0/2493848861 (socket 
> is 192.168.168.200:54457/0)
>
>
> This happens only on two osds and the rest of osds seem fine. Does anyone 
> know why am I seeing this and how to correct it?
>
> Thanks
>
> Andrei
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Cache pool - step by step guide

2014-09-03 Thread Andrei Mikhailovsky

Hello guys,

I was wondering if someone could point me in the right direction of a step by 
step guide on setting up a cache pool. I've seen the 
http://ceph.com/docs/firefly/dev/cache-pool/. However, it has no mentioning of 
the first steps that one need to take. 

For instance, I've got my ssd disks plugged into the osd servers. What do I do 
next? How do i create the initial pool made just of these ssds and promote it 
to the cache pool status. How do i choose the right cache pool sizing, number 
of PGs, etc.

Thanks

Andrei
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cache pool - step by step guide

2014-09-04 Thread Andrei Mikhailovsky

Vlad, thanks for the information. I will review it shortly.

I do have SSDs and SAS (not sata) disks in the same box. But I guess there 
shouldn't be much difference between SAS and SATA.

At the moment I am running firefly. I've seen some comments that the master 
branch has a great deal of improvements introduce to accommodate high IO of the 
SSDs. Does that apply to the improvements of the cache tier as well?

Cheers

Andrei


- Original Message -
From: "Vladislav Gorbunov" 
To: "Andrei Mikhailovsky" 
Cc: ceph-users@lists.ceph.com
Sent: Thursday, 4 September, 2014 1:52:05 AM
Subject: Re: [ceph-users] Cache pool - step by step guide


You mix sata and ssd disks within the same server? Read this: 
http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/
 



When you have different pools for sata and ssd configure cache-pool: 


ceph osd tier add satapool ssdpool 

ceph osd tier cache-mode ssdpool writeback 

ceph osd pool set ssdpool hit_set_type bloom 

ceph osd pool set ssdpool hit_set_count 1 




In this example 80-85% of the cache pool is equal to 280GB 


ceph osd pool set ssdpool target_max_bytes $((280*1024*1024*1024)) 




ceph osd tier set-overlay satapool ssdpool 



ceph osd pool set ssdpool hit_set_period 300 

ceph osd pool set ssdpool cache_min_flush_age 300 # 10 minutes 

ceph osd pool set ssdpool cache_min_evict_age 1800 # 30 minutes 

ceph osd pool set ssdpool cache_target_dirty_ratio .4 
ceph osd pool set ssdpool cache_target_full_ratio .8 




Remember, that the current stable ceph 0.80.5 cache pool osds crashing when 
data is evicting to underlying storage pool. See 
http://tracker.ceph.com/issues/8982 



2014-09-04 10:55 GMT+12:00 Andrei Mikhailovsky < and...@arhont.com > : 


Hello guys, 

I was wondering if someone could point me in the right direction of a step by 
step guide on setting up a cache pool. I've seen the 
http://ceph.com/docs/firefly/dev/cache-pool/ . However, it has no mentioning of 
the first steps that one need to take. 

For instance, I've got my ssd disks plugged into the osd servers. What do I do 
next? How do i create the initial pool made just of these ssds and promote it 
to the cache pool status. How do i choose the right cache pool sizing, number 
of PGs, etc. 

Thanks 

Andrei 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Cache pool and using btrfs for ssd osds

2014-09-04 Thread Andrei Mikhailovsky

Hello guys,

I was wondering if there is a benefit of using journal-less btrfs file system 
on the cache pool osds? Would it speed up the writes to the cache tier? Is 
btrfs and ceph getting close to production level?

Cheers

Andrei
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Ceph and TRIM on SSD disks

2014-09-07 Thread Andrei Mikhailovsky

Hello guys,

was wondering if it is a good idea to enable TRIM (mount option discard) on the 
ssd disks which are used for either cache pool or osd journals?

For performance, is it better to enable it or run fstrim with cron every once 
in a while?

Thanks

Andrei
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Best practices on Filesystem recovery on RBD block volume?

2014-09-10 Thread Andrei Mikhailovsky


Keith, 

You should consider doing regular rbd volume snapshots and keep them for N 
amount of hours/days/months depending on your needs. 

Cheers 

Andrei 
- Original Message -

From: "Keith Phua"  
To: ceph-users@lists.ceph.com 
Cc: y...@nus.edu.sg, cheechi...@nus.edu.sg, eng...@nus.edu.sg 
Sent: Wednesday, 10 September, 2014 3:22:53 AM 
Subject: [ceph-users] Best practices on Filesystem recovery on RBD block 
volume? 

Dear ceph-users, 

Recently we had an encounter of a XFS filesystem corruption on a NAS box. After 
repairing the filesystem, we discover the files were gone. This trigger some 
questions with regards to filesystem on RBD block which I hope the community 
can enlighten me. 

1. If a local filesystem on a rbd block is corrupted, is it fair to say that 
regardless of how many replicated copies we specified for the pool, unless the 
filesystem is properly repaired and recovered, we may not get our data back? 

2. If the above statement is true, does it mean that severe filesystem 
corruption on a RBD block constitute a single point of failure, since 
filesystems corruption can happened when the RBD client is not properly 
shutdown or due to a kernel bug? 

3. Other than existing best practices for a filesystem recovery, does ceph have 
any other best practices for filesystem on RBD which we can adopt for data 
recovery? 


Thanks in advance. 

Regards, 

Keith 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Best practices on Filesystem recovery on RBD block volume?

2014-09-10 Thread Andrei Mikhailovsky

Keith, 

I think the hypervisor / infrastructure orchestration layer should be able to 
handle proper snapshotting with io freezing. For instance, we use CloudStack 
and you can set up automatic snapshots and snapshot retention policies. 

Cheers 

Andrei 
- Original Message -

From: "Ilya Dryomov"  
To: "Keith Phua"  
Cc: "Andrei Mikhailovsky" , ceph-users@lists.ceph.com, 
y...@nus.edu.sg, cheechi...@nus.edu.sg, eng...@nus.edu.sg 
Sent: Wednesday, 10 September, 2014 11:51:04 AM 
Subject: Re: [ceph-users] Best practices on Filesystem recovery on RBD block 
volume? 

On Wed, Sep 10, 2014 at 2:45 PM, Keith Phua  wrote: 
> Hi Andrei, 
> 
> Thanks for the suggestion. 
> 
> But a rbd volume snapshots may only work if the filesystem is in a 
> consistent state, which mean no IO during snapshotting. With cronjob 
> snapshotting, usually we have no control over client doing any IOs. Having 

xfs_freeze -f /mnt 

xfs_freeze -u /mnt 

Thanks, 

Ilya 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Cache Pool writing too much on ssds, poor performance?

2014-09-10 Thread Andrei Mikhailovsky


Hello guys, 

I am experimeting with cache pool and running some tests to see how adding the 
cache pool improves the overall performance of our small cluster. 

While doing testing I've noticed that it seems that the cache pool is writing 
too much on the cache pool ssds. Not sure what the issue here, perhaps someone 
could help me understand what is going on. 

My test cluster is: 
2 x OSD servers (Each server has: 24GB ram, 12 cores, 8 hdd osds, 2 ssds 
journals, 2 ssds for cache pool, 40gbit/s infiniband network capable of 
25gbit/s over ipoib). Cache pool is set to 500GB with replica of 2. 
4 x host servers (128GB ram, 24 core, 40gbit/s infiniband network capable of 
12gbit/s over ipoib) 

So, my test is: 
Simple tests using the following command: "dd if=/dev/vda of=/dev/null bs=4M 
count=2000 iflag=direct". I am concurrently starting this command on 10 virtual 
machines which are running on 4 host servers. The aim is to monitor the use of 
cache pool when reading the same data over and over again. 


Running the above command for the first time does what I was expecting. The 
osds are doing a lot of reads, the cache pool does a lot of writes (around 
250-300MB/s per ssd disk) and no reads. The dd results for the guest vms are 
poor. The results of the "ceph -w" shows consistent performance across the 
time. 

Running the above for the second and consequent times produces IO patterns 
which I was not expecting at all. The hdd osds are not doing much (this part I 
expected), the cache pool still does a lot of writes and very little reads! The 
dd results have improved just a little, but not much. The results of the "ceph 
-w" shows performance breaks over time. For instance, I have a peak of 
throughput in the first couple of seconds (data is probably coming from the osd 
server's ram at high rate). After the peak throughput has finished, the ceph 
reads are done in the following way: 2-3 seconds of activity followed by 2 
seconds if inactivity) and it keeps doing that throughout the length of the 
test. So, to put the numbers in perspective, when running tests over and over 
again I would get around 2000 - 3000MB/s for the first two seconds, followed by 
0MB/s for the next two seconds, followed by around 150-250MB/s over 2-3 
seconds, followed by 0MB/s for 2 seconds, followed 150-250MB/s over 2-3 
seconds, followed by 0MB/s over 2 secods, and the pattern repeats until the 
test is done. 


I kept running the dd command for about 15-20 times and observed the same 
behariour. The cache pool does mainly writes (around 200MB/s per ssd) when 
guest vms are reading the same data over and over again. There is very little 
read IO (around 20-40MB/s). Why am I not getting high read IO? I have expected 
the 80GB of data that is being read from the vms over and over again to be 
firmly recognised as the hot data and kept in the cache pool and read from it 
when guest vms request the data. Instead, I mainly get writes on the cache pool 
ssds and I am not really sure where these writes are coming from as my hdd osds 
are being pretty idle. 

>From the overall tests so far, introducing the cache pool has drastically 
>slowed down my cluster (by as much as 50-60%). 

Thanks for any help 

Andrei 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cache Pool writing too much on ssds, poor performance?

2014-09-11 Thread Andrei Mikhailovsky

Hi, 

I have created the cache tier using the following commands: 

95 ceph osd pool create cache-pool-ssd 2048 2048 ; ceph osd pool set 
cache-pool-ssd crush_ruleset 4 
124 ceph osd pool set cache-pool-ssd size 2 
126 ceph osd pool set cache-pool-ssd min_size 1 
130 ceph osd tier add Primary-ubuntu-1 cache-pool-ssd 
131 ceph osd tier cache-mode cache-pool-ssd writeback 
132 ceph osd tier set-overlay Primary-ubuntu-1 cache-pool-ssd 
135 ceph osd pool set cache-pool-ssd hit_set_type bloom 
136 ceph osd pool set cache-pool-ssd hit_set_count 1 
137 ceph osd pool set cache-pool-ssd hit_set_period 3600 
138 ceph osd pool set cache-pool-ssd target_max_bytes 5000 
143 ceph osd pool set cache-pool-ssd cache_target_full_ratio 0.8 


SInce the initial install i've increased the target_max_bytes to 800GB. The 
rest of the settings are left as default. 

Did I miss something that might explain the behaviour that i am experiencing? 

Cheers 

Andrei 


- Original Message -

From: "Xiaoxi Chen"  
To: "Andrei Mikhailovsky" , "ceph-users" 
 
Sent: Thursday, 11 September, 2014 2:00:31 AM 
Subject: RE: Cache Pool writing too much on ssds, poor performance? 



Could you show your cache tiering configuration? Especially this three 
parameters. 

ceph osd pool set hot-storage cache_target_dirty_ratio 0.4 ceph osd pool set 
hot-storage cache_target_full_ratio 0.8 ceph osd pool set {cachepool} 
target_max_bytes {#bytes} 




From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Andrei 
Mikhailovsky 
Sent: Wednesday, September 10, 2014 8:51 PM 
To: ceph-users 
Subject: [ceph-users] Cache Pool writing too much on ssds, poor performance? 



Hello guys, 

I am experimeting with cache pool and running some tests to see how adding the 
cache pool improves the overall performance of our small cluster. 

While doing testing I've noticed that it seems that the cache pool is writing 
too much on the cache pool ssds. Not sure what the issue here, perhaps someone 
could help me understand what is going on. 

My test cluster is: 
2 x OSD servers (Each server has: 24GB ram, 12 cores, 8 hdd osds, 2 ssds 
journals, 2 ssds for cache pool, 40gbit/s infiniband network capable of 
25gbit/s over ipoib). Cache pool is set to 500GB with replica of 2. 
4 x host servers (128GB ram, 24 core, 40gbit/s infiniband network capable of 
12gbit/s over ipoib) 

So, my test is: 
Simple tests using the following command: "dd if=/dev/vda of=/dev/null bs=4M 
count=2000 iflag=direct". I am concurrently starting this command on 10 virtual 
machines which are running on 4 host servers. The aim is to monitor the use of 
cache pool when reading the same data over and over again. 


Running the above command for the first time does what I was expecting. The 
osds are doing a lot of reads, the cache pool does a lot of writes (around 
250-300MB/s per ssd disk) and no reads. The dd results for the guest vms are 
poor. The results of the "ceph -w" shows consistent performance across the 
time. 

Running the above for the second and consequent times produces IO patterns 
which I was not expecting at all. The hdd osds are not doing much (this part I 
expected), the cache pool still does a lot of writes and very little reads! The 
dd results have improved just a little, but not much. The results of the "ceph 
-w" shows performance breaks over time. For instance, I have a peak of 
throughput in the first couple of seconds (data is probably coming from the osd 
server's ram at high rate). After the peak throughput has finished, the ceph 
reads are done in the following way: 2-3 seconds of activity followed by 2 
seconds if inactivity) and it keeps doing that throughout the length of the 
test. So, to put the numbers in perspective, when running tests over and over 
again I would get around 2000 - 3000MB/s for the first two seconds, followed by 
0MB/s for the next two seconds, followed by around 150-250MB/s over 2-3 
seconds, followed by 0MB/s for 2 seconds, followed 150-250MB/s over 2-3 
seconds, followed by 0MB/s over 2 secods, and the pattern repeats until the 
test is done. 


I kept running the dd command for about 15-20 times and observed the same 
behariour. The cache pool does mainly writes (around 200MB/s per ssd) when 
guest vms are reading the same data over and over again. There is very little 
read IO (around 20-40MB/s). Why am I not getting high read IO? I have expected 
the 80GB of data that is being read from the vms over and over again to be 
firmly recognised as the hot data and kept in the cache pool and read from it 
when guest vms request the data. Instead, I mainly get writes on the cache pool 
ssds and I am not really sure where these writes are coming from as my hdd osds 
are being pretty idle. 

>From the overall tests so far, introducing the cache pool has drastically 
>slowed down my cluster (by as much as 50-60%). 

Thanks for any help

Re: [ceph-users] Rebalancing slow I/O.

2014-09-11 Thread Andrei Mikhailovsky

Irek, 

have you change the ceph.conf file to change the recovery p riority? 

Options like these might help with prioritising repair/rebuild io with the 
client IO: 

osd_recovery_max_chunk = 8388608 
osd_recovery_op_priority = 2 
osd_max_backfills = 1 
osd_recovery_max_active = 1 
osd_recovery_threads = 1 


Andrei 
- Original Message -

From: "Irek Fasikhov"  
To: ceph-users@lists.ceph.com 
Sent: Thursday, 11 September, 2014 1:07:06 PM 
Subject: [ceph-users] Rebalancing slow I/O. 




Hi,All. 


DELL R720X8,96 OSDs, Network 2x10Gbit LACP. 


When one of the nodes crashes, I get very slow I / O operations on virtual 
machines. 
A cluster map by default. 
[ceph@ceph08 ~]$ ceph osd tree 


# id weight type name up/down reweight 
-1 262.1 root defaults 
-2 32.76 host ceph01 
0 2.73 osd.0 up 1 
... 
11 2.73 osd.11 up 1 
-3 32.76 host ceph02 
13 2.73 osd.13 up 1 
.. 
12 2.73 osd.12 up 1 
-4 32.76 host ceph03 
24 2.73 osd.24 up 1 
 
35 2.73 osd.35 up 1 
-5 32.76 host ceph04 
37 2.73 osd.37 up 1 
. 
47 2.73 osd.47 up 1 
-6 32.76 host ceph05 
48 2.73 osd.48 up 1 
... 
59 2.73 osd.59 up 1 
-7 32.76 host ceph06 
60 2.73 osd.60 down 0 
... 
71 2.73 osd.71 down 0 
-8 32.76 host ceph07 
72 2.73 osd.72 up 1 
 
83 2.73 osd.83 up 1 
-9 32.76 host ceph08 
84 2.73 osd.84 up 1 
 
95 2.73 osd.95 up 1 




If I change the cluster map on the following: 

root---| 
| 
|-rack1 
| | 
| host ceph01 
| host ceph02 
| host ceph03 
| host ceph04 
| 
|---rack2 
| 
host ceph05 
host ceph06 
host ceph07 
host ceph08 
What will povidenie cluster failover one node? And how much will it affect the 
performance? 
Thank you 


-- 

С уважением, Фасихов Ирек Нургаязович 
Моб.: +79229045757 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cache Pool writing too much on ssds, poor performance?

2014-09-11 Thread Andrei Mikhailovsky

Mark, 

Thanks for a very detailed email. Really apreciate your help on this. I now 
have a bit more understanding on how it works and understand why I am getting 
so much write on the cache ssds. 

I am however, trouble to understand why the cache pool is not keeping the data 
and flushing it? I've got the pool size of about 7x as large as the current 
benchmark set (80gb data set vs 500GB pool size), so the benchmark data should 
fit nicely many times over. I understand if there is a small percentage of the 
data is a cache miss, but from what it looks like it is missing a considerable 
amount. 

Is there a way to check the stats of the cache pool, including hit/miss 
information and other data? 

Yes, I am using firefly 0.80.5. 

Thanks 

Andrei 

- Original Message -

From: "Mark Nelson"  
To: ceph-users@lists.ceph.com 
Sent: Thursday, 11 September, 2014 3:02:40 PM 
Subject: Re: [ceph-users] Cache Pool writing too much on ssds, poor 
performance? 

Something that is very important to keep in mind with the way that the 
cache tier implementation currently works in Ceph is that cache misses 
are very expensive. It's really important that your workload have a 
really big hot/cold data skew otherwise it's not going to work well at 
all. In your case, you are doing sequential reads which is terrible for 
this because for each pass you are never re-reading the same blocks, and 
by the time you get to the end of the test and restart it, the first 
blocks (apparently) have already been flushed. If you increased the 
size of the cache tier, you might be able to fit the whole thing in 
cache which should help dramatically, but that's not easy to guarantee 
outside of benchmarks. 

I'm guessing you are using firefly right? To improve this behaviour, 
Sage implemented a new policy in the recent development releases not to 
promote reads right away. Instead, we wait to promote until there are 
several reads of the same object within a certain time period. That 
should dramatically help in this case. You really don't want big 
sequential reads being promoted into cache since cache promotion is 
expensive and the disks are really good at doing that kind of thing anyway. 

On the flip side, 4MB read misses are bad, but the situation is even 
worse with 4K misses. Imagine for instance that you are going to do a 4K 
read from a default 4MB RBD block and that object is not in cache. In 
the implementation we have in firefly, the whole 4MB object will be 
promoted to the cache which will in most cases require a transfer of 
that object over the network to the primary OSD for the associated PG in 
the cache pool. Now depending on the replication policy, that primary 
cache pool OSD will fan out and write (by default) 2 extra copies of the 
data to the other OSDs in the PG, so 3 total. Now assuming your cache 
tier is on SSDs with co-located journals, each one of those writes will 
actually be 2 writes, one to the journal, and one to the data store. 

To recap: *Any* read miss regardless if it's 4KB or 4MB means at least 1 
4MB object promotion, times 3 replicas (ie 12MB over the network) times 
2 for the journal writes. So 24MB of data written to the cache tier 
disks, no matter if it's 4KB or 4MB. Imagine you have 200 IOPS worth of 
4KB read cache misses. That's roughly 4.8GB/s of writes into the cache 
tier. If you are seldomly re-reading the same blocks, performance will 
be absolutely terrible. On the other hand, if you have lots of small 
random reads from the same set of 4MB objects, the cache tier really can 
help. How much it helps vs just doing the reads from page cache is 
debatable though. There's some band between page cache and disk where 
the cache tier fits in, but getting everything right is going to be tricky. 

The optimal situation imho is that the cache tier only promote objects 
that have a lot of small random reads hitting them, and be very 
conservative about promotions, especially for new writes. I don't know 
whether or not cache promotion might pollute page cache in strange ways, 
but that's something we also may need to be careful about. 

For more info, see the following thread: 

http://www.spinics.net/lists/ceph-devel/msg20189.html 

Mark 

On 09/10/2014 07:51 AM, Andrei Mikhailovsky wrote: 
> Hello guys, 
> 
> I am experimeting with cache pool and running some tests to see how 
> adding the cache pool improves the overall performance of our small cluster. 
> 
> While doing testing I've noticed that it seems that the cache pool is 
> writing too much on the cache pool ssds. Not sure what the issue here, 
> perhaps someone could help me understand what is going on. 
> 
> My test cluster is: 
> 2 x OSD servers (Each server has: 24GB ram, 12 cores, 8 hdd osds, 2 ssds 
> journals, 2 ssds for cache pool, 40gbit/s infiniband network capable of 
> 25gbit/s over ipoib).

[ceph-users] writing to rbd mapped device produces hang tasks

2014-09-13 Thread Andrei Mikhailovsky

Hello guys, 

I've been trying to map an rbd disk to run some testing and I've noticed that 
while I can successfully read from the rbd image mapped to /dev/rbdX, I am 
failing to reliably write to it. Sometimes write tests work perfectly well, 
especially if I am using large block sizes. But often writes hang for a 
considerable amount of time and I have kernel hang task messages (shown below) 
in my dmesg. the hang tasks show particularly frequently when using 4K block 
size. However, with large block sizes writes also sometimes hang, but for sure 
less frequent 

I am using simple dd tests like: dd if=/dev/zero of= bs=4K 
count=250K. 

I am running firefly on Ubuntu 12.04 on all osd/mon servers. The rbd image is 
mapped on one of the osd servers. All osd servers are running kernel version 
3.15.10-031510-generic . 


Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.439974] INFO: task 
jbd2/rbd0-8:3505 blocked for more than 120 seconds. 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.441586] Not tainted 
3.15.10-031510-generic #201408132333 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.443022] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message. 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444862] jbd2/rbd0-8 D 
0003 0 3505 2 0x 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444870] 8803a10a7c48 
0002 88007963b288 8803a10a7fd8 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444874] 00014540 
00014540 880469f63260 880866969930 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444876] 8803a10a7c58 
8803a10a7d88 88034d142100 880848519824 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444879] Call Trace: 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444893] [] 
schedule+0x29/0x70 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444901] [] 
jbd2_journal_commit_transaction+0x240/0x1510 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444908] [] ? 
sched_clock_cpu+0x85/0xc0 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444920] [] ? 
arch_vtime_task_switch+0x8a/0x90 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444923] [] ? 
vtime_common_task_switch+0x3d/0x50 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444928] [] ? 
__wake_up_sync+0x20/0x20 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444933] [] ? 
try_to_del_timer_sync+0x4f/0x70 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444938] [] 
kjournald2+0xb8/0x240 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444941] [] ? 
__wake_up_sync+0x20/0x20 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444943] [] ? 
commit_timeout+0x10/0x10 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444949] [] 
kthread+0xc9/0xe0 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444952] [] ? 
flush_kthread_worker+0xb0/0xb0 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444965] [] 
ret_from_fork+0x7c/0xb0 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444969] [] ? 
flush_kthread_worker+0xb0/0xb0 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.445141] INFO: task dd:21180 
blocked for more than 120 seconds. 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.446595] Not tainted 
3.15.10-031510-generic #201408132333 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.448070] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message. 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449910] dd D 0002 
0 21180 19562 0x0002 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449913] 880485a1b5d8 
0002 880485a1b5e8 880485a1bfd8 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449916] 00014540 
00014540 880341833260 88011086cb90 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449919] 880485a1b5a8 
88046fc94e40 88011086cb90 81204da0 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449921] Call Trace: 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449927] [] ? 
__wait_on_buffer+0x30/0x30 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449930] [] 
schedule+0x29/0x70 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449934] [] 
io_schedule+0x8f/0xd0 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449936] [] 
sleep_on_buffer+0xe/0x20 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449940] [] 
__wait_on_bit+0x62/0x90 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449945] [] ? 
bio_alloc_bioset+0xa0/0x1d0 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449947] [] ? 
__wait_on_buffer+0x30/0x30 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449951] [] 
out_of_line_wait_on_bit+0x7c/0x90 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449954] [] ? 
wake_atomic_t_function+0x40/0x40 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449957] [] 
__wait_on_buffer+0x2e/0x30 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449962] [] 
ext4_wait_block_bitmap+0xd8/0xe0 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449969] [] 
ext4

Re: [ceph-users] writing to rbd mapped device produces hang tasks

2014-09-14 Thread Andrei Mikhailovsky



Hi guys, 

Following up on my previous message. I've tried to repeat the same experiment 
with 3.16.2 kernel and I still have the same behaviour. The dd process got 
stuck after running dd for 3 times in a row. The iostat shows that the rbd0 
device is fully utilised, but without any activity on the device itself: 

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await 
w_await svctm %util 
rbd0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 135.00 0.00 0.00 0.00 0.00 100.00 

I've also tried enabling and disabling rbd cache, which didn't make a 
difference. 

Could someone help me with debugging the issue and getting to the root cause? 

Thanks 

Andrei 
- Original Message -

From: "Andrei Mikhailovsky"  
To: ceph-users@lists.ceph.com 
Sent: Sunday, 14 September, 2014 12:04:15 AM 
Subject: [ceph-users] writing to rbd mapped device produces hang tasks 


Hello guys, 

I've been trying to map an rbd disk to run some testing and I've noticed that 
while I can successfully read from the rbd image mapped to /dev/rbdX, I am 
failing to reliably write to it. Sometimes write tests work perfectly well, 
especially if I am using large block sizes. But often writes hang for a 
considerable amount of time and I have kernel hang task messages (shown below) 
in my dmesg. the hang tasks show particularly frequently when using 4K block 
size. However, with large block sizes writes also sometimes hang, but for sure 
less frequent 

I am using simple dd tests like: dd if=/dev/zero of= bs=4K 
count=250K. 

I am running firefly on Ubuntu 12.04 on all osd/mon servers. The rbd image is 
mapped on one of the osd servers. All osd servers are running kernel version 
3.15.10-031510-generic. 


Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.439974] INFO: task 
jbd2/rbd0-8:3505 blocked for more than 120 seconds. 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.441586] Not tainted 
3.15.10-031510-generic #201408132333 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.443022] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message. 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444862] jbd2/rbd0-8 D 
0003 0 3505 2 0x 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444870] 8803a10a7c48 
0002 88007963b288 8803a10a7fd8 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444874] 00014540 
00014540 880469f63260 880866969930 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444876] 8803a10a7c58 
8803a10a7d88 88034d142100 880848519824 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444879] Call Trace: 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444893] [] 
schedule+0x29/0x70 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444901] [] 
jbd2_journal_commit_transaction+0x240/0x1510 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444908] [] ? 
sched_clock_cpu+0x85/0xc0 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444920] [] ? 
arch_vtime_task_switch+0x8a/0x90 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444923] [] ? 
vtime_common_task_switch+0x3d/0x50 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444928] [] ? 
__wake_up_sync+0x20/0x20 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444933] [] ? 
try_to_del_timer_sync+0x4f/0x70 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444938] [] 
kjournald2+0xb8/0x240 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444941] [] ? 
__wake_up_sync+0x20/0x20 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444943] [] ? 
commit_timeout+0x10/0x10 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444949] [] 
kthread+0xc9/0xe0 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444952] [] ? 
flush_kthread_worker+0xb0/0xb0 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444965] [] 
ret_from_fork+0x7c/0xb0 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444969] [] ? 
flush_kthread_worker+0xb0/0xb0 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.445141] INFO: task dd:21180 
blocked for more than 120 seconds. 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.446595] Not tainted 
3.15.10-031510-generic #201408132333 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.448070] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message. 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449910] dd D 0002 
0 21180 19562 0x0002 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449913] 880485a1b5d8 
0002 880485a1b5e8 880485a1bfd8 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449916] 00014540 
00014540 880341833260 88011086cb90 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449919] 880485a1b5a8 
88046fc94e40 88011086cb90 81204da0 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449921] Call Trace: 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.449927] [] ? 
__wait_on_buffer+0x30/0x30 
Sep 13 21:24:

Re: [ceph-users] writing to rbd mapped device produces hang tasks

2014-09-14 Thread Andrei Mikhailovsky

 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472476] [] 
writeback_sb_inodes+0x22e/0x340 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472479] [] 
__writeback_inodes_wb+0x9e/0xd0 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472483] [] 
wb_writeback+0x28b/0x330 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472487] [] ? 
get_nr_dirty_inodes+0x52/0x80 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472490] [] 
wb_check_old_data_flush+0x9f/0xb0 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472493] [] 
wb_do_writeback+0x134/0x1c0 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472496] [] ? 
set_worker_desc+0x6f/0x80 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472500] [] 
bdi_writeback_workfn+0x78/0x1f0 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472503] [] 
process_one_work+0x17f/0x4c0 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472507] [] 
worker_thread+0x11b/0x3f0 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472510] [] ? 
create_and_start_worker+0x80/0x80 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472513] [] 
kthread+0xc9/0xe0 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472516] [] ? 
flush_kthread_worker+0xb0/0xb0 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472520] [] 
ret_from_fork+0x7c/0xb0 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472523] [] ? 
flush_kthread_worker+0xb0/0xb0 



Cheers 

- Original Message -

From: "Andrei Mikhailovsky"  
To: ceph-users@lists.ceph.com 
Sent: Sunday, 14 September, 2014 11:34:07 AM 
Subject: Re: [ceph-users] writing to rbd mapped device produces hang tasks 




Hi guys, 

Following up on my previous message. I've tried to repeat the same experiment 
with 3.16.2 kernel and I still have the same behaviour. The dd process got 
stuck after running dd for 3 times in a row. The iostat shows that the rbd0 
device is fully utilised, but without any activity on the device itself: 

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await 
w_await svctm %util 
rbd0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 135.00 0.00 0.00 0.00 0.00 100.00 

I've also tried enabling and disabling rbd cache, which didn't make a 
difference. 

Could someone help me with debugging the issue and getting to the root cause? 

Thanks 

Andrei 
- Original Message -

From: "Andrei Mikhailovsky"  
To: ceph-users@lists.ceph.com 
Sent: Sunday, 14 September, 2014 12:04:15 AM 
Subject: [ceph-users] writing to rbd mapped device produces hang tasks 


Hello guys, 

I've been trying to map an rbd disk to run some testing and I've noticed that 
while I can successfully read from the rbd image mapped to /dev/rbdX, I am 
failing to reliably write to it. Sometimes write tests work perfectly well, 
especially if I am using large block sizes. But often writes hang for a 
considerable amount of time and I have kernel hang task messages (shown below) 
in my dmesg. the hang tasks show particularly frequently when using 4K block 
size. However, with large block sizes writes also sometimes hang, but for sure 
less frequent 

I am using simple dd tests like: dd if=/dev/zero of= bs=4K 
count=250K. 

I am running firefly on Ubuntu 12.04 on all osd/mon servers. The rbd image is 
mapped on one of the osd servers. All osd servers are running kernel version 
3.15.10-031510-generic. 


Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.439974] INFO: task 
jbd2/rbd0-8:3505 blocked for more than 120 seconds. 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.441586] Not tainted 
3.15.10-031510-generic #201408132333 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.443022] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message. 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444862] jbd2/rbd0-8 D 
0003 0 3505 2 0x 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444870] 8803a10a7c48 
0002 88007963b288 8803a10a7fd8 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444874] 00014540 
00014540 880469f63260 880866969930 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444876] 8803a10a7c58 
8803a10a7d88 88034d142100 880848519824 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444879] Call Trace: 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444893] [] 
schedule+0x29/0x70 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444901] [] 
jbd2_journal_commit_transaction+0x240/0x1510 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444908] [] ? 
sched_clock_cpu+0x85/0xc0 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444920] [] ? 
arch_vtime_task_switch+0x8a/0x90 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444923] [] ? 
vtime_common_task_switch+0x3d/0x50 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444928] [] ? 
__wake_up_sync+0x20/0x20 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.444933] [] ? 
try_to_del_timer_sync+0x4f/0x70 
Sep 13 21:24:30 arh-ibstorage2-ib kernel: [11880.4449

Re: [ceph-users] writing to rbd mapped device produces hang tasks

2014-09-14 Thread Andrei Mikhailovsky

To answer my own question, I think I am getting 8818 bug - 
http://tracker.ceph.com/issues/8818 . The solution seems to be to upgrade to 
the latest 3.17 kernel brunch. 

Cheers 
- Original Message -

From: "Andrei Mikhailovsky"  
To: ceph-users@lists.ceph.com 
Sent: Sunday, 14 September, 2014 11:38:07 AM 
Subject: Re: [ceph-users] writing to rbd mapped device produces hang tasks 


Oh, forgot to show the hang tasks message, which looks different on the 3.16.2 
kernel: 



Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.467439] INFO: task 
kworker/u25:2:668 blocked for more than 120 seconds. 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.469032] Not tainted 
3.16.2-031602-generic #201409052035 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.470474] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message. 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472288] kworker/u25:2 D 
0004 0 668 2 0x 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472299] Workqueue: writeback 
bdi_writeback_workfn (flush-251:0) 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472301] 88046713afe8 
0046 8804560ccc60 88046713bfd8 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472304] 00014400 
00014400 880469fb8000 880464753cc0 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472307] 88011022f768 
88011022f7c8 88011022f7cc 880464753cc0 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472309] Call Trace: 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472313] [] 
schedule+0x29/0x70 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472316] [] 
schedule_preempt_disabled+0xe/0x10 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472319] [] 
__mutex_lock_slowpath+0xd5/0x1c0 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472322] [] 
mutex_lock+0x23/0x37 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472330] [] 
ceph_osdc_start_request+0x42/0x80 [libceph] 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472335] [] 
rbd_obj_request_submit+0x33/0x70 [rbd] 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472340] [] 
rbd_img_obj_request_submit+0xaa/0x100 [rbd] 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472344] [] 
rbd_img_request_submit+0x56/0x80 [rbd] 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472349] [] 
rbd_request_fn+0x2ac/0x350 [rbd] 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472356] [] 
__blk_run_queue+0x37/0x50 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472359] [] 
queue_unplugged+0x3d/0xc0 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472364] [] 
blk_flush_plug_list+0x1d2/0x210 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472369] [] ? 
__wait_on_buffer+0x30/0x30 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472371] [] 
io_schedule+0x78/0xd0 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472374] [] 
sleep_on_buffer+0xe/0x20 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472377] [] 
__wait_on_bit+0x62/0x90 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472381] [] ? 
bio_alloc_bioset+0xa0/0x1d0 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472384] [] ? 
__wait_on_buffer+0x30/0x30 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472387] [] 
out_of_line_wait_on_bit+0x7c/0x90 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472393] [] ? 
wake_atomic_t_function+0x40/0x40 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472395] [] 
__wait_on_buffer+0x2e/0x30 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472401] [] 
ext4_wait_block_bitmap+0xd8/0xe0 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472406] [] 
ext4_mb_init_cache+0x1de/0x750 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472411] [] ? 
pagecache_get_page+0xaa/0x1d0 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472414] [] 
ext4_mb_init_group+0xbe/0x110 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472417] [] 
ext4_mb_load_buddy+0x380/0x3a0 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472420] [] 
ext4_mb_find_by_goal+0xa3/0x310 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472423] [] 
ext4_mb_regular_allocator+0x5e/0x450 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472428] [] ? 
ext4_ext_find_extent+0x220/0x2a0 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472431] [] 
ext4_mb_new_blocks+0x40a/0x540 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472435] [] ? 
ext4_ext_find_extent+0x120/0x2a0 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472438] [] ? 
ext4_ext_check_overlap.isra.27+0xbc/0xd0 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472441] [] 
ext4_ext_map_blocks+0x973/0xad0 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472446] [] ? 
__ext4_es_shrink+0x210/0x2d0 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472450] [] 
ext4_map_blocks+0x15f/0x550 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472455] [] ? 
__pagevec_release+0x2c/0x40 
Sep 14 11:29:56 arh-ibstorage1-ib kernel: [ 1200.472458] [] 
mpage_map_one_exten

[ceph-users] Bcache / Enhanceio with osds

2014-09-14 Thread Andrei Mikhailovsky

Hello guys, 

Was wondering if anyone uses or done some testing with using bcache or 
enhanceio caching in front of ceph osds? 

I've got a small cluster of 2 osd servers, 16 osds in total and 4 ssds for 
journals. I've recently purchased four additional ssds to be used for ceph 
cache pool, but i've found performance of guest vms to be slower with the cache 
pool for many benchmarks. The write performance has slightly improved, but the 
read performance has suffered a lot (as much as 60% in some tests). 

Therefore, I am planning to scrap the cache pool (at least until it matures) 
and use either bcache or enhanceio instead. 

Thanks 

Andrei 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Cache pool stats

2014-09-15 Thread Andrei Mikhailovsky

Hi 

Does anyone know how to check the basic cache pool stats for the information 
like how well the cache layer is working for a recent or historic time frame? 
Things like cache hit ratio would be very helpful as well as. 

Thanks 

Andrei 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bcache / Enhanceio with osds

2014-09-15 Thread Andrei Mikhailovsky

- Original Message -

> From: "Mark Nelson" 
> To: ceph-users@lists.ceph.com
> Sent: Monday, 15 September, 2014 1:13:01 AM
> Subject: Re: [ceph-users] Bcache / Enhanceio with osds

> On 09/14/2014 05:11 PM, Andrei Mikhailovsky wrote:
> > Hello guys,
> >
> > Was wondering if anyone uses or done some testing with using bcache
> > or
> > enhanceio caching in front of ceph osds?
> >
> > I've got a small cluster of 2 osd servers, 16 osds in total and 4
> > ssds
> > for journals. I've recently purchased four additional ssds to be
> > used
> > for ceph cache pool, but i've found performance of guest vms to be
> > slower with the cache pool for many benchmarks. The write
> > performance
> > has slightly improved, but the read performance has suffered a lot
> > (as
> > much as 60% in some tests).
> >
> > Therefore, I am planning to scrap the cache pool (at least until it
> > matures) and use either bcache or enhanceio instead.

> We're actually looking at dm-cache a bit right now. (and talking some
> of
> the developers about the challenges they are facing to help improve
> our
> own cache tiering) No meaningful benchmarks of dm-cache yet though.
> Bcache, enhanceio, and flashcache all look interesting too. Regarding
> the cache pool: we've got a couple of ideas that should help improve
> performance, especially for reads.
Mark, do you mind sharing these ideas with the rest of cephers? Can these ideas 
be implemented on the existing firefly install? 

> There are definitely advantages to
> keeping cache local to the node though. I think some form of local
> node
> caching could be pretty useful going forward.

What do you mean by the local to the node? Do you mean the use of cache disks 
on the hypervisor level? Or do you mean using cache ssd disks on the osd 
servers rather than creating a separate cache tier hardware? 

Thanks 

> >
> > Thanks
> >
> > Andrei
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] XenServer and Ceph - any updates?

2014-09-22 Thread Andrei Mikhailovsky

Hello guys, 

I was wondering if there has been any updates on getting XenServer ready for 
ceph? I've seen a howto that was written well over a year ago (I think) for a 
PoC integration of XenServer and Ceph. However, I've not seen any developments 
lately.It would be cool to see other hypervisors adapting Ceph )) 

Cheers 

Andrei 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bcache / Enhanceio with osds

2014-09-22 Thread Andrei Mikhailovsky

I've done a bit of testing with Enhanceio on my cluster and I can see a 
definate improvement in read performance for cached data. The performance 
increase is around 3-4 times the cluster speed prior to using enhanceio based 
on large block size IO (1M and 4M). 

I've done a concurrent test of running a single "dd if=/dev/vda of=/dev/null 
bs=1M/4M iflag=direct" instance over 20 vms which were running on 4 host 
servers. Prior to enchanceio i was getting around 30-35MB/s per guest vm 
regardless of how many times i run the test. With enhanceio (from the second 
run) I was hitting over 130MB/s per vm. I've not seen any lag in performance of 
other vms while using enchanceio, unlike a considerable lag without the 
enchanceio. The ssd disk utilisation was not hitting much over 60%. 

The small block size (4K) performance hasn't changed with enhanceio, which made 
me think that the performance of osds themselves is limited when using small 
block sizes. I wasn't getting much over 2-3MB/s per guest vm. 

On a contrary, when I tried to use the firefly cache pool on the same hardware, 
my cluster has performed significantly slower with the cache pool. The whole 
cluster seemed under a lot more load and the performance has dropped to around 
12-15MB/s and other guest vms were very very slow. The ssd disks were utilised 
100% all the time during the test with majority of write IO. 

I admit that these tests shouldn't be considered as a definate and fully 
performance tests of ceph cluster as this is a live cluster with disk io 
actiivity outside outside of the test vms. The average load is not much 
(300-500 IO/s), mainly reads. However, it still indicates that there is a room 
for improvement in the ceph's cache pool implementation. Looking at my results, 
I think ceph is missing a lot of hits on the read cache, which causes osds to 
write a lot of data. With enchanceio I was getting well over 50% read hit ratio 
and the main activity on the ssds was read io unlike ceph. 

Outside of the tests, i've left enchanceio running on the osd servers. It has 
been a few days now and the hit ratio on the osds is around 8-11%, which seems 
a bit low. I was wondering if I should change the default block size of 
enchance io to 2K instead of the default 4K. Taking into account's ceph object 
size of 4M I am not sure if this will help the hit ratio. Does anyone have an 
idea? 

Andrei 
- Original Message -

> From: "Mark Nelson" 
> To: "Robert LeBlanc" , "Mark Nelson"
> 
> Cc: ceph-users@lists.ceph.com
> Sent: Monday, 22 September, 2014 10:49:42 PM
> Subject: Re: [ceph-users] Bcache / Enhanceio with osds

> Likely it won't since the OSD is already coalescing journal writes.
> FWIW, I ran through a bunch of tests using seekwatcher and blktrace
> at
> 4k, 128k, and 4m IO sizes on a 4 OSD cluster (3x replication) to get
> a
> feel for what the IO patterns are like for the dm-cache developers. I
> included both the raw blktrace data and seekwatcher graphs here:

> http://nhm.ceph.com/firefly_blktrace/

> there are some interesting patterns but they aren't too easy to spot
> (I
> don't know why the Chris decided to use blue and green by default!)

> Mark

> On 09/22/2014 04:32 PM, Robert LeBlanc wrote:
> > We are still in the middle of testing things, but so far we have
> > had
> > more improvement with SSD journals than the OSD cached with bcache
> > (five
> > OSDs fronted by one SSD). We still have yet to test if adding a
> > bcache
> > layer in addition to the SSD journals provides any additional
> > improvements.
> >
> > Robert LeBlanc
> >
> > On Sun, Sep 14, 2014 at 6:13 PM, Mark Nelson
> >  > <mailto:mark.nel...@inktank.com>> wrote:
> >
> > On 09/14/2014 05:11 PM, Andrei Mikhailovsky wrote:
> >
> > Hello guys,
> >
> > Was wondering if anyone uses or done some testing with using
> > bcache or
> > enhanceio caching in front of ceph osds?
> >
> > I've got a small cluster of 2 osd servers, 16 osds in total and
> > 4 ssds
> > for journals. I've recently purchased four additional ssds to be
> > used
> > for ceph cache pool, but i've found performance of guest vms to be
> > slower with the cache pool for many benchmarks. The write
> > performance
> > has slightly improved, but the read performance has suffered a
> > lot (as
> > much as 60% in some tests).
> >
> > Therefore, I am planning to scrap the cache pool (at least until it
> > matures) and use either bcache or enhanceio instead.
> >
> >
> > We're actually looking at dm-cache a bit right now. (and talking
> > some of the developers about the challenges

Re: [ceph-users] ceph backups

2014-09-23 Thread Andrei Mikhailovsky

Luis, 

you may want to take a look at rbd export/import and export-diff import-diff 
functionality. this could be used to copy data to another cluster or offsite. 

S3 has regions, which you could use for async replication. 

Not sure how the cephfs work for backups. 

Andrei 
- Original Message -

> From: "Luis Periquito" 
> To: ceph-users@lists.ceph.com
> Sent: Tuesday, 23 September, 2014 11:28:39 AM
> Subject: [ceph-users] ceph backups

> Hi fellow cephers,

> I'm being asked questions around our backup of ceph, mainly due to
> data deletion.

> We are currently using ceph to store RBD, S3 and eventually cephFS;
> and we would like to be able to devise a plan to backup the
> information as to avoid issues with data being deleted from the
> cluster.

> I know RBD has the snapshots, but how can they be automated? Can we
> rely on them to perform data recovery?

> And for S3/CephFS? Are there any backup methods? Other than copying
> all the information into another location?

> thanks,
> Luis

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] features of the next stable release

2015-02-02 Thread Andrei Mikhailovsky

Hi cephers, 

I've got three questions: 

1. Does anyone have an estimation on the release dates of the next stable ceph 
branch? 

2. Will the new stable release have improvements in the following areas: a) 
working with ssd disks; b) cache tier 

3. Will the new stable release introduce support for native RDMA / Infiniband 
networking without the need of using IP over Infiniband? 

Thanks 

Andrei 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] features of the next stable release

2015-02-02 Thread Andrei Mikhailovsky

- Original Message -

> I'm not sure what you mean about improvements for SSD disks, but the
> OSD should be generally a bit faster. There are several cache tier
> improvements included that should improve performance on most
> workloads that others can speak about in more detail than I.

What i mean by the SSD disk improvement is that currently, the cluster made of 
all ssd disks is pretty slow. You will not get the IO throughput of the SSDs. 
My tests show the limit seems to be around 3K IOPs even though the ssds can 
easily do 50K+ IOps . This makes it impossible to have a decent database 
workload on the ceph cluster. 

> -Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph Performance with SSD journal

2015-02-13 Thread Andrei Mikhailovsky

Not sure what exact brands of samsung you have, but i've got the 840 Pro and it 
sucks big time. its is slow and unreliable and halts to a stand still over a 
period of time due to the trimming issue. Even after i've left unreserved like 
50% of the disk. 

Unlike the Intel disks (even the consumer brand like 520 and 530 are just way 
better. 

I will stay away from any samsung drives in the future. 

Andrei 

- Original Message -

> From: "Sumit Gaur" 
> To: "Irek Fasikhov" 
> Cc: ceph-users@lists.ceph.com
> Sent: Friday, 13 February, 2015 1:09:38 PM
> Subject: Re: [ceph-users] ceph Performance with SSD journal

> Hi Irek,
> I am using v0.80.5 Firefly
> -sumit

> On Fri, Feb 13, 2015 at 1:30 PM, Irek Fasikhov < malm...@gmail.com >
> wrote:

> > Hi.
> 
> > What version?
> 

> > 2015-02-13 6:04 GMT+03:00 Sumit Gaur < sumitkg...@gmail.com > :
> 

> > > Hi Chir,
> > 
> 
> > > Please fidn my answer below in blue
> > 
> 

> > > On Thu, Feb 12, 2015 at 12:42 PM, Chris Hoy Poy < ch...@gopc.net
> > > >
> > > wrote:
> > 
> 

> > > > Hi Sumit,
> > > 
> > 
> 

> > > > A couple questions:
> > > 
> > 
> 

> > > > What brand/model SSD?
> > > 
> > 
> 

> > > samsung 480G SSD(PM853T) having random write 90K IOPS (4K,
> > > 368MBps)
> > 
> 
> > > > What brand/model HDD?
> > > 
> > 
> 

> > > 64GB memory, 300GB SAS HDD (seagate) , 10Gb nic
> > 
> 
> > > > Also how they are connected to controller/motherboard? Are they
> > > > sharing a bus (ie SATA expander)?
> > > 
> > 
> 
> > > no , They are connected with local Bus not the SATA expander.
> > 
> 

> > > > RAM?
> > > 
> > 
> 
> > > 64GB
> > 
> 
> > > > Also look at the output of "iostat -x" or similiar, are the
> > > > SSDs
> > > > hitting 100% utilisation?
> > > 
> > 
> 

> > > No, SSD was hitting 2000 iops only.
> > 
> 
> > > > I suspect that the 5:1 ratio of HDDs to SDDs is not ideal, you
> > > > now
> > > > have 5x the write IO trying to fit into a single SSD.
> > > 
> > 
> 
> > > I have not seen any documented reference to calculate the ratio.
> > > Could you suggest one. Here I want to mention that results for
> > > 1024K
> > > write improve a lot. Problem is with 1024K read and 4k write .
> > 
> 

> > > SSD journal 810 IOPS and 810MBps
> > 
> 
> > > HDD journal 620 IOPS and 620 MBps
> > 
> 

> > > > I'll take a punt on it being a SATA connected SSD (most
> > > > common),
> > > > 5x
> > > > ~130 megabytes/second gets very close to most SATA bus limits.
> > > > If
> > > > its a shared BUS, you possibly hit that limit even earlier
> > > > (since
> > > > all that data is now being written twice out over the bus).
> > > 
> > 
> 

> > > > cheers;
> > > 
> > 
> 

> > > > \Chris
> > > 
> > 
> 

> > > > From: "Sumit Gaur" < sumitkg...@gmail.com >
> > > 
> > 
> 
> > > > To: ceph-users@lists.ceph.com
> > > 
> > 
> 
> > > > Sent: Thursday, 12 February, 2015 9:23:35 AM
> > > 
> > 
> 
> > > > Subject: [ceph-users] ceph Performance with SSD journal
> > > 
> > 
> 

> > > > Hi Ceph -Experts,
> > > 
> > 
> 

> > > > Have a small ceph architecture related question
> > > 
> > 
> 

> > > > As blogs and documents suggest that ceph perform much better if
> > > > we
> > > > use journal on SSD .
> > > 
> > 
> 

> > > > I have made the ceph cluster with 30 HDD + 6 SSD for 6 OSD
> > > > nodes.
> > > > 5
> > > > HDD + 1 SSD on each node and each SSD have 5 partition for
> > > > journaling 5 OSDs on the node.
> > > 
> > 
> 

> > > > Now I ran similar test as I ran for all HDD setup.
> > > 
> > 
> 

> > > > What I saw below two reading goes in wrong direction as
> > > > expected
> > > 
> > 
> 

> > > > 1) 4K write IOPS are less for SSD setup, though not major
> > > > difference
> > > > but less.
> > > 
> > 
> 
> > > > 2) 1024K Read IOPS are less for SSD setup than HDD setup.
> > > 
> > 
> 

> > > > On the other hand 4K read and 1024K write both have much better
> > > > numbers for SSD setup.
> > > 
> > 
> 

> > > > Let me know if I am missing some obvious concept.
> > > 
> > 
> 

> > > > Thanks
> > > 
> > 
> 
> > > > sumit
> > > 
> > 
> 
> > > > ___
> > > 
> > 
> 
> > > > ceph-users mailing list
> > > 
> > 
> 
> > > > ceph-users@lists.ceph.com
> > > 
> > 
> 
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > 
> > 
> 

> > > ___
> > 
> 
> > > ceph-users mailing list
> > 
> 
> > > ceph-users@lists.ceph.com
> > 
> 
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> 

> > --
> 

> > С уважением, Фасихов Ирек Нургаязович
> 
> > Моб.: +79229045757
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Introducing "Learning Ceph" : The First ever Book on Ceph

2015-02-13 Thread Andrei Mikhailovsky

Yeah, guys, thanks! I've got it a few days ago and done a few chapters already. 

Well done! 

Andrei 

- Original Message -

> From: "Wido den Hollander" 
> To: ceph-users@lists.ceph.com
> Sent: Friday, 13 February, 2015 5:38:47 PM
> Subject: Re: [ceph-users] Introducing "Learning Ceph" : The First
> ever Book on Ceph

> On 05-02-15 23:53, Karan Singh wrote:
> > Hello Community Members
> >
> > I am happy to introduce the first book on Ceph with the title
> > “*Learning
> > Ceph*”.
> >
> > Me and many folks from the publishing house together with technical
> > reviewers spent several months to get this book compiled and
> > published.
> >
> > Finally the book is up for sale on , i hope you would like it and
> > surely
> > will learn a lot from it.
> >

> Great! Just ordered myself a copy!

> > Amazon :
> > http://www.amazon.com/Learning-Ceph-Karan-Singh/dp/1783985623/ref=sr_1_1?s=books&ie=UTF8&qid=1423174441&sr=1-1&keywords=ceph
> > Packtpub :
> > https://www.packtpub.com/application-development/learning-ceph
> >
> > You can grab the sample copy from here :
> > https://www.dropbox.com/s/ek76r01r9prs6pb/Learning_Ceph_Packt.pdf?dl=0
> >
> > *Finally , I would like to express my sincere thanks to *
> >
> > *Sage Weil* - For developing Ceph and everything around it as well
> > as
> > writing foreword for “Learning Ceph”.
> > *Patrick McGarry *- For his usual off the track support that too
> > always.
> >
> > Last but not the least , to our great community members , who are
> > also
> > reviewers of the book *Don Talton , Julien Recurt , Sebastien Han
> > *and
> > *Zihong Chen *, Thank you guys for your efforts.
> >
> >
> > 
> > Karan Singh
> > Systems Specialist , Storage Platforms
> > CSC - IT Center for Science,
> > Keilaranta 14, P. O. Box 405, FIN-02101 Espoo, Finland
> > mobile: +358 503 812758
> > tel. +358 9 4572001
> > fax +358 9 4572302
> > http://www.csc.fi/
> > 
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >

> --
> Wido den Hollander
> 42on B.V.

> Phone: +31 (0)20 700 9902
> Skype: contact42on
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph Dumpling/Firefly/Hammer SSD/Memstore performance comparison

2015-02-18 Thread Andrei Mikhailovsky

Mark, many thanks for your effort and ceph performance tests. This puts things 
in perspective. 

Looking at the results, I was a bit concerned that the IOPs performance in 
niether releases come even marginally close to the capabilities of the 
underlying ssd device. Even the fastest PCI ssds have only managed to achieve 
about the 1/6th IOPs of the raw device. 

I guess there is a great deal more optimisations to be done in the upcoming LTS 
releases to make the IOPs rate close to the raw device performance. 

I have done some testing in the past and noticed that despite the server having 
a lot of unused resources (about 40-50% server idle and about 60-70% ssd idle) 
the ceph would not perform well when used with ssds. I was testing with Firefly 
+ auth and my IOPs rate was around the 3K mark. Something is holding ceph back 
from performing well with ssds ((( 

Andrei 

- Original Message -

> From: "Mark Nelson" 
> To: "ceph-devel" 
> Cc: ceph-users@lists.ceph.com
> Sent: Tuesday, 17 February, 2015 5:37:01 PM
> Subject: [ceph-users] Ceph Dumpling/Firefly/Hammer SSD/Memstore
> performance comparison

> Hi All,

> I wrote up a short document describing some tests I ran recently to
> look
> at how SSD backed OSD performance has changed across our LTS
> releases.
> This is just looking at RADOS performance and not RBD or RGW. It also
> doesn't offer any real explanations regarding the results. It's just
> a
> first high level step toward understanding some of the behaviors
> folks
> on the mailing list have reported over the last couple of releases. I
> hope you find it useful.

> Mark

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Extreme slowness in SSD cluster with 3 nodes and 9 OSD with 3.16-3 kernel

2015-02-28 Thread Andrei Mikhailovsky

Martin, 

I have been using Samsung 840 Pro for journals about 2 years now and have just 
replaced all my samsung drives with Intel. We have found a lot of performance 
issues with 840 Pro (we are using 128mb). In particular, a very strange 
behaviour with using 4 partitions (with 50% underprovisioning left as empty 
unpartitioned space on the drive) where the drive would grind to almost a halt 
after a few weeks of use. I was getting 100% utilisation on the drives doing 
just 3-4MB/s writes. This was not the case when I've installed the new drives. 
Manual Trimming helps for a few weeks until the same happens again. 

This has been happening with all 840 Pro ssds that we have and contacting 
Samsung Support has proven to be utterly useless. They do not want to speak 
with you until you install windows and run their monkey utility ((. 

Also, i've noticed the latencies of the Samsung 840 Pro ssd drives to be about 
15-20 slower compared with a consumer grade Intel drives, like Intel 520. 
According to ceph osd pef, I would consistently get higher figures on the osds 
with Samsung journal drive compared with the Intel drive on the same server. 
Something like 2-3ms for Intel vs 40-50ms for Samsungs. 

At some point we had enough with Samsungs and scrapped them. 

Andrei 

- Original Message -

> From: "Martin B Nielsen" 
> To: "Philippe Schwarz" 
> Cc: ceph-users@lists.ceph.com
> Sent: Saturday, 28 February, 2015 11:51:57 AM
> Subject: Re: [ceph-users] Extreme slowness in SSD cluster with 3
> nodes and 9 OSD with 3.16-3 kernel

> Hi,

> I cannot recognize that picture; we've been using samsumg 840 pro in
> production for almost 2 years now - and have had 1 fail.

> We run a 8node mixed ssd/platter cluster with 4x samsung 840 pro
> (500gb) in each so that is 32x ssd.

> They've written ~25TB data in avg each.

> Using the dd you had inside an existing semi-busy mysql-guest I get:

> 10240 bytes (102 MB) copied, 5.58218 s, 18.3 MB/s

> Which is still not a lot, but I think it is more a limitation of our
> setup/load.

> We are using dumpling.

> All that aside, I would prob. go with something tried and tested if I
> was to redo it today - we haven't had any issues, but it is still
> nice to use something you know should have a baseline performance
> and can compare to that.

> Cheers,
> Martin

> On Sat, Feb 28, 2015 at 12:32 PM, Philippe Schwarz <
> p...@schwarz-fr.net > wrote:

> > -BEGIN PGP SIGNED MESSAGE-
> 
> > Hash: SHA1
> 

> > Le 28/02/2015 12:19, mad Engineer a écrit :
> 

> > > Hello All,
> 
> > >
> 
> > > I am trying ceph-firefly 0.80.8
> 
> > > (69eaad7f8308f21573c604f121956e64679a52a7) with 9 OSD ,all
> > > Samsung
> 
> > > SSD 850 EVO on 3 servers with 24 G RAM,16 cores @2.27 Ghz Ubuntu
> 
> > > 14.04 LTS with 3.16-3 kernel.All are connected to 10G ports with
> 
> > > maximum MTU.There are no extra disks for journaling and also
> > > there
> 
> > > are no separate network for replication and data transfer.All 3
> 
> > > nodes are also hosting monitoring process.Operating system runs
> > > on
> 
> > > SATA disk.
> 
> > >
> 
> > > When doing a sequential benchmark using "dd" on RBD, mounted on
> 
> > > client as ext4 its taking 110s to write 100Mb data at an average
> 
> > > speed of 926Kbps.
> 
> > >
> 
> > > time dd if=/dev/zero of=hello bs=4k count=25000 oflag=direct
> 
> > > 25000+0 records in 25000+0 records out 10240 bytes (102 MB)
> 
> > > copied, 110.582 s, 926 kB/s
> 
> > >
> 
> > > real 1m50.585s user 0m0.106s sys 0m2.233s
> 
> > >
> 
> > > While doing this directly on ssd mount point shows:
> 
> > >
> 
> > > time dd if=/dev/zero of=hello bs=4k count=25000 oflag=direct
> 
> > > 25000+0 records in 25000+0 records out 10240 bytes (102 MB)
> 
> > > copied, 1.38567 s, 73.9 MB/s
> 
> > >
> 
> > > OSDs are in XFS with these extra arguments :
> 
> > >
> 
> > > rw,noatime,inode64,logbsize=256k,delaylog,allocsize=4M
> 
> > >
> 
> > > ceph.conf
> 
> > >
> 
> > > [global] fsid = 7d889081-7826-439c-9fe5-d4e57480d9be
> 
> > > mon_initial_members = ceph1, ceph2, ceph3 mon_host =
> 
> > > 10.99.10.118,10.99.10.119,10.99.10.120 auth_cluster_required =
> 
> > > cephx auth_service_required = cephx auth_client_required = cephx
> 
> > > filestore_xattr_use_omap = true osd_pool_default_size = 2
> 
> > > osd_pool_default_min_size = 2 osd_pool_default_pg_num = 450
> 
> > > osd_pool_default_pgp_num = 450 max_open_files = 131072
> 
> > >
> 
> > > [osd] osd_mkfs_type = xfs osd_op_threads = 8 osd_disk_threads = 4
> 
> > > osd_mount_options_xfs =
> 
> > > "rw,noatime,inode64,logbsize=256k,delaylog,allocsize=4M"
> 
> > >
> 
> > >
> 
> > > on our traditional storage with Full SAS disk, same "dd"
> > > completes
> 
> > > in 16s with an average write speed of 6Mbps.
> 
> > >
> 
> > > Rados bench:
> 
> > >
> 
> > > rados bench -p rbd 10 write Maintaining 16 concurrent writes of
> 
> > > 4194304 bytes for up to 10 seconds or 0 objects Object prefix:
> 
> > > benchmark_data_ceph1_2977 sec Cur

Re: [ceph-users] SSD selection

2015-03-01 Thread Andrei Mikhailovsky

I would not use a single ssd for 5 osds. I would recommend the 3-4 osds max per 
ssd or you will get the bottleneck on the ssd side. 

I've had a reasonable experience with Intel 520 ssds (which are not produced 
anymore). I've found Samsung 840 Pro to be horrible! 

Otherwise, it seems that everyone here recommends the DC3500 or DC3700 and it 
has the best wear per $ ratio out of all the drives. 

Andrei 

- Original Message -

> From: "Tony Harris" 
> To: "Christian Balzer" 
> Cc: ceph-users@lists.ceph.com
> Sent: Sunday, 1 March, 2015 4:19:30 PM
> Subject: Re: [ceph-users] SSD selection

> Well, although I have 7 now per node, you make a good point and I'm
> in a position where I can either increase to 8 and split 4/4 and
> have 2 ssds, or reduce to 5 and use a single osd per node (the
> system is not in production yet).

> Do all the DC lines have caps in them or just the DC s line?

> -Tony

> On Sat, Feb 28, 2015 at 11:21 PM, Christian Balzer < ch...@gol.com >
> wrote:

> > On Sat, 28 Feb 2015 20:42:35 -0600 Tony Harris wrote:
> 

> > > Hi all,
> 
> > >
> 
> > > I have a small cluster together and it's running fairly well (3
> > > nodes, 21
> 
> > > osds). I'm looking to improve the write performance a bit though,
> > > which
> 
> > > I was hoping that using SSDs for journals would do. But, I was
> > > wondering
> 
> > > what people had as recommendations for SSDs to act as journal
> > > drives.
> 
> > > If I read the docs on ceph.com correctly, I'll need 2 ssds per
> > > node
> 
> > > (with 7 drives in each node, I think the recommendation was 1ssd
> > > per 4-5
> 
> > > drives?) so I'm looking for drives that will work well without
> > > breaking
> 
> > > the bank for where I work (I'll probably have to purchase them
> > > myself
> 
> > > and donate, so my budget is somewhat small). Any suggestions? I'd
> 
> > > prefer one that can finish its write in a power outage case, the
> > > only
> 
> > > one I know of off hand is the intel dcs3700 I think, but at $300
> > > it's
> 
> > > WAY above my affordability range.
> 

> > Firstly, an uneven number of OSDs (HDDs) per node will bite you in
> > the
> 
> > proverbial behind down the road when combined with journal SSDs, as
> > one of
> 
> > those SSDs will wear our faster than the other.
> 

> > Secondly, how many SSDs you need is basically a trade-off between
> > price,
> 
> > performance, endurance and limiting failure impact.
> 

> > I have cluster where I used 4 100GB DC S3700s with 8 HDD OSDs,
> > optimizing
> 
> > the write paths and IOPS and failure domain, but not the sequential
> > speed
> 
> > or cost.
> 

> > Depending on what your write load is and the expected lifetime of
> > this
> 
> > cluster, you might be able to get away with DC S3500s or even
> > better
> > the
> 
> > new DC S3610s.
> 
> > Keep in mind that buying a cheap, low endurance SSD now might cost
> > you
> 
> > more down the road if you have to replace it after a year (TBW/$).
> 

> > All the cheap alternatives to DC level SSDs tend to wear out too
> > fast,
> 
> > have no powercaps and tend to have unpredictable (caused by garbage
> 
> > collection) and steadily decreasing performance.
> 

> > Christian
> 
> > --
> 
> > Christian Balzer Network/Systems Engineer
> 
> > ch...@gol.com Global OnLine Japan/Fusion Communications
> 
> > http://www.gol.com/
> 

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSD selection

2015-03-01 Thread Andrei Mikhailovsky

I am not sure about the enterprise grade and underprovisioning, but for the 
Intel 520s i've got 240gbs (the speeds of 240 is a bit better than 120s). and 
i've left 50% underprovisioned. I've got 10GB for journals and I am using 4 
osds per ssd. 

Andrei 

- Original Message -

> From: "Tony Harris" 
> To: "Andrei Mikhailovsky" 
> Cc: ceph-users@lists.ceph.com, "Christian Balzer" 
> Sent: Sunday, 1 March, 2015 8:49:56 PM
> Subject: Re: [ceph-users] SSD selection

> Ok, any size suggestion? Can I get a 120 and be ok? I see I can get
> DCS3500 120GB for within $120/drive so it's possible to get 6 of
> them...

> -Tony

> On Sun, Mar 1, 2015 at 12:46 PM, Andrei Mikhailovsky <
> and...@arhont.com > wrote:

> > I would not use a single ssd for 5 osds. I would recommend the 3-4
> > osds max per ssd or you will get the bottleneck on the ssd side.
> 

> > I've had a reasonable experience with Intel 520 ssds (which are not
> > produced anymore). I've found Samsung 840 Pro to be horrible!
> 

> > Otherwise, it seems that everyone here recommends the DC3500 or
> > DC3700 and it has the best wear per $ ratio out of all the drives.
> 

> > Andrei
> 

> > > From: "Tony Harris" < neth...@gmail.com >
> > 
> 
> > > To: "Christian Balzer" < ch...@gol.com >
> > 
> 
> > > Cc: ceph-users@lists.ceph.com
> > 
> 
> > > Sent: Sunday, 1 March, 2015 4:19:30 PM
> > 
> 
> > > Subject: Re: [ceph-users] SSD selection
> > 
> 

> > > Well, although I have 7 now per node, you make a good point and
> > > I'm
> > > in a position where I can either increase to 8 and split 4/4 and
> > > have 2 ssds, or reduce to 5 and use a single osd per node (the
> > > system is not in production yet).
> > 
> 

> > > Do all the DC lines have caps in them or just the DC s line?
> > 
> 

> > > -Tony
> > 
> 

> > > On Sat, Feb 28, 2015 at 11:21 PM, Christian Balzer <
> > > ch...@gol.com
> > > >
> > > wrote:
> > 
> 

> > > > On Sat, 28 Feb 2015 20:42:35 -0600 Tony Harris wrote:
> > > 
> > 
> 

> > > > > Hi all,
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > I have a small cluster together and it's running fairly well
> > > > > (3
> > > > > nodes, 21
> > > 
> > 
> 
> > > > > osds). I'm looking to improve the write performance a bit
> > > > > though,
> > > > > which
> > > 
> > 
> 
> > > > > I was hoping that using SSDs for journals would do. But, I
> > > > > was
> > > > > wondering
> > > 
> > 
> 
> > > > > what people had as recommendations for SSDs to act as journal
> > > > > drives.
> > > 
> > 
> 
> > > > > If I read the docs on ceph.com correctly, I'll need 2 ssds
> > > > > per
> > > > > node
> > > 
> > 
> 
> > > > > (with 7 drives in each node, I think the recommendation was
> > > > > 1ssd
> > > > > per 4-5
> > > 
> > 
> 
> > > > > drives?) so I'm looking for drives that will work well
> > > > > without
> > > > > breaking
> > > 
> > 
> 
> > > > > the bank for where I work (I'll probably have to purchase
> > > > > them
> > > > > myself
> > > 
> > 
> 
> > > > > and donate, so my budget is somewhat small). Any suggestions?
> > > > > I'd
> > > 
> > 
> 
> > > > > prefer one that can finish its write in a power outage case,
> > > > > the
> > > > > only
> > > 
> > 
> 
> > > > > one I know of off hand is the intel dcs3700 I think, but at
> > > > > $300
> > > > > it's
> > > 
> > 
> 
> > > > > WAY above my affordability range.
> > > 
> > 
> 

> > > > Firstly, an uneven number of OSDs (HDDs) per node will bite you
> > > > in
> > > > the
> > > 
> > 
> 
> > > > proverbial behind down the road when combined with journal
> > > > SSDs,
> > > > as
> > > > one of
> > > 
> > 
> 
> > > > those SSDs will wear our f

Re: [ceph-users] OSD + Flashcache + udev + Partition uuid

2015-03-21 Thread Andrei Mikhailovsky

In a long term use I also had some issues with flashcache and enhanceio. I've 
noticed frequent slow requests. 

Andrei 

- Original Message -

> From: "Robert LeBlanc" 
> To: "Nick Fisk" 
> Cc: ceph-users@lists.ceph.com
> Sent: Friday, 20 March, 2015 8:14:16 PM
> Subject: Re: [ceph-users] OSD + Flashcache + udev + Partition uuid

> We tested bcache and abandoned it for two reasons.

> 1. Didn't give us any better performance than journals on SSD.
> 2. We had lots of corruption of the OSDs and were rebuilding them
> frequently.

> Since removing them, the OSDs have been much more stable.

> On Fri, Mar 20, 2015 at 4:03 AM, Nick Fisk < n...@fisk.me.uk > wrote:

> > > -Original Message-
> 
> > > From: ceph-users [mailto: ceph-users-boun...@lists.ceph.com ] On
> > > Behalf Of
> 
> > > Burkhard Linke
> 
> > > Sent: 20 March 2015 09:09
> 
> > > To: ceph-users@lists.ceph.com
> 
> > > Subject: Re: [ceph-users] OSD + Flashcache + udev + Partition
> > > uuid
> 
> > >
> 
> > > Hi,
> 
> > >
> 
> > > On 03/19/2015 10:41 PM, Nick Fisk wrote:
> 
> > > > I'm looking at trialling OSD's with a small flashcache device
> > > > over
> 
> > > > them to hopefully reduce the impact of metadata updates when
> > > > doing
> 
> > > small block io.
> 
> > > > Inspiration from here:-
> 
> > > >
> 
> > > > http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/12083
> 
> > > >
> 
> > > > One thing I suspect will happen, is that when the OSD node
> > > > starts
> > > > up
> 
> > > > udev could possibly mount the base OSD partition instead of
> 
> > > > flashcached device, as the base disk will have the ceph
> > > > partition
> > > > uuid
> 
> > > > type. This could result in quite nasty corruption.
> 
> > > I ran into this problem with an enhanceio based cache for one of
> > > our
> 
> > > database servers.
> 
> > >
> 
> > > I think you can prevent this problem by using bcache, which is
> > > also
> 
> > integrated
> 
> > > into the official kernel tree. It does not act as a drop in
> > > replacement,
> 
> > but
> 
> > > creates a new device that is only available if the cache is
> > > initialized
> 
> > correctly. A
> 
> > > GPT partion table on the bcache device should be enough to allow
> > > the
> 
> > > standard udev rules to kick in.
> 
> > >
> 
> > > I haven't used bcache in this scenario yet, and I cannot comment
> > > on
> > > its
> 
> > speed
> 
> > > and reliability compared to other solutions. But from the
> > > operational
> 
> > point of
> 
> > > view it is "safer" than enhanceio/flashcache.
> 

> > I did look at bcache, but there are a lot of worrying messages on
> > the
> 
> > mailing list about hangs and panics that has discouraged me
> > slightly
> > from
> 
> > it. I do think it is probably the best solution, but I'm not
> > convinced about
> 
> > the stability.
> 

> > >
> 
> > > Best regards,
> 
> > > Burkhard
> 
> > > ___
> 
> > > ceph-users mailing list
> 
> > > ceph-users@lists.ceph.com
> 
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

> > ___
> 
> > ceph-users mailing list
> 
> > ceph-users@lists.ceph.com
> 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Preliminary RDMA vs TCP numbers

2015-04-08 Thread Andrei Mikhailovsky

Hi, 

Am I the only person noticing disappointing results from the preliminary RDMA 
testing, or am I reading the numbers wrong? 

Yes, it's true that on a very small cluster you do see a great improvement in 
rdma, but in real life rdma is used in large infrastructure projects, not on a 
few servers with a handful of osds. In fact, from what i've seen from the 
slides, the rdma implementation scales horribly to the point that it becomes 
slower the more osds you through at it. 

>From my limited knowledge, i have expected a much higher performance gains 
>with rdma, taking into account that you should have much lower latency and 
>overhead and lower cpu utilisation when using this transport in comparison 
>with tcp. 

Are we likely to see a great deal of improvement with ceph and rdma in a near 
future? Is there a roadmap for having a stable and reliable rdma protocol 
support? 

Thanks 

Andrei 
- Original Message -

> From: "Andrey Korolyov" 
> To: "Somnath Roy" 
> Cc: ceph-users@lists.ceph.com, "ceph-devel"
> 
> Sent: Wednesday, 8 April, 2015 9:28:12 AM
> Subject: Re: [ceph-users] Preliminary RDMA vs TCP numbers

> On Wed, Apr 8, 2015 at 11:17 AM, Somnath Roy
>  wrote:
> >
> > Hi,
> > Please find the preliminary performance numbers of TCP Vs RDMA
> > (XIO) implementation (on top of SSDs) in the following link.
> >
> > http://www.slideshare.net/somnathroy7568/ceph-on-rdma
> >
> > The attachment didn't go through it seems, so, I had to use
> > slideshare.
> >
> > Mark,
> > If we have time, I can present it in tomorrow's performance
> > meeting.
> >
> > Thanks & Regards
> > Somnath
> >

> Those numbers are really impressive (for small numbers at least)!
> What
> are TCP settings you using?For example, difference can be lowered on
> scale due to less intensive per-connection acceleration on CUBIC on a
> larger number of nodes, though I do not believe that it was a main
> reason for an observed TCP catchup on a relatively flat workload such
> as fio generates.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Preliminary RDMA vs TCP numbers

2015-04-08 Thread Andrei Mikhailovsky

Somnath, 

Sounds very promising! I can't wait to try it on my cluster as I am currently 
using IPOIB instread of the native rdma. 

Cheers 

Andrei 

- Original Message -

> From: "Somnath Roy" 
> To: "Andrei Mikhailovsky" , "Andrey Korolyov"
> 
> Cc: ceph-users@lists.ceph.com, "ceph-devel"
> 
> Sent: Wednesday, 8 April, 2015 5:23:23 PM
> Subject: RE: [ceph-users] Preliminary RDMA vs TCP numbers

> Andrei,
> Yes, I see it has lot of potential and I believe fixing the
> performance bottlenecks inside XIO messenger it should go further.
> We are working on it and will keep community posted..

> Thanks & Regards
> Somnath

> From: Andrei Mikhailovsky [mailto:and...@arhont.com]
> Sent: Wednesday, April 08, 2015 2:22 AM
> To: Andrey Korolyov
> Cc: ceph-users@lists.ceph.com; ceph-devel; Somnath Roy
> Subject: Re: [ceph-users] Preliminary RDMA vs TCP numbers

> Hi,

> Am I the only person noticing disappointing results from the
> preliminary RDMA testing, or am I reading the numbers wrong?

> Yes, it's true that on a very small cluster you do see a great
> improvement in rdma, but in real life rdma is used in large
> infrastructure projects, not on a few servers with a handful of
> osds. In fact, from what i've seen from the slides, the rdma
> implementation scales horribly to the point that it becomes slower
> the more osds you through at it.

> From my limited knowledge, i have expected a much higher performance
> gains with rdma, taking into account that you should have much lower
> latency and overhead and lower cpu utilisation when using this
> transport in comparison with tcp.

> Are we likely to see a great deal of improvement with ceph and rdma
> in a near future? Is there a roadmap for having a stable and
> reliable rdma protocol support?

> Thanks

> Andrei
> - Original Message -

> > From: "Andrey Korolyov" < and...@xdel.ru >
> 
> > To: "Somnath Roy" < somnath@sandisk.com >
> 
> > Cc: ceph-users@lists.ceph.com , "ceph-devel" <
> > ceph-de...@vger.kernel.org >
> 
> > Sent: Wednesday, 8 April, 2015 9:28:12 AM
> 
> > Subject: Re: [ceph-users] Preliminary RDMA vs TCP numbers
> 

> > On Wed, Apr 8, 2015 at 11:17 AM, Somnath Roy <
> > somnath@sandisk.com > wrote:
> 
> > >
> 
> > > Hi,
> 
> > > Please find the preliminary performance numbers of TCP Vs RDMA
> > > (XIO) implementation (on top of SSDs) in the following link.
> 
> > >
> 
> > > http://www.slideshare.net/somnathroy7568/ceph-on-rdma
> 
> > >
> 
> > > The attachment didn't go through it seems, so, I had to use
> > > slideshare.
> 
> > >
> 
> > > Mark,
> 
> > > If we have time, I can present it in tomorrow's performance
> > > meeting.
> 
> > >
> 
> > > Thanks & Regards
> 
> > > Somnath
> 
> > >
> 

> > Those numbers are really impressive (for small numbers at least)!
> > What
> 
> > are TCP settings you using?For example, difference can be lowered
> > on
> 
> > scale due to less intensive per-connection acceleration on CUBIC on
> > a
> 
> > larger number of nodes, though I do not believe that it was a main
> 
> > reason for an observed TCP catchup on a relatively flat workload
> > such
> 
> > as fio generates.
> 
> > ___
> 
> > ceph-users mailing list
> 
> > ceph-users@lists.ceph.com
> 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> PLEASE NOTE: The information contained in this electronic mail
> message is intended only for the use of the designated recipient(s)
> named above. If the reader of this message is not the intended
> recipient, you are hereby notified that you have received this
> message in error and that any review, dissemination, distribution,
> or copying of this message is strictly prohibited. If you have
> received this communication in error, please notify the sender by
> telephone or e-mail (as shown above) immediately and destroy any and
> all copies of this message in your possession (whether hard copies
> or electronically stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Preliminary RDMA vs TCP numbers

2015-04-08 Thread Andrei Mikhailovsky

Mike, yeah, I wouldn't switch to rdma until it is fully supported in a stable 
release ))) 

Andrei 

- Original Message -

> From: "Andrei Mikhailovsky" 
> To: "Somnath Roy" 
> Cc: ceph-users@lists.ceph.com, "ceph-devel"
> 
> Sent: Wednesday, 8 April, 2015 7:16:40 PM
> Subject: Re: [ceph-users] Preliminary RDMA vs TCP numbers

> Somnath,

> Sounds very promising! I can't wait to try it on my cluster as I am
> currently using IPOIB instread of the native rdma.

> Cheers

> Andrei

> - Original Message -

> > From: "Somnath Roy" 
> 
> > To: "Andrei Mikhailovsky" , "Andrey Korolyov"
> > 
> 
> > Cc: ceph-users@lists.ceph.com, "ceph-devel"
> > 
> 
> > Sent: Wednesday, 8 April, 2015 5:23:23 PM
> 
> > Subject: RE: [ceph-users] Preliminary RDMA vs TCP numbers
> 

> > Andrei,
> 
> > Yes, I see it has lot of potential and I believe fixing the
> > performance bottlenecks inside XIO messenger it should go further.
> 
> > We are working on it and will keep community posted..
> 

> > Thanks & Regards
> 
> > Somnath
> 

> > From: Andrei Mikhailovsky [mailto:and...@arhont.com]
> 
> > Sent: Wednesday, April 08, 2015 2:22 AM
> 
> > To: Andrey Korolyov
> 
> > Cc: ceph-users@lists.ceph.com; ceph-devel; Somnath Roy
> 
> > Subject: Re: [ceph-users] Preliminary RDMA vs TCP numbers
> 

> > Hi,
> 

> > Am I the only person noticing disappointing results from the
> > preliminary RDMA testing, or am I reading the numbers wrong?
> 

> > Yes, it's true that on a very small cluster you do see a great
> > improvement in rdma, but in real life rdma is used in large
> > infrastructure projects, not on a few servers with a handful of
> > osds. In fact, from what i've seen from the slides, the rdma
> > implementation scales horribly to the point that it becomes slower
> > the more osds you through at it.
> 

> > From my limited knowledge, i have expected a much higher
> > performance
> > gains with rdma, taking into account that you should have much
> > lower
> > latency and overhead and lower cpu utilisation when using this
> > transport in comparison with tcp.
> 

> > Are we likely to see a great deal of improvement with ceph and rdma
> > in a near future? Is there a roadmap for having a stable and
> > reliable rdma protocol support?
> 

> > Thanks
> 

> > Andrei
> 
> > - Original Message -
> 

> > > From: "Andrey Korolyov" < and...@xdel.ru >
> > 
> 
> > > To: "Somnath Roy" < somnath@sandisk.com >
> > 
> 
> > > Cc: ceph-users@lists.ceph.com , "ceph-devel" <
> > > ceph-de...@vger.kernel.org >
> > 
> 
> > > Sent: Wednesday, 8 April, 2015 9:28:12 AM
> > 
> 
> > > Subject: Re: [ceph-users] Preliminary RDMA vs TCP numbers
> > 
> 

> > > On Wed, Apr 8, 2015 at 11:17 AM, Somnath Roy <
> > > somnath@sandisk.com > wrote:
> > 
> 
> > > >
> > 
> 
> > > > Hi,
> > 
> 
> > > > Please find the preliminary performance numbers of TCP Vs RDMA
> > > > (XIO) implementation (on top of SSDs) in the following link.
> > 
> 
> > > >
> > 
> 
> > > > http://www.slideshare.net/somnathroy7568/ceph-on-rdma
> > 
> 
> > > >
> > 
> 
> > > > The attachment didn't go through it seems, so, I had to use
> > > > slideshare.
> > 
> 
> > > >
> > 
> 
> > > > Mark,
> > 
> 
> > > > If we have time, I can present it in tomorrow's performance
> > > > meeting.
> > 
> 
> > > >
> > 
> 
> > > > Thanks & Regards
> > 
> 
> > > > Somnath
> > 
> 
> > > >
> > 
> 

> > > Those numbers are really impressive (for small numbers at least)!
> > > What
> > 
> 
> > > are TCP settings you using?For example, difference can be lowered
> > > on
> > 
> 
> > > scale due to less intensive per-connection acceleration on CUBIC
> > > on
> > > a
> > 
> 
> > > larger number of nodes, though I do not believe that it was a
> > > main
> > 
> 
> > > reason for an observed TCP catchup on a relatively flat workload
> > > such
> > 
> 
> > > as fio generates.
> > 
> 
> > >

Re: [ceph-users] deep scrubbing causes osd down

2015-04-11 Thread Andrei Mikhailovsky

Hi JC, 

I am running ceph 0.87.1 on Ubuntu 12.04 LTS server with latest patches. I am 
however running kernel version 3.19.3 and not the stock distro one. 

I am running cfq on all spindles and noop on all ssds (used for journals). 

I've not done any scrub specific options, but will try and see if it makes a 
difference. 

Thanks for your feedback 

Andrei 

- Original Message -

> From: "LOPEZ Jean-Charles" 
> To: "Andrei Mikhailovsky" 
> Cc: "LOPEZ Jean-Charles" ,
> ceph-users@lists.ceph.com
> Sent: Saturday, 11 April, 2015 7:54:18 PM
> Subject: Re: [ceph-users] deep scrubbing causes osd down

> Hi Andrei,

> 1) what ceph version are you running?
> 2) what distro and version are you running?
> 3) have you checked the disk elevator for the OSD devices to be set
> to cfq?
> 4) Have have you considered exploring the following parameters to
> further tune
> - osd_scrub_chunk_min lower the default value of 5. e.g. = 1
> - osd_scrub_chunk_max lower the default value of 25. e.g. = 5

> - osd_deep_scrub_stride If you have lowered parameters above, you can
> play with this one to fit best your physical disk behaviour. -
> osd_scrub_sleep introduce a half second sleep between 2 scrubs; e.g.
> = 0.5 to start with a half second delay

> Cheers
> JC

> > On 10 Apr 2015, at 12:01, Andrei Mikhailovsky < and...@arhont.com >
> > wrote:
> 

> > Hi guys,
> 

> > I was wondering if anyone noticed that the deep scrubbing process
> > causes some osd to go down?
> 

> > I have been keeping an eye on a few remaining stability issues in
> > my
> > test cluster. One of the unsolved issues is the occasional
> > reporting
> > of osd(s) going down and coming back up after about 20-30 seconds.
> > This happens to various osds throughout the cluster. I have a small
> > cluster of just 2 osd servers with 9 osds each.
> 

> > The common trend that i see week after week is that whenever there
> > is
> > a long deep scrubbing activity on the cluster it triggers one or
> > more osds to go down for a short period of time. After the osd is
> > marked down, it goes back up after about 20 seconds. Obviously
> > there
> > is a repair process that kicks in which causes more load on the
> > cluster. While looking at the logs, i've not seen the osds being
> > marked down when the cluster is not deep scrubbing. It _always_
> > happens when there is a deep scrub activity. I am seeing the
> > reports
> > of osds going down about 3-4 times a week.
> 

> > The latest happened just recently with the following log entries:
> 

> > 2015-04-10 19:32:48.330430 mon.0 192.168.168.13:6789/0 3441533 :
> > cluster [INF] pgmap v50849466: 8508 pgs: 8506 active+clean, 2
> > active+clean+scrubbing+deep; 13213 GB data, 26896 GB used, 23310 GB
> > / 50206 GB avail; 1005 B/s rd, 1005
> 
> > B/s wr, 0 op/s
> 
> > 2015-04-10 19:32:52.950633 mon.0 192.168.168.13:6789/0 3441542 :
> > cluster [INF] osd.6 192.168.168.200:6816/3738 failed (5 reports
> > from
> > 5 peers after 60.747890 >= grace 46.701350)
> 
> > 2015-04-10 19:32:53.121904 mon.0 192.168.168.13:6789/0 3441544 :
> > cluster [INF] osdmap e74309: 18 osds: 17 up, 18 in
> 
> > 2015-04-10 19:32:53.231730 mon.0 192.168.168.13:6789/0 3441545 :
> > cluster [INF] pgmap v50849467: 8508 pgs: 599 stale+active+clean,
> > 7907 active+clean, 1 stale+active+clean+scrubbing+deep, 1
> > active+clean+scrubbing+deep; 13213 GB data, 26896 GB used, 23310 GB
> > / 50206 GB avail; 375 B/s rd, 0 op/s
> 

> > osd.6 logs around the same time are:
> 

> > 2015-04-10 19:16:29.110617 7fad6d5ec700 0 log_channel(default) log
> > [INF] : 5.3d7 deep-scrub ok
> 
> > 2015-04-10 19:27:47.561389 7fad6bde9700 0 log_channel(default) log
> > [INF] : 5.276 deep-scrub ok
> 
> > 2015-04-10 19:31:11.611321 7fad6d5ec700 0 log_channel(default) log
> > [INF] : 5.287 deep-scrub ok
> 
> > 2015-04-10 19:31:53.339881 7fad7ce0b700 1 heartbeat_map is_healthy
> > 'OSD::osd_op_tp thread 0x7fad735f8700' had timed out after 15
> 
> > 2015-04-10 19:31:53.339887 7fad7ce0b700 1 heartbeat_map is_healthy
> > 'OSD::osd_op_tp thread 0x7fad745fa700' had timed out after 15
> 
> > 2015-04-10 19:31:53.339890 7fad7ce0b700 1 heartbeat_map is_healthy
> > 'OSD::osd_op_tp thread 0x7fad705f2700' had timed out after 15
> 
> > 2015-04-10 19:31:53.340050 7fad7e60e700 1 heartbeat_map is_healthy
> > 'OSD::osd_op_tp thread 0x7fad735f8700' had timed out after 15
> 
> > 2015-04-10 19:31:53.340053 7fad7e60e700 1 heartbeat_map is_health

Re: [ceph-users] deep scrubbing causes osd down

2015-04-12 Thread Andrei Mikhailovsky

JC, 

I've implemented the following changes to the ceph.conf and restarted mons and 
osds. 

osd_scrub_chunk_min = 1 
osd_scrub_chunk_max =5 

Things have become considerably worse after the changes. Shortly after doing 
that, majority of osd processes started taking up over 100% cpu and the cluster 
has considerably slowed down. All my vms are reporting high IO wait (between 
30-80%), even vms which are pretty idle and don't do much. 

i have tried restarting all osds, but shortly after the restart the cpu usage 
goes up. The osds are showing the following logs: 

2015-04-12 08:39:28.853860 7f96f81dd700 0 log_channel(default) log [WRN] : slow 
request 60.277590 seconds old, received at 2015-04-12 08:38:28.576168: 
osd_op(client.69637439.0:290325926 rbd_data.265f967a5f7514.4a00 
[set-alloc-hint object_size 4194304 write_size 4194304,write 1249280~4096] 
5.cb2620e0 snapc ac=[ac] ack+ondisk+write+known_if_redirected e74834) currently 
waiting for missing object 
2015-04-12 08:39:28.853863 7f96f81dd700 0 log_channel(default) log [WRN] : slow 
request 60.246943 seconds old, received at 2015-04-12 08:38:28.606815: 
osd_op(client.69637439.0:290325927 rbd_data.265f967a5f7514.4a00 
[set-alloc-hint object_size 4194304 write_size 4194304,write 1310720~4096] 
5.cb2620e0 snapc ac=[ac] ack+ondisk+write+known_if_redirected e74834) currently 
waiting for missing object 
2015-04-12 08:39:36.855180 7f96f81dd700 0 log_channel(default) log [WRN] : 7 
slow requests, 1 included below; oldest blocked for > 68.278951 secs 
2015-04-12 08:39:36.855191 7f96f81dd700 0 log_channel(default) log [WRN] : slow 
request 30.268450 seconds old, received at 2015-04-12 08:39:06.586669: 
osd_op(client.64965167.0:1607510 rbd_data.1f264b2ae8944a.0228 
[set-alloc-hint object_size 4194304 write_size 4194304,write 3584000~69632] 
5.30418007 ack+ondisk+write+known_if_redirected e74834) currently waiting for 
subops from 9 
2015-04-12 08:40:43.570004 7f96dd693700 0  cls/rgw/cls_rgw.cc:1458: 
gc_iterate_entries end_key=1_01428824443.569998000 

[In total i've got around 40,000 slow request entries accumulated overnight ((( 
] 

On top of that, I have reports of osds going down and back up as frequently as 
every 10-20 minutes. This effects all osds and not a particular set of osds. 

I will restart the osd servers to see if it makes a difference, otherwise, I 
will need to revert back to the default settings as the cluster as it currently 
is is not functional. 

Andrei 

- Original Message -

> From: "LOPEZ Jean-Charles" 
> To: "Andrei Mikhailovsky" 
> Cc: "LOPEZ Jean-Charles" ,
> ceph-users@lists.ceph.com
> Sent: Saturday, 11 April, 2015 7:54:18 PM
> Subject: Re: [ceph-users] deep scrubbing causes osd down

> Hi Andrei,

> 1) what ceph version are you running?
> 2) what distro and version are you running?
> 3) have you checked the disk elevator for the OSD devices to be set
> to cfq?
> 4) Have have you considered exploring the following parameters to
> further tune
> - osd_scrub_chunk_min lower the default value of 5. e.g. = 1
> - osd_scrub_chunk_max lower the default value of 25. e.g. = 5

> - osd_deep_scrub_stride If you have lowered parameters above, you can
> play with this one to fit best your physical disk behaviour. -
> osd_scrub_sleep introduce a half second sleep between 2 scrubs; e.g.
> = 0.5 to start with a half second delay

> Cheers
> JC

> > On 10 Apr 2015, at 12:01, Andrei Mikhailovsky < and...@arhont.com >
> > wrote:
> 

> > Hi guys,
> 

> > I was wondering if anyone noticed that the deep scrubbing process
> > causes some osd to go down?
> 

> > I have been keeping an eye on a few remaining stability issues in
> > my
> > test cluster. One of the unsolved issues is the occasional
> > reporting
> > of osd(s) going down and coming back up after about 20-30 seconds.
> > This happens to various osds throughout the cluster. I have a small
> > cluster of just 2 osd servers with 9 osds each.
> 

> > The common trend that i see week after week is that whenever there
> > is
> > a long deep scrubbing activity on the cluster it triggers one or
> > more osds to go down for a short period of time. After the osd is
> > marked down, it goes back up after about 20 seconds. Obviously
> > there
> > is a repair process that kicks in which causes more load on the
> > cluster. While looking at the logs, i've not seen the osds being
> > marked down when the cluster is not deep scrubbing. It _always_
> > happens when there is a deep scrub activity. I am seeing the
> > reports
> > of osds going down about 3-4 times a week.
> 

> > The latest happened just recently with the following log entries:
> 

> > 2015-04-

Re: [ceph-users] deep scrubbing causes osd down

2015-04-12 Thread Andrei Mikhailovsky

JC, 

the restart of the osd servers seems to have stabilised the cluster. It has 
been a few hours since the restart and I haven't not seen a single osd 
disconnect. 

Is there a way to limit the total number of scrub and/or deep-scrub processes 
running at the same time? For instance, I do not want to have more than 1 or 2 
scrub/deep-scrubs running at the same time on my cluster. How do I implement 
this? 

Thanks 

Andrei 

- Original Message -

> From: "Andrei Mikhailovsky" 
> To: "LOPEZ Jean-Charles" 
> Cc: ceph-users@lists.ceph.com
> Sent: Sunday, 12 April, 2015 9:02:05 AM
> Subject: Re: [ceph-users] deep scrubbing causes osd down

> JC,

> I've implemented the following changes to the ceph.conf and restarted
> mons and osds.

> osd_scrub_chunk_min = 1
> osd_scrub_chunk_max =5

> Things have become considerably worse after the changes. Shortly
> after doing that, majority of osd processes started taking up over
> 100% cpu and the cluster has considerably slowed down. All my vms
> are reporting high IO wait (between 30-80%), even vms which are
> pretty idle and don't do much.

> i have tried restarting all osds, but shortly after the restart the
> cpu usage goes up. The osds are showing the following logs:

> 2015-04-12 08:39:28.853860 7f96f81dd700 0 log_channel(default) log
> [WRN] : slow request 60.277590 seconds old, received at 2015-04-12
> 08:38:28.576168: osd_op(client.69637439.0:290325926
> rbd_data.265f967a5f7514.4a00 [set-alloc-hint object_size
> 4194304 write_size 4194304,write 1249280~4096] 5.cb2620e0 snapc
> ac=[ac] ack+ondisk+write+known_if_redirected e74834) currently
> waiting for missing object
> 2015-04-12 08:39:28.853863 7f96f81dd700 0 log_channel(default) log
> [WRN] : slow request 60.246943 seconds old, received at 2015-04-12
> 08:38:28.606815: osd_op(client.69637439.0:290325927
> rbd_data.265f967a5f7514.4a00 [set-alloc-hint object_size
> 4194304 write_size 4194304,write 1310720~4096] 5.cb2620e0 snapc
> ac=[ac] ack+ondisk+write+known_if_redirected e74834) currently
> waiting for missing object
> 2015-04-12 08:39:36.855180 7f96f81dd700 0 log_channel(default) log
> [WRN] : 7 slow requests, 1 included below; oldest blocked for >
> 68.278951 secs
> 2015-04-12 08:39:36.855191 7f96f81dd700 0 log_channel(default) log
> [WRN] : slow request 30.268450 seconds old, received at 2015-04-12
> 08:39:06.586669: osd_op(client.64965167.0:1607510
> rbd_data.1f264b2ae8944a.0228 [set-alloc-hint object_size
> 4194304 write_size 4194304,write 3584000~69632] 5.30418007
> ack+ondisk+write+known_if_redirected e74834) currently waiting for
> subops from 9
> 2015-04-12 08:40:43.570004 7f96dd693700 0 
> cls/rgw/cls_rgw.cc:1458: gc_iterate_entries
> end_key=1_01428824443.569998000

> [In total i've got around 40,000 slow request entries accumulated
> overnight ((( ]

> On top of that, I have reports of osds going down and back up as
> frequently as every 10-20 minutes. This effects all osds and not a
> particular set of osds.

> I will restart the osd servers to see if it makes a difference,
> otherwise, I will need to revert back to the default settings as the
> cluster as it currently is is not functional.

> Andrei

> - Original Message -

> > From: "LOPEZ Jean-Charles" 
> 
> > To: "Andrei Mikhailovsky" 
> 
> > Cc: "LOPEZ Jean-Charles" ,
> > ceph-users@lists.ceph.com
> 
> > Sent: Saturday, 11 April, 2015 7:54:18 PM
> 
> > Subject: Re: [ceph-users] deep scrubbing causes osd down
> 

> > Hi Andrei,
> 

> > 1) what ceph version are you running?
> 
> > 2) what distro and version are you running?
> 
> > 3) have you checked the disk elevator for the OSD devices to be set
> > to cfq?
> 
> > 4) Have have you considered exploring the following parameters to
> > further tune
> 
> > - osd_scrub_chunk_min lower the default value of 5. e.g. = 1
> 
> > - osd_scrub_chunk_max lower the default value of 25. e.g. = 5
> 

> > - osd_deep_scrub_stride If you have lowered parameters above, you
> > can
> > play with this one to fit best your physical disk behaviour. -
> > osd_scrub_sleep introduce a half second sleep between 2 scrubs;
> > e.g.
> > = 0.5 to start with a half second delay
> 

> > Cheers
> 
> > JC
> 

> > > On 10 Apr 2015, at 12:01, Andrei Mikhailovsky < and...@arhont.com
> > > >
> > > wrote:
> > 
> 

> > > Hi guys,
> > 
> 

> > > I was wondering if anyone noticed that the deep scrubbing process
> > > causes some osd to go down?
> > 
> 

> > > I have been kee

Re: [ceph-users] deep scrubbing causes osd down

2015-04-12 Thread Andrei Mikhailovsky

JC, 

Thanks 

I think the max scrub option that you refer to is a value per osd and not per 
cluster. So, the default is not to run more than 1 scrub per osd. So, if you 
have 100 osds by default it will not run more than 100 scurb processes at the 
same time. However, I want to limit the number on a cluster basis rather than 
on an osd basis. 

Andrei 

- Original Message -

> From: "Jean-Charles Lopez" 
> To: "Andrei Mikhailovsky" 
> Cc: ceph-users@lists.ceph.com
> Sent: Sunday, 12 April, 2015 5:17:10 PM
> Subject: Re: [ceph-users] deep scrubbing causes osd down

> Hi andrei

> There is one parameter, osd_max_scrub I think, that controls the
> number of scrubs per OSD. But the default is 1 if I'm correct.

> Can you check on one of your OSDs with the admin socket?

> Then it remains the option of scheduling the deep scrubs via a cron
> job after setting nodeep-scrub to prevent automatic deep scrubbing.

> Dan Van Der Ster had a post on this ML regarding this.
> JC

> While moving. Excuse unintended typos.

> On Apr 12, 2015, at 05:21, Andrei Mikhailovsky < and...@arhont.com >
> wrote:

> > JC,
> 

> > the restart of the osd servers seems to have stabilised the
> > cluster.
> > It has been a few hours since the restart and I haven't not seen a
> > single osd disconnect.
> 

> > Is there a way to limit the total number of scrub and/or deep-scrub
> > processes running at the same time? For instance, I do not want to
> > have more than 1 or 2 scrub/deep-scrubs running at the same time on
> > my cluster. How do I implement this?
> 

> > Thanks
> 

> > Andrei
> 

> > - Original Message -
> 

> > > From: "Andrei Mikhailovsky" < and...@arhont.com >
> > 
> 
> > > To: "LOPEZ Jean-Charles" < jelo...@redhat.com >
> > 
> 
> > > Cc: ceph-users@lists.ceph.com
> > 
> 
> > > Sent: Sunday, 12 April, 2015 9:02:05 AM
> > 
> 
> > > Subject: Re: [ceph-users] deep scrubbing causes osd down
> > 
> 

> > > JC,
> > 
> 

> > > I've implemented the following changes to the ceph.conf and
> > > restarted
> > > mons and osds.
> > 
> 

> > > osd_scrub_chunk_min = 1
> > 
> 
> > > osd_scrub_chunk_max =5
> > 
> 

> > > Things have become considerably worse after the changes. Shortly
> > > after doing that, majority of osd processes started taking up
> > > over
> > > 100% cpu and the cluster has considerably slowed down. All my vms
> > > are reporting high IO wait (between 30-80%), even vms which are
> > > pretty idle and don't do much.
> > 
> 

> > > i have tried restarting all osds, but shortly after the restart
> > > the
> > > cpu usage goes up. The osds are showing the following logs:
> > 
> 

> > > 2015-04-12 08:39:28.853860 7f96f81dd700 0 log_channel(default)
> > > log
> > > [WRN] : slow request 60.277590 seconds old, received at
> > > 2015-04-12
> > > 08:38:28.576168: osd_op(client.69637439.0:290325926
> > > rbd_data.265f967a5f7514.4a00 [set-alloc-hint
> > > object_size
> > > 4194304 write_size 4194304,write 1249280~4096] 5.cb2620e0 snapc
> > > ac=[ac] ack+ondisk+write+known_if_redirected e74834) currently
> > > waiting for missing object
> > 
> 
> > > 2015-04-12 08:39:28.853863 7f96f81dd700 0 log_channel(default)
> > > log
> > > [WRN] : slow request 60.246943 seconds old, received at
> > > 2015-04-12
> > > 08:38:28.606815: osd_op(client.69637439.0:290325927
> > > rbd_data.265f967a5f7514.4a00 [set-alloc-hint
> > > object_size
> > > 4194304 write_size 4194304,write 1310720~4096] 5.cb2620e0 snapc
> > > ac=[ac] ack+ondisk+write+known_if_redirected e74834) currently
> > > waiting for missing object
> > 
> 
> > > 2015-04-12 08:39:36.855180 7f96f81dd700 0 log_channel(default)
> > > log
> > > [WRN] : 7 slow requests, 1 included below; oldest blocked for >
> > > 68.278951 secs
> > 
> 
> > > 2015-04-12 08:39:36.855191 7f96f81dd700 0 log_channel(default)
> > > log
> > > [WRN] : slow request 30.268450 seconds old, received at
> > > 2015-04-12
> > > 08:39:06.586669: osd_op(client.64965167.0:1607510
> > > rbd_data.1f264b2ae8944a.0228 [set-alloc-hint
> > > object_size
> > > 4194304 write_size 4194304,write 3584000~69632] 5.30418007
> > > ack+ondisk+write+known_if_redirected e74834) curre

Re: [ceph-users] Possible improvements for a slow write speed (excluding independent SSD journals)

2015-04-21 Thread Andrei Mikhailovsky

Hi 

I have been testing the Samsung 840 Pro (128gb) for quite sometime and I can 
also confirm that this drive is unsuitable for osd journal. The performance and 
latency that I get from these drives (according to ceph osd perf) are between 
10 - 15 times slower compared to the Intel 520. The Intel 530 drives are also 
pretty awful. They are meant to be a replacement of the 520 drives, but the 
performance is pretty bad. 

I have found Intel 520 to be a reasonable drive for performance per price, for 
a cluster without a great deal of writes. However they do not make those 
anymore. 

Otherwise, it seems that the Intel 3600 and 3700 series is a good performer and 
has a much longer life expectancy. 

Andrei 
- Original Message -

> From: "Eneko Lacunza" 
> To: "J-P Methot" , "Christian Balzer"
> , ceph-users@lists.ceph.com
> Sent: Tuesday, 21 April, 2015 8:18:20 AM
> Subject: Re: [ceph-users] Possible improvements for a slow write
> speed (excluding independent SSD journals)

> Hi,

> I'm just writing to you to stress out what others have already said,
> because it is very important that you take it very seriously.

> On 20/04/15 19:17, J-P Methot wrote:
> > On 4/20/2015 11:01 AM, Christian Balzer wrote:
> >>
> >>> This is similar to another thread running right now, but since
> >>> our
> >>> current setup is completely different from the one described in
> >>> the
> >>> other thread, I thought it may be better to start a new one.
> >>>
> >>> We are running Ceph Firefly 0.80.8 (soon to be upgraded to
> >>> 0.80.9). We
> >>> have 6 OSD hosts with 16 OSD each (so a total of 96 OSDs). Each
> >>> OSD
> >>> is a
> >>> Samsung SSD 840 EVO on which I can reach write speeds of roughly
> >>> 400
> >>> MB/sec, plugged in jbod on a controller that can theoretically
> >>> transfer
> >>> at 6gb/sec. All of that is linked to openstack compute nodes on
> >>> two
> >>> bonded 10gbps links (so a max transfer rate of 20 gbps).
> >>>
> >> I sure as hell hope you're not planning to write all that much to
> >> this
> >> cluster.
> >> But then again you're worried about write speed, so I guess you
> >> do.
> >> Those _consumer_ SSDs will be dropping like flies, there are a
> >> number of
> >> threads about them here.
> >>
> >> They also might be of the kind that don't play well with O_DSYNC,
> >> I
> >> can't
> >> recall for sure right now, check the archives.
> >> Consumer SSDs universally tend to slow down quite a bit when not
> >> TRIM'ed
> >> and/or subjected to prolonged writes, like those generated by a
> >> benchmark.
> > I see, yes it looks like these SSDs are not the best for the job.
> > We
> > will not change them for now, but if they start failing, we will
> > replace them with better ones.
> I tried to put a Samsung 840 Pro 256GB in a ceph setup. It is
> supposed
> to be quite better than the EVO right? It was total crap. No "not the
> best for the job". TOTAL CRAP. :)

> It can't give any useful write performance for a Ceph OSD. Spec sheet
> numbers don't matter for this, they don't work for ceph OSD, period.
> And
> yes, the drive is fine and works like a charm in workstation
> workloads.

> I suggest you at least get some intel S3700/S3610 and use them for
> the
> journal of those samsung drives, I think that could help performance
> a lot.

> Cheers
> Eneko

> --
> Zuzendari Teknikoa / Director Técnico
> Binovo IT Human Project, S.L.
> Telf. 943575997
> 943493611
> Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun
> (Gipuzkoa)
> www.binovo.es

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Possible improvements for a slow write speed (excluding independent SSD journals)

2015-04-26 Thread Andrei Mikhailovsky

Anthony, 

I doubt the manufacturer reported 315MB/s for 4K block size. Most likely 
they've used 1M or 4M as the block size to achieve the 300MB/s+ speeds 

Andrei 
- Original Message -

> From: "Alexandre DERUMIER" 
> To: "Anthony Levesque" 
> Cc: "ceph-users" 
> Sent: Saturday, 25 April, 2015 5:32:30 PM
> Subject: Re: [ceph-users] Possible improvements for a slow write
> speed (excluding independent SSD journals)

> I'm able to reach around 2-25000iops with 4k block with s3500
> (with o_dsync) (so yes, around 80-100MB/S).

> I'l bench new s3610 soon to compare.

> - Mail original -
> De: "Anthony Levesque" 
> À: "Christian Balzer" 
> Cc: "ceph-users" 
> Envoyé: Vendredi 24 Avril 2015 22:00:44
> Objet: Re: [ceph-users] Possible improvements for a slow write speed
> (excluding independent SSD journals)

> Hi Christian,

> We tested some DC S3500 300GB using dd if=randfile of=/dev/sda bs=4k
> count=10 oflag=direct,dsync

> we got 96 MB/s which is far from the 315 MB/s from the website.

> Can I ask you or anyone on the mailing list how you are testing the
> write speed for journals?

> Thanks
> ---
> Anthony Lévesque
> GloboTech Communications
> Phone: 1-514-907-0050 x 208
> Toll Free: 1-(888)-GTCOMM1 x 208
> Phone Urgency: 1-(514) 907-0047
> 1-(866)-500-1555
> Fax: 1-(514)-907-0750
> aleves...@gtcomm.net
> http://www.gtcomm.net

> On Apr 23, 2015, at 9:05 PM, Christian Balzer < ch...@gol.com >
> wrote:

> Hello,

> On Thu, 23 Apr 2015 18:40:38 -0400 Anthony Levesque wrote:

> BQ_BEGIN
> To update you on the current test in our lab:

> 1.We tested the Samsung OSD in Recovery mode and the speed was able
> to
> maxout 2x 10GbE port(transferring data at 2200+ MB/s during
> recovery).
> So for normal write operation without O_DSYNC writes Samsung drives
> seem
> ok.

> 2.We then tested a couple of different model of SSD we had in stock
> with
> the following command:

> dd if=randfile of=/dev/sda bs=4k count=10 oflag=direct,dsync

> This was from a blog written by Sebastien Han and I think should be
> able
> to show how the drives would perform in O_DSYNC writes. For people
> interested in some result of what we tested here they are:

> Intel DC S3500 120GB = 114 MB/s
> Samsung Pro 128GB = 2.4 MB/s
> WD Black 1TB (HDD) = 409 KB/s
> Intel 330 120GB = 105 MB/s
> Intel 520 120GB = 9.4 MB/s
> Intel 335 80GB = 9.4 MB/s
> Samsung EVO 1TB = 2.5 MB/s
> Intel 320 120GB = 78 MB/s
> OCZ Revo Drive 240GB = 60.8 MB/s
> 4x Samsung EVO 1TB LSI RAID0 HW + BBU = 28.4 MB/s

> No real surprises here, but a nice summary nonetheless.

> You _really_ want to avoid consumer SSDs for journals and have a good
> idea
> on how much data you'll write per day and how long you expect your
> SSDs to
> last (the TBW/$ ratio).

> BQ_BEGIN
> Please let us know if the command we ran was not optimal to test
> O_DSYNC
> writes

> We order larger drive from Intel DC series to see if we could get
> more
> than 200 MB/s per SSD. We will keep you posted on tests if that
> interested you guys. We dint test multiple parallel test yet (to
> simulate multiple journal on one SSD).

> BQ_END
> You can totally trust the numbers on Intel's site:
> http://ark.intel.com/products/family/83425/Data-Center-SSDs

> The S3500s are by far the slowest and have the lowest endurance.
> Again, depending on your expected write level the S3610 or S3700
> models
> are going to be a better fit regarding price/performance.
> Especially when you consider that loosing a journal SSD will result
> in
> several dead OSDs.

> BQ_BEGIN
> 3.We remove the Journal from all Samsung OSD and put 2x Intel 330
> 120GB
> on all 6 Node to test. The overall speed we were getting from the
> rados
> bench went from 1000 MB/s(approx.) to 450 MB/s which might only be
> because the intel cannot do too much in term of journaling (was
> tested
> at around 100 MB/s). It will be interesting to test with bigger Intel
> DC S3500 drives(and more journals) per node to see if I can back up
> to
> 1000MB/s or even surpass it.

> We also wanted to test if the CPU could be a huge bottle neck so we
> swap
> the Dual E5-2620v2 from node #6 and replace them with Dual
> E5-2609v2(Which are much smaller in core and speed) and the 450 MB/s
> we
> got from he rados bench went even lower to 180 MB/s.

> BQ_END
> You really don't have to swap CPUs around, monitor things with atop
> or
> other tools to see where your bottlenecks are.

> BQ_BEGIN
> So Im wondering if the 1000MB/s we got when the Journal was shared on
> the OSD SSD was not limited by the CPUs (even though the samsung are
> not
> good for journals on the long run) and not just by the fact Samsung
> SSD
> are bad in O_DSYNC writes(or maybe both). It is probable that 16 SSD
> OSD per node in a full SSD cluster is too much and the major
> bottleneck
> will be from the CPU.

> BQ_END
> That's what I kept saying. ^.^

> BQ_BEGIN
> 4.Im wondering if we find good SSD for the journal and keep the
> samsung
> for normal writes and read(

Re: [ceph-users] How to estimate whether putting a journal on SSD will help with performance?

2015-05-01 Thread Andrei Mikhailovsky

Piotr,

You may also investigate if the cache tier made of a couple of ssds could help 
you. Not sure how the data is used in your company, but if you have a bunch of 
hot data that moves around from one vm to another it might greatly speed up the 
rsync. On the other hand, if a lot of rsync data is cold, it might have an 
adverse effect on performance.

As a test, you could try to create a small pool with a couple of ssds in a 
cache tier on top of your spinning osds. You don't need to purchase tons of 
ssds in advance. As a test case, I would suggest 2-4 ssds in a cache tier 
should be okay for the PoC.

Andrei


- Original Message -
From: "Nick Fisk" 
To: "Piotr Wachowicz" 
Cc: ceph-users@lists.ceph.com
Sent: Friday, 1 May, 2015 10:42:12 AM
Subject: Re: [ceph-users] How to estimate whether putting a journal on SSD will 
help with performance?





Yeah, that’s your problem, doing a single thread rsync when you have quite poor 
write latency will not be quick. SSD journals should give you a fair 
performance boost, otherwise you need to coalesce the writes at the client so 
that Ceph is given bigger IOs at higher queue depths. 



RBD Cache can help here as well as potentially FS tuning to buffer more 
aggressively. If writeback RBD cache is enabled, data will be buffered by RBD 
until a sync is called by the client, so data loss can occur during this period 
if the app is not issuing fsyncs properly. Once a sync is called data is 
flushed to the journals and then later to the actual OSD store. 






From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Piotr 
Wachowicz 
Sent: 01 May 2015 10:14 
To: Nick Fisk 
Cc: ceph-users@lists.ceph.com 
Subject: Re: [ceph-users] How to estimate whether putting a journal on SSD will 
help with performance? 





Thanks for your answer, Nick. 




Typically it's a single rsync session at a time (sometimes two, but rarely more 
concurrently). So it's a single ~5GB typical linux filesystem from one random 
VM to another random VM. 





Apart from using RBD Cache, is there any other way to improve the overall 
performance of such a use case in a Ceph cluster? 





In theory I guess we could always tarball it, and rsync the tarball, thus 
effectively using sequential IO rather than random. But that's simply not 
feasible for us at the moment. Any other ways? 





Sidequestion: does using RBDCache impact the way data is stored on the client? 
(e.g. a write call returning after data has been written to Journal (fast) vs 
written all the way to the OSD data store(slow)). I'm guessing it's always the 
first one, regardless of whether client uses RBDCache or not, right? My logic 
here is that otherwise that would imply that clients can impact the way OSDs 
behave, which could be dangerous in some situations. 





Kind Regards, 




Piotr 










On Fri, May 1, 2015 at 10:59 AM, Nick Fisk < n...@fisk.me.uk > wrote: 





How many Rsync’s are doing at a time? If it is only a couple, you will not be 
able to take advantage of the full number of OSD’s, as each block of data is 
only located on 1 OSD (not including replicas). When you look at disk 
statistics you are seeing an average over time, so it will look like the OSD’s 
are not very busy, when in fact each one is busy for a very brief period. 



SSD journals will help your write latency, probably going down from around 
15-30ms to under 5ms 






From: ceph-users [mailto: ceph-users-boun...@lists.ceph.com ] On Behalf Of 
Piotr Wachowicz 
Sent: 01 May 2015 09:31 
To: ceph-users@lists.ceph.com 
Subject: [ceph-users] How to estimate whether putting a journal on SSD will 
help with performance? 







Is there any way to confirm (beforehand) that using SSDs for journals will 
help? 

We're seeing very disappointing Ceph performance. We have 10GigE interconnect 
(as a shared public/internal network). 





We're wondering whether it makes sense to buy SSDs and put journals on them. 
But we're looking for a way to verify that this will actually help BEFORE we 
splash cash on SSDs. 





The problem is that the way we have things configured now, with journals on 
spinning HDDs (shared with OSDs as the backend storage), apart from slow 
read/write performance to Ceph I already mention, we're also seeing fairly low 
disk utilization on OSDs. 





This low disk utilization suggests that journals are not really used to their 
max, which begs for the questions whether buying SSDs for journals will help. 





This kind of suggests that the bottleneck is NOT the disk. But,m yeah, we 
cannot really confirm that. 





Our typical data access use case is a lot of small random read/writes. We're 
doing a lot of rsyncing (entire regular linux filesystems) from one VM to 
another. 





We're using Ceph for OpenStack storage (kvm). Enabling RBD cache didn't really 
help all that much. 





So, is there any way to confirm beforehand that using SSDs for journals will 
help in our case? 







Kind Regards

Re: [ceph-users] rbd performance issue - can't find bottleneck

2015-06-19 Thread Andrei Mikhailovsky

Hi guys,

I also use a combination of intel 520 and 530 for my journals and have noticed 
that the latency and the speed of 520s is better than 530s. 

Could someone please confirm that doing the following at start up will stop the 
dsync on the relevant drives?

# echo temporary write through > /sys/class/scsi_disk/1\:0\:0\:0/cache_type 

Do I need to patch my kernel for this or is this already implementable in 
vanilla? I am running 3.19.x branch from ubuntu testing repo.

Would the above change the performance of 530s to be more like 520s?

Cheers

Andrei



- Original Message -
> From: "Alexandre DERUMIER" 
> To: "Jacek Jarosiewicz" 
> Cc: "ceph-users" 
> Sent: Thursday, 18 June, 2015 11:54:42 AM
> Subject: Re: [ceph-users] rbd performance issue - can't find bottleneck
> 
> Hi,
> 
> for read benchmark
> 
> with fio, what is the iodepth ?
> 
> my fio 4k randr results with
> 
> iodepth=1 : bw=6795.1KB/s, iops=1698
> iodepth=2 : bw=14608KB/s, iops=3652
> iodepth=4 : bw=32686KB/s, iops=8171
> iodepth=8 : bw=76175KB/s, iops=19043
> iodepth=16 :bw=173651KB/s, iops=43412
> iodepth=32 :bw=336719KB/s, iops=84179
> 
> (This should be similar with rados bench -t (threads) option).
> 
> This is normal because of network latencies + ceph latencies.
> Doing more parallism increase iops.
> 
> (doing a bench with "dd" = iodepth=1)
> 
> Theses result are with 1 client/rbd volume.
> 
> 
> now with more fio client (numjobs=X)
> 
> I can reach up to 300kiops with 8-10 clients.
> 
> 
> This should be the same with lauching multiple rados bench in parallel
> 
> (BTW, it could be great to have an option in rados bench to do it)
> 
> 
> - Mail original -
> De: "Jacek Jarosiewicz" 
> À: "Mark Nelson" , "ceph-users"
> 
> Envoyé: Jeudi 18 Juin 2015 11:49:11
> Objet: Re: [ceph-users] rbd performance issue - can't find bottleneck
> 
> On 06/17/2015 04:19 PM, Mark Nelson wrote:
> >> SSD's are INTEL SSDSC2BW240A4
> > 
> > Ah, if I'm not mistaken that's the Intel 530 right? You'll want to see
> > this thread by Stefan Priebe:
> > 
> > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg05667.html
> > 
> > In fact it was the difference in Intel 520 and Intel 530 performance
> > that triggered many of the different investigations that have taken
> > place by various folks into SSD flushing behavior on ATA_CMD_FLUSH. The
> > gist of it is that the 520 is very fast but probably not safe. The 530
> > is safe but not fast. The DC S3700 (and similar drives with super
> > capacitors) are thought to be both fast and safe (though some drives
> > like the crucual M500 and later misrepresented their power loss
> > protection so you have to be very careful!)
> > 
> 
> Yes, these are Intel 530.
> I did the tests described in the thread You pasted and unfortunately
> that's my case... I think.
> 
> The dd run locally on a mounted ssd partition looks like this:
> 
> [root@cf02 journal]# dd if=/dev/zero of=test bs=350k count=1
> oflag=direct,dsync
> 1+0 records in
> 1+0 records out
> 358400 bytes (3.6 GB) copied, 211.698 s, 16.9 MB/s
> 
> and when I skip the flag dsync it goes fast:
> 
> [root@cf02 journal]# dd if=/dev/zero of=test bs=350k count=1
> oflag=direct
> 1+0 records in
> 1+0 records out
> 358400 bytes (3.6 GB) copied, 9.05432 s, 396 MB/s
> 
> (I used the same 350k block size as mentioned in the e-mail from the
> thread above)
> 
> I tried disabling the dsync like this:
> 
> [root@cf02 ~]# echo temporary write through >
> /sys/class/scsi_disk/1\:0\:0\:0/cache_type
> 
> [root@cf02 ~]# cat /sys/class/scsi_disk/1\:0\:0\:0/cache_type
> write through
> 
> ..and then locally I see the speedup:
> 
> [root@cf02 journal]# dd if=/dev/zero of=test bs=350k count=1
> oflag=direct,dsync
> 1+0 records in
> 1+0 records out
> 358400 bytes (3.6 GB) copied, 10.4624 s, 343 MB/s
> 
> 
> ..but when I test it from a client I still get slow results:
> 
> root@cf03:/ceph/tmp# dd if=/dev/zero of=test bs=100M count=100 oflag=direct
> 100+0 records in
> 100+0 records out
> 1048576 bytes (10 GB) copied, 122.482 s, 85.6 MB/s
> 
> and fio gives the same 2-3k iops.
> 
> after the change to SSD cache_type I tried remounting the test image,
> recreating it and so on - nothing helped.
> 
> I ran rbd bench-write on it, and it's not good either:
> 
> root@cf03:~# rbd bench-write t2
> bench-write io_size 4096 io_threads 16 bytes 1073741824 pattern seq
> SEC OPS OPS/SEC BYTES/SEC
> 1 4221 4220.64 32195919.35
> 2 9628 4813.95 36286083.00
> 3 15288 4790.90 35714620.49
> 4 19610 4902.47 36626193.93
> 5 24844 4968.37 37296562.14
> 6 30488 5081.31 38112444.88
> 7 36152 5164.54 38601615.10
> 8 41479 5184.80 38860207.38
> 9 46971 5218.70 39181437.52
> 10 52219 5221.77 39322641.34
> 11 5 5151.36 38761566.30
> 12 62073 5172.71 38855021.35
> 13 65962 5073.95 38182880.49
> 14 71541 5110.02 38431536.17
> 15 77039 5135.85 38615125.42
> 16 82133 5133.31 38692578.98
> 17 87657 5156.24 38849948.84
> 18 92943 5141.03 38

Re: [ceph-users] rbd performance issue - can't find bottleneck

2015-06-19 Thread Andrei Mikhailovsky

Mark,

Thanks, I do understand that there is a risk of data loss by doing this. Having 
said this, ceph is designed to be fault tollerant and self repairing should 
something happen to individual journals, osds and server nodes. Isn't this a 
still good measure to compromise between data integrity and speed? So, by 
faking dsync and not actually doing this, you have a window of opportunity to 
data loss should a failure happen between the last flash and the moment of 
failure. 

Thus, if the ssd disk failure happens, regardless if dsync is used or not, 
would ceph still consider the osds behind the journal to be unavailable/lost 
and migrate the data around anyway and perform the necessary checks to make 
sure the data integrity is not compromised? If this is true, I would still 
consider using the dsync bypass in favour of the extra speed benefit. Unless I 
am missing a bigger picture and miscalculated something.

Could someone please elaborate on this a bit further to understand the realy 
world threat of not using the dsync bypass?

Cheers

Andrei


- Original Message -
> From: "Mark Nelson" 
> To: ceph-users@lists.ceph.com
> Sent: Friday, 19 June, 2015 3:59:55 PM
> Subject: Re: [ceph-users] rbd performance issue - can't find bottleneck
> 
> 
> 
> On 06/19/2015 09:54 AM, Andrei Mikhailovsky wrote:
> > Hi guys,
> >
> > I also use a combination of intel 520 and 530 for my journals and have
> > noticed that the latency and the speed of 520s is better than 530s.
> >
> > Could someone please confirm that doing the following at start up will stop
> > the dsync on the relevant drives?
> >
> > # echo temporary write through > /sys/class/scsi_disk/1\:0\:0\:0/cache_type
> >
> > Do I need to patch my kernel for this or is this already implementable in
> > vanilla? I am running 3.19.x branch from ubuntu testing repo.
> >
> > Would the above change the performance of 530s to be more like 520s?
> 
> I need to comment that it's *really* not a good idea to do this if you
> care about data integrity.  There's a reason why the 530 is slower than
> the 520.  If you need speed and you care about your data, you should
> really consider jumping up to the DC S3700.
> 
> There's a possibility that the 730 *may* be ok as it supposedly has
> power loss protection, but it's still not using HET MLC so the flash
> cells will wear out faster.  It's also a consumer grade drive, so no one
> will give you support for this kind of use case if you have problems.
> 
> Mark
> 
> >
> > Cheers
> >
> > Andrei
> >
> >
> >
> > - Original Message -
> >> From: "Alexandre DERUMIER" 
> >> To: "Jacek Jarosiewicz" 
> >> Cc: "ceph-users" 
> >> Sent: Thursday, 18 June, 2015 11:54:42 AM
> >> Subject: Re: [ceph-users] rbd performance issue - can't find bottleneck
> >>
> >> Hi,
> >>
> >> for read benchmark
> >>
> >> with fio, what is the iodepth ?
> >>
> >> my fio 4k randr results with
> >>
> >> iodepth=1 : bw=6795.1KB/s, iops=1698
> >> iodepth=2 : bw=14608KB/s, iops=3652
> >> iodepth=4 : bw=32686KB/s, iops=8171
> >> iodepth=8 : bw=76175KB/s, iops=19043
> >> iodepth=16 :bw=173651KB/s, iops=43412
> >> iodepth=32 :bw=336719KB/s, iops=84179
> >>
> >> (This should be similar with rados bench -t (threads) option).
> >>
> >> This is normal because of network latencies + ceph latencies.
> >> Doing more parallism increase iops.
> >>
> >> (doing a bench with "dd" = iodepth=1)
> >>
> >> Theses result are with 1 client/rbd volume.
> >>
> >>
> >> now with more fio client (numjobs=X)
> >>
> >> I can reach up to 300kiops with 8-10 clients.
> >>
> >>
> >> This should be the same with lauching multiple rados bench in parallel
> >>
> >> (BTW, it could be great to have an option in rados bench to do it)
> >>
> >>
> >> - Mail original -
> >> De: "Jacek Jarosiewicz" 
> >> À: "Mark Nelson" , "ceph-users"
> >> 
> >> Envoyé: Jeudi 18 Juin 2015 11:49:11
> >> Objet: Re: [ceph-users] rbd performance issue - can't find bottleneck
> >>
> >> On 06/17/2015 04:19 PM, Mark Nelson wrote:
> >>>> SSD's are INTEL SSDSC2BW240A4
> >>>
> >>> Ah, if I'm not mistaken that's the Intel 530 right? You'll want to see
> >>> this thread by Stefan Priebe:
&g

Re: [ceph-users] rbd performance issue - can't find bottleneck

2015-06-19 Thread Andrei Mikhailovsky


Mark, thanks for putting it down this way. It does make sense.

Does it mean that having the Intel 520s, which bypass the dsync is theat to the 
data stored on the journals?

I do have a few of these installed, alongside with 530s. I did not plan to 
replace them just yet. Would it make more sense to get a small battery 
protected raid card in front of the 520s and 530s to protect against these 
types of scenarios?

Cheers

- Original Message -
> From: "Mark Nelson" 
> To: "Andrei Mikhailovsky" 
> Cc: ceph-users@lists.ceph.com
> Sent: Friday, 19 June, 2015 5:08:31 PM
> Subject: Re: [ceph-users] rbd performance issue - can't find bottleneck
> 
> On 06/19/2015 10:29 AM, Andrei Mikhailovsky wrote:
> > Mark,
> >
> > Thanks, I do understand that there is a risk of data loss by doing this.
> > Having said this, ceph is designed to be fault tollerant and self
> > repairing should something happen to individual journals, osds and server
> > nodes. Isn't this a still good measure to compromise between data
> > integrity and speed? So, by faking dsync and not actually doing this, you
> > have a window of opportunity to data loss should a failure happen between
> > the last flash and the moment of failure.
> >
> > Thus, if the ssd disk failure happens, regardless if dsync is used or not,
> > would ceph still consider the osds behind the journal to be
> > unavailable/lost and migrate the data around anyway and perform the
> > necessary checks to make sure the data integrity is not compromised? If
> > this is true, I would still consider using the dsync bypass in favour of
> > the extra speed benefit. Unless I am missing a bigger picture and
> > miscalculated something.
> >
> > Could someone please elaborate on this a bit further to understand the
> > realy world threat of not using the dsync bypass?
> 
> Hi Andrei,
> 
> Basically the entire point of the Ceph journal is to guarantee that data
> hits a persistent medium before the write gets acknowledged.  Imagine a
> scenario where you lose power just as the write happens.
> 
> Scenario A:  You have proper O_DSYNC writes.  In this case, assuming the
> SSD is behaving properly, you can be fairly confident that the write to
> the local journal succeeded (or not).
> 
> Scenario B: You bypass O_DSYNC.  The journal write "completes" quickly,
> but it's not actually written out to flash, just to the drive cache.  If
> the SSD has power loss protection it can theoretically write that data
> out to the flash before it losses power.  For this reason, drives with
> PLP can often perform O_DSYNC writes very quickly even without this hack
> (ie it can ignore ATA_CMD_FLUSH).
> 
> For a drive like the 530 without PLP, there's no guarantee that the data
> in cache will hit the flash.  Ceph will *think* it did though, and the
> risk is worse because the write "completes" so fast.  Now you have a
> scenario where ceph thinks something exists but it really doesn't (or
> exists in a corrupted state).  This leads to all sorts of problems.  If
> another OSD goes down and you have two copies of the data that disagree
> with each other, what do you do?  What if not all of the replica writes
> succeeded but you have a copy of the data on the primary?  Can you trust
> it?  Everything starts breaking down.
> 
> Mark
> 
> >
> > Cheers
> >
> > Andrei
> >
> >
> > - Original Message -
> >> From: "Mark Nelson" 
> >> To: ceph-users@lists.ceph.com
> >> Sent: Friday, 19 June, 2015 3:59:55 PM
> >> Subject: Re: [ceph-users] rbd performance issue - can't find bottleneck
> >>
> >>
> >>
> >> On 06/19/2015 09:54 AM, Andrei Mikhailovsky wrote:
> >>> Hi guys,
> >>>
> >>> I also use a combination of intel 520 and 530 for my journals and have
> >>> noticed that the latency and the speed of 520s is better than 530s.
> >>>
> >>> Could someone please confirm that doing the following at start up will
> >>> stop
> >>> the dsync on the relevant drives?
> >>>
> >>> # echo temporary write through >
> >>> /sys/class/scsi_disk/1\:0\:0\:0/cache_type
> >>>
> >>> Do I need to patch my kernel for this or is this already implementable in
> >>> vanilla? I am running 3.19.x branch from ubuntu testing repo.
> >>>
> >>> Would the above change the performance of 530s to be more like 520s?
> >>
> >> I need to comment that it's *really* not a good idea to do this if you
>

[ceph-users] latest Hammer for Ubuntu precise

2015-06-21 Thread Andrei Mikhailovsky

Hi, 

I seem to be missing the latest Hammer release 0.94.2 in the repo for Ubuntu 
precise. I can see the packages for trusty, but precise still shows 0.94.1. Is 
there a miss or did you stop supporting precise? Or perhaps something is odd 
happened with my precise servers? 

Cheers 

Andrei 




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] latest Hammer for Ubuntu precise

2015-06-22 Thread Andrei Mikhailovsky

Thanks Mate, I was under the same impression. 

Could someone at Inktank please help us with this problem? Is this intentional 
or has it simply been an error? 

Thanks 

Andrei 

-- 
Andrei Mikhailovsky 
Director 
Arhont Information Security 

Web: http://www.arhont.com 
http://www.wi-foo.com 
Tel: +44 (0)870 4431337 
Fax: +44 (0)208 429 3111 
PGP: Key ID - 0x2B3438DE 
PGP: Server - keyserver.pgp.com 

DISCLAIMER 

The information contained in this email is intended only for the use of the 
person(s) to whom it is addressed and may be confidential or contain legally 
privileged information. If you are not the intended recipient you are hereby 
notified that any perusal, use, distribution, copying or disclosure is strictly 
prohibited. If you have received this email in error please immediately advise 
us by return email at and...@arhont.com and delete and purge the email and any 
attachments without making a copy. 

- Original Message -

From: "Gabri Mate"  
To: "Andrei Mikhailovsky"  
Cc: ceph-users@lists.ceph.com 
Sent: Monday, 22 June, 2015 6:28:11 PM 
Subject: Re: [ceph-users] latest Hammer for Ubuntu precise 

As far as I see the packages are there but the Packages file wasn't 
updated (correctly?) that's why we, Precise users do not see the 
updates. I am still wondering whether this is intentional or not. 
Probably not. :) Hopefully it will be sorted out soon. 

Mate 

On 00:14 Mon 22 Jun , Andrei Mikhailovsky wrote: 
> Hi, 
> 
> I seem to be missing the latest Hammer release 0.94.2 in the repo for Ubuntu 
> precise. I can see the packages for trusty, but precise still shows 0.94.1. 
> Is there a miss or did you stop supporting precise? Or perhaps something is 
> odd happened with my precise servers? 
> 
> Cheers 
> 
> Andrei 
> 
> 
> 
> 

> ___ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph and EnhanceIO cache

2015-06-26 Thread Andrei Mikhailovsky

Hi Nick, 

I've played with Flashcache and EnhanceIO, but I've decided not to use it for 
production in the end. The reason was that using both has increased the amount 
of slow requests that I had on the cluster and I have also noticed somewhat 
higher level of iowait on the vms. At that time, I didn't have much time to 
investigate the slow requests issue and I wasn't sure what exactly is causing 
them. All I can say is that after disabling the caching the slow requests have 
stopped. 

Perhaps others could share if they had any issues. 

THanks 

- Original Message -

> From: "Nick Fisk" 
> To: "Dominik Zalewski" 
> Cc: ceph-users@lists.ceph.com
> Sent: Friday, 26 June, 2015 11:12:25 AM
> Subject: Re: [ceph-users] Ceph and EnhanceIO cache

> I think flashcache bombs out, I must admit I have tested that yet, but as I
> would only be running it in writecache mode, there is no requirement I can
> think of for it to keep on running gracefully.

> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Dominik Zalewski
> Sent: 26 June 2015 10:54
> To: Nick Fisk
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Ceph and EnhanceIO cache

> Thanks for your reply.

> Do you know by any chance how flashcache handles SSD going offline?

> Here is an snip from enhanceio wiki page:

> Failure of an SSD device in read-only and write-through modes is
> handled gracefully by allowing I/O to continue to/from the
> source volume. An application may notice a drop in performance but it
> will not receive any I/O errors.
> Failure of an SSD device in write-back mode obviously results in the
> loss of dirty blocks in the cache. To guard against this data loss, two
> SSD devices can be mirrored via RAID 1.
> EnhanceIO identifies device failures based on error codes. Depending on
> whether the failure is likely to be intermittent or permanent, it takes
> the best suited action.
> Looking at mailing list and github commits, both flashcache and enhanceio had
> not much going on since last  year.
> Dominik

> On Fri, Jun 26, 2015 at 10:28 AM, Nick Fisk < n...@fisk.me.uk > wrote:
> > > -Original Message-
> 
> > > From: ceph-users [mailto: ceph-users-boun...@lists.ceph.com ] On Behalf
> > > Of
> 
> > > Dominik Zalewski
> 
> > > Sent: 26 June 2015 09:59
> 
> > > To: ceph-users@lists.ceph.com
> 
> > > Subject: [ceph-users] Ceph and EnhanceIO cache
> 
> > >
> 
> > > Hi,
> 
> > >
> 
> > > I came across this blog post mentioning using EnhanceIO (fork of
> > > flashcache)
> 
> > > as cache for OSDs.
> 
> > >
> 
> > > http://xo4t.mj.am/link/xo4t/jgu895v/1/DnECCniu-HWfTojpLN1IqA/aHR0cDovL3d3dy5zZWJhc3RpZW4taGFuLmZyL2Jsb2cvMjAxNC8xMC8wNi9jZXBoLWFuZC1lbmhhbmNlaW8v
> 
> > >
> 
> > > http://xo4t.mj.am/link/xo4t/jgu895v/2/FTxs29ShRIqNOekTAo2wKw/aHR0cHM6Ly9naXRodWIuY29tL3N0ZWMtaW5jL0VuaGFuY2VJTw
> 
> > >
> 
> > > I'm planning to test it with 5x 1TB HGST Travelstar 7k1000 2.5inch OSDs
> > > and
> 
> > > using 256GB Transcend SSD as enhanceio cache.
> 
> > >
> 
> > > I'm wondering if anyone is using EnhanceIO or others like bcache,
> > > dm-cache
> 
> > > with Ceph in production and what is your experience/results.
> 

> > Not using in production, but have been testing all of the above both
> > caching
> > the OSD and RBD's.
> 

> > If your RBD's are being used in scenarios where small sync writes are
> > important (iscsi,database's) then caching the RBD's is almost essential. My
> > findings:-
> 

> > FlashCache - Probably the best of the bunch. Has sequential override and
> > hopefully facebook will continue to maintain it
> 
> > EnhanceIO - Nice that you can hot add the cache, but is likely to no longer
> > be maintained, so risky for production
> 
> > DMCache - Well maintained, but biggest problem is that it only caches
> > writes
> > for blocks that are already in cache
> 
> > Bcache - Didn't really spend much time looking at this. Looks as if
> > development activity is dying down and potential stability issues
> 
> > DM-WriteBoost - Great performance, really suits RBD requirements.
> > Unfortunately the ram buffering part seems to not play safe with iSCSI use.
> 

> > With something like flashcache on a RBD I can max out the SSD with small
> > sequential write IO's and it then passes them to the RBD in nice large
> > IO's.
> > Bursty random writes also benefit.
> 

> > Caching the OSD's, or more specifically a small section where the levelDB
> > lives can greatly improve small block write performance. Flashcache is best
> > for this as you can limit the sequential cutoff to the levelDB transaction
> > size. DMcache is potentially an option as well. You can probably halve OSD
> > latency by doing this.
> 

> > >
> 
> > > Thanks
> 
> > >
> 
> > > Dominik
> 

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@

Re: [ceph-users] qemu-1.4.0 and onwards, linux kernel 3.2.x, ceph-RBD, heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive qemu-process, [Qemu-devel] [Bug 1207686]

2013-08-09 Thread Andrei Mikhailovsky

I can confirm that I am having similar issues with ubuntu vm guests using fio 
with bs=4k direct=1 numjobs=4 iodepth=16. Occasionally i see hang tasks, 
occasionally guest vm stops responding without leaving anything in the logs and 
sometimes i see kernel panic on the console. I typically leave the runtime of 
the fio test for 60 minutes and it tends to stop responding after about 10-30 
mins. 

I am on ubuntu 12.04 with 3.5 kernel backport and using ceph 0.61.7 with qemu 
1.5.0 and libvirt 1.0.2 

Andrei 
- Original Message -

From: "Oliver Francke"  
To: "Josh Durgin"  
Cc: ceph-users@lists.ceph.com, "Mike Dawson" , 
"Stefan Hajnoczi" , qemu-de...@nongnu.org 
Sent: Friday, 9 August, 2013 10:22:00 AM 
Subject: Re: [ceph-users] qemu-1.4.0 and onwards, linux kernel 3.2.x, ceph-RBD, 
heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive 
qemu-process, [Qemu-devel] [Bug 1207686] 

Hi Josh, 

just opened 

http://tracker.ceph.com/issues/5919 

with all collected information incl. debug-log. 

Hope it helps, 

Oliver. 

On 08/08/2013 07:01 PM, Josh Durgin wrote: 
> On 08/08/2013 05:40 AM, Oliver Francke wrote: 
>> Hi Josh, 
>> 
>> I have a session logged with: 
>> 
>> debug_ms=1:debug_rbd=20:debug_objectcacher=30 
>> 
>> as you requested from Mike, even if I think, we do have another story 
>> here, anyway. 
>> 
>> Host-kernel is: 3.10.0-rc7, qemu-client 1.6.0-rc2, client-kernel is 
>> 3.2.0-51-amd... 
>> 
>> Do you want me to open a ticket for that stuff? I have about 5MB 
>> compressed logfile waiting for you ;) 
> 
> Yes, that'd be great. If you could include the time when you saw the 
> guest hang that'd be ideal. I'm not sure if this is one or two bugs, 
> but it seems likely it's a bug in rbd and not qemu. 
> 
> Thanks! 
> Josh 
> 
>> Thnx in advance, 
>> 
>> Oliver. 
>> 
>> On 08/05/2013 09:48 AM, Stefan Hajnoczi wrote: 
>>> On Sun, Aug 04, 2013 at 03:36:52PM +0200, Oliver Francke wrote: 
 Am 02.08.2013 um 23:47 schrieb Mike Dawson : 
> We can "un-wedge" the guest by opening a NoVNC session or running a 
> 'virsh screenshot' command. After that, the guest resumes and runs 
> as expected. At that point we can examine the guest. Each time we'll 
> see: 
>>> If virsh screenshot works then this confirms that QEMU itself is still 
>>> responding. Its main loop cannot be blocked since it was able to 
>>> process the screendump command. 
>>> 
>>> This supports Josh's theory that a callback is not being invoked. The 
>>> virtio-blk I/O request would be left in a pending state. 
>>> 
>>> Now here is where the behavior varies between configurations: 
>>> 
>>> On a Windows guest with 1 vCPU, you may see the symptom that the 
>>> guest no 
>>> longer responds to ping. 
>>> 
>>> On a Linux guest with multiple vCPUs, you may see the hung task message 
>>> from the guest kernel because other vCPUs are still making progress. 
>>> Just the vCPU that issued the I/O request and whose task is in 
>>> UNINTERRUPTIBLE state would really be stuck. 
>>> 
>>> Basically, the symptoms depend not just on how QEMU is behaving but 
>>> also 
>>> on the guest kernel and how many vCPUs you have configured. 
>>> 
>>> I think this can explain how both problems you are observing, Oliver 
>>> and 
>>> Mike, are a result of the same bug. At least I hope they are :). 
>>> 
>>> Stefan 
>> 
>> 
> 


-- 

Oliver Francke 

filoo GmbH 
Moltkestraße 25a 
0 Gütersloh 
HRB4355 AG Gütersloh 

Geschäftsführer: J.Rehpöhler | C.Kunz 

Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSD pool write performance

2013-10-11 Thread Andrei Mikhailovsky

Hi 

i've also tested 4k performance and found similar results with fio and iozone 
tests as well as simple dd. I've noticed that my io rate doesn't go above 2k-3k 
in the virtual machines. I've got two servers with ssd journals but spindles 
for the osd. I've previusly tried to use nfs + zfs on the same hardware with 
the same drives acting as cache drives. The nfs performance was far better for 
4k io. I was hitting around 60k when the storage servers were reading the test 
file from ram. 

It looks like some more optimisations have to be done to fix the current 
bottleneck. 

Having said this, the read performance from multiple clients would excel NFS by 
far. In nfs I would not see the total speeds over 450-500 but with ceph i was 
going over 1GB/s 

Andrei 
- Original Message -

From: "Sergey Pimkov"  
To: ceph-users@lists.ceph.com 
Sent: Thursday, 10 October, 2013 8:47:32 PM 
Subject: [ceph-users] SSD pool write performance 

Hello! 

I'm testing small CEPH pool consists of some SSD drives (without any 
spinners). Ceph version is 0.67.4. Seems like write performance of this 
configuration is not so good as possible, when I testing it with small 
block size (4k). 

Pool configuration: 
2 mons on separated hosts, one host with two OSD. First partition of 
each disk is used for journal and has 20Gb size, second is formatted as 
XFS and used for data (mount options: 
rw,noexec,nodev,noatime,nodiratime,inode64). 20% of space left 
unformatted. Journal aio and dio turned on. 

Each disk has about 15k IOPS with 4k blocks, iodepth 1 and 50k IOPS with 
4k block, iodepth 16 (tested with fio). Linear throughput of disks is 
about 420Mb/s. Network throughput is 1Gbit/s. 

I use rbd pool with size 1 and want this pool to act like RAID0 at this 
time. 

Virtual machine (QEMU/KVM) on separated host is configured to use 100Gb 
RBD as second disk. Fio running in this machine (iodepth 16, buffered=0, 
direct=1, libaio, 4k randwrite) shows about 2.5-3k IOPS. 
Multiple quests with the same configuration shows similar summary 
result. Local kernel RBD on host with OSD also shows about 2-2.5k IOPS. 
Latency is about 7ms. I also tried to pre-fill RBD without any results. 

Atop shows about 90% disks utilization during tests. CPU utilization is 
about 400% (2x Xeon E5504 is installed on ceph node). There is a lot of 
free memory on host. Blktrace shows that about 4k operations (4k to 
about 40k bytes) completing every second on every disk. OSD throughput 
is about 30 MB/s. 

I expected to see about 2 x 50k/4 = 20-30k IOPS on RBD, so is that too 
optimistic for CEPH with such load or if I missed up something important? 
I also tried to use one disk as journal (20GB, last space left 
unformatted) and configure the next disk as OSD, this configuration have 
shown almost the same result. 

Playing with some osd/filestore/journal options with admin socket ended 
with no result. 

Please, tell me am I wrong with this setup? Or should I use more disks 
to get better performance with small concurrent writes? Or is ceph 
optimized for work with slow spinners and shouldn't be used with SSD 
disk only? 
Thank you very much in advance! 

My ceph configuration: 
ceph.conf 
== 
[global] 

auth cluster required = none 
auth service required = none 
auth client required = none 

[client] 

rbd cache = true 
rbd cache max dirty = 0 

[osd] 

osd journal aio = true 
osd max backfills = 4 
osd recovery max active = 1 
filestore max sync interval = 5 

[mon.1] 

host = ceph1 
mon addr = 10.10.0.1:6789 

[mon.2] 

host = ceph2 
mon addr = 10.10.0.2:6789 

[osd.72] 
host = ceph7 
devs = /dev/sdd2 
osd journal = /dev/sdd1 

[osd.73] 
host = ceph7 
devs = /dev/sde2 
osd journal = /dev/sde1 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] CloudStack 4.2 - radosgw / S3 storage issues

2013-10-29 Thread Andrei Mikhailovsky

Hello guys, 

I am doing a test ACS setup to see how we can use Ceph for both Primary and 
Secondary storage services. I have now successfully added both Primary (cluster 
wide) and Secondary storage. However, I've noticed that my SSVM and CPVM are 
not being created, so digging in the logs revealed the following exceptions: 



The radosgw logs show the following: 

2013-10-29 00:19:38.289487 7f2aa7d9f780 20 enqueued request req=0x2390060 
2013-10-29 00:19:38.289518 7f2aa7d9f780 20 RGWWQ: 
2013-10-29 00:19:38.289521 7f2aa7d9f780 20 req: 0x2390060 
2013-10-29 00:19:38.289529 7f2aa7d9f780 10 allocated request req=0x23452f0 
2013-10-29 00:19:38.289572 7f2aa7d9f780 20 enqueued request req=0x23452f0 
2013-10-29 00:19:38.289575 7f2aa7d9f780 20 RGWWQ: 
2013-10-29 00:19:38.289576 7f2aa7d9f780 20 req: 0x2390060 
2013-10-29 00:19:38.289578 7f2aa7d9f780 20 req: 0x23452f0 
2013-10-29 00:19:38.289610 7f2aa7d9f780 10 allocated request req=0x23a1630 
2013-10-29 00:19:38.289613 7f2a54ff9700 20 dequeued request req=0x2390060 
2013-10-29 00:19:38.289627 7f2a54ff9700 20 RGWWQ: 
2013-10-29 00:19:38.289629 7f2a54ff9700 20 req: 0x23452f0 
2013-10-29 00:19:38.289647 7f2a54ff9700 1 == starting new request 
req=0x2390060 = 
2013-10-29 00:19:38.289650 7f2a36fcd700 20 dequeued request req=0x23452f0 
2013-10-29 00:19:38.289675 7f2a36fcd700 20 RGWWQ: empty 
2013-10-29 00:19:38.289685 7f2a36fcd700 1 == starting new request 
req=0x23452f0 = 
2013-10-29 00:19:38.289715 7f2a54ff9700 2 req 1291:0.69::POST 
/template%2Ftmpl%2F1%2F4%2Fcentos55-x86_64%2Feec2209b-9875-3c8d-92be-c001bd8a0faf.qcow2.bz2::initializing
 
2013-10-29 00:19:38.289723 7f2a54ff9700 10 
host=cloudstack-secondary.arh-ibstorage1domain-name.com 
rgw_dns_name=arh-ibstorage1-ibdomain-name.com 
2013-10-29 00:19:38.289755 7f2a36fcd700 2 req 1292:0.69::POST 
/template%2Ftmpl%2F1%2F3%2Frouting-3%2Fsystemvmtemplate-2013-06-12-master-kvm.qcow2.bz2::initializing
 
2013-10-29 00:19:38.289761 7f2a36fcd700 10 
host=cloudstack-secondary.arh-ibstorage1domain-name.com 
rgw_dns_name=arh-ibstorage1-ibdomain-name.com 
2013-10-29 00:19:38.289761 7f2a54ff9700 10 
s->object=tmpl/1/4/centos55-x86_64/eec2209b-9875-3c8d-92be-c001bd8a0faf.qcow2.bz2
 s->bucket=template 
2013-10-29 00:19:38.289770 7f2a54ff9700 20 FCGI_ROLE=RESPONDER 
2013-10-29 00:19:38.289771 7f2a54ff9700 20 
SCRIPT_URL=/template/tmpl/1/4/centos55-x86_64/eec2209b-9875-3c8d-92be-c001bd8a0faf.qcow2.bz2
 
2013-10-29 00:19:38.289773 7f2a54ff9700 20 
SCRIPT_URI=http://cloudstack-secondary.arh-ibstorage1domain-name.com/template/tmpl/1/4/centos55-x86_64/eec2209b-9875-3c8d-92be-c001bd8a0faf.qcow2.b
 
z2 
2013-10-29 00:19:38.289775 7f2a54ff9700 20 HTTP_AUTHORIZATION=AWS 
S3-User-Key:v1NjAqxoFbROJOlBPRWyOSw8IZI= 
2013-10-29 00:19:38.289776 7f2a54ff9700 20 
HTTP_HOST=cloudstack-secondary.arh-ibstorage1domain-name.com 
2013-10-29 00:19:38.289776 7f2a54ff9700 20 HTTP_DATE=Tue, 29 Oct 2013 00:19:38 
GMT 
2013-10-29 00:19:38.289777 7f2a54ff9700 20 HTTP_USER_AGENT=aws-sdk-java/1.3.22 
Linux/3.5.0-42-generic OpenJDK_64-Bit_Server_VM/20.0-b12 
2013-10-29 00:19:38.289778 7f2a54ff9700 20 CONTENT_TYPE=application/x-bzip2 
2013-10-29 00:19:38.289780 7f2a54ff9700 20 HTTP_TRANSFER_ENCODING=chunked 
2013-10-29 00:19:38.289782 7f2a54ff9700 20 HTTP_CONNECTION=Keep-Alive 
2013-10-29 00:19:38.289784 7f2a54ff9700 20 PATH=/usr/local/bin:/usr/bin:/bin 
2013-10-29 00:19:38.289785 7f2a54ff9700 20 SERVER_SIGNATURE= 
2013-10-29 00:19:38.289786 7f2a54ff9700 20 SERVER_SOFTWARE=Apache/2.2.22 
(Ubuntu) 
2013-10-29 00:19:38.289787 7f2a54ff9700 20 
SERVER_NAME=cloudstack-secondary.arh-ibstorage1domain-name.com 
2013-10-29 00:19:38.289788 7f2a54ff9700 20 SERVER_ADDR=192.168.169.200 
2013-10-29 00:19:38.289789 7f2a54ff9700 20 SERVER_PORT=80 
2013-10-29 00:19:38.289790 7f2a54ff9700 20 REMOTE_ADDR=192.168.169.1 
2013-10-29 00:19:38.289790 7f2a36fcd700 10 
s->object=tmpl/1/3/routing-3/systemvmtemplate-2013-06-12-master-kvm.qcow2.bz2 
s->bucket=template 
2013-10-29 00:19:38.289791 7f2a54ff9700 20 DOCUMENT_ROOT=/var/www 
2013-10-29 00:19:38.289794 7f2a54ff9700 20 SCRIPT_FILENAME=/var/www/s3gw.fcgi 
2013-10-29 00:19:38.289794 7f2a54ff9700 20 REMOTE_PORT=34613 
2013-10-29 00:19:38.289795 7f2a54ff9700 20 GATEWAY_INTERFACE=CGI/1.1 
2013-10-29 00:19:38.289796 7f2a54ff9700 20 SERVER_PROTOCOL=HTTP/1.1 
2013-10-29 00:19:38.289797 7f2a54ff9700 20 REQUEST_METHOD=POST 
2013-10-29 00:19:38.289796 7f2a36fcd700 20 FCGI_ROLE=RESPONDER 
2013-10-29 00:19:38.289798 7f2a54ff9700 20 
QUERY_STRING=page=template¶ms=/tmpl/1/4/centos55-x86_64/eec2209b-9875-3c8d-92be-c001bd8a0faf.qcow2.bz2&uploads
 
2013-10-29 00:19:38.289798 7f2a36fcd700 20 
SCRIPT_URL=/template/tmpl/1/3/routing-3/systemvmtemplate-2013-06-12-master-kvm.qcow2.bz2
 
2013-10-29 00:19:38.289799 7f2a54ff9700 20 
REQUEST_URI=/template%2Ftmpl%2F1%2F4%2Fcentos55-x86_64%2Feec2209b-9875-3c8d-92be-c001bd8a0faf.qcow2.bz2?uploads
 
2013-10-29 00:19:38.289800 7f2a36fcd700 20 
SCRIPT_URI=http://cloudstack-secondary.arh-ibstorag

[ceph-users] CloudStack 4.2 - radosgw / S3 storage issues

2013-10-29 Thread Andrei Mikhailovsky


Hello guys, 

I am doing a test ACS setup to see how we can use Ceph for both Primary and 
Secondary storage services. I have now successfully added both Primary (cluster 
wide) and Secondary storage. However, I've noticed that my SSVM and CPVM are 
not being created, so digging in the logs revealed the following exceptions: 



The radosgw logs show the following: 

2013-10-29 00:19:38.289487 7f2aa7d9f780 20 enqueued request req=0x2390060 
2013-10-29 00:19:38.289518 7f2aa7d9f780 20 RGWWQ: 
2013-10-29 00:19:38.289521 7f2aa7d9f780 20 req: 0x2390060 
2013-10-29 00:19:38.289529 7f2aa7d9f780 10 allocated request req=0x23452f0 
2013-10-29 00:19:38.289572 7f2aa7d9f780 20 enqueued request req=0x23452f0 
2013-10-29 00:19:38.289575 7f2aa7d9f780 20 RGWWQ: 
2013-10-29 00:19:38.289576 7f2aa7d9f780 20 req: 0x2390060 
2013-10-29 00:19:38.289578 7f2aa7d9f780 20 req: 0x23452f0 
2013-10-29 00:19:38.289610 7f2aa7d9f780 10 allocated request req=0x23a1630 
2013-10-29 00:19:38.289613 7f2a54ff9700 20 dequeued request req=0x2390060 
2013-10-29 00:19:38.289627 7f2a54ff9700 20 RGWWQ: 
2013-10-29 00:19:38.289629 7f2a54ff9700 20 req: 0x23452f0 
2013-10-29 00:19:38.289647 7f2a54ff9700 1 == starting new request 
req=0x2390060 = 
2013-10-29 00:19:38.289650 7f2a36fcd700 20 dequeued request req=0x23452f0 
2013-10-29 00:19:38.289675 7f2a36fcd700 20 RGWWQ: empty 
2013-10-29 00:19:38.289685 7f2a36fcd700 1 == starting new request 
req=0x23452f0 = 
2013-10-29 00:19:38.289715 7f2a54ff9700 2 req 1291:0.69::POST 
/template%2Ftmpl%2F1%2F4%2Fcentos55-x86_64%2Feec2209b-9875-3c8d-92be-c001bd8a0faf.qcow2.bz2::initializing
 
2013-10-29 00:19:38.289723 7f2a54ff9700 10 
host=cloudstack-secondary.arh-ibstorage1domain-name.com 
rgw_dns_name=arh-ibstorage1-ibdomain-name.com 
2013-10-29 00:19:38.289755 7f2a36fcd700 2 req 1292:0.69::POST 
/template%2Ftmpl%2F1%2F3%2Frouting-3%2Fsystemvmtemplate-2013-06-12-master-kvm.qcow2.bz2::initializing
 
2013-10-29 00:19:38.289761 7f2a36fcd700 10 
host=cloudstack-secondary.arh-ibstorage1domain-name.com 
rgw_dns_name=arh-ibstorage1-ibdomain-name.com 
2013-10-29 00:19:38.289761 7f2a54ff9700 10 
s->object=tmpl/1/4/centos55-x86_64/eec2209b-9875-3c8d-92be-c001bd8a0faf.qcow2.bz2
 s->bucket=template 
2013-10-29 00:19:38.289770 7f2a54ff9700 20 FCGI_ROLE=RESPONDER 
2013-10-29 00:19:38.289771 7f2a54ff9700 20 
SCRIPT_URL=/template/tmpl/1/4/centos55-x86_64/eec2209b-9875-3c8d-92be-c001bd8a0faf.qcow2.bz2
 
2013-10-29 00:19:38.289773 7f2a54ff9700 20 
SCRIPT_URI=http://cloudstack-secondary.arh-ibstorage1domain-name.com/template/tmpl/1/4/centos55-x86_64/eec2209b-9875-3c8d-92be-c001bd8a0faf.qcow2.b
 
z2 
2013-10-29 00:19:38.289775 7f2a54ff9700 20 HTTP_AUTHORIZATION=AWS 
S3-User-Key:v1NjAqxoFbROJOlBPRWyOSw8IZI= 
2013-10-29 00:19:38.289776 7f2a54ff9700 20 
HTTP_HOST=cloudstack-secondary.arh-ibstorage1domain-name.com 
2013-10-29 00:19:38.289776 7f2a54ff9700 20 HTTP_DATE=Tue, 29 Oct 2013 00:19:38 
GMT 
2013-10-29 00:19:38.289777 7f2a54ff9700 20 HTTP_USER_AGENT=aws-sdk-java/1.3.22 
Linux/3.5.0-42-generic OpenJDK_64-Bit_Server_VM/20.0-b12 
2013-10-29 00:19:38.289778 7f2a54ff9700 20 CONTENT_TYPE=application/x-bzip2 
2013-10-29 00:19:38.289780 7f2a54ff9700 20 HTTP_TRANSFER_ENCODING=chunked 
2013-10-29 00:19:38.289782 7f2a54ff9700 20 HTTP_CONNECTION=Keep-Alive 
2013-10-29 00:19:38.289784 7f2a54ff9700 20 PATH=/usr/local/bin:/usr/bin:/bin 
2013-10-29 00:19:38.289785 7f2a54ff9700 20 SERVER_SIGNATURE= 
2013-10-29 00:19:38.289786 7f2a54ff9700 20 SERVER_SOFTWARE=Apache/2.2.22 
(Ubuntu) 
2013-10-29 00:19:38.289787 7f2a54ff9700 20 
SERVER_NAME=cloudstack-secondary.arh-ibstorage1domain-name.com 
2013-10-29 00:19:38.289788 7f2a54ff9700 20 SERVER_ADDR=192.168.169.200 
2013-10-29 00:19:38.289789 7f2a54ff9700 20 SERVER_PORT=80 
2013-10-29 00:19:38.289790 7f2a54ff9700 20 REMOTE_ADDR=192.168.169.1 
2013-10-29 00:19:38.289790 7f2a36fcd700 10 
s->object=tmpl/1/3/routing-3/systemvmtemplate-2013-06-12-master-kvm.qcow2.bz2 
s->bucket=template 
2013-10-29 00:19:38.289791 7f2a54ff9700 20 DOCUMENT_ROOT=/var/www 
2013-10-29 00:19:38.289794 7f2a54ff9700 20 SCRIPT_FILENAME=/var/www/s3gw.fcgi 
2013-10-29 00:19:38.289794 7f2a54ff9700 20 REMOTE_PORT=34613 
2013-10-29 00:19:38.289795 7f2a54ff9700 20 GATEWAY_INTERFACE=CGI/1.1 
2013-10-29 00:19:38.289796 7f2a54ff9700 20 SERVER_PROTOCOL=HTTP/1.1 
2013-10-29 00:19:38.289797 7f2a54ff9700 20 REQUEST_METHOD=POST 
2013-10-29 00:19:38.289796 7f2a36fcd700 20 FCGI_ROLE=RESPONDER 
2013-10-29 00:19:38.289798 7f2a54ff9700 20 
QUERY_STRING=page=template¶ms=/tmpl/1/4/centos55-x86_64/eec2209b-9875-3c8d-92be-c001bd8a0faf.qcow2.bz2&uploads
 
2013-10-29 00:19:38.289798 7f2a36fcd700 20 
SCRIPT_URL=/template/tmpl/1/3/routing-3/systemvmtemplate-2013-06-12-master-kvm.qcow2.bz2
 
2013-10-29 00:19:38.289799 7f2a54ff9700 20 
REQUEST_URI=/template%2Ftmpl%2F1%2F4%2Fcentos55-x86_64%2Feec2209b-9875-3c8d-92be-c001bd8a0faf.qcow2.bz2?uploads
 
2013-10-29 00:19:38.289800 7f2a36fcd700 20 
SCRIPT_URI=http://cloudstack-secondary.arh-ibstora

Re: [ceph-users] CloudStack 4.2 - radosgw / S3 storage issues

2013-10-29 Thread Andrei Mikhailovsky

To answer myself - there was a problem with my api secret key which rados 
generated. It has escaped the "/", which for some reason CloudStack couldn't 
understand. Removing the escape (\) character has solved the problem. 

Andrei 
- Original Message -

From: "Andrei Mikhailovsky"  
To: ceph-users@lists.ceph.com 
Sent: Tuesday, 29 October, 2013 11:33:44 AM 
Subject: [ceph-users] CloudStack 4.2 - radosgw / S3 storage issues 



Hello guys, 

I am doing a test ACS setup to see how we can use Ceph for both Primary and 
Secondary storage services. I have now successfully added both Primary (cluster 
wide) and Secondary storage. However, I've noticed that my SSVM and CPVM are 
not being created, so digging in the logs revealed the following exceptions: 



The radosgw logs show the following: 

2013-10-29 00:19:38.289487 7f2aa7d9f780 20 enqueued request req=0x2390060 
2013-10-29 00:19:38.289518 7f2aa7d9f780 20 RGWWQ: 
2013-10-29 00:19:38.289521 7f2aa7d9f780 20 req: 0x2390060 
2013-10-29 00:19:38.289529 7f2aa7d9f780 10 allocated request req=0x23452f0 
2013-10-29 00:19:38.289572 7f2aa7d9f780 20 enqueued request req=0x23452f0 
2013-10-29 00:19:38.289575 7f2aa7d9f780 20 RGWWQ: 
2013-10-29 00:19:38.289576 7f2aa7d9f780 20 req: 0x2390060 
2013-10-29 00:19:38.289578 7f2aa7d9f780 20 req: 0x23452f0 
2013-10-29 00:19:38.289610 7f2aa7d9f780 10 allocated request req=0x23a1630 
2013-10-29 00:19:38.289613 7f2a54ff9700 20 dequeued request req=0x2390060 
2013-10-29 00:19:38.289627 7f2a54ff9700 20 RGWWQ: 
2013-10-29 00:19:38.289629 7f2a54ff9700 20 req: 0x23452f0 
2013-10-29 00:19:38.289647 7f2a54ff9700 1 == starting new request 
req=0x2390060 = 
2013-10-29 00:19:38.289650 7f2a36fcd700 20 dequeued request req=0x23452f0 
2013-10-29 00:19:38.289675 7f2a36fcd700 20 RGWWQ: empty 
2013-10-29 00:19:38.289685 7f2a36fcd700 1 == starting new request 
req=0x23452f0 = 
2013-10-29 00:19:38.289715 7f2a54ff9700 2 req 1291:0.69::POST 
/template%2Ftmpl%2F1%2F4%2Fcentos55-x86_64%2Feec2209b-9875-3c8d-92be-c001bd8a0faf.qcow2.bz2::initializing
 
2013-10-29 00:19:38.289723 7f2a54ff9700 10 
host=cloudstack-secondary.arh-ibstorage1domain-name.com 
rgw_dns_name=arh-ibstorage1-ibdomain-name.com 
2013-10-29 00:19:38.289755 7f2a36fcd700 2 req 1292:0.69::POST 
/template%2Ftmpl%2F1%2F3%2Frouting-3%2Fsystemvmtemplate-2013-06-12-master-kvm.qcow2.bz2::initializing
 
2013-10-29 00:19:38.289761 7f2a36fcd700 10 
host=cloudstack-secondary.arh-ibstorage1domain-name.com 
rgw_dns_name=arh-ibstorage1-ibdomain-name.com 
2013-10-29 00:19:38.289761 7f2a54ff9700 10 
s->object=tmpl/1/4/centos55-x86_64/eec2209b-9875-3c8d-92be-c001bd8a0faf.qcow2.bz2
 s->bucket=template 
2013-10-29 00:19:38.289770 7f2a54ff9700 20 FCGI_ROLE=RESPONDER 
2013-10-29 00:19:38.289771 7f2a54ff9700 20 
SCRIPT_URL=/template/tmpl/1/4/centos55-x86_64/eec2209b-9875-3c8d-92be-c001bd8a0faf.qcow2.bz2
 
2013-10-29 00:19:38.289773 7f2a54ff9700 20 
SCRIPT_URI=http://cloudstack-secondary.arh-ibstorage1domain-name.com/template/tmpl/1/4/centos55-x86_64/eec2209b-9875-3c8d-92be-c001bd8a0faf.qcow2.b
 
z2 
2013-10-29 00:19:38.289775 7f2a54ff9700 20 HTTP_AUTHORIZATION=AWS 
S3-User-Key:v1NjAqxoFbROJOlBPRWyOSw8IZI= 
2013-10-29 00:19:38.289776 7f2a54ff9700 20 
HTTP_HOST=cloudstack-secondary.arh-ibstorage1domain-name.com 
2013-10-29 00:19:38.289776 7f2a54ff9700 20 HTTP_DATE=Tue, 29 Oct 2013 00:19:38 
GMT 
2013-10-29 00:19:38.289777 7f2a54ff9700 20 HTTP_USER_AGENT=aws-sdk-java/1.3.22 
Linux/3.5.0-42-generic OpenJDK_64-Bit_Server_VM/20.0-b12 
2013-10-29 00:19:38.289778 7f2a54ff9700 20 CONTENT_TYPE=application/x-bzip2 
2013-10-29 00:19:38.289780 7f2a54ff9700 20 HTTP_TRANSFER_ENCODING=chunked 
2013-10-29 00:19:38.289782 7f2a54ff9700 20 HTTP_CONNECTION=Keep-Alive 
2013-10-29 00:19:38.289784 7f2a54ff9700 20 PATH=/usr/local/bin:/usr/bin:/bin 
2013-10-29 00:19:38.289785 7f2a54ff9700 20 SERVER_SIGNATURE= 
2013-10-29 00:19:38.289786 7f2a54ff9700 20 SERVER_SOFTWARE=Apache/2.2.22 
(Ubuntu) 
2013-10-29 00:19:38.289787 7f2a54ff9700 20 
SERVER_NAME=cloudstack-secondary.arh-ibstorage1domain-name.com 
2013-10-29 00:19:38.289788 7f2a54ff9700 20 SERVER_ADDR=192.168.169.200 
2013-10-29 00:19:38.289789 7f2a54ff9700 20 SERVER_PORT=80 
2013-10-29 00:19:38.289790 7f2a54ff9700 20 REMOTE_ADDR=192.168.169.1 
2013-10-29 00:19:38.289790 7f2a36fcd700 10 
s->object=tmpl/1/3/routing-3/systemvmtemplate-2013-06-12-master-kvm.qcow2.bz2 
s->bucket=template 
2013-10-29 00:19:38.289791 7f2a54ff9700 20 DOCUMENT_ROOT=/var/www 
2013-10-29 00:19:38.289794 7f2a54ff9700 20 SCRIPT_FILENAME=/var/www/s3gw.fcgi 
2013-10-29 00:19:38.289794 7f2a54ff9700 20 REMOTE_PORT=34613 
2013-10-29 00:19:38.289795 7f2a54ff9700 20 GATEWAY_INTERFACE=CGI/1.1 
2013-10-29 00:19:38.289796 7f2a54ff9700 20 SERVER_PROTOCOL=HTTP/1.1 
2013-10-29 00:19:38.289797 7f2a54ff9700 20 REQUEST_METHOD=POST 
2013-10-29 00:19:38.289796 7f2a36fcd700 20 FCGI_ROLE=RESPONDER 
2013-10-29 00:19:38.289798 7f2a54ff9700 20 
QUERY_STRING=

Re: [ceph-users] Frequent Crashes on rbd to nfs gateway Server

2014-09-24 Thread Andrei Mikhailovsky

Ilya, 

Was wondering if you've had a chance to look into performance issues with rbd 
and the patched kernel? I've downloaded 3.16.3 and running some dd tests, which 
were producing hang tasks in the past. I've noticed that i can't get past 
20mb/s on the rbd mounted volume. I am sure I was hitting over 60MB/s before. 

Cheers 

andrei 
- Original Message -

> From: "Ilya Dryomov" 
> To: "Micha Krause" 
> Cc: ceph-users@lists.ceph.com
> Sent: Wednesday, 24 September, 2014 3:33:20 PM
> Subject: Re: [ceph-users] Frequent Crashes on rbd to nfs gateway
> Server

> On Wed, Sep 24, 2014 at 4:52 PM, Micha Krause 
> wrote:
> > Hi,
> >
> >> Like I mentioned in my other reply, I'd be very interested in any
> >>
> >> similar messages on kernel other than 3.15.*, 3.16.1 and 3.16.2.
> >> One
> >> hung task stack trace is usually not enough to diagnose this sort
> >> of
> >> problems.
> >
> >
> > Ok, here is a more complete dmesg output from kernel 3.14:
> >
> > [ 22.250600] rbd: loaded
> > [ 23.159914] libceph: client24407525 fsid
> > 46e857ee-855c-4165-8413-8950f8d081be
> > [ 23.289691] libceph: mon1 10.210.34.11:6789 session established
> > [ 23.890625] rbd2: unknown partition table
> > [ 23.890702] rbd: rbd2: added with size 0x100
> > [ 23.937051] rbd0: unknown partition table
> > [ 23.937144] rbd: rbd0: added with size 0x100
> > [ 24.052402] rbd1: unknown partition table
> > [ 24.052479] rbd: rbd1: added with size 0xa00
> > [ 24.396333] rbd3: unknown partition table
> > [ 24.396430] rbd: rbd3: added with size 0x100
> > [ 25.927373] SGI XFS with ACLs, security attributes, realtime,
> > large
> > block/inode numbers, no debug enabled
> > [ 25.960975] XFS (rbd1): Mounting Filesystem
> > [ 25.961072] XFS (rbd3): Mounting Filesystem
> > [ 25.961637] XFS (rbd2): Mounting Filesystem
> > [ 25.961708] XFS (rbd0): Mounting Filesystem
> > [ 28.236952] XFS (rbd3): Starting recovery (logdev: internal)
> > [ 28.794631] XFS (rbd1): Starting recovery (logdev: internal)
> > [ 31.501516] XFS (rbd0): Starting recovery (logdev: internal)
> > [ 35.498950] XFS (rbd2): Starting recovery (logdev: internal)
> > [ 63.601465] XFS (rbd0): Ending recovery (logdev: internal)
> > [ 64.214852] XFS (rbd3): Ending recovery (logdev: internal)
> > [ 64.783531] rbd4: unknown partition table
> > [ 64.784005] rbd: rbd4: added with size 0x100
> > [ 65.280960] XFS (rbd4): Mounting Filesystem
> > [ 68.443439] XFS (rbd2): Ending recovery (logdev: internal)
> > [ 69.030358] XFS (rbd4): Starting recovery (logdev: internal)
> > [ 69.945523] rbd5: unknown partition table
> > [ 69.946021] rbd: rbd5: added with size 0x100
> > [ 70.398567] XFS (rbd5): Mounting Filesystem
> > [ 71.187934] XFS (rbd5): Starting recovery (logdev: internal)
> > [ 74.144173] rbd6: unknown partition table
> > [ 74.144630] rbd: rbd6: added with size 0x100
> > [ 75.402767] XFS (rbd6): Mounting Filesystem
> > [ 76.133654] XFS (rbd6): Starting recovery (logdev: internal)
> > [ 111.131893] XFS (rbd4): Ending recovery (logdev: internal)
> > [ 112.460383] rbd7: unknown partition table
> > [ 112.460898] rbd: rbd7: added with size 0x100
> > [ 116.834457] XFS (rbd5): Ending recovery (logdev: internal)
> > [ 116.949218] XFS (rbd6): Ending recovery (logdev: internal)
> > [ 166.357039] XFS (rbd1): Ending recovery (logdev: internal)
> > [ 167.531353] XFS (rbd7): Mounting Filesystem
> > [ 168.303166] XFS (rbd7): Starting recovery (logdev: internal)
> > [ 172.477811] XFS (rbd7): Ending recovery (logdev: internal)
> > [ 2038.723394] INFO: task kthreadd:2 blocked for more than 120
> > seconds.
> > [ 2038.723497] Not tainted 3.14-0.bpo.1-amd64 #1
> > [ 2038.723553] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables
> > this message.
> > [ 2038.723637] kthreadd D 88003fc14340 0 2 0
> > 0x
> > [ 2038.723641] 88003fa3a8e0 0046 88003fa43628
> > 81813480
> > [ 2038.723644] 00014340 88003fa43fd8 00014340
> > 88003fa3a8e0
> > [ 2038.723646] 88003fa43638 886a7410 7fff
> > 886a7408
> > [ 2038.723648] Call Trace:
> > [ 2038.723660] [] ? schedule_timeout+0x1ed/0x250
> > [ 2038.723665] [] ? blk_finish_plug+0xb/0x30
> > [ 2038.723677] [] ? _xfs_buf_ioapply+0x277/0x2e0
> > [xfs]
> > [ 2038.723680] [] ?
> > wait_for_completion+0xa4/0x110
> > [ 2038.723685] [] ? try_to_wake_up+0x280/0x280
> > [ 2038.723691] [] ? xfs_bwrite+0x23/0x60 [xfs]
> > [ 2038.723696] [] ? xfs_buf_iowait+0x96/0xf0
> > [xfs]
> > [ 2038.723703] [] ? xfs_bwrite+0x23/0x60 [xfs]
> > [ 2038.723711] [] ? xfs_reclaim_inode+0x2f4/0x310
> > [xfs]
> > [ 2038.723720] [] ?
> > xfs_reclaim_inodes_ag+0x1f7/0x320
> > [xfs]
> > [ 2038.723729] [] ?
> > xfs_reclaim_inodes_nr+0x2c/0x40 [xfs]
> > [ 2038.723736] [] ? super_cache_scan+0x167/0x170
> > [ 2038.723742] [] ? shrink_slab_node+0x126/0x290
> > [ 2038.723746] [] ? vmpressure+0x23/0xa0
> > [ 2038.723750] [] ? shrink_slab+0x82/0x130
> >

Re: [ceph-users] Frequent Crashes on rbd to nfs gateway Server

2014-09-24 Thread Andrei Mikhailovsky

I also had the hang tasks issues with 3.13.0-35 -generic - Original Message 
-

> From: "German Anders" 
> To: "Micha Krause" 
> Cc: ceph-users@lists.ceph.com
> Sent: Wednesday, 24 September, 2014 4:35:15 PM
> Subject: Re: [ceph-users] Frequent Crashes on rbd to nfs gateway
> Server

> 3.13.0-35 -generic? really? I found my self in a similar situation
> like yours and making a downgrade to that version works fine, also
> you could try 3.14.9-031, it work fine for me also.

> German Anders

> > --- Original message ---
> 
> > Asunto: Re: [ceph-users] Frequent Crashes on rbd to nfs gateway
> > Server
> 
> > De: Micha Krause 
> 
> > Para: German Anders , Ilya Dryomov
> > 
> 
> > Cc: 
> 
> > Fecha: Wednesday, 24/09/2014 12:33
> 

> > Hi,
> 

> > > things work fine on kernel 3.13.0-35
> > 
> 

> > I can reproduce this on 3.13.10, and I had in once on 3.13.0-35 as
> > well.
> 

> > Micha Krause
> 

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Frequent Crashes on rbd to nfs gateway Server

2014-09-25 Thread Andrei Mikhailovsky

Guys, 

Have done some testing with 3.16.3-031603-generic downloaded from Ubuntu utopic 
branch. The hang task problem is gone when using large block size (tested with 
1M and 4M) and I could no longer preproduce the hang tasks while doing 100 dd 
tests in a for loop. 

However, I can confirm that I am still getting hang tasks while working with a 
4K block size. The hang tasks start after about an hour, but they do not cause 
the server crash. After a while the dd test times out and continues with the 
loop. This is what I was running: 

for i in {1..100} ; do time dd if=/dev/zero of=/tmp/mount/1G bs=4K count=25K 
oflag=direct ; done 

The following test definately produces the hang tasks like these: 

[23160.549785] INFO: task dd:2033 blocked for more than 120 seconds. 
[23160.588364] Tainted: G OE 3.16.3-031603-generic #201409171435 
[23160.627998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message. 
[23160.706856] dd D 000b 0 2033 23859 0x 
[23160.706861] 88011cec78c8 0082 88011cec78d8 
88011cec7fd8 
[23160.706865] 000143c0 000143c0 88048661bcc0 
880113441440 
[23160.706868] 88011cec7898 88067fd54cc0 880113441440 
880113441440 
[23160.706871] Call Trace: 
[23160.706883] [] schedule+0x29/0x70 
[23160.706887] [] io_schedule+0x8f/0xd0 
[23160.706893] [] dio_await_completion+0x54/0xd0 
[23160.706897] [] do_blockdev_direct_IO+0x958/0xcc0 
[23160.706903] [] ? wake_up_bit+0x2e/0x40 
[23160.706908] [] ? jbd2_journal_dirty_metadata+0xc5/0x260 
[23160.706914] [] ? ext4_get_block_write+0x20/0x20 
[23160.706919] [] __blockdev_direct_IO+0x4c/0x50 
[23160.706922] [] ? ext4_get_block_write+0x20/0x20 
[23160.706928] [] ext4_ind_direct_IO+0xce/0x410 
[23160.706931] [] ? ext4_get_block_write+0x20/0x20 
[23160.706935] [] ext4_ext_direct_IO+0x1bb/0x2a0 
[23160.706938] [] ? __ext4_journal_stop+0x78/0xa0 
[23160.706942] [] ext4_direct_IO+0xec/0x1e0 
[23160.706946] [] ? __mark_inode_dirty+0x53/0x2d0 
[23160.706952] [] generic_file_direct_write+0xbb/0x180 
[23160.706957] [] ? mnt_clone_write+0x12/0x30 
[23160.706960] [] __generic_file_write_iter+0x2a7/0x350 
[23160.706963] [] ext4_file_write_iter+0x111/0x3d0 
[23160.706969] [] ? iov_iter_init+0x14/0x40 
[23160.706976] [] new_sync_write+0x7b/0xb0 
[23160.706978] [] vfs_write+0xc7/0x1f0 
[23160.706980] [] SyS_write+0x4f/0xb0 
[23160.706985] [] system_call_fastpath+0x1a/0x1f 
[23280.705400] INFO: task dd:2033 blocked for more than 120 seconds. 
[23280.745358] Tainted: G OE 3.16.3-031603-generic #201409171435 
[23280.785069] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message. 
[23280.864158] dd D 000b 0 2033 23859 0x 
[23280.864164] 88011cec78c8 0082 88011cec78d8 
88011cec7fd8 
[23280.864167] 000143c0 000143c0 88048661bcc0 
880113441440 
[23280.864170] 88011cec7898 88067fd54cc0 880113441440 
880113441440 
[23280.864173] Call Trace: 
[23280.864185] [] schedule+0x29/0x70 
[23280.864197] [] io_schedule+0x8f/0xd0 
[23280.864203] [] dio_await_completion+0x54/0xd0 
[23280.864207] [] do_blockdev_direct_IO+0x958/0xcc0 
[23280.864213] [] ? wake_up_bit+0x2e/0x40 
[23280.864218] [] ? jbd2_journal_dirty_metadata+0xc5/0x260 
[23280.864224] [] ? ext4_get_block_write+0x20/0x20 
[23280.864229] [] __blockdev_direct_IO+0x4c/0x50 
[23280.864239] [] ? ext4_get_block_write+0x20/0x20 
[23280.864244] [] ext4_ind_direct_IO+0xce/0x410 
[23280.864247] [] ? ext4_get_block_write+0x20/0x20 
[23280.864251] [] ext4_ext_direct_IO+0x1bb/0x2a0 
[23280.864254] [] ? __ext4_journal_stop+0x78/0xa0 
[23280.864258] [] ext4_direct_IO+0xec/0x1e0 
[23280.864263] [] ? __mark_inode_dirty+0x53/0x2d0 
[23280.864268] [] generic_file_direct_write+0xbb/0x180 
[23280.864273] [] ? mnt_clone_write+0x12/0x30 
[23280.864284] [] __generic_file_write_iter+0x2a7/0x350 
[23280.864289] [] ext4_file_write_iter+0x111/0x3d0 
[23280.864295] [] ? iov_iter_init+0x14/0x40 
[23280.864300] [] new_sync_write+0x7b/0xb0 
[23280.864302] [] vfs_write+0xc7/0x1f0 
[23280.864307] [] SyS_write+0x4f/0xb0 
[23280.864314] [] system_call_fastpath+0x1a/0x1f 
[23400.861043] INFO: task dd:2033 blocked for more than 120 seconds. 
[23400.901529] Tainted: G OE 3.16.3-031603-generic #201409171435 
[23400.942255] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message. 
[23401.020985] dd D 000b 0 2033 23859 0x 
[23401.020991] 88011cec78c8 0082 88011cec78d8 
88011cec7fd8 
[23401.020995] 000143c0 000143c0 88048661bcc0 
880113441440 
[23401.020997] 88011cec7898 88067fd54cc0 880113441440 
880113441440 
[23401.021001] Call Trace: 
[23401.021014] [] schedule+0x29/0x70 
[23401.021025] [] io_schedule+0x8f/0xd0 
[23401.021031] [] dio_await_completion+0x54/0xd0 
[23401.021035] [] do_blockdev_direct_IO+0x958/0xcc0 
[23401.021041] [] ? wake_up_bit+0x2e/0x40 
[23401.0210

Re: [ceph-users] Frequent Crashes on rbd to nfs gateway Server

2014-09-25 Thread Andrei Mikhailovsky

Right, I've stopped the tests because it is just getting ridiculous. Without 
rbd cache enabled, dd tests run extremely slow: 

dd if=/dev/zero of=/tmp/mount/1G bs=1M count=1000 oflag=direct 
230+0 records in 
230+0 records out 
241172480 bytes (241 MB) copied, 929.71 s, 259 kB/s 

Any thoughts why I am getting 250kb/s instead of expected 100MB/s+ with large 
block size? 

How do I investigate what's causing this crappy performance? 

Cheers 

Andrei 

- Original Message -

> From: "Andrei Mikhailovsky" 
> To: "Micha Krause" 
> Cc: ceph-users@lists.ceph.com
> Sent: Thursday, 25 September, 2014 10:58:07 AM
> Subject: Re: [ceph-users] Frequent Crashes on rbd to nfs gateway
> Server

> Guys,

> Have done some testing with 3.16.3-031603-generic downloaded from
> Ubuntu utopic branch. The hang task problem is gone when using large
> block size (tested with 1M and 4M) and I could no longer preproduce
> the hang tasks while doing 100 dd tests in a for loop.

> However, I can confirm that I am still getting hang tasks while
> working with a 4K block size. The hang tasks start after about an
> hour, but they do not cause the server crash. After a while the dd
> test times out and continues with the loop. This is what I was
> running:

> for i in {1..100} ; do time dd if=/dev/zero of=/tmp/mount/1G bs=4K
> count=25K oflag=direct ; done

> The following test definately produces the hang tasks like these:

> [23160.549785] INFO: task dd:2033 blocked for more than 120 seconds.
> [23160.588364] Tainted: G OE 3.16.3-031603-generic #201409171435
> [23160.627998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [23160.706856] dd D 000b 0 2033 23859 0x
> [23160.706861] 88011cec78c8 0082 88011cec78d8
> 88011cec7fd8
> [23160.706865] 000143c0 000143c0 88048661bcc0
> 880113441440
> [23160.706868] 88011cec7898 88067fd54cc0 880113441440
> 880113441440
> [23160.706871] Call Trace:
> [23160.706883] [] schedule+0x29/0x70
> [23160.706887] [] io_schedule+0x8f/0xd0
> [23160.706893] [] dio_await_completion+0x54/0xd0
> [23160.706897] [] do_blockdev_direct_IO+0x958/0xcc0
> [23160.706903] [] ? wake_up_bit+0x2e/0x40
> [23160.706908] [] ?
> jbd2_journal_dirty_metadata+0xc5/0x260
> [23160.706914] [] ? ext4_get_block_write+0x20/0x20
> [23160.706919] [] __blockdev_direct_IO+0x4c/0x50
> [23160.706922] [] ? ext4_get_block_write+0x20/0x20
> [23160.706928] [] ext4_ind_direct_IO+0xce/0x410
> [23160.706931] [] ? ext4_get_block_write+0x20/0x20
> [23160.706935] [] ext4_ext_direct_IO+0x1bb/0x2a0
> [23160.706938] [] ? __ext4_journal_stop+0x78/0xa0
> [23160.706942] [] ext4_direct_IO+0xec/0x1e0
> [23160.706946] [] ? __mark_inode_dirty+0x53/0x2d0
> [23160.706952] []
> generic_file_direct_write+0xbb/0x180
> [23160.706957] [] ? mnt_clone_write+0x12/0x30
> [23160.706960] []
> __generic_file_write_iter+0x2a7/0x350
> [23160.706963] [] ext4_file_write_iter+0x111/0x3d0
> [23160.706969] [] ? iov_iter_init+0x14/0x40
> [23160.706976] [] new_sync_write+0x7b/0xb0
> [23160.706978] [] vfs_write+0xc7/0x1f0
> [23160.706980] [] SyS_write+0x4f/0xb0
> [23160.706985] [] system_call_fastpath+0x1a/0x1f
> [23280.705400] INFO: task dd:2033 blocked for more than 120 seconds.
> [23280.745358] Tainted: G OE 3.16.3-031603-generic #201409171435
> [23280.785069] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [23280.864158] dd D 000b 0 2033 23859 0x
> [23280.864164] 88011cec78c8 0082 88011cec78d8
> 88011cec7fd8
> [23280.864167] 000143c0 000143c0 88048661bcc0
> 880113441440
> [23280.864170] 88011cec7898 88067fd54cc0 880113441440
> 880113441440
> [23280.864173] Call Trace:
> [23280.864185] [] schedule+0x29/0x70
> [23280.864197] [] io_schedule+0x8f/0xd0
> [23280.864203] [] dio_await_completion+0x54/0xd0
> [23280.864207] [] do_blockdev_direct_IO+0x958/0xcc0
> [23280.864213] [] ? wake_up_bit+0x2e/0x40
> [23280.864218] [] ?
> jbd2_journal_dirty_metadata+0xc5/0x260
> [23280.864224] [] ? ext4_get_block_write+0x20/0x20
> [23280.864229] [] __blockdev_direct_IO+0x4c/0x50
> [23280.864239] [] ? ext4_get_block_write+0x20/0x20
> [23280.864244] [] ext4_ind_direct_IO+0xce/0x410
> [23280.864247] [] ? ext4_get_block_write+0x20/0x20
> [23280.864251] [] ext4_ext_direct_IO+0x1bb/0x2a0
> [23280.864254] [] ? __ext4_journal_stop+0x78/0xa0
> [23280.864258] [] ext4_direct_IO+0xec/0x1e0
> [23280.864263] [] ? __mark_inode_dirty+0x53/0x2d0
> [23280.864268] []
> generic_file_direct_write+0xbb/0x180
> [23280.864273] [] ? mnt_clone_write+0x12/0x30
> [23280.

Re: [ceph-users] Frequent Crashes on rbd to nfs gateway Server

2014-09-25 Thread Andrei Mikhailovsky

Ilya, 

I've not used rbd map on older kernels. Just experimenting with rbd map to have 
an iscsi and nfs gateway service for hypervisors such as xenserver and vmware. 
I've tried it with the latest ubuntu LTS kernel 3.13 I believe and noticed the 
issue. 
Can you not reproduce the hang tasks when doing dd testing? have you tried 4K 
block sizes and running it for sometime, like I have done? 

Thanks 

Andrei 

- Original Message -

> From: "Ilya Dryomov" 
> To: "Andrei Mikhailovsky" 
> Cc: "Micha Krause" , ceph-users@lists.ceph.com
> Sent: Thursday, 25 September, 2014 12:04:37 PM
> Subject: Re: [ceph-users] Frequent Crashes on rbd to nfs gateway
> Server

> On Thu, Sep 25, 2014 at 1:58 PM, Andrei Mikhailovsky
>  wrote:
> > Guys,
> >
> > Have done some testing with 3.16.3-031603-generic downloaded from
> > Ubuntu
> > utopic branch. The hang task problem is gone when using large block
> > size
> > (tested with 1M and 4M) and I could no longer preproduce the hang
> > tasks
> > while doing 100 dd tests in a for loop.
> >
> >
> >
> > However, I can confirm that I am still getting hang tasks while
> > working with
> > a 4K block size. The hang tasks start after about an hour, but they
> > do not
> > cause the server crash. After a while the dd test times out and
> > continues
> > with the loop. This is what I was running:
> >
> > for i in {1..100} ; do time dd if=/dev/zero of=/tmp/mount/1G bs=4K
> > count=25K
> > oflag=direct ; done
> >
> > The following test definately produces the hang tasks like these:
> >
> > [23160.549785] INFO: task dd:2033 blocked for more than 120
> > seconds.
> > [23160.588364] Tainted: G OE 3.16.3-031603-generic
> > #201409171435
> > [23160.627998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables
> > this message.
> > [23160.706856] dd D 000b 0 2033 23859
> > 0x
> > [23160.706861] 88011cec78c8 0082 88011cec78d8
> > 88011cec7fd8
> > [23160.706865] 000143c0 000143c0 88048661bcc0
> > 880113441440
> > [23160.706868] 88011cec7898 88067fd54cc0 880113441440
> > 880113441440
> > [23160.706871] Call Trace:
> > [23160.706883] [] schedule+0x29/0x70
> > [23160.706887] [] io_schedule+0x8f/0xd0
> > [23160.706893] [] dio_await_completion+0x54/0xd0
> > [23160.706897] []
> > do_blockdev_direct_IO+0x958/0xcc0
> > [23160.706903] [] ? wake_up_bit+0x2e/0x40
> > [23160.706908] [] ?
> > jbd2_journal_dirty_metadata+0xc5/0x260
> > [23160.706914] [] ?
> > ext4_get_block_write+0x20/0x20
> > [23160.706919] [] __blockdev_direct_IO+0x4c/0x50
> > [23160.706922] [] ?
> > ext4_get_block_write+0x20/0x20
> > [23160.706928] [] ext4_ind_direct_IO+0xce/0x410
> > [23160.706931] [] ?
> > ext4_get_block_write+0x20/0x20
> > [23160.706935] [] ext4_ext_direct_IO+0x1bb/0x2a0
> > [23160.706938] [] ? __ext4_journal_stop+0x78/0xa0
> > [23160.706942] [] ext4_direct_IO+0xec/0x1e0
> > [23160.706946] [] ? __mark_inode_dirty+0x53/0x2d0
> > [23160.706952] []
> > generic_file_direct_write+0xbb/0x180
> > [23160.706957] [] ? mnt_clone_write+0x12/0x30
> > [23160.706960] []
> > __generic_file_write_iter+0x2a7/0x350
> > [23160.706963] []
> > ext4_file_write_iter+0x111/0x3d0
> > [23160.706969] [] ? iov_iter_init+0x14/0x40
> > [23160.706976] [] new_sync_write+0x7b/0xb0
> > [23160.706978] [] vfs_write+0xc7/0x1f0
> > [23160.706980] [] SyS_write+0x4f/0xb0
> > [23160.706985] [] system_call_fastpath+0x1a/0x1f
> > [23280.705400] INFO: task dd:2033 blocked for more than 120
> > seconds.
> > [23280.745358] Tainted: G OE 3.16.3-031603-generic
> > #201409171435
> > [23280.785069] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables
> > this message.
> > [23280.864158] dd D 000b 0 2033 23859
> > 0x
> > [23280.864164] 88011cec78c8 0082 88011cec78d8
> > 88011cec7fd8
> > [23280.864167] 000143c0 000143c0 88048661bcc0
> > 880113441440
> > [23280.864170] 88011cec7898 88067fd54cc0 880113441440
> > 880113441440
> > [23280.864173] Call Trace:
> > [23280.864185] [] schedule+0x29/0x70
> > [23280.864197] [] io_schedule+0x8f/0xd0
> > [23280.864203] [] dio_await_completion+0x54/0xd0
> > [23280.864207] []
> > do_blockdev_direct_IO+0x958/0xcc0
> > [23280.864213] [] ? wake_up_bit+0x2e/0x40
> > [23280.86421

[ceph-users] OSD log bound mismatch

2014-09-26 Thread Andrei Mikhailovsky

Hello Cephers, 

I am having some issues with two osds, which are either flapping or just 
crashing without recovering back. I've got a log file 100MB or so for these 
osds which has been generated in a couple of hours if anyone is interested. I 
am running firefly with the latest updates on Ubuntu 12.04 with the latest LTS 
kernel. 

Looking at the osd logs I see a bunch of these entries: 

2014-09-26 15:24:08.998918 7f73cb194700 0 log [ERR] : 5.108 log bound mismatch, 
info (53757'2809698,54690'2817536] actual [53757'2809532,54690'2817536] 

followed by slow requests like these: 

2014-09-26 15:24:16.798355 7f73e247c700 0 log [WRN] : slow request 31.463701 
seconds old, received at 2014-09-26 15:23:45.334567: 
osd_op(client.37190249.0:6372257 rbd_data.3a0cd42ae8944a.280d 
[set-alloc-hint object_size 4194304 write_size 4194304,write 2203648~4096] 
5.27e2bd53 ack+ondisk+write e54691) v4 currently waiting for subops from 8 
2014-09-26 15:24:16.798358 7f73e247c700 0 log [WRN] : slow request 31.004246 
seconds old, received at 2014-09-26 15:23:45.794022: 
osd_op(client.38862536.0:2001456 rbd_data.250f7505e5edd7.0f4f 
[stat,set-alloc-hint object_size 4194304 write_size 4194304,write 3813376~4096] 
5.5a3d6aa3 ack+ondisk+write e54691) v4 currently waiting for missing object 

The cluster seems to suffer and the guest vms are running a bit with a lag. 

Any idea how to fix these issues? 

Cheers 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Why performance of benchmarks with small blocks is extremely small?

2014-10-01 Thread Andrei Mikhailovsky

Timur, 

As far as I know, the latest master has a number of improvements for ssd disks. 
If you check the mailing list discussion from a couple of weeks back, you can 
see that the latest stable firefly is not that well optimised for ssd drives 
and IO is limited. However changes are being made to address that. 

I am well surprised that you can get 10K IOps as in my tests I was not getting 
over 3K IOPs on the ssd disks which are capable of doing 90K IOps. 

P.S. does anyone know if the ssd optimisation code will be added to the next 
maintenance release of firefy? 

Andrei 
- Original Message -

> From: "Timur Nurlygayanov" 
> To: "Christian Balzer" 
> Cc: ceph-us...@ceph.com
> Sent: Wednesday, 1 October, 2014 1:11:25 PM
> Subject: Re: [ceph-users] Why performance of benchmarks with small
> blocks is extremely small?

> Hello Christian,

> Thank you for your detailed answer!

> I have other pre-production environment with 4 Ceph servers, 4 SSD
> disks per Ceph server (each Ceph OSD node on the separate SSD disk)
> Probably I should move journals to other disks or it is not required
> in my case?

> [root@ceph-node ~]# mount | grep ceph
> /dev/sdb4 on /var/lib/ceph/osd/ceph-0 type xfs
> (rw,noexec,nodev,noatime,nodiratime,inode64,logbsize=256k,delaylog,user_xattr,data=writeback)
> /dev/sde4 on /var/lib/ceph/osd/ceph-5 type xfs
> (rw,noexec,nodev,noatime,nodiratime,inode64,logbsize=256k,delaylog,user_xattr,data=writeback)
> /dev/sdd4 on /var/lib/ceph/osd/ceph-2 type xfs
> (rw,noexec,nodev,noatime,nodiratime,inode64,logbsize=256k,delaylog,user_xattr,data=writeback)
> /dev/sdc4 on /var/lib/ceph/osd/ceph-1 type xfs
> (rw,noexec,nodev,noatime,nodiratime,inode64,logbsize=256k,delaylog,user_xattr,data=writeback)

> [root@ceph-node ~]# find /var/lib/ceph/osd/ | grep journal
> /var/lib/ceph/osd/ceph-0/journal
> /var/lib/ceph/osd/ceph-5/journal
> /var/lib/ceph/osd/ceph-1/journal
> /var/lib/ceph/osd/ceph-2/journal

> My SSD disks have ~ 40k IOPS per disk, but on the VM I can see only ~
> 10k - 14k IOPS for disks operations.
> To check this I execute the following command on VM with root
> partition mounted on disk in Ceph storage:

> root@test-io:/home/ubuntu# rm -rf /tmp/test && spew -d --write -r -b
> 4096 10M /tmp/test
> WTR: 56506.22 KiB/s Transfer time: 00:00:00 IOPS: 14126.55

> Is it expected result or I can improve the performance and get at
> least 30k-40k IOPS on the VM disks? (I have 2x 10Gb/s networks
> interfaces in LACP bonding for storage network, looks like network
> can't be the bottleneck).

> Thank you!

> On Wed, Oct 1, 2014 at 6:50 AM, Christian Balzer < ch...@gol.com >
> wrote:

> > Hello,
> 

> > [reduced to ceph-users]
> 

> > On Sat, 27 Sep 2014 19:17:22 +0400 Timur Nurlygayanov wrote:
> 

> > > Hello all,
> 
> > >
> 
> > > I installed OpenStack with Glance + Ceph OSD with replication
> > > factor 2
> 
> > > and now I can see the write operations are extremly slow.
> 
> > > For example, I can see only 0.04 MB/s write speed when I run
> > > rados
> > > bench
> 
> > > with 512b blocks:
> 
> > >
> 
> > > rados bench -p test 60 write --no-cleanup -t 1 -b 512
> 
> > >
> 
> > There are 2 things wrong with that this test:
> 

> > 1. You're using rados bench, when in fact you should be testing
> > from
> 
> > within VMs. For starters a VM could make use of the rbd cache you
> > enabled,
> 
> > rados bench won't.
> 

> > 2. Given the parameters of this test you're testing network latency
> > more
> 
> > than anything else. If you monitor the Ceph nodes (atop is a good
> > tool for
> 
> > that), you will probably see that neither CPU nor disks resources
> > are
> 
> > being exhausted. With a single thread rados puts that tiny block of
> > 512
> 
> > bytes on the wire, the primary OSD for the PG has to write this to
> > the
> 
> > journal (on your slow, non-SSD disks) and send it to the secondary
> > OSD,
> 
> > which has to ACK the write to its journal back to the primary one,
> > which
> 
> > in turn then ACKs it to the client (rados bench) and then rados
> > bench
> > can
> 
> > send the next packet.
> 
> > You get the drift.
> 

> > Using your parameters I can get 0.17MB/s on a pre-production
> > cluster
> 
> > that uses 4xQDR Infiniband (IPoIB) connections, on my shitty test
> > cluster
> 
> > with 1GB/s links I get similar results to you, unsurprisingly.
> 

> > Ceph excels only with lots of parallelism, so an individual thread
> > might
> 
> > be slow (and in your case HAS to be slow, which has nothing to do
> > with
> 
> > Ceph per se) but many parallel ones will utilize the resources
> > available.
> 

> > Having data blocks that are adequately sized (4MB, the default
> > rados
> > size)
> 
> > will help for bandwidth and the rbd cache inside a properly
> > configured VM
> 
> > should make that happen.
> 

> > Of course in most real life scenarios you will run out of IOPS long
> > before
> 
> > you run out of bandwidth.
> 

> > > Maintaining 1 concurrent writes of 512 bytes for up to

Re: [ceph-users] Why performance of benchmarks with small blocks is extremely small?

2014-10-01 Thread Andrei Mikhailovsky

Greg, are they going to be a part of the next stable release? 

Cheers 
- Original Message -

> From: "Gregory Farnum" 
> To: "Andrei Mikhailovsky" 
> Cc: "Timur Nurlygayanov" , "ceph-users"
> 
> Sent: Wednesday, 1 October, 2014 3:04:51 PM
> Subject: Re: [ceph-users] Why performance of benchmarks with small
> blocks is extremely small?

> On Wed, Oct 1, 2014 at 5:24 AM, Andrei Mikhailovsky
>  wrote:
> > Timur,
> >
> > As far as I know, the latest master has a number of improvements
> > for ssd
> > disks. If you check the mailing list discussion from a couple of
> > weeks back,
> > you can see that the latest stable firefly is not that well
> > optimised for
> > ssd drives and IO is limited. However changes are being made to
> > address
> > that.
> >
> > I am well surprised that you can get 10K IOps as in my tests I was
> > not
> > getting over 3K IOPs on the ssd disks which are capable of doing
> > 90K IOps.
> >
> > P.S. does anyone know if the ssd optimisation code will be added to
> > the next
> > maintenance release of firefly?

> Not a chance. The changes enabling that improved throughput are very
> invasive and sprinkled all over the OSD; they aren't the sort of
> thing
> that one does backport or that one could put on top of a stable
> release for any meaningful definition of "stable". :)
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] ceph, ssds, hdds, journals and caching

2014-10-02 Thread Andrei Mikhailovsky

Hello Cephers, 

I am a bit lost on the best ways of using ssd and hdds for ceph cluster which 
uses rbd + kvm for guest vms. 

At the moment I've got 2 osd servers which currently have 8 hdd osds (max 16 
bays) each and 4 ssd disks. Currently, I am using 2 ssds for osd journals and 
I've got 2x512GB ssd spare, which are waiting to be utilised. I am running 
Ubuntu 12.04 with 3.13 kernel from Ubuntu 14.04 and the latest firefly release. 

I've tried to use ceph cache pool tier and the results were not good. My small 
cluster slowed down by quite a bit and i've disabled the cache tier altogether. 

My question is how would one utilise the ssds in the best manner to achieve a 
good performance boost compared to a pure hdd setup? Should I enable block 
level caching (likes of bcache or similar) using all my ssds and do not bother 
using ssd journals? Should I keep the journals on two ssds and utilse the 
remaining two ssds for bcache? Or is there a better alternative? 

Cheers 

Andrei 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph, ssds, hdds, journals and caching

2014-10-03 Thread Andrei Mikhailovsky

From: "Christian Balzer"  

> To: ceph-users@lists.ceph.com
> Sent: Friday, 3 October, 2014 2:06:48 AM
> Subject: Re: [ceph-users] ceph, ssds, hdds, journals and caching

> On Thu, 2 Oct 2014 21:54:54 +0100 (BST) Andrei Mikhailovsky wrote:

> > Hello Cephers,
> >
> > I am a bit lost on the best ways of using ssd and hdds for ceph
> > cluster
> > which uses rbd + kvm for guest vms.
> >
> > At the moment I've got 2 osd servers which currently have 8 hdd
> > osds
> > (max 16 bays) each and 4 ssd disks. Currently, I am using 2 ssds
> > for osd
> > journals and I've got 2x512GB ssd spare, which are waiting to be
> > utilised. I am running Ubuntu 12.04 with 3.13 kernel from Ubuntu
> > 14.04
> > and the latest firefly release.
> >
> In case you're planning to add more HDDs to those nodes, the obvious
> use
> case for those SSDs would be additional journals.

>From what i've seen so far, the two ssds that i currently use for journaling 
>are happy serving 8 osds and I do not have much load on them. Having more osds 
>per server might change that though, you are right. But at the moment I was 
>hoping to improve the read performance, especially for small block sizes, 
>hense I was thinking of adding the caching layer. 

> Also depending on your use case, a kernel newer than 3.13(which also
> is
> not getting any upstream updates/support) might be a good idea.

Yes, indeed. I am considering the latest supported kernels from Ubuntu team 

> > I've tried to use ceph cache pool tier and the results were not
> > good. My
> > small cluster slowed down by quite a bit and i've disabled the
> > cache
> > tier altogether.
> >
> Yeah, this feature is clearly a case of "wait for the next major
> release
> or the one after that and try again".

Anyone know if the latest 0.80.6 firefly improves the cache behaviour? I've 
seen a bunch of changes in the cache tiering, however, I am not sure if these 
are addressing the stability of the tier or its efficiency? 

> > My question is how would one utilise the ssds in the best manner to
> > achieve a good performance boost compared to a pure hdd setup?
> > Should I
> > enable block level caching (likes of bcache or similar) using all
> > my
> > ssds and do not bother using ssd journals? Should I keep the
> > journals on
> > two ssds and utilse the remaining two ssds for bcache? Or is there
> > a
> > better alternative?
> >
> This has all been discussed very recently here and the results where
> inconclusive at best. In some cases reads were improved, but for
> writes it
> was potentially worse than normal Ceph journals.

> Have you monitored your storage nodes (I keep recommending atop for
> this)
> during a high load time? If your SSDs are becoming the bottleneck and
> not
> the actual disks (doubtful, but verify), more journals.

I am monitoring my ceph cluster with Zabbix and I do not have a significant 
load on the servers at all. My biggest concern is the single thread performance 
of vms. From what I can see, this is the main downside of ceph. On average, I 
am not getting much over 35-40MB/s per thread in cold data reads. This is 
compared with a single hdd read performance of 150-160MB/s. Having about 1/4 of 
the raw device performance is a bit worring, especially compared with what i've 
read. I should be getting about 1/2 of the raw drive performance for a single 
thread, but I am not. My hope was with caching tier I can increase it. 

> Other than that, maybe create a 1TB (usable space) SSD pool for
> guests
> with special speed requirements...

I am planning to do this for the database volumes, however, from what I've read 
so far, there are performance bottlenecks and the current stable firefly is not 
optimised for ssds. I've not tried it myself, but it doesn't look like having a 
dedicated ssd pool will bring a significant increase in performance. 

Has anyone tried using bcache of dm-cache with ceph? Any tips on how to 
integrate it? From what I've read so far, they require you to format the 
existing hdd, which is not feasible if you have an existing live cluster. 

Cheers 

> Christian
> --
> Christian Balzer Network/Systems Engineer
> ch...@gol.com Global OnLine Japan/Fusion Communications
> http://www.gol.com/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph, ssds, hdds, journals and caching

2014-10-03 Thread Andrei Mikhailovsky

That is what I am afraid of! 

- Original Message -

> From: "Vladislav Gorbunov" 
> To: "Andrei Mikhailovsky" 
> Cc: "Christian Balzer" , ceph-users@lists.ceph.com
> Sent: Friday, 3 October, 2014 12:04:37 PM
> Subject: Re: [ceph-users] ceph, ssds, hdds, journals and caching

> > Has anyone tried using bcache of dm-cache with ceph?
> I'm tested lvmcache (based on dm-cache) with ceph 0.80.5 on CentOS 7.
> Got unrecoverable error with xfs and total lost osd server.
> ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph, ssds, hdds, journals and caching

2014-10-04 Thread Andrei Mikhailovsky

> While I doubt you're hitting any particular bottlenecks on your
> storage
> servers I don't think Zabbix (very limited experience with it so I
> might
> be wrong) monitors everything, nor does it so at sufficiently high
> freqency to show what is going on during a peak or fio test from a
> client.
> Thus my suggestion to stare at it live with atop (on all nodes).

I will give it a go and see what happens during benchmarks. The Atop is rather 
informative indeed! There is a zabbix plugin/template for ceph, which gives a 
good overview of the ceph cluster. It does not provide the level of details 
that you would get from an admin socket, but rather an overview of the cluster 
thhroughput and io rates as well as PGs status. 

> > My biggest concern is the single
> > thread performance of vms. From what I can see, this is the main
> > downside of ceph. On average, I am not getting much over 35-40MB/s
> > per
> > thread in cold data reads. This is compared with a single hdd read
> > performance of 150-160MB/s. Having about 1/4 of the raw device
> > performance is a bit worring, especially compared with what i've
> > read. I
> > should be getting about 1/2 of the raw drive performance for a
> > single
> > thread, but I am not. My hope was with caching tier I can increase
> > it.
> >
> Have a look at:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-April/028552.html

> Your numbers look very much like mine before increasing the
> read_ahead
> buffer.

How much in performance did you gain by setting the read_ahead values? The 
performance figures that I get are using the following udev rules: 

# set read_ahead values 
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="1", 
ATTR{queue/read_ahead_kb}="2048" 
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="1", 
ATTR{queue/nr_requests}="2048" 
# set deadline scheduler for non-rotating disks 
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="0", 
ATTR{queue/scheduler}="noop" 
# # set cfq scheduler for rotating disks 
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="1", 
ATTR{queue/scheduler}="cfq" 

Is there anything else that I am missing? 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph, ssds, hdds, journals and caching

2014-10-04 Thread Andrei Mikhailovsky

> Read the above link again, carefully. ^o^
> In in it I state that:
> a) despite reading such in old posts, setting read_ahead on the OSD
> nodes
> has no or even negative effects. Inside the VM, it is very helpful:

> b) the read speed increased about 10 times, from 35MB/s to 380MB/s

Christian, are you getting 380MB/s from hdd osds or ssd osds? It seems a bit 
high for a single thread cold data throughput. 

Cheers 

> Regards,

> Christian
> > # set read_ahead values
> > ACTION=="add|change", KERNEL=="sd[a-z]",
> > ATTR{queue/rotational}=="1",
> > ATTR{queue/read_ahead_kb}="2048" ACTION=="add|change",
> > KERNEL=="sd[a-z]", ATTR{queue/rotational}=="1",
> > ATTR{queue/nr_requests}="2048" # set deadline scheduler for
> > non-rotating
> > disks ACTION=="add|change", KERNEL=="sd[a-z]",
> > ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="noop" # # set
> > cfq
> > scheduler for rotating disks ACTION=="add|change",
> > KERNEL=="sd[a-z]",
> > ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="cfq"
> >
> > Is there anything else that I am missing?

> --
> Christian Balzer Network/Systems Engineer
> ch...@gol.com Global OnLine Japan/Fusion Communications
> http://www.gol.com/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] urgent- object unfound

2014-10-16 Thread Andrei Mikhailovsky

Tuan, 

I had a similar behaviour when I've connected the cache pool tier. I resolved 
the issues by restarting all my osds. If your case is the same, try it and see 
if it works. If not, I guess the guys here and on the ceph irc might be able to 
help you. 

Cheers 

Andrei 
- Original Message -

> From: "Ta Ba Tuan" 
> To: ceph-users@lists.ceph.com
> Sent: Thursday, 16 October, 2014 1:36:01 PM
> Subject: [ceph-users] urgent- object unfound

> Hi eveyone, I use replicate 3, many unfound object and Ceph very
> slow.

> pg 6.9d8 is active+recovery_wait+degraded+remapped, acting [22,93], 4
> unfound
> pg 6.766 is active+recovery_wait+degraded+remapped, acting [21,36], 1
> unfound
> pg 6.73f is active+recovery_wait+degraded+remapped, acting [19,84], 2
> unfound
> pg 6.63c is active+recovery_wait+degraded+remapped, acting [10,37], 2
> unfound
> pg 6.56c is active+recovery_wait+degraded+remapped, acting [124,93],
> 2
> unfound
> pg 6.4d3 is active+recovering+degraded+remapped, acting [33,94], 2
> unfound
> pg 6.4a5 is active+recovery_wait+degraded+remapped, acting [11,94], 2
> unfound
> pg 6.2f9 is active+recovery_wait+degraded+remapped, acting [22,34], 2
> unfound
> recovery 535673/52672768 objects degraded (1.017%); 17/17470639
> unfound
> (0.000%)

> ceph pg map 6.766
> osdmap e94990 pg 6.766 (6.766) -> up [49,36,21] acting [21,36]

> I can't resolve it. I need data on those objects. Guide me, please!

> Thank you!

> --
> Tuan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] slow requests - what is causing them?

2014-10-16 Thread Andrei Mikhailovsky

Hello cephers, 

I've been testing flashcache and enhanceio block device caching for the osds 
and i've noticed i have started getting the slow requests. The caching type 
that I use is ready only, so all writes bypass the caching ssds and go directly 
to osds, just like what it used to be before introducing the caching layer. 
Prior to introducing caching, i rarely had the slow requests. Judging by the 
logs, all slow requests are looking like these: 


2014-10-16 01:09:15.600807 osd.7 192.168.168.200:6836/32031 100 : [WRN] slow 
request 30.999641 seconds old, received at 2014-10-16 01:08:44.601040: 
osd_op(client.36035566.0:16626375 rbd_data.51da686763845e 
.5a15 [set-alloc-hint object_size 4194304 write_size 4194304,write 
2007040~16384] 5.7b16421b snapc c4=[c4] ack+ondisk+write e61892) v4 currently 
waiting for subops from 9 
2014-10-16 01:09:15.600811 osd.7 192.168.168.200:6836/32031 101 : [WRN] slow 
request 30.999581 seconds old, received at 2014-10-16 01:08:44.601100: 
osd_op(client.36035566.0:16626376 rbd_data.51da686763845e 
.5a15 [set-alloc-hint object_size 4194304 write_size 4194304,write 
2039808~16384] 5.7b16421b snapc c4=[c4] ack+ondisk+write e61892) v4 currently 
waiting for subops from 9 
2014-10-16 01:09:16.185530 osd.2 192.168.168.200:6811/31891 76 : [WRN] 20 slow 
requests, 1 included below; oldest blocked for > 57.003961 secs 
2014-10-16 01:09:16.185564 osd.2 192.168.168.200:6811/31891 77 : [WRN] slow 
request 30.098574 seconds old, received at 2014-10-16 01:08:46.086854: 
osd_op(client.38917806.0:3481697 rbd_data.251d05e3db45a54. 
0304 [stat,set-alloc-hint object_size 4194304 write_size 
4194304,write 2732032~8192] 5.e4683bbb ack+ondisk+write e61892) v4 currently 
waiting for subops from 11 
2014-10-16 01:09:16.601020 osd.7 192.168.168.200:6836/32031 102 : [WRN] 16 slow 
requests, 2 included below; oldest blocked for > 43.531516 secs 


In general, I see between 0 and about 2,000 slow request log entries per day. 
On one day I saw over 100k entries, but it only happened once. 

I am struggling to understand what is casing the slow requests? If all the 
writes go the same path as before caching was introduced, how come I am getting 
them? How can I investigate this further? 

Thanks 

Andrei 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] slow requests - what is causing them?

2014-10-20 Thread Andrei Mikhailovsky


Hello cephers, 

I've been testing flashcache and enhanceio block device caching for the osds 
and i've noticed i have started getting the slow requests. The caching type 
that I use is ready only, so all writes bypass the caching ssds and go directly 
to osds, just like what it used to be before introducing the caching layer. 
Prior to introducing caching, i rarely had the slow requests. Judging by the 
logs, all slow requests are looking like these: 


2014-10-16 01:09:15.600807 osd.7 192.168.168.200:6836/32031 100 : [WRN] slow 
request 30.999641 seconds old, received at 2014-10-16 01:08:44.601040: 
osd_op(client.36035566.0:16626375 rbd_data.51da686763845e 
.5a15 [set-alloc-hint object_size 4194304 write_size 4194304,write 
2007040~16384] 5.7b16421b snapc c4=[c4] ack+ondisk+write e61892) v4 currently 
waiting for subops from 9 
2014-10-16 01:09:15.600811 osd.7 192.168.168.200:6836/32031 101 : [WRN] slow 
request 30.999581 seconds old, received at 2014-10-16 01:08:44.601100: 
osd_op(client.36035566.0:16626376 rbd_data.51da686763845e 
.5a15 [set-alloc-hint object_size 4194304 write_size 4194304,write 
2039808~16384] 5.7b16421b snapc c4=[c4] ack+ondisk+write e61892) v4 currently 
waiting for subops from 9 
2014-10-16 01:09:16.185530 osd.2 192.168.168.200:6811/31891 76 : [WRN] 20 slow 
requests, 1 included below; oldest blocked for > 57.003961 secs 
2014-10-16 01:09:16.185564 osd.2 192.168.168.200:6811/31891 77 : [WRN] slow 
request 30.098574 seconds old, received at 2014-10-16 01:08:46.086854: 
osd_op(client.38917806.0:3481697 rbd_data.251d05e3db45a54. 
0304 [stat,set-alloc-hint object_size 4194304 write_size 
4194304,write 2732032~8192] 5.e4683bbb ack+ondisk+write e61892) v4 currently 
waiting for subops from 11 
2014-10-16 01:09:16.601020 osd.7 192.168.168.200:6836/32031 102 : [WRN] 16 slow 
requests, 2 included below; oldest blocked for > 43.531516 secs 


In general, I see between 0 and about 2,000 slow request log entries per day. 
On one day I saw over 100k entries, but it only happened once. 

I am struggling to understand what is casing the slow requests? If all the 
writes go the same path as before caching was introduced, how come I am getting 
them? How can I investigate this further? 

Thanks 

Andrei 


___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

1 2 3 >

1 - 100 of 265 matches

Mail list logo