Re: [ceph-users] krbd splitting large IO's into smaller IO's

2015-06-30 Thread Ilya Dryomov
On Tue, Jun 30, 2015 at 8:30 AM, Z Zhang  wrote:
> Hi Ilya,
>
> Thanks for your explanation. This makes sense. Will you make max_segments to
> be configurable? Could you pls point me the fix you have made? We might help
> to test it.

[PATCH] rbd: bump queue_max_segments on ceph-devel.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS posix test performance

2015-06-30 Thread Ilya Dryomov
On Tue, Jun 30, 2015 at 6:57 AM, Yan, Zheng  wrote:
> I tried 4.1 kernel and 0.94.2 ceph-fuse. their performance are about the same.
>
> fuse:
> Files=191, Tests=1964, 60 wallclock secs ( 0.43 usr  0.08 sys +  1.16 cusr  
> 0.65 csys =  2.32 CPU)
>
> kernel:
> Files=191, Tests=2286, 61 wallclock secs ( 0.45 usr  0.08 sys +  1.21 cusr  
> 0.72 csys =  2.46 CPU)

On Friday, I tried stock 3.10 vs 4.1 and they were about the same as
well (a few tests failed in 3.10 though).  However Dan is using
3.10.0-229.7.2.el7.x86_64, which is 3.10 with a lot of backports, so
it's not quite the same.  Dan, are the numbers you are seeing
consistent?

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] which version of ceph with my kernel 3.14 ?

2015-06-30 Thread Pascal GREGIS
Hello,

I have installed a ceph firefly (0.80) on my system last year.
I run a 3.14.43 kernel (I upgraded it recently from 3.14.4). ceph seems to be 
working well in most cases, though I haven't used it in real production mode as 
of now.
The only thing I noticed recently was some Input/Output Error on a replicated 
block-device that I mounted successively on a node, then another when I stopped 
brutally the first one, then back on the first node, but I'm not asking for 
help on this problem now, as long as I don't know more about it (ceph is 
probably not to incrimnate).

My question is :
should I upgrade ceph ?
do I take risks doing it ?
what would be the benefits ?
are there some behaviour changes that could cause problems ?

I make minimal use of ceph :
- 2 storage nodes only, with potentially several dozens of terabytes of disk, 
but probably not more than 32 GB of RAM
- use of ceph block-device only, to create a replicated partition with almost 
all the disk space to make backups on it

Thanks

Pascal
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to use cgroup to bind ceph-osd to a specific cpu core?

2015-06-30 Thread Jan Schermer
Not having OSDs and KVMs compete against each other is one thing.
But there are more reasons to do this

1) not moving the processes and threads between cores that much (better cache 
utilization)
2) aligning the processes with memory on NUMA systems (that means all modern 
dual socket systems) - you don’t want your OSD running on CPU1 with memory 
allocated to CPU2
3) the same goes for other resources like NICs or storage controllers - but 
that’s less important and not always practical to do
4) you can limit the scheduling domain on linux if you limit the cpuset for 
your OSDs (I’m not sure how important this is, just best practice)
5) you can easily limit memory or CPU usage, set priority, with much greater 
granularity than without cgroups
6) if you have HyperThreading enabled you get the most gain when the workloads 
on the threads are dissimiliar - so to have the higher throughput you have to 
pin OSD to thread1 and KVM to thread2 on the same core. We’re not doing that 
because latency and performance of the core can vary depending on what the 
other thread is doing. But it might be useful to someone.

Some workloads exhibit >100% performance gain when everything aligns in a NUMA 
system, compared to a SMP mode on the same hardware. You likely won’t notice it 
on light workloads, as the interconnects (QPI) are very fast and there’s a lot 
of bandwidth, but for stuff like big OLAP databases or other data-manipulation 
workloads there’s a huge difference. And with CEPH being CPU hungy and memory 
intensive, we’re seeing some big gains here just by co-locating the memory with 
the processes….


Jan

 
> On 30 Jun 2015, at 08:12, Ray Sun  wrote:
> 
> ​Sound great, any update please let me know.​
> 
> Best Regards
> -- Ray
> 
> On Tue, Jun 30, 2015 at 1:46 AM, Jan Schermer  > wrote:
> I promised you all our scripts for automatic cgroup assignment - they are in 
> our production already and I just need to put them on github, stay tuned 
> tomorrow :-)
> 
> Jan
> 
> 
>> On 29 Jun 2015, at 19:41, Somnath Roy > > wrote:
>> 
>> Presently, you have to do it by using tool like ‘taskset’ or ‘numactl’…
>>  
>> Thanks & Regards
>> Somnath
>>  
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
>> ] On Behalf Of Ray Sun
>> Sent: Monday, June 29, 2015 9:19 AM
>> To: ceph-users@lists.ceph.com 
>> Subject: [ceph-users] How to use cgroup to bind ceph-osd to a specific cpu 
>> core?
>>  
>> Cephers,
>> I want to bind each of my ceph-osd to a specific cpu core, but I didn't find 
>> any document to explain that, could any one can provide me some detailed 
>> information. Thanks.
>>  
>> Currently, my ceph is running like this:
>>  
>> oot  28692  1  0 Jun23 ?00:37:26 /usr/bin/ceph-mon -i 
>> seed.econe.com  --pid-file 
>> /var/run/ceph/mon.seed.econe.com.pid -c /etc/ceph/ceph.conf --cluster ceph
>> root  40063  1  1 Jun23 ?02:13:31 /usr/bin/ceph-osd -i 0 
>> --pid-file /var/run/ceph/osd.0.pid -c /etc/ceph/ceph.conf --cluster ceph
>> root  42096  1  0 Jun23 ?01:33:42 /usr/bin/ceph-osd -i 1 
>> --pid-file /var/run/ceph/osd.1.pid -c /etc/ceph/ceph.conf --cluster ceph
>> root  43263  1  0 Jun23 ?01:22:59 /usr/bin/ceph-osd -i 2 
>> --pid-file /var/run/ceph/osd.2.pid -c /etc/ceph/ceph.conf --cluster ceph
>> root  44527  1  0 Jun23 ?01:16:53 /usr/bin/ceph-osd -i 3 
>> --pid-file /var/run/ceph/osd.3.pid -c /etc/ceph/ceph.conf --cluster ceph
>> root  45863  1  0 Jun23 ?01:25:18 /usr/bin/ceph-osd -i 4 
>> --pid-file /var/run/ceph/osd.4.pid -c /etc/ceph/ceph.conf --cluster ceph
>> root  47462  1  0 Jun23 ?01:20:36 /usr/bin/ceph-osd -i 5 
>> --pid-file /var/run/ceph/osd.5.pid -c /etc/ceph/ceph.conf --cluster ceph
>>  
>> Best Regards
>> -- Ray
>> 
>> 
>> PLEASE NOTE: The information contained in this electronic mail message is 
>> intended only for the use of the designated recipient(s) named above. If the 
>> reader of this message is not the intended recipient, you are hereby 
>> notified that you have received this message in error and that any review, 
>> dissemination, distribution, or copying of this message is strictly 
>> prohibited. If you have received this communication in error, please notify 
>> the sender by telephone or e-mail (as shown above) immediately and destroy 
>> any and all copies of this message in your possession (whether hard copies 
>> or electronically stored copies).
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
h

Re: [ceph-users] RHEL 7.1 ceph-disk failures creating OSD with ver 0.94.2

2015-06-30 Thread HEWLETT, Paul (Paul)
We are using Ceph (Hammer) on Centos7 and RHEL7.1 successfully.

One secret is to ensure that the disk is cleaned prior to ceph-disk
command. Because GPT tables are used one must use the Œsgdisk -Z¹ command
to purge the disk of all partition tables. We usually issue this command
in the RedHat kickstart file.

The second trick is not to use the mount command explicitly (as shown in
your post below).

The Œceph-disk prepare¹ command should automatically start the OSD.

Paul

On 29/06/2015 20:19, "Bruce McFarland" 
wrote:

>Do these issues occur in Centos 7 also?
>
>> -Original Message-
>> From: Bruce McFarland
>> Sent: Monday, June 29, 2015 12:06 PM
>> To: 'Loic Dachary'; 'ceph-users@lists.ceph.com'
>> Subject: RE: [ceph-users] RHEL 7.1 ceph-disk failures creating OSD with
>>ver
>> 0.94.2
>> 
>> Using the "manual" method of creating an OSD on RHEL 7.1 with Ceph 94.2
>> turns up an issue with the ondisk fsid of the journal device. From a
>>quick
>> web search I've found reference to this exact same issue from earlier
>>this
>> year. Is there a version of Ceph that works with RHEL 7.1???
>> 
>> [root@ceph0 ceph]# ceph-disk-prepare --cluster ceph --cluster-uuid
>> b2c2e866-ab61-4f80-b116-20fa2ea2ca94 --fs-type xfs /dev/sdc /dev/sdb1
>> WARNING:ceph-disk:OSD will not be hot-swappable if journal is not the
>> same device as the osd data The operation has completed successfully.
>> partx: /dev/sdc: error adding partition 1
>> meta-data=/dev/sdc1  isize=2048   agcount=4,
>>agsize=244188597
>> blks
>>  =   sectsz=512   attr=2, projid32bit=1
>>  =   crc=0finobt=0
>> data =   bsize=4096   blocks=976754385,
>>imaxpct=5
>>  =   sunit=0  swidth=0 blks
>> naming   =version 2  bsize=4096   ascii-ci=0 ftype=0
>> log  =internal log   bsize=4096   blocks=476930, version=2
>>  =   sectsz=512   sunit=0 blks, lazy-count=1
>> realtime =none   extsz=4096   blocks=0, rtextents=0
>> The operation has completed successfully.
>> partx: /dev/sdc: error adding partition 1
>> [root@ceph0 ceph]# mkdir /var/lib/ceph/osd/ceph-0
>> [root@ceph0 ceph]# ll /var/lib/ceph/osd/ total 0 drwxr-xr-x. 2 root
>>root 6
>> Jun 29 12:01 ceph-0
>> [root@ceph0 ceph]# mount -t xfs /dev/sdc1 /var/lib/ceph/osd/ceph-0/
>> [root@ceph0 ceph]# mount
>> proc on /proc type proc (rw,nosuid,nodev,noexec,relatime) sysfs on /sys
>>type
>> sysfs (rw,nosuid,nodev,noexec,relatime,seclabel)
>> devtmpfs on /dev type devtmpfs
>> (rw,nosuid,seclabel,size=57648336k,nr_inodes=14412084,mode=755)
>> securityfs on /sys/kernel/security type securityfs
>> (rw,nosuid,nodev,noexec,relatime) tmpfs on /dev/shm type tmpfs
>> (rw,nosuid,nodev,seclabel) devpts on /dev/pts type devpts
>> (rw,nosuid,noexec,relatime,seclabel,gid=5,mode=620,ptmxmode=000)
>> tmpfs on /run type tmpfs (rw,nosuid,nodev,seclabel,mode=755)
>> tmpfs on /sys/fs/cgroup type tmpfs
>> (rw,nosuid,nodev,noexec,seclabel,mode=755)
>> cgroup on /sys/fs/cgroup/systemd type cgroup
>> 
>>(rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/sys
>> temd-cgroups-agent,name=systemd)
>> pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
>> cgroup on /sys/fs/cgroup/cpuset type cgroup
>> (rw,nosuid,nodev,noexec,relatime,cpuset)
>> cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup
>> (rw,nosuid,nodev,noexec,relatime,cpuacct,cpu)
>> cgroup on /sys/fs/cgroup/memory type cgroup
>> (rw,nosuid,nodev,noexec,relatime,memory)
>> cgroup on /sys/fs/cgroup/devices type cgroup
>> (rw,nosuid,nodev,noexec,relatime,devices)
>> cgroup on /sys/fs/cgroup/freezer type cgroup
>> (rw,nosuid,nodev,noexec,relatime,freezer)
>> cgroup on /sys/fs/cgroup/net_cls type cgroup
>> (rw,nosuid,nodev,noexec,relatime,net_cls)
>> cgroup on /sys/fs/cgroup/blkio type cgroup
>> (rw,nosuid,nodev,noexec,relatime,blkio)
>> cgroup on /sys/fs/cgroup/perf_event type cgroup
>> (rw,nosuid,nodev,noexec,relatime,perf_event)
>> cgroup on /sys/fs/cgroup/hugetlb type cgroup
>> (rw,nosuid,nodev,noexec,relatime,hugetlb)
>> configfs on /sys/kernel/config type configfs (rw,relatime)
>> /dev/mapper/rhel_ceph0-root on / type xfs
>> (rw,relatime,seclabel,attr2,inode64,noquota)
>> selinuxfs on /sys/fs/selinux type selinuxfs (rw,relatime)
>> systemd-1 on /proc/sys/fs/binfmt_misc type autofs
>> (rw,relatime,fd=35,pgrp=1,timeout=300,minproto=5,maxproto=5,direct)
>> debugfs on /sys/kernel/debug type debugfs (rw,relatime) mqueue on
>> /dev/mqueue type mqueue (rw,relatime,seclabel) hugetlbfs on
>> /dev/hugepages type hugetlbfs (rw,relatime,seclabel)
>> /dev/mapper/rhel_ceph0-home on /home type xfs
>> (rw,relatime,seclabel,attr2,inode64,noquota)
>> /dev/sda2 on /boot type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
>> binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,relatime)
>> fusectl on /sys/fs/fuse/connections type fusectl (rw,relatime)
>> /de

[ceph-users] adding a extra monitor with ceph-deploy

2015-06-30 Thread Makkelie, R (ITCDCC) - KLM
i'm trying to add a extra monitor with ceph-deploy
the current/first monitor is installed by hand

when i do
ceph-deploy mon add HOST

the new monitor seems to assimilate the old monitor
so the old/first monitor is now in the same state as the new monitor
so it is not aware of anything.

i needed to shutdown and kill both monitors
and restore the data on the old/first monitor

but the first few tries it keeps getting probbing the new/failed monitor
after the 10th try it worked again and stopped probbing the failed monitor

can someone explain to me how i should approach this
because i really would like to add extra monitors

greetz
Ramonskie

For information, services and offers, please visit our web site: 
http://www.klm.com. This e-mail and any attachment may contain confidential and 
privileged material intended for the addressee only. If you are not the 
addressee, you are notified that no part of the e-mail or any attachment may be 
disclosed, copied or distributed, and that any other action related to this 
e-mail or attachment is strictly prohibited, and may be unlawful. If you have 
received this e-mail by error, please notify the sender immediately by return 
e-mail, and delete this message.

Koninklijke Luchtvaart Maatschappij NV (KLM), its subsidiaries and/or its 
employees shall not be liable for the incorrect or incomplete transmission of 
this e-mail or any attachments, nor responsible for any delay in receipt.
Koninklijke Luchtvaart Maatschappij N.V. (also known as KLM Royal Dutch 
Airlines) is registered in Amstelveen, The Netherlands, with registered number 
33014286

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to use cgroup to bind ceph-osd to a specific cpu core?

2015-06-30 Thread Huang Zhiteng
On Tue, Jun 30, 2015 at 4:25 PM, Jan Schermer  wrote:

> Not having OSDs and KVMs compete against each other is one thing.
> But there are more reasons to do this
>
> 1) not moving the processes and threads between cores that much (better
> cache utilization)
> 2) aligning the processes with memory on NUMA systems (that means all
> modern dual socket systems) - you don’t want your OSD running on CPU1 with
> memory allocated to CPU2
> 3) the same goes for other resources like NICs or storage controllers -
> but that’s less important and not always practical to do
> 4) you can limit the scheduling domain on linux if you limit the cpuset
> for your OSDs (I’m not sure how important this is, just best practice)
> 5) you can easily limit memory or CPU usage, set priority, with much
> greater granularity than without cgroups
> 6) if you have HyperThreading enabled you get the most gain when the
> workloads on the threads are dissimiliar - so to have the higher throughput
> you have to pin OSD to thread1 and KVM to thread2 on the same core. We’re
> not doing that because latency and performance of the core can vary
> depending on what the other thread is doing. But it might be useful to
> someone.
>
> Some workloads exhibit >100% performance gain when everything aligns in a
> NUMA system, compared to a SMP mode on the same hardware. You likely won’t
> notice it on light workloads, as the interconnects (QPI) are very fast and
> there’s a lot of bandwidth, but for stuff like big OLAP databases or other
> data-manipulation workloads there’s a huge difference. And with CEPH being
> CPU hungy and memory intensive, we’re seeing some big gains here just by
> co-locating the memory with the processes….
>
Could you elaborate a it on this?  I'm interested to learn in what
situation memory locality helps Ceph to what extend.

>
>
> Jan
>
>
>
> On 30 Jun 2015, at 08:12, Ray Sun  wrote:
>
> ​Sound great, any update please let me know.​
>
> Best Regards
> -- Ray
>
> On Tue, Jun 30, 2015 at 1:46 AM, Jan Schermer  wrote:
>
>> I promised you all our scripts for automatic cgroup assignment - they are
>> in our production already and I just need to put them on github, stay tuned
>> tomorrow :-)
>>
>> Jan
>>
>>
>> On 29 Jun 2015, at 19:41, Somnath Roy  wrote:
>>
>> Presently, you have to do it by using tool like ‘taskset’ or ‘numactl’…
>>
>> Thanks & Regards
>> Somnath
>>
>> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com
>> ] *On Behalf Of *Ray Sun
>> *Sent:* Monday, June 29, 2015 9:19 AM
>> *To:* ceph-users@lists.ceph.com
>> *Subject:* [ceph-users] How to use cgroup to bind ceph-osd to a specific
>> cpu core?
>>
>> Cephers,
>> I want to bind each of my ceph-osd to a specific cpu core, but I didn't
>> find any document to explain that, could any one can provide me some
>> detailed information. Thanks.
>>
>> Currently, my ceph is running like this:
>>
>> oot  28692  1  0 Jun23 ?00:37:26 /usr/bin/ceph-mon -i
>> seed.econe.com --pid-file /var/run/ceph/mon.seed.econe.com.pid -c
>> /etc/ceph/ceph.conf --cluster ceph
>> root  40063  1  1 Jun23 ?02:13:31 /usr/bin/ceph-osd -i 0
>> --pid-file /var/run/ceph/osd.0.pid -c /etc/ceph/ceph.conf --cluster ceph
>> root  42096  1  0 Jun23 ?01:33:42 /usr/bin/ceph-osd -i 1
>> --pid-file /var/run/ceph/osd.1.pid -c /etc/ceph/ceph.conf --cluster ceph
>> root  43263  1  0 Jun23 ?01:22:59 /usr/bin/ceph-osd -i 2
>> --pid-file /var/run/ceph/osd.2.pid -c /etc/ceph/ceph.conf --cluster ceph
>> root  44527  1  0 Jun23 ?01:16:53 /usr/bin/ceph-osd -i 3
>> --pid-file /var/run/ceph/osd.3.pid -c /etc/ceph/ceph.conf --cluster ceph
>> root  45863  1  0 Jun23 ?01:25:18 /usr/bin/ceph-osd -i 4
>> --pid-file /var/run/ceph/osd.4.pid -c /etc/ceph/ceph.conf --cluster ceph
>> root  47462  1  0 Jun23 ?01:20:36 /usr/bin/ceph-osd -i 5
>> --pid-file /var/run/ceph/osd.5.pid -c /etc/ceph/ceph.conf --cluster ceph
>>
>> Best Regards
>> -- Ray
>>
>> --
>>
>> PLEASE NOTE: The information contained in this electronic mail message is
>> intended only for the use of the designated recipient(s) named above. If
>> the reader of this message is not the intended recipient, you are hereby
>> notified that you have received this message in error and that any review,
>> dissemination, distribution, or copying of this message is strictly
>> prohibited. If you have received this communication in error, please notify
>> the sender by telephone or e-mail (as shown above) immediately and destroy
>> any and all copies of this message in your possession (whether hard copies
>> or electronically stored copies).
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists

[ceph-users] Node reboot -- OSDs not "logging off" from cluster

2015-06-30 Thread Daniel Schneller

Hi!

We are seeing a strange - and problematic - behavior in our 0.94.1
cluster on Ubuntu 14.04.1. We have 5 nodes, 4 OSDs each.

When rebooting one of the nodes (e. g. for a kernel upgrade) the OSDs
do not seem to shut down correctly. Clients hang and ceph osd tree show
the OSDs of that node still up. Repeated runs of ceph osd tree show
them going down after a while. For instance, here OSD.7 is still up,
even though the machine is in the middle of the reboot cycle.

[C|root@control01]  ~ ➜  ceph osd tree
# idweight  type name   up/down reweight
-1  36.2root default
-2  7.24host node01
0   1.81osd.0   up  1   
5   1.81osd.5   up  1   
10  1.81osd.10  up  1   
15  1.81osd.15  up  1   
-3  7.24host node02
1   1.81osd.1   up  1   
6   1.81osd.6   up  1   
11  1.81osd.11  up  1   
16  1.81osd.16  up  1   
-4  7.24host node03
2   1.81osd.2   down1   
7   1.81osd.7   up  1   
12  1.81osd.12  down1   
17  1.81osd.17  down1   
-5  7.24host node04
3   1.81osd.3   up  1   
8   1.81osd.8   up  1   
13  1.81osd.13  up  1   
18  1.81osd.18  up  1   
-6  7.24host node05
4   1.81osd.4   up  1   
9   1.81osd.9   up  1   
14  1.81osd.14  up  1   
19  1.81osd.19  up  1

So it seems, the services are either not shut down correctly when the
reboot begins, or they do not get enough time to actually let the
cluster know they are going away.

If I stop the OSDs on that node manually before the reboot, everything
works as expected and clients don't notice any interruptions.

[C|root@node03]  ~ ➜  service ceph-osd stop id=2
ceph-osd stop/waiting
[C|root@node03]  ~ ➜  service ceph-osd stop id=7
ceph-osd stop/waiting
[C|root@node03]  ~ ➜  service ceph-osd stop id=12
ceph-osd stop/waiting
[C|root@node03]  ~ ➜  service ceph-osd stop id=17
ceph-osd stop/waiting
[C|root@node03]  ~ ➜  reboot

The upstart file was not changed from the packaged version.
Interestingly, the same Ceph version on a different cluster does _not_
show this behaviour.

Any ideas as to what is causing this or how to diagnose this?

Cheers,
Daniel


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS posix test performance

2015-06-30 Thread Yan, Zheng

> On Jun 30, 2015, at 15:37, Ilya Dryomov  wrote:
> 
> On Tue, Jun 30, 2015 at 6:57 AM, Yan, Zheng  wrote:
>> I tried 4.1 kernel and 0.94.2 ceph-fuse. their performance are about the 
>> same.
>> 
>> fuse:
>> Files=191, Tests=1964, 60 wallclock secs ( 0.43 usr  0.08 sys +  1.16 cusr  
>> 0.65 csys =  2.32 CPU)
>> 
>> kernel:
>> Files=191, Tests=2286, 61 wallclock secs ( 0.45 usr  0.08 sys +  1.21 cusr  
>> 0.72 csys =  2.46 CPU)
> 
> On Friday, I tried stock 3.10 vs 4.1 and they were about the same as
> well (a few tests failed in 3.10 though).  However Dan is using
> 3.10.0-229.7.2.el7.x86_64, which is 3.10 with a lot of backports, so
> it's not quite the same.  Dan, are the numbers you are seeing
> consistent?
> 

I just tried 3.10.0-229.7.2.el7 kernel. it’s a little slower than 4.1 kernel

4.1:
Files=191, Tests=2286, 61 wallclock secs ( 0.45 usr  0.07 sys +  1.24 cusr  
0.76 csys =  2.52 CPU)

3.10.0-229.7.2.el7:
Files=191, Tests=1964, 75 wallclock secs ( 0.45 usr  0.09 sys +  1.73 cusr  
5.04 csys =  7.31 CPU)

Dan, did you run the test on the same client machine. I think network latency 
affects run time of this test a lots

Regards
Yan, Zheng

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Old vs New pool on same OSDs - Performance Difference

2015-06-30 Thread Nick Fisk
> -Original Message-
> From: Somnath Roy [mailto:somnath@sandisk.com]
> Sent: 29 June 2015 23:29
> To: Nick Fisk
> Cc: ceph-users@lists.ceph.com
> Subject: RE: [ceph-users] Old vs New pool on same OSDs - Performance
> Difference
> 
> Nick,
> I think you are probably hitting the issue of crossing the xattr size limit 
> that
> XFS can inline (255 bytes). In your case "_" xattr size is 267 bytes.
> Sage talked about that in one of his earlier mails..You can try to apply the
> following patch (not backported to hammer yet) and see if it is improving
> anything.
> 
> c6cdb4081e366f471b372102905a1192910ab2da

Ok I will see if this is something I can apply, I haven't really got the 
facility to rebuild Ceph at the moment, so I will look into getting a VM set up 
to build some debs.

> 
> But, I am not sure why this will impact one pool but not other !
> In the slow pool do you have lot of snaps/clones/watchers ?

I don't think this is related to particular pools, I think the problem relates 
to RBD's that haven't been written to in a while. Overwriting the RBD's 
contents with a fio run seems to restore performance. 

No, that’s the weird thing. I have a few pools with maybe 8-10 RBD's on them, 
nothing special is being done. Can there be a case where xattr's can grow 
larger than 255 bytes if I'm not using any special features? Is there a way to 
"dump" the xattrs to see why they are taking up so much space?

> 
> 
> Thanks & Regards
> Somnath
> 
> 
> -Original Message-
> From: Nick Fisk [mailto:n...@fisk.me.uk]
> Sent: Monday, June 29, 2015 3:05 PM
> To: Somnath Roy
> Cc: ceph-users@lists.ceph.com
> Subject: RE: [ceph-users] Old vs New pool on same OSDs - Performance
> Difference
> 
> Sorry, forgot to enable that. Here is another capture with it on and I think 
> you
> are spot on as I can see a 100ms delay doing the getattr request. Any ideas
> how to debug further? Thanks for the help by the way, really appreciated.
> 
> 2015-06-29 22:48:50.851645 7fd8a2a1e700 15 osd.1 26349 enqueue_op
> 0x522bf00 prio 63 cost 0 latency 0.000288 osd_op(client.2796502.0:136
> rb.0.1ba70.238e1f29.00011477 [read 65536~65536] 0.8b312528
> ack+read+known_if_redirected e26349) v5
> 2015-06-29 22:48:50.851735 7fd8b026e700 10 osd.1 26349 dequeue_op
> 0x522bf00 prio 63 cost 0 latency 0.000378 osd_op(client.2796502.0:136
> rb.0.1ba70.238e1f29.00011477 [read 65536~65536] 0.8b312528
> ack+read+known_if_redirected e26349) v5 pg pg[0.128( v 26335'8141858
> (26331'8138777,26335'8141858] local-les=26276 n=5243 ec=1 les/c
> 26276/26305 26264/26272/26272) [1,21,31] r=0 lpr=26272 crt=26335'8141855
> lcod 26335'8141857 mlcod 26335'8141857 active+clean]
> 2015-06-29 22:48:50.852076 7fd8b026e700 20 osd.1 pg_epoch: 26349
> pg[0.128( v 26335'8141858 (26331'8138777,26335'8141858] local-les=26276
> n=5243 ec=1 les/c 26276/26305 26264/26272/26272) [1,21,31] r=0 lpr=26272
> crt=26335'8141855 lcod 26335'8141857 mlcod 26335'8141857 active+clean]
> op_has_sufficient_caps pool=0 (rbd ) owner=0 need_read_cap=1
> need_write_cap=0 need_class_read_cap=0 need_class_write_cap=0 -> yes
> 2015-06-29 22:48:50.852252 7fd8b026e700 10 osd.1 pg_epoch: 26349
> pg[0.128( v 26335'8141858 (26331'8138777,26335'8141858] local-les=26276
> n=5243 ec=1 les/c 26276/26305 26264/26272/26272) [1,21,31] r=0 lpr=26272
> crt=26335'8141855 lcod 26335'8141857 mlcod 26335'8141857 active+clean]
> handle_message: 0x522bf00
> 2015-06-29 22:48:50.852471 7fd8b026e700 10 osd.1 pg_epoch: 26349
> pg[0.128( v 26335'8141858 (26331'8138777,26335'8141858] local-les=26276
> n=5243 ec=1 les/c 26276/26305 26264/26272/26272) [1,21,31] r=0 lpr=26272
> crt=26335'8141855 lcod 26335'8141857 mlcod 26335'8141857 active+clean]
> do_op osd_op(client.2796502.0:136 rb.0.1ba70.238e1f29.00011477 [read
> 65536~65536] 0.8b312528 ack+read+known_if_redirected e26349) v5
> may_read -> read-ordered flags ack+read+known_if_redirected
> 2015-06-29 22:48:50.852960 7fd8b026e700 10 osd.1 pg_epoch: 26349
> pg[0.128( v 26335'8141858 (26331'8138777,26335'8141858] local-les=26276
> n=5243 ec=1 les/c 26276/26305 26264/26272/26272) [1,21,31] r=0 lpr=26272
> crt=26335'8141855 lcod 26335'8141857 mlcod 26335'8141857 active+clean]
> get_object_context: obc NOT found in cache:
> 8b312528/rb.0.1ba70.238e1f29.00011477/head//0
> 2015-06-29 22:48:50.853016 7fd8b026e700 15
> filestore(/var/lib/ceph/osd/ceph-1) getattr
> 0.128_head/8b312528/rb.0.1ba70.238e1f29.00011477/head//0 '_'
> 2015-06-29 22:48:50.953748 7fd8b026e700 10
> filestore(/var/lib/ceph/osd/ceph-1) getattr
> 0.128_head/8b312528/rb.0.1ba70.238e1f29.00011477/head//0 '_' = 267
> 2015-06-29 22:48:50.953951 7fd8b026e700 15
> filestore(/var/lib/ceph/osd/ceph-1) getattr
> 0.128_head/8b312528/rb.0.1ba70.238e1f29.00011477/head//0 'snapset'
> 2015-06-29 22:48:50.954148 7fd8b026e700 10
> filestore(/var/lib/ceph/osd/ceph-1) getattr
> 0.128_head/8b312528/rb.0.1ba70.238e1f29.00011477/head//0 'snapset'
> = 31
> 2015-06-29 22:48:50.954379 7fd8

Re: [ceph-users] Old vs New pool on same OSDs - Performance Difference

2015-06-30 Thread Nick Fisk
Answering the question myself, here are the contents of xattr for the object

user.cephos.spill_out:
   30 00  0.

user.ceph._:
   0F 08 05 01 00 00 04 03 41 00 00 00 00 00 00 00A...
0010   20 00 00 00 72 62 2E 30 2E 31 62 61 37 30 2E 32 ...rb.0.1ba70.2
0020   33 38 65 31 66 32 39 2E 30 30 30 30 30 30 30 3138e1f29.0001
0030   31 34 37 37 FE FF FF FF FF FF FF FF 28 25 31 8B1477(%1.
0040   00 00 00 00 00 00 00 00 00 00 00 00 00 06 03 1C
0050   00 00 00 00 00 00 00 00 00 00 00 FF FF FF FF 00
0060   00 00 00 00 00 00 00 FF FF FF FF FF FF FF FF 00
0070   00 00 00 CB FF 11 00 00 00 00 00 70 08 00 00 8C...p
0080   B1 00 00 00 00 00 00 0C 04 00 00 02 02 15 00 00
0090   00 04 01 00 00 00 00 00 00 00 74 FE 28 00 00 00..t.(...
00A0   00 00 00 00 00 00 00 00 10 00 00 00 00 00 5D 04..].
00B0   07 55 00 9C 84 39 02 02 15 00 00 00 00 00 00 00.U...9..
00C0   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00D0   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00E0   00 00 00 00 00 00 8C B1 00 00 00 00 00 00 00 00
00F0   00 00 00 00 00 00 00 34 00 00 00 5D 04 07 55 B8...4...]..U.
0100   E3 F1 3A 00 5B 0A A5 FF FF FF FF   ..:.[..

user.ceph.snapset:
   02 02 19 00 00 00 00 00 00 00 00 00 00 00 01 00
0010   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ...


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Nick Fisk
> Sent: 30 June 2015 10:51
> To: 'Somnath Roy'
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Old vs New pool on same OSDs - Performance
> Difference
> 
> > -Original Message-
> > From: Somnath Roy [mailto:somnath@sandisk.com]
> > Sent: 29 June 2015 23:29
> > To: Nick Fisk
> > Cc: ceph-users@lists.ceph.com
> > Subject: RE: [ceph-users] Old vs New pool on same OSDs - Performance
> > Difference
> >
> > Nick,
> > I think you are probably hitting the issue of crossing the xattr size
> > limit that XFS can inline (255 bytes). In your case "_" xattr size is 267 
> > bytes.
> > Sage talked about that in one of his earlier mails..You can try to
> > apply the following patch (not backported to hammer yet) and see if it
> > is improving anything.
> >
> > c6cdb4081e366f471b372102905a1192910ab2da
> 
> Ok I will see if this is something I can apply, I haven't really got the 
> facility to
> rebuild Ceph at the moment, so I will look into getting a VM set up to build
> some debs.
> 
> >
> > But, I am not sure why this will impact one pool but not other !
> > In the slow pool do you have lot of snaps/clones/watchers ?
> 
> I don't think this is related to particular pools, I think the problem 
> relates to
> RBD's that haven't been written to in a while. Overwriting the RBD's contents
> with a fio run seems to restore performance.
> 
> No, that’s the weird thing. I have a few pools with maybe 8-10 RBD's on
> them, nothing special is being done. Can there be a case where xattr's can
> grow larger than 255 bytes if I'm not using any special features? Is there a
> way to "dump" the xattrs to see why they are taking up so much space?
> 
> >
> >
> > Thanks & Regards
> > Somnath
> >
> >
> > -Original Message-
> > From: Nick Fisk [mailto:n...@fisk.me.uk]
> > Sent: Monday, June 29, 2015 3:05 PM
> > To: Somnath Roy
> > Cc: ceph-users@lists.ceph.com
> > Subject: RE: [ceph-users] Old vs New pool on same OSDs - Performance
> > Difference
> >
> > Sorry, forgot to enable that. Here is another capture with it on and I
> > think you are spot on as I can see a 100ms delay doing the getattr
> > request. Any ideas how to debug further? Thanks for the help by the way,
> really appreciated.
> >
> > 2015-06-29 22:48:50.851645 7fd8a2a1e700 15 osd.1 26349 enqueue_op
> > 0x522bf00 prio 63 cost 0 latency 0.000288 osd_op(client.2796502.0:136
> > rb.0.1ba70.238e1f29.00011477 [read 65536~65536] 0.8b312528
> > ack+read+known_if_redirected e26349) v5
> > 2015-06-29 22:48:50.851735 7fd8b026e700 10 osd.1 26349 dequeue_op
> > 0x522bf00 prio 63 cost 0 latency 0.000378 osd_op(client.2796502.0:136
> > rb.0.1ba70.238e1f29.00011477 [read 65536~65536] 0.8b312528
> > ack+read+known_if_redirected e26349) v5 pg pg[0.128( v 26335'8141858
> > (26331'8138777,26335'8141858] local-les=26276 n=5243 ec=1 les/c
> > 26276/26305 26264/26272/26272) [1,21,31] r=0 lpr=26272
> > crt=26335'8141855 lcod 26335'8141857 mlcod 26335'8141857 active+clean]
> > 2015-06-29 22:48:50.852076 7fd8b026e700 20 osd.1 pg_epoch: 26349
> > pg[0.128( v 26335'8141858 (26331'8138777,26335'8141858]
> > local-les=26276
> > n=5243 ec=1 les/c 26276/26305 26264/26272/26272) [1,21,31] r=0
> > lpr=26272
> > crt=26335'8141855 lcod 26335'8141857 mlcod 26335'8141857 act

[ceph-users] CDS Jewel Wed/Thurs

2015-06-30 Thread Patrick McGarry
Hey cephers,

Just a friendly reminder that our Ceph Developer Summit for Jewel
planning is set to run tomorrow and Thursday. The schedule and dial in
information is available on the new wiki:

http://tracker.ceph.com/projects/ceph/wiki/CDS_Jewel

Please let me know if you have any questions. Thanks!


-- 

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Very low 4k randread performance ~1000iops

2015-06-30 Thread Tuomas Juntunen
Hi

 

I have been trying to figure out why our 4k random reads in VM's are so bad.
I am using fio to test this.

 

Write : 170k iops

Random write : 109k iops

Read : 64k iops

Random read : 1k iops

 

Our setup is:

3 nodes with 36 OSDs, 18 SSD's one SSD for two OSD's, each node has 64gb mem
& 2x6core cpu's

4 monitors running on other servers

40gbit infiniband with IPoIB

Openstack : Qemu-kvm for virtuals

 

Any help would be appreciated

 

Thank you in advance.

 

Br,

Tuomas

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to use cgroup to bind ceph-osd to a specific cpu core?

2015-06-30 Thread Jan Schermer
Hi all,
our script is available on GitHub

https://github.com/prozeta/pincpus 

I haven’t had much time to do a proper README, but I hope the configuration is 
self explanatory enough for now.
What it does is pin each OSD into the most “empty” cgroup assigned to a NUMA 
node.

Let me know how it works for you!

Jan


> On 30 Jun 2015, at 10:50, Huang Zhiteng  wrote:
> 
> 
> 
> On Tue, Jun 30, 2015 at 4:25 PM, Jan Schermer  > wrote:
> Not having OSDs and KVMs compete against each other is one thing.
> But there are more reasons to do this
> 
> 1) not moving the processes and threads between cores that much (better cache 
> utilization)
> 2) aligning the processes with memory on NUMA systems (that means all modern 
> dual socket systems) - you don’t want your OSD running on CPU1 with memory 
> allocated to CPU2
> 3) the same goes for other resources like NICs or storage controllers - but 
> that’s less important and not always practical to do
> 4) you can limit the scheduling domain on linux if you limit the cpuset for 
> your OSDs (I’m not sure how important this is, just best practice)
> 5) you can easily limit memory or CPU usage, set priority, with much greater 
> granularity than without cgroups
> 6) if you have HyperThreading enabled you get the most gain when the 
> workloads on the threads are dissimiliar - so to have the higher throughput 
> you have to pin OSD to thread1 and KVM to thread2 on the same core. We’re not 
> doing that because latency and performance of the core can vary depending on 
> what the other thread is doing. But it might be useful to someone.
> 
> Some workloads exhibit >100% performance gain when everything aligns in a 
> NUMA system, compared to a SMP mode on the same hardware. You likely won’t 
> notice it on light workloads, as the interconnects (QPI) are very fast and 
> there’s a lot of bandwidth, but for stuff like big OLAP databases or other 
> data-manipulation workloads there’s a huge difference. And with CEPH being 
> CPU hungy and memory intensive, we’re seeing some big gains here just by 
> co-locating the memory with the processes….
> Could you elaborate a it on this?  I'm interested to learn in what situation 
> memory locality helps Ceph to what extend. 
> 
> 
> Jan
> 
>  
>> On 30 Jun 2015, at 08:12, Ray Sun > > wrote:
>> 
>> ​Sound great, any update please let me know.​
>> 
>> Best Regards
>> -- Ray
>> 
>> On Tue, Jun 30, 2015 at 1:46 AM, Jan Schermer > > wrote:
>> I promised you all our scripts for automatic cgroup assignment - they are in 
>> our production already and I just need to put them on github, stay tuned 
>> tomorrow :-)
>> 
>> Jan
>> 
>> 
>>> On 29 Jun 2015, at 19:41, Somnath Roy >> > wrote:
>>> 
>>> Presently, you have to do it by using tool like ‘taskset’ or ‘numactl’…
>>>  
>>> Thanks & Regards
>>> Somnath
>>>  
>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
>>> ] On Behalf Of Ray Sun
>>> Sent: Monday, June 29, 2015 9:19 AM
>>> To: ceph-users@lists.ceph.com 
>>> Subject: [ceph-users] How to use cgroup to bind ceph-osd to a specific cpu 
>>> core?
>>>  
>>> Cephers,
>>> I want to bind each of my ceph-osd to a specific cpu core, but I didn't 
>>> find any document to explain that, could any one can provide me some 
>>> detailed information. Thanks.
>>>  
>>> Currently, my ceph is running like this:
>>>  
>>> oot  28692  1  0 Jun23 ?00:37:26 /usr/bin/ceph-mon -i 
>>> seed.econe.com  --pid-file 
>>> /var/run/ceph/mon.seed.econe.com.pid -c /etc/ceph/ceph.conf --cluster ceph
>>> root  40063  1  1 Jun23 ?02:13:31 /usr/bin/ceph-osd -i 0 
>>> --pid-file /var/run/ceph/osd.0.pid -c /etc/ceph/ceph.conf --cluster ceph
>>> root  42096  1  0 Jun23 ?01:33:42 /usr/bin/ceph-osd -i 1 
>>> --pid-file /var/run/ceph/osd.1.pid -c /etc/ceph/ceph.conf --cluster ceph
>>> root  43263  1  0 Jun23 ?01:22:59 /usr/bin/ceph-osd -i 2 
>>> --pid-file /var/run/ceph/osd.2.pid -c /etc/ceph/ceph.conf --cluster ceph
>>> root  44527  1  0 Jun23 ?01:16:53 /usr/bin/ceph-osd -i 3 
>>> --pid-file /var/run/ceph/osd.3.pid -c /etc/ceph/ceph.conf --cluster ceph
>>> root  45863  1  0 Jun23 ?01:25:18 /usr/bin/ceph-osd -i 4 
>>> --pid-file /var/run/ceph/osd.4.pid -c /etc/ceph/ceph.conf --cluster ceph
>>> root  47462  1  0 Jun23 ?01:20:36 /usr/bin/ceph-osd -i 5 
>>> --pid-file /var/run/ceph/osd.5.pid -c /etc/ceph/ceph.conf --cluster ceph
>>>  
>>> Best Regards
>>> -- Ray
>>> 
>>> 
>>> PLEASE NOTE: The information contained in this electronic mail message is 
>>> intended only for the use of the designated recipient(s) named above. If 
>>> the reader of this message is not the intended recipient, you are hereby 
>>> notified tha

Re: [ceph-users] Very low 4k randread performance ~1000iops

2015-06-30 Thread Somnath Roy
Break it down, try fio-rbd to see what is the performance you getting..
But, I am really surprised you are getting > 100k iops for write, did you check 
it is hitting the disks ?

Thanks & Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Tuomas 
Juntunen
Sent: Tuesday, June 30, 2015 8:33 AM
To: 'ceph-users'
Subject: [ceph-users] Very low 4k randread performance ~1000iops

Hi

I have been trying to figure out why our 4k random reads in VM's are so bad. I 
am using fio to test this.

Write : 170k iops
Random write : 109k iops
Read : 64k iops
Random read : 1k iops

Our setup is:
3 nodes with 36 OSDs, 18 SSD's one SSD for two OSD's, each node has 64gb mem & 
2x6core cpu's
4 monitors running on other servers
40gbit infiniband with IPoIB
Openstack : Qemu-kvm for virtuals

Any help would be appreciated

Thank you in advance.

Br,
Tuomas



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS posix test performance

2015-06-30 Thread Dan van der Ster
On Tue, Jun 30, 2015 at 11:37 AM, Yan, Zheng  wrote:
>
>> On Jun 30, 2015, at 15:37, Ilya Dryomov  wrote:
>>
>> On Tue, Jun 30, 2015 at 6:57 AM, Yan, Zheng  wrote:
>>> I tried 4.1 kernel and 0.94.2 ceph-fuse. their performance are about the 
>>> same.
>>>
>>> fuse:
>>> Files=191, Tests=1964, 60 wallclock secs ( 0.43 usr  0.08 sys +  1.16 cusr  
>>> 0.65 csys =  2.32 CPU)
>>>
>>> kernel:
>>> Files=191, Tests=2286, 61 wallclock secs ( 0.45 usr  0.08 sys +  1.21 cusr  
>>> 0.72 csys =  2.46 CPU)
>>
>> On Friday, I tried stock 3.10 vs 4.1 and they were about the same as
>> well (a few tests failed in 3.10 though).  However Dan is using
>> 3.10.0-229.7.2.el7.x86_64, which is 3.10 with a lot of backports, so
>> it's not quite the same.  Dan, are the numbers you are seeing
>> consistent?
>>
>
> I just tried 3.10.0-229.7.2.el7 kernel. it’s a little slower than 4.1 kernel
>
> 4.1:
> Files=191, Tests=2286, 61 wallclock secs ( 0.45 usr  0.07 sys +  1.24 cusr  
> 0.76 csys =  2.52 CPU)
>
> 3.10.0-229.7.2.el7:
> Files=191, Tests=1964, 75 wallclock secs ( 0.45 usr  0.09 sys +  1.73 cusr  
> 5.04 csys =  7.31 CPU)
>
> Dan, did you run the test on the same client machine. I think network latency 
> affects run time of this test a lots
>

All the tests run on the same client, but it seems there is some
variability in the tests. Now I get:

Linux 3.10.0-229.7.2.el7.x86_64
Files=184, Tests=1957, 91 wallclock secs ( 0.72 usr  0.19 sys +  5.68
cusr 10.09 csys = 16.68 CPU)

Linux 4.1.0-1.el7.elrepo.x86_64
Files=184, Tests=1957, 84 wallclock secs ( 0.75 usr  0.44 sys +  5.17
cusr  9.77 csys = 16.13 CPU)

ceph-fuse 0.94.2:
Files=184, Tests=1957, 78 wallclock secs ( 0.69 usr  0.17 sys +  5.08
cusr  9.93 csys = 15.87 CPU)


I don't know if it's related -- and maybe I misunderstood something
fundamental -- but we don't manage to get FUSE or the kernel client to
use the page cache:

I have fuse_use_invalidate_cb = true then used fincore to see what's cached:

# df -h .
Filesystem  Size  Used Avail Use% Mounted on
ceph-fuse   444T  135T  309T  31% /cephfs
# cat zero > /dev/null
# linux-fincore zero
filename
sizetotal_pagesmin_cached page
  cached_pagescached_sizecached_perc

------
  ------
zero
 104,857,600 25,600 -1
 0  0   0.00
---
total cached size: 0

The kernel client has the same behaviour. Is this expected?

Cheers, Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Very low 4k randread performance ~1000iops

2015-06-30 Thread Tuomas Juntunen
Hi

 

It’s not probably hitting the disks, but that really doesn’t matter. The
point is we have very responsive VM’s while writing and that is what the
users will see.

The iops we get with sequential read is good, but the random read is way too
low.

 

Is using SSD’s as OSD’s the only way to get it up? or is there some tunable
which would enhance it? I would assume Linux caches reads in memory and
serves them from there, but atleast now we don’t see it.

 

Br,

Tuomas

 

 

From: Somnath Roy [mailto:somnath@sandisk.com] 
Sent: 30. kesäkuuta 2015 19:24
To: Tuomas Juntunen; 'ceph-users'
Subject: RE: [ceph-users] Very low 4k randread performance ~1000iops

 

Break it down, try fio-rbd to see what is the performance you getting..

But, I am really surprised you are getting > 100k iops for write, did you
check it is hitting the disks ?

 

Thanks & Regards

Somnath

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Tuomas Juntunen
Sent: Tuesday, June 30, 2015 8:33 AM
To: 'ceph-users'
Subject: [ceph-users] Very low 4k randread performance ~1000iops

 

Hi

 

I have been trying to figure out why our 4k random reads in VM’s are so bad.
I am using fio to test this.

 

Write : 170k iops

Random write : 109k iops

Read : 64k iops

Random read : 1k iops

 

Our setup is:

3 nodes with 36 OSDs, 18 SSD’s one SSD for two OSD’s, each node has 64gb mem
& 2x6core cpu’s

4 monitors running on other servers

40gbit infiniband with IPoIB

Openstack : Qemu-kvm for virtuals

 

Any help would be appreciated

 

Thank you in advance.

 

Br,

Tuomas

 

  _  


PLEASE NOTE: The information contained in this electronic mail message is
intended only for the use of the designated recipient(s) named above. If the
reader of this message is not the intended recipient, you are hereby
notified that you have received this message in error and that any review,
dissemination, distribution, or copying of this message is strictly
prohibited. If you have received this communication in error, please notify
the sender by telephone or e-mail (as shown above) immediately and destroy
any and all copies of this message in your possession (whether hard copies
or electronically stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Very low 4k randread performance ~1000iops

2015-06-30 Thread Stephen Mercier
I ran into the same problem. What we did, and have been using since, is 
increased the read ahead buffer in the VMs to 16MB (The sweet spot we settled 
on after testing). This isn't a solution for all scenarios, but for our uses, 
it was enough to get performance inline with expectations.

In Ubuntu, we added the following udev config to facilitate this:

root@ubuntu:/lib/udev/rules.d# vi /etc/udev/rules.d/99-virtio.rules 

SUBSYSTEM=="block", ATTR{queue/rotational}=="1", ACTION=="add|change", 
KERNEL=="vd[a-z]", ATTR{bdi/read_ahead_kb}="16384", 
ATTR{queue/read_ahead_kb}="16384", ATTR{queue/scheduler}="deadline"


Cheers,
-- 
Stephen Mercier
Senior Systems Architect
Attainia, Inc.
Phone: 866-288-2464 ext. 727
Email: stephen.merc...@attainia.com
Web: www.attainia.com

Capital equipment lifecycle planning & budgeting solutions for healthcare



On Jun 30, 2015, at 10:18 AM, Tuomas Juntunen wrote:

> Hi
>  
> It’s not probably hitting the disks, but that really doesn’t matter. The 
> point is we have very responsive VM’s while writing and that is what the 
> users will see.
> The iops we get with sequential read is good, but the random read is way too 
> low.
>  
> Is using SSD’s as OSD’s the only way to get it up? or is there some tunable 
> which would enhance it? I would assume Linux caches reads in memory and 
> serves them from there, but atleast now we don’t see it.
>  
> Br,
> Tuomas
>  
>  
> From: Somnath Roy [mailto:somnath@sandisk.com] 
> Sent: 30. kesäkuuta 2015 19:24
> To: Tuomas Juntunen; 'ceph-users'
> Subject: RE: [ceph-users] Very low 4k randread performance ~1000iops
>  
> Break it down, try fio-rbd to see what is the performance you getting..
> But, I am really surprised you are getting > 100k iops for write, did you 
> check it is hitting the disks ?
>  
> Thanks & Regards
> Somnath
>  
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Tuomas Juntunen
> Sent: Tuesday, June 30, 2015 8:33 AM
> To: 'ceph-users'
> Subject: [ceph-users] Very low 4k randread performance ~1000iops
>  
> Hi
>  
> I have been trying to figure out why our 4k random reads in VM’s are so bad. 
> I am using fio to test this.
>  
> Write : 170k iops
> Random write : 109k iops
> Read : 64k iops
> Random read : 1k iops
>  
> Our setup is:
> 3 nodes with 36 OSDs, 18 SSD’s one SSD for two OSD’s, each node has 64gb mem 
> & 2x6core cpu’s
> 4 monitors running on other servers
> 40gbit infiniband with IPoIB
> Openstack : Qemu-kvm for virtuals
>  
> Any help would be appreciated
>  
> Thank you in advance.
>  
> Br,
> Tuomas
>  
> 
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Very low 4k randread performance ~1000iops

2015-06-30 Thread Tuomas Juntunen
Hi

 

This is something I was thinking too. But it doesn’t take away the problem.

 

Can you share your setup and how many VM’s you are running, that would give
us some starting point on sizing our setup.

 

Thanks

 

Br,

Tuomas

 

From: Stephen Mercier [mailto:stephen.merc...@attainia.com] 
Sent: 30. kesäkuuta 2015 20:32
To: Tuomas Juntunen
Cc: 'Somnath Roy'; 'ceph-users'
Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops

 

I ran into the same problem. What we did, and have been using since, is
increased the read ahead buffer in the VMs to 16MB (The sweet spot we
settled on after testing). This isn't a solution for all scenarios, but for
our uses, it was enough to get performance inline with expectations.

 

In Ubuntu, we added the following udev config to facilitate this:

 

root@ubuntu:/lib/udev/rules.d# vi /etc/udev/rules.d/99-virtio.rules 

 

SUBSYSTEM=="block", ATTR{queue/rotational}=="1", ACTION=="add|change",
KERNEL=="vd[a-z]", ATTR{bdi/read_ahead_kb}="16384",
ATTR{queue/read_ahead_kb}="16384", ATTR{queue/scheduler}="deadline"

 

 

Cheers,

-- 

Stephen Mercier

Senior Systems Architect

Attainia, Inc.

Phone: 866-288-2464 ext. 727

Email: stephen.merc...@attainia.com

Web: www.attainia.com

 

Capital equipment lifecycle planning & budgeting solutions for healthcare

 

 

 

On Jun 30, 2015, at 10:18 AM, Tuomas Juntunen wrote:





Hi

 

It’s not probably hitting the disks, but that really doesn’t matter. The
point is we have very responsive VM’s while writing and that is what the
users will see.

The iops we get with sequential read is good, but the random read is way too
low.

 

Is using SSD’s as OSD’s the only way to get it up? or is there some tunable
which would enhance it? I would assume Linux caches reads in memory and
serves them from there, but atleast now we don’t see it.

 

Br,

Tuomas

 

 

From: Somnath Roy [mailto:somnath@sandisk.com] 
Sent: 30. kesäkuuta 2015 19:24
To: Tuomas Juntunen; 'ceph-users'
Subject: RE: [ceph-users] Very low 4k randread performance ~1000iops

 

Break it down, try fio-rbd to see what is the performance you getting..

But, I am really surprised you are getting > 100k iops for write, did you
check it is hitting the disks ?

 

Thanks & Regards

Somnath

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Tuomas Juntunen
Sent: Tuesday, June 30, 2015 8:33 AM
To: 'ceph-users'
Subject: [ceph-users] Very low 4k randread performance ~1000iops

 

Hi

 

I have been trying to figure out why our 4k random reads in VM’s are so bad.
I am using fio to test this.

 

Write : 170k iops

Random write : 109k iops

Read : 64k iops

Random read : 1k iops

 

Our setup is:

3 nodes with 36 OSDs, 18 SSD’s one SSD for two OSD’s, each node has 64gb mem
& 2x6core cpu’s

4 monitors running on other servers

40gbit infiniband with IPoIB

Openstack : Qemu-kvm for virtuals

 

Any help would be appreciated

 

Thank you in advance.

 

Br,

Tuomas

 

  _  


PLEASE NOTE: The information contained in this electronic mail message is
intended only for the use of the designated recipient(s) named above. If the
reader of this message is not the intended recipient, you are hereby
notified that you have received this message in error and that any review,
dissemination, distribution, or copying of this message is strictly
prohibited. If you have received this communication in error, please notify
the sender by telephone or e-mail (as shown above) immediately and destroy
any and all copies of this message in your possession (whether hard copies
or electronically stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Very low 4k randread performance ~1000iops

2015-06-30 Thread Mark Nelson

Hi Tuomos,

Can you paste the command you ran to do the test?

Thanks,
Mark

On 06/30/2015 12:18 PM, Tuomas Juntunen wrote:

Hi

It’s not probably hitting the disks, but that really doesn’t matter. The
point is we have very responsive VM’s while writing and that is what the
users will see.

The iops we get with sequential read is good, but the random read is way
too low.

Is using SSD’s as OSD’s the only way to get it up? or is there some
tunable which would enhance it? I would assume Linux caches reads in
memory and serves them from there, but atleast now we don’t see it.

Br,

Tuomas

*From:*Somnath Roy [mailto:somnath@sandisk.com]
*Sent:* 30. kesäkuuta 2015 19:24
*To:* Tuomas Juntunen; 'ceph-users'
*Subject:* RE: [ceph-users] Very low 4k randread performance ~1000iops

Break it down, try fio-rbd to see what is the performance you getting..

But, I am really surprised you are getting > 100k iops for write, did
you check it is hitting the disks ?

Thanks & Regards

Somnath

*From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
Of *Tuomas Juntunen
*Sent:* Tuesday, June 30, 2015 8:33 AM
*To:* 'ceph-users'
*Subject:* [ceph-users] Very low 4k randread performance ~1000iops

Hi

I have been trying to figure out why our 4k random reads in VM’s are so
bad. I am using fio to test this.

Write : 170k iops

Random write : 109k iops

Read : 64k iops

Random read : 1k iops

Our setup is:

3 nodes with 36 OSDs, 18 SSD’s one SSD for two OSD’s, each node has 64gb
mem & 2x6core cpu’s

4 monitors running on other servers

40gbit infiniband with IPoIB

Openstack : Qemu-kvm for virtuals

Any help would be appreciated

Thank you in advance.

Br,

Tuomas




PLEASE NOTE: The information contained in this electronic mail message
is intended only for the use of the designated recipient(s) named above.
If the reader of this message is not the intended recipient, you are
hereby notified that you have received this message in error and that
any review, dissemination, distribution, or copying of this message is
strictly prohibited. If you have received this communication in error,
please notify the sender by telephone or e-mail (as shown above)
immediately and destroy any and all copies of this message in your
possession (whether hard copies or electronically stored copies).



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Very low 4k randread performance ~1000iops

2015-06-30 Thread Somnath Roy
Read_ahead_kb should help you in case of seq workload, but, if you are saying 
it is helping your workload in random case also, try to do it both in VM as 
well as in OSD side as well and see if it is making any difference.

Thanks & Regards
Somnath

From: Tuomas Juntunen [mailto:tuomas.juntu...@databasement.fi]
Sent: Tuesday, June 30, 2015 10:49 AM
To: 'Stephen Mercier'
Cc: Somnath Roy; 'ceph-users'
Subject: RE: [ceph-users] Very low 4k randread performance ~1000iops

Hi

This is something I was thinking too. But it doesn't take away the problem.

Can you share your setup and how many VM's you are running, that would give us 
some starting point on sizing our setup.

Thanks

Br,
Tuomas

From: Stephen Mercier [mailto:stephen.merc...@attainia.com]
Sent: 30. kesäkuuta 2015 20:32
To: Tuomas Juntunen
Cc: 'Somnath Roy'; 'ceph-users'
Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops

I ran into the same problem. What we did, and have been using since, is 
increased the read ahead buffer in the VMs to 16MB (The sweet spot we settled 
on after testing). This isn't a solution for all scenarios, but for our uses, 
it was enough to get performance inline with expectations.

In Ubuntu, we added the following udev config to facilitate this:

root@ubuntu:/lib/udev/rules.d# vi 
/etc/udev/rules.d/99-virtio.rules

SUBSYSTEM=="block", ATTR{queue/rotational}=="1", ACTION=="add|change", 
KERNEL=="vd[a-z]", ATTR{bdi/read_ahead_kb}="16384", 
ATTR{queue/read_ahead_kb}="16384", ATTR{queue/scheduler}="deadline"


Cheers,
--
Stephen Mercier
Senior Systems Architect
Attainia, Inc.
Phone: 866-288-2464 ext. 727
Email: stephen.merc...@attainia.com
Web: www.attainia.com

Capital equipment lifecycle planning & budgeting solutions for healthcare



On Jun 30, 2015, at 10:18 AM, Tuomas Juntunen wrote:

Hi

It's not probably hitting the disks, but that really doesn't matter. The point 
is we have very responsive VM's while writing and that is what the users will 
see.
The iops we get with sequential read is good, but the random read is way too 
low.

Is using SSD's as OSD's the only way to get it up? or is there some tunable 
which would enhance it? I would assume Linux caches reads in memory and serves 
them from there, but atleast now we don't see it.

Br,
Tuomas


From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: 30. kesäkuuta 2015 19:24
To: Tuomas Juntunen; 'ceph-users'
Subject: RE: [ceph-users] Very low 4k randread performance ~1000iops

Break it down, try fio-rbd to see what is the performance you getting..
But, I am really surprised you are getting > 100k iops for write, did you check 
it is hitting the disks ?

Thanks & Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Tuomas 
Juntunen
Sent: Tuesday, June 30, 2015 8:33 AM
To: 'ceph-users'
Subject: [ceph-users] Very low 4k randread performance ~1000iops

Hi

I have been trying to figure out why our 4k random reads in VM's are so bad. I 
am using fio to test this.

Write : 170k iops
Random write : 109k iops
Read : 64k iops
Random read : 1k iops

Our setup is:
3 nodes with 36 OSDs, 18 SSD's one SSD for two OSD's, each node has 64gb mem & 
2x6core cpu's
4 monitors running on other servers
40gbit infiniband with IPoIB
Openstack : Qemu-kvm for virtuals

Any help would be appreciated

Thank you in advance.

Br,
Tuomas



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 403-Forbidden error using radosgw

2015-06-30 Thread B, Naga Venkata
I am also having same issue can somebody help me out. But for me it is 
"HTTP/1.1 404 Not Found".
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpected disk write activity with btrfs OSDs

2015-06-30 Thread Jan Schermer
I don’t run Ceph on btrfs, but isn’t this related to the btrfs snapshotting 
feature ceph uses to ensure a consistent journal?

Jan

> On 19 Jun 2015, at 14:26, Lionel Bouton  wrote:
> 
> On 06/19/15 13:42, Burkhard Linke wrote:
>> 
>> Forget the reply to the list...
>> 
>>  Forwarded Message 
>> Subject: Re: [ceph-users] Unexpected disk write activity with btrfs OSDs
>> Date:Fri, 19 Jun 2015 09:06:33 +0200
>> From:Burkhard Linke 
>>  
>> 
>> To:  Lionel Bouton  
>> 
>> Hi,
>> 
>> On 06/18/2015 11:28 PM, Lionel Bouton wrote:
>> > Hi,
>> *snipsnap*
>> 
>> > - Disks with btrfs OSD have a spike of activity every 30s (2 intervals
>> > of 10s with nearly 0 activity, one interval with a total amount of
>> > writes of ~120MB). The averages are : 4MB/s, 100 IO/s.
>> 
>> Just a guess:
>> 
>> btrfs has a commit interval which defaults to 30 seconds.
>> 
>> You can verify this by changing the interval with the commit=XYZ mount 
>> option.
> 
> I know and I tested commit intervals of 60 and 120 seconds without any 
> change. As this is directly linked to filestore max sync interval I didn't 
> report this test result.
> 
> Best regards,
> 
> Lionel
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph's RBD flattening and image options

2015-06-30 Thread Michał Chybowski

Hi,

Lately I've been working on XEN RBD SM and I'm using RBD's built-in 
snapshot functionality.


My system looks like this:
base image -> snapshot -> snaphot is used to create XEN VM's volumes -> 
volume snapshots (via rbd snap..) -> another VMs -> etc.


I'd like to be able to delete one of the volumes "in the chain" but 
without cloning base image blocks that hadn't changed yet (I'm trying to 
save up space) as it is done "by default" while flattening snapshots.

Is this possible and if it is, then how can I achieve this?

Also, we wanted to tune a bit ceph images while they're being created 
and we're unable to set any of the striping options succesfully - We're 
setting variables with correct values (even copying those in 
documentation) and we get errors from librbd during image creation or 
image mapping.


I can post you our ceph.conf and CRUSH map if needed.

--
Regards
Michał Chybowski
Tiktalik.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v9.0.1 released

2015-06-30 Thread Yuri Weinstein
Sage

We still running nightlies on next and branches.

Just wanted to reaffirm that this is not time yet to start scheduling suites on 
"infernalis"?

Thx
YuriW

- Original Message -
From: "Sage Weil" 
To: ceph-annou...@ceph.com, ceph-de...@vger.kernel.org, ceph-us...@ceph.com, 
ceph-maintain...@ceph.com
Sent: Thursday, June 11, 2015 10:06:38 AM
Subject: v9.0.1 released

This development release is delayed a bit due to tooling changes in the 
build environment.  As a result the next one (v9.0.2) will have a bit more 
work than is usual.

Highlights here include lots of RGW Swift fixes, RBD feature work 
surrounding the new object map feature, more CephFS snapshot fixes, and a 
few important CRUSH fixes.

Notable Changes
---

* auth: cache/reuse crypto lib key objects, optimize msg signature check 
  (Sage Weil)
* build: allow tcmalloc-minimal (Thorsten Behrens)
* build: do not build ceph-dencoder with tcmalloc (#10691 Boris Ranto)
* build: fix pg ref disabling (William A. Kennington III)
* build: install-deps.sh improvements (Loic Dachary)
* build: misc fixes (Boris Ranto, Ken Dreyer, Owen Synge)
* ceph-authtool: fix return code on error (Gerhard Muntingh)
* ceph-disk: fix zap sgdisk invocation (Owen Synge, Thorsten Behrens)
* ceph-disk: pass --cluster arg on prepare subcommand (Kefu Chai)
* ceph-fuse, libcephfs: drop inode when rmdir finishes (#11339 Yan, Zheng)
* ceph-fuse,libcephfs: fix uninline (#11356 Yan, Zheng)
* ceph-monstore-tool: fix store-copy (Huangjun)
* common: add perf counter descriptions (Alyona Kiseleva)
* common: fix throttle max change (Henry Chang)
* crush: fix crash from invalid 'take' argument (#11602 Shiva Rkreddy, 
  Sage Weil)
* crush: fix divide-by-2 in straw2 (#11357 Yann Dupont, Sage Weil)
* deb: fix rest-bench-dbg and ceph-test-dbg dependendies (Ken Dreyer)
* doc: document region hostnames (Robin H. Johnson)
* doc: update release schedule docs (Loic Dachary)
* init-radosgw: run radosgw as root (#11453 Ken Dreyer)
* librados: fadvise flags per op (Jianpeng Ma)
* librbd: allow additional metadata to be stored with the image (Haomai 
  Wang)
* librbd: better handling for dup flatten requests (#11370 Jason Dillaman)
* librbd: cancel in-flight ops on watch error (#11363 Jason Dillaman)
* librbd: default new images to format 2 (#11348 Jason Dillaman)
* librbd: fast diff implementation that leverages object map (Jason 
  Dillaman)
* librbd: fix snapshot creation when other snap is active (#11475 Jason 
  Dillaman)
* librbd: new diff_iterate2 API (Jason Dillaman)
* librbd: object map rebuild support (Jason Dillaman)
* logrotate.d: prefer service over invoke-rc.d (#11330 Win Hierman, Sage 
  Weil)
* mds: avoid getting stuck in XLOCKDONE (#11254 Yan, Zheng)
* mds: fix integer truncateion on large client ids (Henry Chang)
* mds: many snapshot and stray fixes (Yan, Zheng)
* mds: persist completed_requests reliably (#11048 John Spray)
* mds: separate safe_pos in Journaler (#10368 John Spray)
* mds: snapshot rename support (#3645 Yan, Zheng)
* mds: warn when clients fail to advance oldest_client_tid (#10657 Yan, 
  Zheng)
* misc cleanups and fixes (Danny Al-Gaaf)
* mon: fix average utilization calc for 'osd df' (Mykola Golub)
* mon: fix variance calc in 'osd df' (Sage Weil)
* mon: improve callout to crushtool (Mykola Golub)
* mon: prevent bucket deletion when referenced by a crush rule (#11602 
  Sage Weil)
* mon: prime pg_temp when CRUSH map changes (Sage Weil)
* monclient: flush_log (John Spray)
* msgr: async: many many fixes (Haomai Wang)
* msgr: simple: fix clear_pipe (#11381 Haomai Wang)
* osd: add latency perf counters for tier operations (Xinze Chi)
* osd: avoid multiple hit set insertions (Zhiqiang Wang)
* osd: break PG removal into multiple iterations (#10198 Guang Yang)
* osd: check scrub state when handling map (Jianpeng Ma)
* osd: fix endless repair when object is unrecoverable (Jianpeng Ma, Kefu 
  Chai)
* osd: fix pg resurrection (#11429 Samuel Just)
* osd: ignore non-existent osds in unfound calc (#10976 Mykola Golub)
* osd: increase default max open files (Owen Synge)
* osd: prepopulate needs_recovery_map when only one peer has missing 
  (#9558 Guang Yang)
* osd: relax reply order on proxy read (#11211 Zhiqiang Wang)
* osd: skip promotion for flush/evict op (Zhiqiang Wang)
* osd: write journal header on clean shutdown (Xinze Chi)
* qa: run-make-check.sh script (Loic Dachary)
* rados bench: misc fixes (Dmitry Yatsushkevich)
* rados: fix error message on failed pool removal (Wido den Hollander)
* radosgw-admin: add 'bucket check' function to repair bucket index 
  (Yehuda Sadeh)
* rbd: allow unmapping by spec (Ilya Dryomov)
* rbd: deprecate --new-format option (Jason Dillman)
* rgw: do not set content-type if length is 0 (#11091 Orit Wasserman)
* rgw: don't use end_marker for namespaced object listing (#11437 Yehuda 
  Sadeh)
* rgw: fail if parts not specified on multipart upload (#11435 Yehuda 
  Sadeh)
* rgw: fix GET on swift account when limit == 0 (#1

Re: [ceph-users] rbd_cache, limiting read on high iops around 40k

2015-06-30 Thread pushpesh sharma
Just an update, there seems to be no proper way to pass iothread
parameter from openstack-nova (not at least in Juno release). So a
default single iothread per VM is what all we have. So in conclusion a
nova instance max iops on ceph rbd will be limited to 30-40K.

On Tue, Jun 16, 2015 at 10:08 PM, Alexandre DERUMIER
 wrote:
> Hi,
>
> some news about qemu with tcmalloc vs jemmaloc.
>
> I'm testing with multiple disks (with iothreads) in 1 qemu guest.
>
> And if tcmalloc is a little faster than jemmaloc,
>
> I have hit a lot of time the tcmalloc::ThreadCache::ReleaseToCentralCache bug.
>
> increasing TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES, don't help.
>
>
> with multiple disk, I'm around 200k iops with tcmalloc (before hitting the 
> bug) and 350kiops with jemmaloc.
>
> The problem is that when I hit malloc bug, I'm around 4000-1 iops, and 
> only way to fix is is to restart qemu ...
>
>
>
> - Mail original -
> De: "pushpesh sharma" 
> À: "aderumier" 
> Cc: "Somnath Roy" , "Irek Fasikhov" 
> , "ceph-devel" , "ceph-users" 
> 
> Envoyé: Vendredi 12 Juin 2015 08:58:21
> Objet: Re: rbd_cache, limiting read on high iops around 40k
>
> Thanks, posted the question in openstack list. Hopefully will get some
> expert opinion.
>
> On Fri, Jun 12, 2015 at 11:33 AM, Alexandre DERUMIER
>  wrote:
>> Hi,
>>
>> here a libvirt xml sample from libvirt src
>>
>> (you need to define  number, then assign then in disks).
>>
>> I don't use openstack, so I really don't known how it's working with it.
>>
>>
>> 
>> QEMUGuest1
>> c7a5fdbd-edaf-9455-926a-d65c16db1809
>> 219136
>> 219136
>> 2
>> 2
>> 
>> hvm
>> 
>> 
>> 
>> destroy
>> restart
>> destroy
>> 
>> /usr/bin/qemu
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>
>>
>> - Mail original -
>> De: "pushpesh sharma" 
>> À: "aderumier" 
>> Cc: "Somnath Roy" , "Irek Fasikhov" 
>> , "ceph-devel" , "ceph-users" 
>> 
>> Envoyé: Vendredi 12 Juin 2015 07:52:41
>> Objet: Re: rbd_cache, limiting read on high iops around 40k
>>
>> Hi Alexandre,
>>
>> I agree with your rational, of one iothread per disk. CPU consumed in
>> IOwait is pretty high in each VM. But I am not finding a way to set
>> the same on a nova instance. I am using openstack Juno with QEMU+KVM.
>> As per libvirt documentation for setting iothreads, I can edit
>> domain.xml directly and achieve the same effect. However in as in
>> openstack env domain xml is created by nova with some additional
>> metadata, so editing the domain xml using 'virsh edit' does not seems
>> to work(I agree, it is not a very cloud way of doing things, but a
>> hack). Changes made there vanish after saving them, due to reason
>> libvirt validation fails on the same.
>>
>> #virsh dumpxml instance-00c5 > vm.xml
>> #virt-xml-validate vm.xml
>> Relax-NG validity error : Extra element cpu in interleave
>> vm.xml:1: element domain: Relax-NG validity error : Element domain
>> failed to validate content
>> vm.xml fails to validate
>>
>> Second approach I took was to setting QoS in volumes types. But there
>> is no option to set iothreads per volume, there are parameter realted
>> to max_read/wrirte ops/bytes.
>>
>> Thirdly, editing Nova flavor and proving extra specs like
>> hw:cpu_socket/thread/core, can change guest CPU topology however again
>> no way to set iothread. It does accept hw_disk_iothreads(no type check
>> in place, i believe ), but can not pass the same in domain.xml.
>>
>> Could you suggest me a way to set the same.
>>
>> -Pushpesh
>>
>> On Wed, Jun 10, 2015 at 12:59 PM, Alexandre DERUMIER
>>  wrote:
>I need to try out the performance on qemu soon and may come back to you if 
>I need some qemu setting trick :-)
>>>
>>> Sure no problem.
>>>
>>> (BTW, I can reach around 200k iops in 1 qemu vm with 5 virtio disks with 1 
>>> iothread by disk)
>>>
>>>
>>> - Mail original -
>>> De: "Somnath Roy" 
>>> À: "aderumier" , "Irek Fasikhov" 
>>> Cc: "ceph-devel" , "pushpesh sharma" 
>>> , "ceph-users" 
>>> Envoyé: Mercredi 10 Juin 2015 09:06:32
>>> Objet: RE: rbd_cache, limiting read on high iops around 40k
>>>
>>> Hi Alexandre,
>>> Thanks for sharing the data.
>>> I need to try out the performance on qemu soon and may come back to you if 
>>> I need some qemu setting trick :-)
>>>
>>> Regards
>>> Somnath
>>>
>>> -Original Message-
>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
>>> Alexandre DERUMIER
>>> Sent: Tuesday, June 09, 2015 10:42 PM
>>> To: Irek Fasikhov
>>> Cc: ceph-devel; pushpesh sharma; ceph-users
>>> Subject: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k
>>>
>Very good work!
>Do you have a rpm-file?
>Thanks.
>>> no sorry, I'm have compiled it manually (and I'm using debian jessie as 
>>> client)
>>>
>>>
>>>
>>> - Mail original -
>>> De: "Irek Fasikhov" 
>>> À: "aderumier" 
>>> Cc: "Robert LeBlanc" , "ceph-devel" 
>>> , "pushpesh sharma" , 
>>> "ceph-users" 
>>> Envoyé: Mercredi 10 Juin 2015 07:21:42
>>> O

Re: [ceph-users] low power single disk nodes

2015-06-30 Thread Xu (Simon) Chen
hth,

Any idea what caused the pause? I am curious to know more details.

Thanks.
-Simon

On Friday, April 10, 2015, 10 minus  wrote:

> Hi ,
>
> Question is what do you want to use it for . As an OSD it wont cut it.
> Maybe as an iscsi target and YMMV
>
> I played around with an OEM product from Taiwan ... I dont remember the
> name.
>
> They had an  Armada XP arm soc and a SATA port + 2 ethernet. Pretty
> niffty.
> Downside were:
> -  1GB RAM .
> - OS was using a custom ubuntu
> - Ceph had to be compiled and deployed.
> -  if you put steady  i/o  and/or Cpu load for 15 - 20 min . the hardware
> will just pause.
> Causing ceph to crash
>
> hth
>
>
> On Fri, Apr 10, 2015 at 10:04 AM, Philip Williams  > wrote:
>
>> James Hughes talked about these at MSST last year, and some of his
>> colleagues demonstrated the hardware: <
>> http://storageconference.us/2014/Presentations/Hughes.pdf>
>>
>> For tinkering purposes there is java based simulator: <
>> https://developers.seagate.com/display/KV/Kinetic+Open+Storage+Documentation+Wiki
>> >
>>
>> The drives do use a key/value interface.
>>
>> --phil
>>
>> > On 9 Apr 2015, at 17:01, Mark Nelson > > wrote:
>> >
>> > Notice that this is under their emerging technologies section.  I don't
>> think you can buy them yet.  Hopefully we'll know more as time goes on. :)
>> >
>> > Mark
>> >
>> >
>> > On 04/09/2015 10:52 AM, Stillwell, Bryan wrote:
>> >> These are really interesting to me, but how can you buy them?  What's
>> the
>> >> performance like in ceph?  Are they using the keyvaluestore backend, or
>> >> something specific to these drives?  Also what kind of chassis do they
>> go
>> >> into (some kind of ethernet JBOD)?
>> >>
>> >> Bryan
>> >>
>> >> On 4/9/15, 9:43 AM, "Mark Nelson" > > wrote:
>> >>
>> >>> How about drives that run Linux with an ARM processor, RAM, and an
>> >>> ethernet port right on the drive?  Notice the Ceph logo. :)
>> >>>
>> >>>
>> https://www.hgst.com/science-of-storage/emerging-technologies/open-etherne
>> >>> t-drive-architecture
>> >>>
>> >>> Mark
>> >>>
>> >>> On 04/09/2015 10:37 AM, Scott Laird wrote:
>>  Minnowboard Max?  2 atom cores, 1 SATA port, and a real (non-USB)
>>  Ethernet port.
>> 
>> 
>>  On Thu, Apr 9, 2015, 8:03 AM p...@philw.com
>>  > >
>>  
>> >
>> wrote:
>> 
>>  Rather expensive option:
>> 
>>  Applied Micro X-Gene, overkill for a single disk, and only really
>>  available in a
>>  development kit format right now.
>> 
>> 
>>  <
>> https://www.apm.com/products/__data-center/x-gene-family/x-__c1-developm
>>  ent-kits/
>> 
>>  <
>> https://www.apm.com/products/data-center/x-gene-family/x-c1-development-
>>  kits/>>
>> 
>>  Better Option:
>> 
>>  Ambedded CY7 - 7 nodes in 1U half Depth, 6 positions for SATA
>> disks,
>>  and one
>>  node with mSATA SSD
>> 
>>  >  >
>> 
>>  --phil
>> 
>>   > On 09 April 2015 at 15:57 Quentin Hartman
>>  >  > qhart...@direwolfdigital.com
>> >>
>>   > wrote:
>>   >
>>   >  I'm skeptical about how well this would work, but a Banana Pi
>>  might be a
>>   > place to start. Like a raspberry pi, but it has a SATA
>> connector:
>>   > http://www.bananapi.org/
>>   >
>>   >  On Thu, Apr 9, 2015 at 3:18 AM, Jerker Nyberg
>>  >  > jer...@update.uu.se 
>> >
>>   > >  > jer...@update.uu.se >>
>> >
>>  wrote:
>>   >> >Hello ceph users,
>>   > >
>>   > >Is anyone running any low powered single disk nodes with
>>  Ceph now?
>>   > > Calxeda seems to be no more according to Wikipedia. I do not
>>  think HP
>>   > > moonshot is what I am looking for - I want stand-alone
>> nodes,
>>  not server
>>   > > cartridges integrated into server chassis. And I do not
>> want to
>>  be locked to
>>   > > a single vendor.
>>   > >
>>   > >I was playing with Raspberry Pi 2 for signage when I
>> thought
>>  of my old
>>   > > experiments with Ceph.
>>   > >
>>   > >I am thinking of for example Odroid-C1 or Odroid-XU3
>> Lite or
>>  maybe
>>   > > something with a low-power Intel x64/x86 processor. Together
>>  with one SSD or
>>   > > one low power HDD the node could get all power via PoE (via
>>  splitter or
>>   > > integrated into board if such boards exist). PoE provide
>> remote
>>  power-on
>>   > > power-off even for consumer grade nodes.
>>   > >
>>   > >The cost for a single low power node should be able to
>>  comp

[ceph-users] runtime Error for creating ceph MON via ceph-deploy

2015-06-30 Thread Vida Ahmadi
Hi all,
I am a new user who want to deploy simple ceph cluster.
I start to create ceph monitor node via ceph-deploy and got error:

[*ceph_deploy*][*ERROR* ] RuntimeError: remote connection got closed,
ensure ``requiretty`` is disabled for node1

I commented requiretty and I have a password-less access to the node1.

Is there any other issues for this error?

Any kind of help will be appreciated.

Note: I am using centOS 7 and ceph version 1.5.25.
-- 
Best regards,
Vida
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Perfomance issue.

2015-06-30 Thread Marcus Forness
hi! anyone able to privide som tips on performance issue on a newly
installe all flash ceph cluster? When we do write test we get 900MB/s
write. but read tests are only 200MB/s all servers are on 10GBit
connections.

[global]
fsid = 453d2db9-c764-4921-8f3c-ee0f75412e19
mon_initial_members = ceph02, ceph03, ceph04
mon_host = 10.129.23.202,10.129.23.203,10.129.23.204
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
public_network = 10.129.0.0/16


this is the cehph conf file
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Explanation for "ceph osd set nodown" and "ceph osd cluster_snap"

2015-06-30 Thread Jan Schermer
Thanks.

Nobody else knows anything about “cluster_snap”?

It is mentioned in the docs, but that’s all…

Jan

> On 19 Jun 2015, at 12:49, Carsten Schmitt  
> wrote:
> 
> Hi Jan,
> 
> On 06/18/2015 12:48 AM, Jan Schermer wrote:
>> 1) Flags available in ceph osd set are
>> 
>> pause|noup|nodown|noout|noin|nobackfill|norecover|noscrub|nodeep-scrub|notieragent
>> 
>> I know or can guess most of them (the docs are a “bit” lacking)
>> 
>> But with "ceph osd set nodown” I have no idea what it should be used for
>> - to keep hammering a faulty OSD?
> 
> I only know the documentation for this one:
> http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/
> You can set an OSD to "nodown" if you know for certain that it is not faulty 
> but it gets set to this state by the monitor because of problems with the 
> cluster network.
> 
> Cheers,
> Carsten
> 
>> 
>> 2) looking through the docs there I found reference to "ceph osd
>> cluster_snap”
>> http://ceph.com/docs/v0.67.9/rados/operations/control/
>> 
>> what does it do? how does that work? does it really work? ;-) I got a
>> few hits on google which suggest it might not be something that really
>> works, but looks like something we could certainly use
>> 
>> Thanks
>> 
>> Jan
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW access problem

2015-06-30 Thread I Kozin
Thank you Alex. This is useful.
The Ceph documentation is a bit vague about the subject
http://ceph.com/docs/master/radosgw/admin/#create-a-user
whereas your link clearly states what is escaped by backslash.
To answer your question, I was not using a parser. I was just copy/pasting.
It has not immediately occurred to me that the plain text output has to be
parsed before being used elsewhere.

On 25 June 2015 at 15:59, Alex Muntada  wrote:

> INKozin:
>
> > Where can I find the rules for escape chars in keys?
>
> http://json.org/ clearly states that / must be quoted. What kind
> of parser are you using?
>
> Cheers,
> Alex
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] xattrs vs. omap with radosgw

2015-06-30 Thread Zhou, Yuan
FWIW, there was some discussion in OpenStack Swift and their performance tests 
showed 255 is not the best in recent XFS. They decided to use large xattr 
boundary size(65535).

https://gist.github.com/smerritt/5e7e650abaa20599ff34


-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
Sent: Wednesday, June 17, 2015 3:43 AM
To: GuangYang
Cc: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com
Subject: Re: xattrs vs. omap with radosgw

On Tue, 16 Jun 2015, GuangYang wrote:
> Hi Cephers,
> While looking at disk utilization on OSD, I noticed the disk was constantly 
> busy with large number of small writes, further investigation showed that, as 
> radosgw uses xattrs to store metadata (e.g. etag, content-type, etc.), which 
> made the xattrs get from local to extents, which incurred extra I/O.
> 
> I would like to check if anybody has experience with offloading the metadata 
> to omap:
>   1> Offload everything to omap? If this is the case, should we make the 
> inode size as 512 (instead of 2k)?
>   2> Partial offload the metadata to omap, e.g. only offloading the rgw 
> specified metadata to omap.
> 
> Any sharing is deeply appreciated. Thanks!

Hi Guang,

Is this hammer or firefly?

With hammer the size of object_info_t crossed the 255 byte boundary, which is 
the max xattr value that XFS can inline.  We've since merged something that 
stripes over several small xattrs so that we can keep things inline, but it 
hasn't been backported to hammer yet.  See 
c6cdb4081e366f471b372102905a1192910ab2da.  Perhaps this is what you're seeing?

I think we're still better off with larger XFS inodes and inline xattrs if it 
means we avoid leveldb at all for most objects.

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph osd out trigerred the pg recovery process, but by the end, why not all pgs are active+clean?

2015-06-30 Thread Cory
Hi ceph experts,
I did some test on my ceph cluster recently with following steps:
1. at the beginning, all pgs are active+clean;
2. stop a osd. I observed a lot of pgs are degraded.
3. ceph osd out.
4. then I observed ceph is doing recovery process.


my question is  I expected by the end, all pgs  will go back to active+clean 
since the osd is out of cluster. but why there are still some pgs are in 
degraded status and the recovery process seemed stopped for ever.


here is the ceph -s output status.
ceph -s
cluster fdecc391-0c75-417d-a980-57ef52bdc1cd
 health HEALTH_ERR 98 pgs degraded; 16 pgs inconsistent; 104 pgs stuck 
unclean; recovery 58170/14559528 objects degraded (0.400%); 18 scrub errors; 
clock skew detected on
 mon.ceph10 monmap e7: 3 mons at 
{ceph01=10.195.158.199:6789/0,ceph06=10.195.158.204:6789/0,ceph10=10.195.158.208:6789/0},
 election epoch 236, quorum 0,1,2 ceph01,ceph06,ceph10
 osdmap e14456: 81 osds: 80 up, 80 in
  pgmap v1589483: 8320 pgs, 3 pools, 32375 GB data, 4739 kobjects
83877 GB used, 122 TB / 214 TB avail
58170/14559528 objects degraded (0.400%)
8200 active+clean
  16 active+clean+inconsistent
  98 active+degraded
   6 active+remapped
  client io 63614 kB/s rd, 41458 kB/s wr, 2375 op/s

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Very low 4k randread performance ~1000iops

2015-06-30 Thread Tuomas Juntunen
I created a file which has the following parameters


[random-read]
rw=randread
size=128m
directory=/root/asd
ioengine=libaio
bs=4k
#numjobs=8
iodepth=64


Br,T
-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Mark Nelson
Sent: 30. kesäkuuta 2015 20:55
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops

Hi Tuomos,

Can you paste the command you ran to do the test?

Thanks,
Mark

On 06/30/2015 12:18 PM, Tuomas Juntunen wrote:
> Hi
>
> It’s not probably hitting the disks, but that really doesn’t matter. 
> The point is we have very responsive VM’s while writing and that is 
> what the users will see.
>
> The iops we get with sequential read is good, but the random read is 
> way too low.
>
> Is using SSD’s as OSD’s the only way to get it up? or is there some 
> tunable which would enhance it? I would assume Linux caches reads in 
> memory and serves them from there, but atleast now we don’t see it.
>
> Br,
>
> Tuomas
>
> *From:*Somnath Roy [mailto:somnath@sandisk.com]
> *Sent:* 30. kesäkuuta 2015 19:24
> *To:* Tuomas Juntunen; 'ceph-users'
> *Subject:* RE: [ceph-users] Very low 4k randread performance ~1000iops
>
> Break it down, try fio-rbd to see what is the performance you getting..
>
> But, I am really surprised you are getting > 100k iops for write, did 
> you check it is hitting the disks ?
>
> Thanks & Regards
>
> Somnath
>
> *From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On 
> Behalf Of *Tuomas Juntunen
> *Sent:* Tuesday, June 30, 2015 8:33 AM
> *To:* 'ceph-users'
> *Subject:* [ceph-users] Very low 4k randread performance ~1000iops
>
> Hi
>
> I have been trying to figure out why our 4k random reads in VM’s are 
> so bad. I am using fio to test this.
>
> Write : 170k iops
>
> Random write : 109k iops
>
> Read : 64k iops
>
> Random read : 1k iops
>
> Our setup is:
>
> 3 nodes with 36 OSDs, 18 SSD’s one SSD for two OSD’s, each node has 
> 64gb mem & 2x6core cpu’s
>
> 4 monitors running on other servers
>
> 40gbit infiniband with IPoIB
>
> Openstack : Qemu-kvm for virtuals
>
> Any help would be appreciated
>
> Thank you in advance.
>
> Br,
>
> Tuomas
>
> --
> --
>
>
> PLEASE NOTE: The information contained in this electronic mail message 
> is intended only for the use of the designated recipient(s) named above.
> If the reader of this message is not the intended recipient, you are 
> hereby notified that you have received this message in error and that 
> any review, dissemination, distribution, or copying of this message is 
> strictly prohibited. If you have received this communication in error, 
> please notify the sender by telephone or e-mail (as shown above) 
> immediately and destroy any and all copies of this message in your 
> possession (whether hard copies or electronically stored copies).
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Very low 4k randread performance ~1000iops

2015-06-30 Thread Tuomas Juntunen
I have already set readahead to OSD’s before, It is now 2048, this didn’t
affect the random reads, but gave a lot more sequential performance.

 

Br, T

 

 

From: Somnath Roy [mailto:somnath@sandisk.com] 
Sent: 30. kesäkuuta 2015 21:00
To: Tuomas Juntunen; 'Stephen Mercier'
Cc: 'ceph-users'
Subject: RE: [ceph-users] Very low 4k randread performance ~1000iops

 

Read_ahead_kb should help you in case of seq workload, but, if you are
saying it is helping your workload in random case also, try to do it both in
VM as well as in OSD side as well and see if it is making any difference.

 

Thanks & Regards

Somnath

 

From: Tuomas Juntunen [mailto:tuomas.juntu...@databasement.fi] 
Sent: Tuesday, June 30, 2015 10:49 AM
To: 'Stephen Mercier'
Cc: Somnath Roy; 'ceph-users'
Subject: RE: [ceph-users] Very low 4k randread performance ~1000iops

 

Hi

 

This is something I was thinking too. But it doesn’t take away the problem.

 

Can you share your setup and how many VM’s you are running, that would give
us some starting point on sizing our setup.

 

Thanks

 

Br,

Tuomas

 

From: Stephen Mercier [mailto:stephen.merc...@attainia.com] 
Sent: 30. kesäkuuta 2015 20:32
To: Tuomas Juntunen
Cc: 'Somnath Roy'; 'ceph-users'
Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops

 

I ran into the same problem. What we did, and have been using since, is
increased the read ahead buffer in the VMs to 16MB (The sweet spot we
settled on after testing). This isn't a solution for all scenarios, but for
our uses, it was enough to get performance inline with expectations.

 

In Ubuntu, we added the following udev config to facilitate this:

 

root@ubuntu:/lib/udev/rules.d#   vi
/etc/udev/rules.d/99-virtio.rules 

 

SUBSYSTEM=="block", ATTR{queue/rotational}=="1", ACTION=="add|change",
KERNEL=="vd[a-z]", ATTR{bdi/read_ahead_kb}="16384",
ATTR{queue/read_ahead_kb}="16384", ATTR{queue/scheduler}="deadline"

 

 

Cheers,

-- 

Stephen Mercier

Senior Systems Architect

Attainia, Inc.

Phone: 866-288-2464 ext. 727

Email: stephen.merc...@attainia.com

Web: www.attainia.com

 

Capital equipment lifecycle planning & budgeting solutions for healthcare

 

 

 

On Jun 30, 2015, at 10:18 AM, Tuomas Juntunen wrote:

 

Hi

 

It’s not probably hitting the disks, but that really doesn’t matter. The
point is we have very responsive VM’s while writing and that is what the
users will see.

The iops we get with sequential read is good, but the random read is way too
low.

 

Is using SSD’s as OSD’s the only way to get it up? or is there some tunable
which would enhance it? I would assume Linux caches reads in memory and
serves them from there, but atleast now we don’t see it.

 

Br,

Tuomas

 

 

From: Somnath Roy [mailto:somnath@sandisk.com] 
Sent: 30. kesäkuuta 2015 19:24
To: Tuomas Juntunen; 'ceph-users'
Subject: RE: [ceph-users] Very low 4k randread performance ~1000iops

 

Break it down, try fio-rbd to see what is the performance you getting..

But, I am really surprised you are getting > 100k iops for write, did you
check it is hitting the disks ?

 

Thanks & Regards

Somnath

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Tuomas Juntunen
Sent: Tuesday, June 30, 2015 8:33 AM
To: 'ceph-users'
Subject: [ceph-users] Very low 4k randread performance ~1000iops

 

Hi

 

I have been trying to figure out why our 4k random reads in VM’s are so bad.
I am using fio to test this.

 

Write : 170k iops

Random write : 109k iops

Read : 64k iops

Random read : 1k iops

 

Our setup is:

3 nodes with 36 OSDs, 18 SSD’s one SSD for two OSD’s, each node has 64gb mem
& 2x6core cpu’s

4 monitors running on other servers

40gbit infiniband with IPoIB

Openstack : Qemu-kvm for virtuals

 

Any help would be appreciated

 

Thank you in advance.

 

Br,

Tuomas

 

  _  


PLEASE NOTE: The information contained in this electronic mail message is
intended only for the use of the designated recipient(s) named above. If the
reader of this message is not the intended recipient, you are hereby
notified that you have received this message in error and that any review,
dissemination, distribution, or copying of this message is strictly
prohibited. If you have received this communication in error, please notify
the sender by telephone or e-mail (as shown above) immediately and destroy
any and all copies of this message in your possession (whether hard copies
or electronically stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Very low 4k randread performance ~1000iops

2015-06-30 Thread Mark Nelson
Seems reasonable.  What's the latency distribution look like in your fio 
output file?  Would be useful to know if it's universally slow or if 
some ops are taking much longer to complete than others.


Mark

On 06/30/2015 01:27 PM, Tuomas Juntunen wrote:

I created a file which has the following parameters


[random-read]
rw=randread
size=128m
directory=/root/asd
ioengine=libaio
bs=4k
#numjobs=8
iodepth=64


Br,T
-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Mark Nelson
Sent: 30. kesäkuuta 2015 20:55
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops

Hi Tuomos,

Can you paste the command you ran to do the test?

Thanks,
Mark

On 06/30/2015 12:18 PM, Tuomas Juntunen wrote:

Hi

It’s not probably hitting the disks, but that really doesn’t matter.
The point is we have very responsive VM’s while writing and that is
what the users will see.

The iops we get with sequential read is good, but the random read is
way too low.

Is using SSD’s as OSD’s the only way to get it up? or is there some
tunable which would enhance it? I would assume Linux caches reads in
memory and serves them from there, but atleast now we don’t see it.

Br,

Tuomas

*From:*Somnath Roy [mailto:somnath@sandisk.com]
*Sent:* 30. kesäkuuta 2015 19:24
*To:* Tuomas Juntunen; 'ceph-users'
*Subject:* RE: [ceph-users] Very low 4k randread performance ~1000iops

Break it down, try fio-rbd to see what is the performance you getting..

But, I am really surprised you are getting > 100k iops for write, did
you check it is hitting the disks ?

Thanks & Regards

Somnath

*From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On
Behalf Of *Tuomas Juntunen
*Sent:* Tuesday, June 30, 2015 8:33 AM
*To:* 'ceph-users'
*Subject:* [ceph-users] Very low 4k randread performance ~1000iops

Hi

I have been trying to figure out why our 4k random reads in VM’s are
so bad. I am using fio to test this.

Write : 170k iops

Random write : 109k iops

Read : 64k iops

Random read : 1k iops

Our setup is:

3 nodes with 36 OSDs, 18 SSD’s one SSD for two OSD’s, each node has
64gb mem & 2x6core cpu’s

4 monitors running on other servers

40gbit infiniband with IPoIB

Openstack : Qemu-kvm for virtuals

Any help would be appreciated

Thank you in advance.

Br,

Tuomas

--
--


PLEASE NOTE: The information contained in this electronic mail message
is intended only for the use of the designated recipient(s) named above.
If the reader of this message is not the intended recipient, you are
hereby notified that you have received this message in error and that
any review, dissemination, distribution, or copying of this message is
strictly prohibited. If you have received this communication in error,
please notify the sender by telephone or e-mail (as shown above)
immediately and destroy any and all copies of this message in your
possession (whether hard copies or electronically stored copies).



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] runtime Error for creating ceph MON via ceph-deploy

2015-06-30 Thread Alan Johnson
I use sudo visudo and then add in a line under
Defaults requiretty
-->
Defaults: !requiretty

Where  is the username.

Hope this helps?

Alan

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Vida 
Ahmadi
Sent: Monday, June 22, 2015 6:31 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] runtime Error for creating ceph MON via ceph-deploy

Hi all,
I am a new user who want to deploy simple ceph cluster.
I start to create ceph monitor node via ceph-deploy and got error:
[ceph_deploy][ERROR ] RuntimeError: remote connection got closed, ensure 
``requiretty`` is disabled for node1
I commented requiretty and I have a password-less access to the node1.
Is there any other issues for this error?
Any kind of help will be appreciated.
Note: I am using centOS 7 and ceph version 1.5.25.
--
Best regards,
Vida
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Very low 4k randread performance ~1000iops

2015-06-30 Thread Stephen Mercier
I currently have about 250 VMs, ranging from 16GB to 2TB in size. What I found, 
after about a week of testing, sniffing, and observing, is that the larger read 
ahead buffer causes the VM to chunk reads over to ceph, and in doing so, allows 
it to better align with the 4MB block size that Ceph uses. If I dropped the 
cache below 16MB, performance would degrade, almost linearly, all the way down 
to the 16kb standard size. And when I increased it above 16MB, there were some 
intermittent gains, but overall nothing to write home about.

For reference, our ceph cluster is 88TB, spread across 88 1TB SSDs. Each 
storage node has 100GbE connectivity, and each cloud host (proxmox) has 40GbE. 
I'm able to sustain 3400 iops regularly, and seen spikes as high as 5200+ iops 
in our calamari logs. In addition, due to some clever use of LACP and mlag, I'm 
able to sustain 3000+ iops per cloud host simultaneously. Our workload at this 
time for the VMs are MSSQL servers, MySQL servers, and BI servers (Pentaho). We 
also have our ELK stack and collectd/Graphite/Grafana stack in this specific 
cloud. 

In the end, the root cause of the issue, based on my testing and 
investigations, centers around the mismatch of the block sizes between the VMs 
(4kb buffered to 16kb default) and Ceph (4MB blocks).

-- 
Stephen Mercier
Senior Systems Architect
Attainia, Inc.
Phone: 866-288-2464 ext. 727
Email: stephen.merc...@attainia.com
Web: www.attainia.com

Capital equipment lifecycle planning & budgeting solutions for healthcare






On Jun 30, 2015, at 10:49 AM, Tuomas Juntunen wrote:

> Hi
>  
> This is something I was thinking too. But it doesn’t take away the problem.
>  
> Can you share your setup and how many VM’s you are running, that would give 
> us some starting point on sizing our setup.
>  
> Thanks
>  
> Br,
> Tuomas
>  
> From: Stephen Mercier [mailto:stephen.merc...@attainia.com] 
> Sent: 30. kesäkuuta 2015 20:32
> To: Tuomas Juntunen
> Cc: 'Somnath Roy'; 'ceph-users'
> Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops
>  
> I ran into the same problem. What we did, and have been using since, is 
> increased the read ahead buffer in the VMs to 16MB (The sweet spot we settled 
> on after testing). This isn't a solution for all scenarios, but for our uses, 
> it was enough to get performance inline with expectations.
>  
> In Ubuntu, we added the following udev config to facilitate this:
>  
> root@ubuntu:/lib/udev/rules.d# vi /etc/udev/rules.d/99-virtio.rules 
>  
> SUBSYSTEM=="block", ATTR{queue/rotational}=="1", ACTION=="add|change", 
> KERNEL=="vd[a-z]", ATTR{bdi/read_ahead_kb}="16384", 
> ATTR{queue/read_ahead_kb}="16384", ATTR{queue/scheduler}="deadline"
>  
>  
> Cheers,
> -- 
> Stephen Mercier
> Senior Systems Architect
> Attainia, Inc.
> Phone: 866-288-2464 ext. 727
> Email: stephen.merc...@attainia.com
> Web: www.attainia.com
>  
> Capital equipment lifecycle planning & budgeting solutions for healthcare
>  
>  
>  
> On Jun 30, 2015, at 10:18 AM, Tuomas Juntunen wrote:
> 
> 
> Hi
>  
> It’s not probably hitting the disks, but that really doesn’t matter. The 
> point is we have very responsive VM’s while writing and that is what the 
> users will see.
> The iops we get with sequential read is good, but the random read is way too 
> low.
>  
> Is using SSD’s as OSD’s the only way to get it up? or is there some tunable 
> which would enhance it? I would assume Linux caches reads in memory and 
> serves them from there, but atleast now we don’t see it.
>  
> Br,
> Tuomas
>  
>  
> From: Somnath Roy [mailto:somnath@sandisk.com] 
> Sent: 30. kesäkuuta 2015 19:24
> To: Tuomas Juntunen; 'ceph-users'
> Subject: RE: [ceph-users] Very low 4k randread performance ~1000iops
>  
> Break it down, try fio-rbd to see what is the performance you getting..
> But, I am really surprised you are getting > 100k iops for write, did you 
> check it is hitting the disks ?
>  
> Thanks & Regards
> Somnath
>  
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Tuomas Juntunen
> Sent: Tuesday, June 30, 2015 8:33 AM
> To: 'ceph-users'
> Subject: [ceph-users] Very low 4k randread performance ~1000iops
>  
> Hi
>  
> I have been trying to figure out why our 4k random reads in VM’s are so bad. 
> I am using fio to test this.
>  
> Write : 170k iops
> Random write : 109k iops
> Read : 64k iops
> Random read : 1k iops
>  
> Our setup is:
> 3 nodes with 36 OSDs, 18 SSD’s one SSD for two OSD’s, each node has 64gb mem 
> & 2x6core cpu’s
> 4 monitors running on other servers
> 40gbit infiniband with IPoIB
> Openstack : Qemu-kvm for virtuals
>  
> Any help would be appreciated
>  
> Thank you in advance.
>  
> Br,
> Tuomas
>  
> 
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this messa

[ceph-users] Simple CephFS benchmark

2015-06-30 Thread Hadi Montakhabi
I have set up a ceph storage cluster and I'd like to utilize the cephfs (I
am assuming this is the only way one could use some other code without
using the API).
To do so, I have mounted my cephfs on the client node.
I'd like to know what would be a good benchmark for measuring write and
read performance on the cephfs compared to the local filesystem?

Thanks,
Hadi
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Where is what type if IO generated?

2015-06-30 Thread Steffen Tilsch
Hello Cephers,

I got some questions regarding where what type of IO is generated.



As far as I understand it looks like this (please see picture:
http://imageshack.com/a/img673/4563/zctaGA.jpg ) :

1. Clients -> OSD (Journal):

- Is it sequential write?

- Is it parallel due to the many open sockets?

2. OSD journal flush -> Filestore:

- Is it periodic sequential write?

- Is it done in parallel to the Filestore or just with one writer?

3. OSD -> Clients:

- Is it parallel random read (due to distribution of objects to PGs over
many OSD's)?

- Is any reading (from client side) done from journals?



How is the writing to the journal done? Is it like a ringbuffer - every new
block is added to the end of the journal so that there is no "overwriting"
in the middle of the journal (that would force the head of the HDD (if not
SSD) to seek?



It would be nice if you would correct me if anything is wrong here.



Thanks in advance,

Steffen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Simple CephFS benchmark

2015-06-30 Thread Mark Nelson
Two popular benchmarks in the HPC space for testing distributed file 
systems are IOR and mdtest.  Both use MPI to coordinate processes on 
different clients.  Another option may be to use fio or iozone.  Netmist 
may also be an option, but I haven't used it myself and I'm not sure 
that it's fully open source.


If you are only interested in single-client tests vs a local disk, I'd 
personally use fio.


Mark

On 06/30/2015 02:50 PM, Hadi Montakhabi wrote:

I have set up a ceph storage cluster and I'd like to utilize the cephfs
(I am assuming this is the only way one could use some other code
without using the API).
To do so, I have mounted my cephfs on the client node.
I'd like to know what would be a good benchmark for measuring write and
read performance on the cephfs compared to the local filesystem?

Thanks,
Hadi


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to use cgroup to bind ceph-osd to a specific cpu core?

2015-06-30 Thread Ray Sun
Jan,
Thanks a lot. I can do my contribution to this project if I can.

Best Regards
-- Ray

On Tue, Jun 30, 2015 at 11:50 PM, Jan Schermer  wrote:

> Hi all,
> our script is available on GitHub
>
> https://github.com/prozeta/pincpus
>
> I haven’t had much time to do a proper README, but I hope the
> configuration is self explanatory enough for now.
> What it does is pin each OSD into the most “empty” cgroup assigned to a
> NUMA node.
>
> Let me know how it works for you!
>
> Jan
>
>
> On 30 Jun 2015, at 10:50, Huang Zhiteng  wrote:
>
>
>
> On Tue, Jun 30, 2015 at 4:25 PM, Jan Schermer  wrote:
>
>> Not having OSDs and KVMs compete against each other is one thing.
>> But there are more reasons to do this
>>
>> 1) not moving the processes and threads between cores that much (better
>> cache utilization)
>> 2) aligning the processes with memory on NUMA systems (that means all
>> modern dual socket systems) - you don’t want your OSD running on CPU1 with
>> memory allocated to CPU2
>> 3) the same goes for other resources like NICs or storage controllers -
>> but that’s less important and not always practical to do
>> 4) you can limit the scheduling domain on linux if you limit the cpuset
>> for your OSDs (I’m not sure how important this is, just best practice)
>> 5) you can easily limit memory or CPU usage, set priority, with much
>> greater granularity than without cgroups
>> 6) if you have HyperThreading enabled you get the most gain when the
>> workloads on the threads are dissimiliar - so to have the higher throughput
>> you have to pin OSD to thread1 and KVM to thread2 on the same core. We’re
>> not doing that because latency and performance of the core can vary
>> depending on what the other thread is doing. But it might be useful to
>> someone.
>>
>> Some workloads exhibit >100% performance gain when everything aligns in a
>> NUMA system, compared to a SMP mode on the same hardware. You likely won’t
>> notice it on light workloads, as the interconnects (QPI) are very fast and
>> there’s a lot of bandwidth, but for stuff like big OLAP databases or other
>> data-manipulation workloads there’s a huge difference. And with CEPH being
>> CPU hungy and memory intensive, we’re seeing some big gains here just by
>> co-locating the memory with the processes….
>>
> Could you elaborate a it on this?  I'm interested to learn in what
> situation memory locality helps Ceph to what extend.
>
>>
>>
>> Jan
>>
>>
>>
>> On 30 Jun 2015, at 08:12, Ray Sun  wrote:
>>
>> ​Sound great, any update please let me know.​
>>
>> Best Regards
>> -- Ray
>>
>> On Tue, Jun 30, 2015 at 1:46 AM, Jan Schermer  wrote:
>>
>>> I promised you all our scripts for automatic cgroup assignment - they
>>> are in our production already and I just need to put them on github, stay
>>> tuned tomorrow :-)
>>>
>>> Jan
>>>
>>>
>>> On 29 Jun 2015, at 19:41, Somnath Roy  wrote:
>>>
>>> Presently, you have to do it by using tool like ‘taskset’ or ‘numactl’…
>>>
>>> Thanks & Regards
>>> Somnath
>>>
>>> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com
>>> ] *On Behalf Of *Ray Sun
>>> *Sent:* Monday, June 29, 2015 9:19 AM
>>> *To:* ceph-users@lists.ceph.com
>>> *Subject:* [ceph-users] How to use cgroup to bind ceph-osd to a
>>> specific cpu core?
>>>
>>> Cephers,
>>> I want to bind each of my ceph-osd to a specific cpu core, but I didn't
>>> find any document to explain that, could any one can provide me some
>>> detailed information. Thanks.
>>>
>>> Currently, my ceph is running like this:
>>>
>>> oot  28692  1  0 Jun23 ?00:37:26 /usr/bin/ceph-mon -i
>>> seed.econe.com --pid-file /var/run/ceph/mon.seed.econe.com.pid -c
>>> /etc/ceph/ceph.conf --cluster ceph
>>> root  40063  1  1 Jun23 ?02:13:31 /usr/bin/ceph-osd -i 0
>>> --pid-file /var/run/ceph/osd.0.pid -c /etc/ceph/ceph.conf --cluster ceph
>>> root  42096  1  0 Jun23 ?01:33:42 /usr/bin/ceph-osd -i 1
>>> --pid-file /var/run/ceph/osd.1.pid -c /etc/ceph/ceph.conf --cluster ceph
>>> root  43263  1  0 Jun23 ?01:22:59 /usr/bin/ceph-osd -i 2
>>> --pid-file /var/run/ceph/osd.2.pid -c /etc/ceph/ceph.conf --cluster ceph
>>> root  44527  1  0 Jun23 ?01:16:53 /usr/bin/ceph-osd -i 3
>>> --pid-file /var/run/ceph/osd.3.pid -c /etc/ceph/ceph.conf --cluster ceph
>>> root  45863  1  0 Jun23 ?01:25:18 /usr/bin/ceph-osd -i 4
>>> --pid-file /var/run/ceph/osd.4.pid -c /etc/ceph/ceph.conf --cluster ceph
>>> root  47462  1  0 Jun23 ?01:20:36 /usr/bin/ceph-osd -i 5
>>> --pid-file /var/run/ceph/osd.5.pid -c /etc/ceph/ceph.conf --cluster ceph
>>>
>>> Best Regards
>>> -- Ray
>>>
>>> --
>>>
>>> PLEASE NOTE: The information contained in this electronic mail message
>>> is intended only for the use of the designated recipient(s) named above. If
>>> the reader of this message is not the intended recipient, you are hereby
>>> notified that you have received this message in error and that any r

Re: [ceph-users] CephFS posix test performance

2015-06-30 Thread Yan, Zheng

> On Jul 1, 2015, at 00:34, Dan van der Ster  wrote:
> 
> On Tue, Jun 30, 2015 at 11:37 AM, Yan, Zheng  wrote:
>> 
>>> On Jun 30, 2015, at 15:37, Ilya Dryomov  wrote:
>>> 
>>> On Tue, Jun 30, 2015 at 6:57 AM, Yan, Zheng  wrote:
 I tried 4.1 kernel and 0.94.2 ceph-fuse. their performance are about the 
 same.
 
 fuse:
 Files=191, Tests=1964, 60 wallclock secs ( 0.43 usr  0.08 sys +  1.16 cusr 
  0.65 csys =  2.32 CPU)
 
 kernel:
 Files=191, Tests=2286, 61 wallclock secs ( 0.45 usr  0.08 sys +  1.21 cusr 
  0.72 csys =  2.46 CPU)
>>> 
>>> On Friday, I tried stock 3.10 vs 4.1 and they were about the same as
>>> well (a few tests failed in 3.10 though).  However Dan is using
>>> 3.10.0-229.7.2.el7.x86_64, which is 3.10 with a lot of backports, so
>>> it's not quite the same.  Dan, are the numbers you are seeing
>>> consistent?
>>> 
>> 
>> I just tried 3.10.0-229.7.2.el7 kernel. it’s a little slower than 4.1 kernel
>> 
>> 4.1:
>> Files=191, Tests=2286, 61 wallclock secs ( 0.45 usr  0.07 sys +  1.24 cusr  
>> 0.76 csys =  2.52 CPU)
>> 
>> 3.10.0-229.7.2.el7:
>> Files=191, Tests=1964, 75 wallclock secs ( 0.45 usr  0.09 sys +  1.73 cusr  
>> 5.04 csys =  7.31 CPU)
>> 
>> Dan, did you run the test on the same client machine. I think network 
>> latency affects run time of this test a lots
>> 
> 
> All the tests run on the same client, but it seems there is some
> variability in the tests. Now I get:
> 
> Linux 3.10.0-229.7.2.el7.x86_64
> Files=184, Tests=1957, 91 wallclock secs ( 0.72 usr  0.19 sys +  5.68
> cusr 10.09 csys = 16.68 CPU)
> 
> Linux 4.1.0-1.el7.elrepo.x86_64
> Files=184, Tests=1957, 84 wallclock secs ( 0.75 usr  0.44 sys +  5.17
> cusr  9.77 csys = 16.13 CPU)
> 
> ceph-fuse 0.94.2:
> Files=184, Tests=1957, 78 wallclock secs ( 0.69 usr  0.17 sys +  5.08
> cusr  9.93 csys = 15.87 CPU)
> 
> 
> I don't know if it's related -- and maybe I misunderstood something
> fundamental -- but we don't manage to get FUSE or the kernel client to
> use the page cache:
> 
> I have fuse_use_invalidate_cb = true then used fincore to see what's cached:
> 
> # df -h .
> Filesystem  Size  Used Avail Use% Mounted on
> ceph-fuse   444T  135T  309T  31% /cephfs
> # cat zero > /dev/null
> # linux-fincore zero
> filename
>sizetotal_pagesmin_cached page
>  cached_pagescached_sizecached_perc
> 
>------
>  ------
> zero
> 104,857,600 25,600 -1
> 0  0   0.00
> ---
> total cached size: 0
> 
> The kernel client has the same behaviour. Is this expected?

yes. PJD only tests metadata operations. page cache is not involved in these 
operations.

Regards
Yan, Zheng


> 
> Cheers, Dan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Where is what type if IO generated?

2015-06-30 Thread Haomai Wang
On Wed, Jul 1, 2015 at 4:50 AM, Steffen Tilsch  wrote:
> Hello Cephers,
>
> I got some questions regarding where what type of IO is generated.
>
>
>
> As far as I understand it looks like this (please see picture:
> http://imageshack.com/a/img673/4563/zctaGA.jpg ) :
>
> 1. Clients -> OSD (Journal):
>
> - Is it sequential write?

yes

>
> - Is it parallel due to the many open sockets?

Client IO is pg-parallel in pg layer, and is batched to journal within
one thread. At last, io is also pg-parallel  to writeback

>
> 2. OSD journal flush -> Filestore:
>
> - Is it periodic sequential write?

it depends on client workload, mostly are not perfect seq write.

>
> - Is it done in parallel to the Filestore or just with one writer?
>
> 3. OSD -> Clients:
>
> - Is it parallel random read (due to distribution of objects to PGs over
> many OSD's)?

mostly yes

>
> - Is any reading (from client side) done from journals?

no

>
>
>
> How is the writing to the journal done? Is it like a ringbuffer - every new
> block is added to the end of the journal so that there is no "overwriting"
> in the middle of the journal (that would force the head of the HDD (if not
> SSD) to seek?

yes, a loopback mode

>
>
>
> It would be nice if you would correct me if anything is wrong here.
>
>
>
> Thanks in advance,
>
> Steffen
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph's RBD flattening and image options

2015-06-30 Thread Haomai Wang
On Tue, Jun 30, 2015 at 9:07 PM, Michał Chybowski
 wrote:
> Hi,
>
> Lately I've been working on XEN RBD SM and I'm using RBD's built-in snapshot
> functionality.
>
> My system looks like this:
> base image -> snapshot -> snaphot is used to create XEN VM's volumes ->
> volume snapshots (via rbd snap..) -> another VMs -> etc.
>
> I'd like to be able to delete one of the volumes "in the chain" but without
> cloning base image blocks that hadn't changed yet (I'm trying to save up
> space) as it is done "by default" while flattening snapshots.
> Is this possible and if it is, then how can I achieve this?

If I understand well, It looks like you should implement this logic at
upper layer like cloudstack? RBD itself won't provide with this

>
> Also, we wanted to tune a bit ceph images while they're being created and
> we're unable to set any of the striping options succesfully - We're setting
> variables with correct values (even copying those in documentation) and we
> get errors from librbd during image creation or image mapping.
>

You can't map image with stripe* enabled, kernel rbd doesn't support this

> I can post you our ceph.conf and CRUSH map if needed.
>
> --
> Regards
> Michał Chybowski
> Tiktalik.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Very low 4k randread performance ~1000iops

2015-06-30 Thread Tuomas Juntunen
Hi

For seq reads here's the latencies:
lat (usec) : 2=0.01%, 10=0.01%, 20=0.01%, 50=0.02%, 100=0.03%
lat (usec) : 250=1.02%, 500=87.09%, 750=7.47%, 1000=1.50%
lat (msec) : 2=0.76%, 4=1.72%, 10=0.19%, 20=0.19%

Random reads:
lat (usec) : 10=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.03%, 50=0.55%
lat (msec) : 100=99.31%, 250=0.08%

100msecs seems a lot to me.

Br,T

-Original Message-
From: Mark Nelson [mailto:mnel...@redhat.com] 
Sent: 30. kesäkuuta 2015 22:01
To: Tuomas Juntunen; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops

Seems reasonable.  What's the latency distribution look like in your fio
output file?  Would be useful to know if it's universally slow or if some
ops are taking much longer to complete than others.

Mark

On 06/30/2015 01:27 PM, Tuomas Juntunen wrote:
> I created a file which has the following parameters
>
>
> [random-read]
> rw=randread
> size=128m
> directory=/root/asd
> ioengine=libaio
> bs=4k
> #numjobs=8
> iodepth=64
>
>
> Br,T
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf 
> Of Mark Nelson
> Sent: 30. kesäkuuta 2015 20:55
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops
>
> Hi Tuomos,
>
> Can you paste the command you ran to do the test?
>
> Thanks,
> Mark
>
> On 06/30/2015 12:18 PM, Tuomas Juntunen wrote:
>> Hi
>>
>> It’s not probably hitting the disks, but that really doesn’t matter.
>> The point is we have very responsive VM’s while writing and that is 
>> what the users will see.
>>
>> The iops we get with sequential read is good, but the random read is 
>> way too low.
>>
>> Is using SSD’s as OSD’s the only way to get it up? or is there some 
>> tunable which would enhance it? I would assume Linux caches reads in 
>> memory and serves them from there, but atleast now we don’t see it.
>>
>> Br,
>>
>> Tuomas
>>
>> *From:*Somnath Roy [mailto:somnath@sandisk.com]
>> *Sent:* 30. kesäkuuta 2015 19:24
>> *To:* Tuomas Juntunen; 'ceph-users'
>> *Subject:* RE: [ceph-users] Very low 4k randread performance 
>> ~1000iops
>>
>> Break it down, try fio-rbd to see what is the performance you getting..
>>
>> But, I am really surprised you are getting > 100k iops for write, did 
>> you check it is hitting the disks ?
>>
>> Thanks & Regards
>>
>> Somnath
>>
>> *From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On 
>> Behalf Of *Tuomas Juntunen
>> *Sent:* Tuesday, June 30, 2015 8:33 AM
>> *To:* 'ceph-users'
>> *Subject:* [ceph-users] Very low 4k randread performance ~1000iops
>>
>> Hi
>>
>> I have been trying to figure out why our 4k random reads in VM’s are 
>> so bad. I am using fio to test this.
>>
>> Write : 170k iops
>>
>> Random write : 109k iops
>>
>> Read : 64k iops
>>
>> Random read : 1k iops
>>
>> Our setup is:
>>
>> 3 nodes with 36 OSDs, 18 SSD’s one SSD for two OSD’s, each node has 
>> 64gb mem & 2x6core cpu’s
>>
>> 4 monitors running on other servers
>>
>> 40gbit infiniband with IPoIB
>>
>> Openstack : Qemu-kvm for virtuals
>>
>> Any help would be appreciated
>>
>> Thank you in advance.
>>
>> Br,
>>
>> Tuomas
>>
>> -
>> -
>> --
>>
>>
>> PLEASE NOTE: The information contained in this electronic mail 
>> message is intended only for the use of the designated recipient(s) named
above.
>> If the reader of this message is not the intended recipient, you are 
>> hereby notified that you have received this message in error and that 
>> any review, dissemination, distribution, or copying of this message 
>> is strictly prohibited. If you have received this communication in 
>> error, please notify the sender by telephone or e-mail (as shown 
>> above) immediately and destroy any and all copies of this message in 
>> your possession (whether hard copies or electronically stored copies).
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Simple CephFS benchmark

2015-06-30 Thread Tuomas Juntunen
Hi

Our ceph is running the following hardware:
3 nodes with 36 OSDs, 18 SSD’s one SSD for two OSD’s, each node has 64gb mem
& 2x6core cpu’s
4 monitors running on other servers
40gbit infiniband with IPoIB

Here's my cephfs fio test results using the following file, and changing rw
parameter

[test]
rw=randread # read
size=128m
directory=/var/lib/nova/instances/tmp/fio
ioengine=libaio
#ioengine=rbd
#direct=1
bs=4k
#numjobs=8
iodepth=64


randread:
Jobs: 1 (f=1): [r] [100.0% done] [7532KB/0KB/0KB /s] [1883/0/0 iops] [eta
00m:00s]
random-read: (groupid=0, jobs=1): err= 0: pid=2159: Wed Jul  1 06:53:42 2015
  read : io=131072KB, bw=7275.4KB/s, iops=1818, runt= 18016msec
slat (usec): min=213, max=3338, avg=543.06, stdev=131.16
clat (usec): min=4, max=41593, avg=34596.58, stdev=2776.77
 lat (usec): min=555, max=42211, avg=35141.07, stdev=2804.49
clat percentiles (usec):
 |  1.00th=[28288],  5.00th=[30080], 10.00th=[31104], 20.00th=[32384],
 | 30.00th=[33024], 40.00th=[34048], 50.00th=[35072], 60.00th=[35584],
 | 70.00th=[36096], 80.00th=[37120], 90.00th=[37632], 95.00th=[38656],
 | 99.00th=[39680], 99.50th=[40192], 99.90th=[40704], 99.95th=[41216],
 | 99.99th=[41728]
bw (KB  /s): min= 6744, max= 7648, per=99.85%, avg=7264.44, stdev=215.44
lat (usec) : 10=0.01%, 750=0.01%
lat (msec) : 2=0.01%, 4=0.02%, 10=0.05%, 20=0.08%, 50=99.84%
  cpu  : usr=2.25%, sys=8.79%, ctx=63468, majf=0, minf=491
  IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>=64=99.8%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%,
>=64=0.0%
 issued: total=r=32768/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=131072KB, aggrb=7275KB/s, minb=7275KB/s, maxb=7275KB/s,
mint=18016msec, maxt=18016msec

read:
  read : io=131072KB, bw=537180KB/s, iops=134295, runt=   244msec
slat (usec): min=1, max=41811, avg= 6.75, stdev=252.09
clat (usec): min=1, max=41995, avg=463.46, stdev=1985.09
 lat (usec): min=3, max=41996, avg=470.26, stdev=2000.60
clat percentiles (usec):
 |  1.00th=[  126],  5.00th=[  127], 10.00th=[  129], 20.00th=[  129],
 | 30.00th=[  131], 40.00th=[  131], 50.00th=[  133], 60.00th=[  135],
 | 70.00th=[  141], 80.00th=[  153], 90.00th=[  684], 95.00th=[ 2128],
 | 99.00th=[ 4320], 99.50th=[ 4576], 99.90th=[42240], 99.95th=[42240],
 | 99.99th=[42240]
lat (usec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.02%, 50=0.04%
lat (usec) : 100=0.07%, 250=84.49%, 500=4.10%, 750=1.41%, 1000=1.39%
lat (msec) : 2=2.53%, 4=4.48%, 10=1.27%, 50=0.19%
  cpu  : usr=1.65%, sys=43.62%, ctx=121, majf=0, minf=71
  IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>=64=99.8%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%,
>=64=0.0%
 issued: total=r=32768/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=131072KB, aggrb=537180KB/s, minb=537180KB/s, maxb=537180KB/s,
mint=244msec, maxt=244msec

Hope this gives you some idea.

Br, T

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Mark Nelson
Sent: 1. heinäkuuta 2015 0:57
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Simple CephFS benchmark

Two popular benchmarks in the HPC space for testing distributed file systems
are IOR and mdtest.  Both use MPI to coordinate processes on different
clients.  Another option may be to use fio or iozone.  Netmist may also be
an option, but I haven't used it myself and I'm not sure that it's fully
open source.

If you are only interested in single-client tests vs a local disk, I'd
personally use fio.

Mark

On 06/30/2015 02:50 PM, Hadi Montakhabi wrote:
> I have set up a ceph storage cluster and I'd like to utilize the 
> cephfs (I am assuming this is the only way one could use some other 
> code without using the API).
> To do so, I have mounted my cephfs on the client node.
> I'd like to know what would be a good benchmark for measuring write 
> and read performance on the cephfs compared to the local filesystem?
>
> Thanks,
> Hadi
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to use cgroup to bind ceph-osd to a specific cpu core?

2015-06-30 Thread Jan Schermer
Re: your previous question

I will not elaborate on this much more, I hope some of you will try it if you 
have NUMA systems and see for yourself.

But I can recommend some docs:
http://globalsp.ts.fujitsu.com/dmsp/Publications/public/wp-ivy-bridge-ep-memory-performance-ww-en.pdf
 


http://events.linuxfoundation.org/sites/events/files/eeus13_shelton.pdf 


RHEL also has some nice documentation on the issue. If you don’t use ancient 
(like RHEL6) systems then your OS+kernel should do the “right thing” by default 
and take NUMA locality into account when scheduling and migrating.

Jan


> On 01 Jul 2015, at 03:02, Ray Sun  wrote:
> 
> Jan,
> Thanks a lot. I can do my contribution to this project if I can.
> 
> Best Regards
> -- Ray
> 
> On Tue, Jun 30, 2015 at 11:50 PM, Jan Schermer  > wrote:
> Hi all,
> our script is available on GitHub
> 
> https://github.com/prozeta/pincpus 
> 
> I haven’t had much time to do a proper README, but I hope the configuration 
> is self explanatory enough for now.
> What it does is pin each OSD into the most “empty” cgroup assigned to a NUMA 
> node.
> 
> Let me know how it works for you!
> 
> Jan
> 
> 
>> On 30 Jun 2015, at 10:50, Huang Zhiteng > > wrote:
>> 
>> 
>> 
>> On Tue, Jun 30, 2015 at 4:25 PM, Jan Schermer > > wrote:
>> Not having OSDs and KVMs compete against each other is one thing.
>> But there are more reasons to do this
>> 
>> 1) not moving the processes and threads between cores that much (better 
>> cache utilization)
>> 2) aligning the processes with memory on NUMA systems (that means all modern 
>> dual socket systems) - you don’t want your OSD running on CPU1 with memory 
>> allocated to CPU2
>> 3) the same goes for other resources like NICs or storage controllers - but 
>> that’s less important and not always practical to do
>> 4) you can limit the scheduling domain on linux if you limit the cpuset for 
>> your OSDs (I’m not sure how important this is, just best practice)
>> 5) you can easily limit memory or CPU usage, set priority, with much greater 
>> granularity than without cgroups
>> 6) if you have HyperThreading enabled you get the most gain when the 
>> workloads on the threads are dissimiliar - so to have the higher throughput 
>> you have to pin OSD to thread1 and KVM to thread2 on the same core. We’re 
>> not doing that because latency and performance of the core can vary 
>> depending on what the other thread is doing. But it might be useful to 
>> someone.
>> 
>> Some workloads exhibit >100% performance gain when everything aligns in a 
>> NUMA system, compared to a SMP mode on the same hardware. You likely won’t 
>> notice it on light workloads, as the interconnects (QPI) are very fast and 
>> there’s a lot of bandwidth, but for stuff like big OLAP databases or other 
>> data-manipulation workloads there’s a huge difference. And with CEPH being 
>> CPU hungy and memory intensive, we’re seeing some big gains here just by 
>> co-locating the memory with the processes….
>> Could you elaborate a it on this?  I'm interested to learn in what situation 
>> memory locality helps Ceph to what extend. 
>> 
>> 
>> Jan
>> 
>>  
>>> On 30 Jun 2015, at 08:12, Ray Sun >> > wrote:
>>> 
>>> ​Sound great, any update please let me know.​
>>> 
>>> Best Regards
>>> -- Ray
>>> 
>>> On Tue, Jun 30, 2015 at 1:46 AM, Jan Schermer >> > wrote:
>>> I promised you all our scripts for automatic cgroup assignment - they are 
>>> in our production already and I just need to put them on github, stay tuned 
>>> tomorrow :-)
>>> 
>>> Jan
>>> 
>>> 
 On 29 Jun 2015, at 19:41, Somnath Roy >>> > wrote:
 
 Presently, you have to do it by using tool like ‘taskset’ or ‘numactl’…
  
 Thanks & Regards
 Somnath
  
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
 ] On Behalf Of Ray Sun
 Sent: Monday, June 29, 2015 9:19 AM
 To: ceph-users@lists.ceph.com 
 Subject: [ceph-users] How to use cgroup to bind ceph-osd to a specific cpu 
 core?
  
 Cephers,
 I want to bind each of my ceph-osd to a specific cpu core, but I didn't 
 find any document to explain that, could any one can provide me some 
 detailed information. Thanks.
  
 Currently, my ceph is running like this:
  
 oot  28692  1  0 Jun23 ?00:37:26 /usr/bin/ceph-mon -i 
 seed.econe.com  --pid-file 
 /var/run/ceph/mon.seed.econe.com.pid -c /etc/ceph/ceph.conf --cluster ceph
 root  40063  1  1 Jun23 ?02:13:31 /usr/bin/ceph-osd -