Re: [ceph-users] Some monitors have still not reached quorum

2016-04-19 Thread 席智勇
some tips:
1.if you enabled auth_cluster_required, you may shoud have a check
the keyring
2.can you reach the monitors from your admin node by ssh without passwd

2016-04-16 18:16 GMT+08:00 AJ NOURI :

> Followed the preflight and quick start
> http://docs.ceph.com/docs/master/start/quick-ceph-deploy/
>
> Stuck here
>
> ajn@admin-node:~/my-cluster$* ceph-deploy mon create-initial *
>
> [ceph_deploy.conf][DEBUG ] found configuration file at:
> /home/ajn/.cephdeploy.conf
> [ceph_deploy.cli][INFO  ] Invoked (1.5.31): /usr/bin/ceph-deploy mon
> create-initial
> [ceph_deploy.cli][INFO  ] ceph-deploy options:
> [ceph_deploy.cli][INFO  ]  username  : None
> [ceph_deploy.cli][INFO  ]  verbose   : False
> [ceph_deploy.cli][INFO  ]  overwrite_conf: False
> [ceph_deploy.cli][INFO  ]  subcommand: create-initial
> [ceph_deploy.cli][INFO  ]  quiet : False
> [ceph_deploy.cli][INFO  ]  cd_conf   :
> 
> [ceph_deploy.cli][INFO  ]  cluster   : ceph
> [ceph_deploy.cli][INFO  ]  func  :  at 0x7fd1c323a668>
> [ceph_deploy.cli][INFO  ]  ceph_conf : None
> [ceph_deploy.cli][INFO  ]  default_release   : False
> [ceph_deploy.cli][INFO  ]  keyrings  : None
> [ceph_deploy.mon][DEBUG ] Deploying mon, cluster ceph hosts monitor
> [ceph_deploy.mon][DEBUG ] detecting platform for host monitor ...
> [monitor][DEBUG ] connection detected need for sudo
> [monitor][DEBUG ] connected to host: monitor
> [monitor][DEBUG ] detect platform information from remote host
> [monitor][DEBUG ] detect machine type
> [monitor][DEBUG ] find the location of an executable
> [ceph_deploy.mon][INFO  ] distro info: Ubuntu 14.04 trusty
> [monitor][DEBUG ] determining if provided host has same hostname in remote
>
> [monitor][DEBUG ] get remote short hostname
> [monitor][DEBUG ] deploying mon to monitor
> [monitor][DEBUG ] get remote short hostname
> [monitor][DEBUG ] remote hostname: monitor
> [monitor][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
> [monitor][DEBUG ] create the mon path if it does not exist
> [monitor][DEBUG ] checking for done path:
> /var/lib/ceph/mon/ceph-monitor/done
> [monitor][DEBUG ] create a done file to avoid re-doing the mon deployment
> [monitor][DEBUG ] create the init path if it does not exist
> [monitor][INFO  ] Running command: sudo initctl emit ceph-mon cluster=ceph
> id=monitor
> [monitor][INFO  ] Running command: sudo ceph --cluster=ceph --admin-daemon
> /var/run/ceph/ceph-mon.monitor.asok mon_status
> [monitor][ERROR ] admin_socket: exception getting command descriptions:
> [Errno 2] No such file or directory
> [monitor][WARNIN] monitor: mon.monitor, might not be running yet
> [monitor][INFO  ] Running command: sudo ceph --cluster=ceph --admin-daemon
> /var/run/ceph/ceph-mon.monitor.asok mon_status
> [monitor][ERROR ] admin_socket: exception getting command descriptions:
> [Errno 2] No such file or directory
> [monitor][WARNIN] monitor monitor does not exist in monmap
> [monitor][WARNIN] neither `public_addr` nor `public_network` keys are
> defined for monitors
> [monitor][WARNIN] monitors may not be able to form quorum
> [ceph_deploy.mon][INFO  ] processing monitor mon.monitor
> [monitor][DEBUG ] connection detected need for sudo
> [monitor][DEBUG ] connected to host: monitor
> [monitor][DEBUG ] detect platform information from remote host
> [monitor][DEBUG ] detect machine type
> [monitor][DEBUG ] find the location of an executable
> [monitor][INFO  ] Running command: sudo ceph --cluster=ceph --admin-daemon
> /var/run/ceph/ceph-mon.monitor.asok mon_status
> *[monitor][ERROR ] admin_socket: exception getting command descriptions:
> [Errno 2] No such file or directory*
> [ceph_deploy.mon][WARNIN] mon.monitor monitor is not yet in quorum, tries
> left: 5
> [ceph_deploy.mon][WARNIN] waiting 5 seconds before retrying
> [monitor][INFO  ] Running command: sudo ceph --cluster=ceph --admin-daemon
> /var/run/ceph/ceph-mon.monitor.asok mon_status
> *[monitor][ERROR ] admin_socket: exception getting command descriptions:
> [Errno 2] No such file or directory*
> [ceph_deploy.mon][WARNIN] mon.monitor monitor is not yet in quorum, tries
> left: 4
> [ceph_deploy.mon][WARNIN] waiting 10 seconds before retrying
> [monitor][INFO  ] Running command: sudo ceph --cluster=ceph --admin-daemon
> /var/run/ceph/ceph-mon.monitor.asok mon_status
> *[monitor][ERROR ] admin_socket: exception getting command descriptions:
> [Errno 2] No such file or directory*
> [ceph_deploy.mon][WARNIN] mon.monitor monitor is not yet in quorum, tries
> left: 3
> [ceph_deploy.mon][WARNIN] waiting 10 seconds before retrying
> [monitor][INFO  ] Running command: sudo ceph --cluster=ceph --admin-daemon
> /var/run/ceph/ceph-mon.monitor.asok mon_status
> *[monitor][ERROR ] admin_socket: exception getting command descriptions:
> [Errno 2] N

[ceph-users] Slow read on RBD mount, Hammer 0.94.5

2016-04-19 Thread Mike Miller

Hi,

RBD mount
ceph v0.94.5
6 OSD with 9 HDD each
10 GBit/s public and private networks
3 MON nodes 1Gbit/s network

A rbd mounted with btrfs filesystem format performs really badly when 
reading. Tried readahead in all combinations but that does not help in 
any way.


Write rates are very good in excess of 600 MB/s up to 1200 MB/s, average 
800 MB/s

Read rates on the same mounted rbd are about 10-30 MB/s !?

Of course, both writes and reads are from a single client machine with a 
single write/read command. So I am looking at single threaded performance.
Actually, I was hoping to see at least 200-300 MB/s when reading, but I 
am seeing 10% of that at best.


Thanks for your help.

Mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Day Sunnyvale Presentations

2016-04-19 Thread 席智勇
I have read the SK‘s performance tuning work too, it's a good
job especially the analysis of write/read latancy on OSD.
I want to ask a question about the optimize on 'Long logging time', what's
the meaning about 'split logging into another thread and do it later',
AFAIK, ceph does logging async by logging thread.
Can you share more info?

2016-04-13 18:11 GMT+08:00 Alexandre DERUMIER :

> >>Based on discussion with them at Ceph day in Tokyo JP, they have their
> own frozen the Ceph repository.
> >>And they've been optimizing codes by their own team to meet their
> requirements.
> >>AFAICT they had not done any do PR.
>
> Thanks for the info
>
> @cc bspark8.sk.com : maybe can you give us more informations ?
>
>
> - Mail original -
> De: "Shinobu Kinjo" 
> À: "aderumier" 
> Cc: "Patrick McGarry" , "ceph-devel" <
> ceph-de...@vger.kernel.org>, "ceph-users" 
> Envoyé: Mercredi 13 Avril 2016 05:56:01
> Objet: Re: [ceph-users] Ceph Day Sunnyvale Presentations
>
> Alexandre,
>
> Based on discussion with them at Ceph day in Tokyo JP, they have their own
> frozen the Ceph repository.
> And they've been optimizing codes by their own team to meet their
> requirements.
> AFAICT they had not done any do PR.
>
> Cheers,
> Shinobu
>
> - Original Message -
> From: "Alexandre DERUMIER" 
> To: "Patrick McGarry" 
> Cc: "ceph-devel" , "ceph-users" <
> ceph-us...@ceph.com>
> Sent: Wednesday, April 13, 2016 12:45:31 PM
> Subject: Re: [ceph-users] Ceph Day Sunnyvale Presentations
>
> Hi,
>
> I was reading this presentation from SK telecom about flash optimisations
>
> AFCeph: Ceph Performance Analysis & Improvement on Flash [Slides]
>
> http://fr.slideshare.net/Inktank_Ceph/af-ceph-ceph-performance-analysis-and-improvement-on-flash
> Byung-Su Park, SK Telecom
>
>
> They seem to have made optimisations in ceph code. Is there any patches
> reference ? (applied to infernalis/jewel ?)
>
>
> They seem also to have done ceph config tuning and system tunning, but no
> config details is provided :(
> It could be great to share with the community :)
>
> Regards,
>
> Alexandre
>
> - Mail original -
> De: "Patrick McGarry" 
> À: "ceph-devel" , "ceph-users" <
> ceph-us...@ceph.com>
> Envoyé: Mercredi 6 Avril 2016 18:18:28
> Objet: [ceph-users] Ceph Day Sunnyvale Presentations
>
> Hey cephers,
>
> I have all but one of the presentations from Ceph Day Sunnyvale, so
> rather than wait for a full hand I went ahead and posted the link to
> the slides on the event page:
>
> http://ceph.com/cephdays/ceph-day-sunnyvale/
>
> The videos probably wont be processed until after next week, but I’ll
> add those once we get them. Thanks to all of the presenters and
> attendees that made this another great event.
>
>
> --
>
> Best Regards,
>
> Patrick McGarry
> Director Ceph Community || Red Hat
> http://ceph.com || http://community.redhat.com
> @scuttlemonkey || @ceph
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] add mon and move mon

2016-04-19 Thread GuiltyCrown
Dear friends:

Hello,I have a small problem When I use ceph . my ceph has three 
monitor. I want to  move out one.
root@node01 ~]# ceph -s
cluster b0d8bd0d-6269-4ce7-a10b-9adc7ee2c4c8
 health HEALTH_WARN
too many PGs per OSD (682 > max 300)
 monmap e23: 3 mons at 
{node01=172.168.2.185:6789/0,node02=172.168.2.186:6789/0,node03=172.168.2.187:6789/0}
election epoch 472, quorum 0,1,2 node01,node02,node03
 osdmap e7084: 18 osds: 18 up, 18 in
  pgmap v1051011: 4448 pgs, 15 pools, 7915 MB data, 12834 objects
27537 MB used, 23298 GB / 23325 GB avail
4448 active+clean


So I do as this :
#ceph-deploy mon destroy node03

Then  I add it in the cluster again.

#ceph-deploy mon add node03

The node03 is added to the cluster.but after a while,the monitor  is down .
When I see the /var/log/messages
I find that 

Apr 19 11:12:01 node01 systemd: Starting Session 14091 of user root.
Apr 19 11:12:01 node01 systemd: Started Session 14091 of user root.
Apr 19 11:12:39 node01 bash: 2016-04-19 11:12:39.533817 7f6e51ec2700 -1 
mon.node01@0(leader) e23 *** Got Signal Terminated ***
When I start up the monitor ,then after a while it becomes down again.
But I have enough system space.
[root@node03 ~]# df -TH
FilesystemType  Size  Used Avail Use% Mounted on
/dev/mapper/rhel-root xfs11G  4.7G  6.1G  44% /
devtmpfs  devtmpfs   26G 0   26G   0% /dev
tmpfs tmpfs  26G   82k   26G   1% /dev/shm
tmpfs tmpfs  26G  147M   26G   1% /run
tmpfs tmpfs  26G 0   26G   0% /sys/fs/cgroup
/dev/mapper/rhel-usr  xfs11G  4.1G  6.7G  38% /usr
/dev/mapper/rhel-tmp  xfs11G   34M   11G   1% /tmp
/dev/mapper/rhel-home xfs11G   34M   11G   1% /home
/dev/mapper/rhel-var  xfs11G  1.6G  9.2G  15% /var
/dev/sde1 xfs   2.0T  152M  2.0T   1% /var/lib/ceph/osd/ceph-15
/dev/sdg1 xfs   2.0T  3.8G  2.0T   1% /var/lib/ceph/osd/ceph-17
/dev/sdd1 xfs   2.0T  165M  2.0T   1% /var/lib/ceph/osd/ceph-14
/dev/sda1 xfs   521M  131M  391M  26% /boot
/dev/sdb1 xfs   219G  989M  218G   1% /var/lib/ceph/osd/ceph-4
/dev/sdf1 xfs   2.0T  4.6G  2.0T   1% /var/lib/ceph/osd/ceph-16
/dev/sdc1 xfs   219G  129M  219G   1% /var/lib/ceph/osd/ceph-5
You have new mail in /var/spool/mail/root
[root@node03 ~]#

What’s the problem , is my operation wrong?

Looking forward to your reply.



--Dingxf48


发送自 Windows 10 版邮件应用

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Powercpu and ceph

2016-04-19 Thread Ilya Dryomov
On Tue, Apr 19, 2016 at 5:28 AM, min fang  wrote:
> I am confused on ceph/ceph-qa-suite and ceph/teuthology. Which one should I
> use? thanks.

ceph-qa-suite repository contains the test snippets, teuthology is the
test framework that knows how to run them.  It will pull the appropriate
branch of ceph-qa-suite automatically or, in some cases, you can point
it at your own checkout.

Setting it up is not an easy task though, so I'd start with building
and running "make check".

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] krbd map on Jewel, sysfs write failed when rbd map

2016-04-19 Thread Ilya Dryomov
On Mon, Apr 18, 2016 at 11:58 AM, Tim Bishop  wrote:
> I had the same issue when testing on Ubuntu xenial beta. That has 4.4,
> so should be fine? I had to create images without the new RBD features
> to make it works.

None of the "new" features are currently supported by krbd.  4.7 will
support exclusive-lock with most of the rest following in 4.8.

You don't have to recreate images: while those features are enabled in
jewel by default, you should be able to dynamically disable them with
"rbd feature disable imagename deep-flatten fast-diff object-map
exclusive-lock".

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Build Raw Volume from Recovered RBD Objects

2016-04-19 Thread Mike Dawson

All,

I was called in to assist in a failed Ceph environment with the cluster 
in an inoperable state. No rbd volumes are mountable/exportable due to 
missing PGs.


The previous operator was using a replica count of 2. The cluster 
suffered a power outage and various non-catastrophic hardware issues as 
they were starting it back up. At some point during recovery, drives 
were removed from the cluster leaving several PGs missing.


Efforts to restore the missing PGs from the data on the removed drives 
failed using the process detailed in a Red Hat Customer Support blog 
post [0]. Upon starting the OSDs with recovered PGs, a segfault halts 
progress. The original operator isn't clear on when, but there may have 
been a software upgrade applied after the drives were pulled.


I believe the cluster may be irrecoverable at this point.

My recovery assistance has focused on a plan to:

1) Scrape all objects for several key rbd volumes from live OSDs and the 
removed former OSD drives.


2) Compare and deduplicate the two copies of each object.

3) Recombine the objects for each volume into a raw image.

I have completed steps 1 and 2 with apparent success. My initial stab at 
step 3 yielded a raw image that could be mounted and had signs of a 
filesystem, but it could not be read. Could anyone assist me with the 
following questions?


1) Are the rbd objects in order by filename? If not, what is the method 
to determine their order?


2) How should objects smaller than the default 4MB chunk size be 
handled? Should they be padded somehow?


3) If any objects were completely missing and therefore unavailable to 
this process, how should they be handled? I assume we need to offset/pad 
to compensate.

--
Thanks,

Mike Dawson
Co-Founder & Director of Cloud Architecture
Cloudapt LLC
6330 East 75th Street, Suite 170
Indianapolis, IN 46250
M: 317-490-3018
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs does not seem to properly free up space

2016-04-19 Thread Simion Rad
Hello,


At my workplace we have a production cephfs cluster (334 TB on 60 OSDs) which 
was recently upgraded from Infernalis 9.2.0 to Infernalis 9.2.1 on Ubuntu 
14.04.3 (linux 3.19.0-33).

It seems that cephfs still doesn't free up space at all or at least that's what 
df command tells us.

Is there a better way of getting a df-like output with other command for cephfs 
 ?


Thank you,

Marius Rad

SysAdmin

www.propertyshark.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-mon.target not enabled

2016-04-19 Thread Ruben Kerkhof
Hi all,

I just installed 3 monitors, using ceph-deploy, on CentOS 7.2. Ceph is 10.1.2.

My ceph-mon processes do not come up after reboot. This is what ceph-deploy 
create-initial did:

[ams1-ceph01-mon01][INFO  ] Running command: sudo systemctl enable ceph.target
[ams1-ceph01-mon01][WARNIN] Created symlink from 
/etc/systemd/system/multi-user.target.wants/ceph.target to 
/usr/lib/systemd/system/ceph.target.
[ams1-ceph01-mon01][INFO  ] Running command: sudo systemctl enable 
ceph-mon@ams1-ceph01-mon01
[ams1-ceph01-mon01][WARNIN] Created symlink from 
/etc/systemd/system/ceph-mon.target.wants/ceph-mon@ams1-ceph01-mon01.service to 
/usr/lib/systemd/system/ceph-mon@.service.
[ams1-ceph01-mon01][INFO  ] Running command: sudo systemctl start 
ceph-mon@ams1-ceph01-mon01

However, it did not enable ceph-mon.target:
$ sudo systemctl is-enabled ceph-mon.target
disabled

Am I supposed to enable ceph-mon.target by hand? I did search the documentation 
but haven't been able to find anything that says so.

Kind regards,

Ruben Kerkhof
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs does not seem to properly free up space

2016-04-19 Thread John Spray
On Tue, Apr 19, 2016 at 2:40 PM, Simion Rad  wrote:
> Hello,
>
>
> At my workplace we have a production cephfs cluster (334 TB on 60 OSDs)
> which was recently upgraded from Infernalis 9.2.0 to Infernalis 9.2.1 on
> Ubuntu 14.04.3 (linux 3.19.0-33).
>
> It seems that cephfs still doesn't free up space at all or at least that's
> what df command tells us.

Hmm, historically there were bugs with the purging code, but I thought
we fixed them before Infernalis.

Does the space get freed after you unmount the client?  Some issues
have involved clients holding onto references to unlinked inodes.

John

>
> Is there a better way of getting a df-like output with other command for
> cephfs  ?
>
>
> Thank you,
>
> Marius Rad
>
> SysAdmin
>
> www.propertyshark.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs does not seem to properly free up space

2016-04-19 Thread Simion Rad
Mounting and unmount doesn't change anyting. 
The used space reported by df command is nearly the same  as the values 
returned by ceph -s command.

Example 1, df output:
ceph-fuse   334T  134T  200T  41% /cephfs

Example 2, ceph -s output:
 health HEALTH_WARN
mds0: Many clients (22) failing to respond to cache pressure
noscrub,nodeep-scrub,sortbitwise flag(s) set
 monmap e1: 5 mons at 
{r730-12=10.103.213.12:6789/0,r730-4=10.103.213.4:6789/0,r730-5=
10.103.213.5:6789/0,r730-8=10.103.213.8:6789/0,r730-9=10.103.213.9:6789/0}
election epoch 132, quorum 0,1,2,3,4 
r730-4,r730-5,r730-8,r730-9,r730-12
 mdsmap e14637: 1/1/1 up {0=ceph2-mds-2=up:active}
 osdmap e6549: 68 osds: 68 up, 68 in
flags noscrub,nodeep-scrub,sortbitwise
  pgmap v4394151: 896 pgs, 3 pools, 54569 GB data, 56582 kobjects
133 TB used, 199 TB / 333 TB avail
 896 active+clean
  client io 47395 B/s rd, 1979 kB/s wr, 388 op/s



From: John Spray 
Sent: Tuesday, April 19, 2016 22:04
To: Simion Rad
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] cephfs does not seem to properly free up space

On Tue, Apr 19, 2016 at 2:40 PM, Simion Rad  wrote:
> Hello,
>
>
> At my workplace we have a production cephfs cluster (334 TB on 60 OSDs)
> which was recently upgraded from Infernalis 9.2.0 to Infernalis 9.2.1 on
> Ubuntu 14.04.3 (linux 3.19.0-33).
>
> It seems that cephfs still doesn't free up space at all or at least that's
> what df command tells us.

Hmm, historically there were bugs with the purging code, but I thought
we fixed them before Infernalis.

Does the space get freed after you unmount the client?  Some issues
have involved clients holding onto references to unlinked inodes.

John

>
> Is there a better way of getting a df-like output with other command for
> cephfs  ?
>
>
> Thank you,
>
> Marius Rad
>
> SysAdmin
>
> www.propertyshark.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph cache tier clean rate too low

2016-04-19 Thread Stephen Lord


I Have a setup using some Intel P3700 devices as a cache tier, and 33 sata 
drives hosting the pool behind them. I setup the cache tier with writeback, 
gave it a size and max object count etc:

 ceph osd pool set target_max_bytes 5000
 ceph osd pool set nvme target_max_bytes 5000
 ceph osd pool set nvme target_max_objects 50
 ceph osd pool set nvme cache_target_dirty_ratio 0.5
 ceph osd pool set nvme cache_target_full_ratio 0.8

This is all running Jewel using bluestore OSDs (I know experimental). The cache 
tier will write at about 900 Mbytes/sec and read at 2.2 Gbytes/sec, the sata 
pool can take writes at about 600 Mbytes/sec in aggregate. However, it looks 
like the mechanism for cleaning the cache down to the disk layer is being 
massively rate limited and I see about 47 Mbytes/sec of read activity from each 
SSD while this is going on.

This means that while I could be pushing data into the cache at high speed, It 
cannot evict old content very fast at all, and it is very easy to hit the high 
water mark and the application I/O drops dramatically as it becomes throttled 
by how fast the cache can flush.

I suspect it is operating on a placement group at a time so ends up targeting a 
very limited number of objects and hence disks at any one time. I can see 
individual disk drives going busy for very short periods, but most of them are 
idle at any one point in time. The only way to drive the disk based OSDs fast 
is to hit a lot of them at once which would mean issuing many cache flush 
operations in parallel.

Are there any controls which can influence this behavior?

Thanks

  Steve

--
The information contained in this transmission may be confidential. Any 
disclosure, copying, or further distribution of confidential information is not 
permitted unless such privilege is explicitly granted in writing by Quantum. 
Quantum reserves the right to have electronic communications, including email 
and attachments, sent across its networks filtered through anti virus and spam 
software programs and retain such messages in order to comply with applicable 
data security and retention requirements. Quantum is not responsible for the 
proper and complete transmission of the substance of this communication or for 
any delay in its receipt.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs does not seem to properly free up space

2016-04-19 Thread Yan, Zheng
have you ever used fancy layout?

see http://tracker.ceph.com/issues/15050


On Wed, Apr 20, 2016 at 3:17 AM, Simion Rad  wrote:
> Mounting and unmount doesn't change anyting.
> The used space reported by df command is nearly the same  as the values 
> returned by ceph -s command.
>
> Example 1, df output:
> ceph-fuse   334T  134T  200T  41% /cephfs
>
> Example 2, ceph -s output:
>  health HEALTH_WARN
> mds0: Many clients (22) failing to respond to cache pressure
> noscrub,nodeep-scrub,sortbitwise flag(s) set
>  monmap e1: 5 mons at 
> {r730-12=10.103.213.12:6789/0,r730-4=10.103.213.4:6789/0,r730-5=
> 10.103.213.5:6789/0,r730-8=10.103.213.8:6789/0,r730-9=10.103.213.9:6789/0}
> election epoch 132, quorum 0,1,2,3,4 
> r730-4,r730-5,r730-8,r730-9,r730-12
>  mdsmap e14637: 1/1/1 up {0=ceph2-mds-2=up:active}
>  osdmap e6549: 68 osds: 68 up, 68 in
> flags noscrub,nodeep-scrub,sortbitwise
>   pgmap v4394151: 896 pgs, 3 pools, 54569 GB data, 56582 kobjects
> 133 TB used, 199 TB / 333 TB avail
>  896 active+clean
>   client io 47395 B/s rd, 1979 kB/s wr, 388 op/s
>
>
> 
> From: John Spray 
> Sent: Tuesday, April 19, 2016 22:04
> To: Simion Rad
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] cephfs does not seem to properly free up space
>
> On Tue, Apr 19, 2016 at 2:40 PM, Simion Rad  wrote:
>> Hello,
>>
>>
>> At my workplace we have a production cephfs cluster (334 TB on 60 OSDs)
>> which was recently upgraded from Infernalis 9.2.0 to Infernalis 9.2.1 on
>> Ubuntu 14.04.3 (linux 3.19.0-33).
>>
>> It seems that cephfs still doesn't free up space at all or at least that's
>> what df command tells us.
>
> Hmm, historically there were bugs with the purging code, but I thought
> we fixed them before Infernalis.
>
> Does the space get freed after you unmount the client?  Some issues
> have involved clients holding onto references to unlinked inodes.
>
> John
>
>>
>> Is there a better way of getting a df-like output with other command for
>> cephfs  ?
>>
>>
>> Thank you,
>>
>> Marius Rad
>>
>> SysAdmin
>>
>> www.propertyshark.com
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph cache tier clean rate too low

2016-04-19 Thread Christian Balzer

Hello,

On Tue, 19 Apr 2016 20:21:39 + Stephen Lord wrote:

> 
> 
> I Have a setup using some Intel P3700 devices as a cache tier, and 33
> sata drives hosting the pool behind them. 

A bit more details about the setup would be nice, as in how many nodes,
interconnect, replication size of the cache tier and the backing HDD
pool, etc. 
And "some" isn't a number, how many P3700s (which size?) in how many nodes?
One assumes there are no further SSDs involved with those SATA HDDs?

>I setup the cache tier with
> writeback, gave it a size and max object count etc:
> 
>  ceph osd pool set target_max_bytes 5000
^^^
This should have given you an error, it needs the pool name, as in your
next line.

>  ceph osd pool set nvme target_max_bytes 5000
>  ceph osd pool set nvme target_max_objects 50
>  ceph osd pool set nvme cache_target_dirty_ratio 0.5
>  ceph osd pool set nvme cache_target_full_ratio 0.8
> 
> This is all running Jewel using bluestore OSDs (I know experimental).
Make sure to report all pyrotechnics, trap doors and sharp edges. ^_-

> The cache tier will write at about 900 Mbytes/sec and read at 2.2
> Gbytes/sec, the sata pool can take writes at about 600 Mbytes/sec in
> aggregate. 
  ^
Key word there.

That's just 18MB/s per HDD (60MB/s with a replication of 3), a pretty
disappointing result for the supposedly twice as fast BlueStore. 
Again, replication size and topology might explain that up to a point, but
we don't know them (yet).

Also exact methodology of your tests please, i.e. the fio command line, how
was the RBD device (if you tested with one) mounted and where, etc...

> However, it looks like the mechanism for cleaning the cache
> down to the disk layer is being massively rate limited and I see about
> 47 Mbytes/sec of read activity from each SSD while this is going on.
> 
This number is meaningless w/o knowing home many NVMe's you have.
That being said, there are 2 levels of flushing past Hammer, but if you
push the cache tier to the 2nd limit (cache_target_dirty_high_ratio) you
will get full speed.

> This means that while I could be pushing data into the cache at high
> speed, It cannot evict old content very fast at all, and it is very easy
> to hit the high water mark and the application I/O drops dramatically as
> it becomes throttled by how fast the cache can flush.
> 
> I suspect it is operating on a placement group at a time so ends up
> targeting a very limited number of objects and hence disks at any one
> time. I can see individual disk drives going busy for very short
> periods, but most of them are idle at any one point in time. The only
> way to drive the disk based OSDs fast is to hit a lot of them at once
> which would mean issuing many cache flush operations in parallel.
>
Yes, it is all PG based, so your observations match the expectations and
what everybody else is seeing. 
See also the thread "Cache tier operation clarifications" by me, version 2
is in the works.
There are also some new knobs in Jewel that may be helpful, see:
http://www.spinics.net/lists/ceph-users/msg25679.html

If you have a use case with a clearly defined idle/low use time and a
small enough growth in dirty objects, consider what I'm doing, dropping the
cache_target_dirty_ratio a few percent (in my case 2-3% is enough for a
whole day) via cron job,wait a bit and then up again to it's normal value. 

That way flushes won't normally happen at all during your peak usage
times, though in my case that's purely cosmetic, flushes are not
problematic at any time in that cluster currently.

> Are there any controls which can influence this behavior?
> 
See above (cache_target_dirty_high_ratio).

Aside from that you might want to reflect on what your use case, workload
is going to be and how your testing reflects on it.

As in, are you really going to write MASSIVE amounts of data at very high
speeds or is it (like in 90% of common cases) the amount of small
write IOPS that is really going to be the limiting factor. 
Which is something that cache tiers can deal with very well (or
sufficiently large and well designed "plain" clusters).

Another thing to think about is using the "readforward" cache mode,
leaving your cache tier free to just handle writes and thus giving it more
space to work with.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] join the users

2016-04-19 Thread GuiltyCrown
join the users

发送自 Windows 10 版邮件应用

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph cache tier clean rate too low

2016-04-19 Thread Stephen Lord

OK, you asked ;-)

This is all via RBD, I am running a single filesystem on top of 8 RBD devices 
in an
effort to get data striping across more OSDs, I had been using that setup 
before adding
the cache tier.

3 nodes with 11 6 Tbyte SATA drives each for a base RBD pool, this is setup with
replication size 3. No SSDs involved in those OSDs, since ceph-disk does not let
you break a bluestore configuration into more than one device at the moment.

The 600 Mbytes/sec is an approx sustained number for the data rate I can get 
going
into this pool via RBD, that turns into 3 times that for raw data rate, so at 
33 drives
that is mid 50s Mbytes/sec per drive. I have pushed it harder than that from 
time to
time, but the OSD really wants to use fdatasync a lot and that tends to suck up 
a
lot of the potential of a device, these disks will do 160 Mbytes/sec if you 
stream
data to them.

I just checked with rados bench to this set of 33 OSDs with a 3 replica pool,
and 600 Mbytes/sec is what it will do from the same client host.

All the networking is 40 GB ethernet, single port per host, generally I can 
push 2.2 Gbytes/sec
in one direction between two hosts over a single tcp link, the max I have seen 
is about 2.7 Gbytes/sec
coming into a node. Short of going to RDMA that appears to be about the limit 
for these processors.

There are a grand total of 2 400 GB P3700s which are running a pool with a 
replication factor of 1,
these are in 2 other nodes. Once I add in replication perf goes downhill. If I 
had more hardware I
would be running more of these and using replication, but I am out of network 
cards right now.

So 5 nodes running OSDs, and a 6th node running the RBD client using the kernel 
implementation.

Complete set of commands for creating the cache tier, I pulled this from
history, so the line in the middle was a failed command actually so sorry for 
the red herring.

  982  ceph osd pool create nvme 512 512 replicated_nvme 
  983  ceph osd pool set nvme size 1
  984  ceph osd tier add rbd nvme
  985  ceph osd tier cache-mode  nvme writeback
  986  ceph osd tier set-overlay rbd nvme 
  987  ceph osd pool set nvme  hit_set_type bloom 
  988  ceph osd pool set target_max_bytes 5000 <<—— typo here, so never 
mind
  989  ceph osd pool set nvme target_max_bytes 5000
  990  ceph osd pool set nvme target_max_objects 50
  991  ceph osd pool set nvme cache_target_dirty_ratio 0.5
  992  ceph osd pool set nvme cache_target_full_ratio 0.8

I wish the cache tier would cause a health warning if it does not have
a max size set, it lets you do that, flushes nothing and fills the OSDs.

As for what the actual test is, this is 4K uncompressed DPX video frames,
so 50 Mbyte files written at least 24 a second on a good day, ideally more.
This needs to sustain around 1.3 Gbytes/sec in either direction from a single
application and needs to do it consistently. There is a certain amount of 
buffering to deal with fluctuations in perf. I am pushing 4096 of these files
sequentially with a queue depth of 32 so there is rather a lot of data in flight
at any one time. I know I do not have enough hardware to achieve this rate
on writes.

The are being written with direct I/O into a pool of 8 RBD LUNs. The 8 LUN
setup will not really help here with the small number of OSDs in the cache 
pool, it does help when the RBD LUNs are going directly to a large pool of
disk based OSDs as it gets all the OSDs moving in parallel.

My basic point here is that there is a lot more potential bandwidth to be had 
in the
backing pool, but I cannot get the cache tier to use more than a small fraction 
of the
available bandwidth when flushing content. Since the front end of the cache can
sustain around 900 Mbytes/sec over RBD, I am somewhat out of balance here:

cache input rate 900 Mbytes/sec
backing pool input rate 600 Mbytes/sec

But not by a significant amount.

The question is really about is there anything I can do to get cache flushing to
take advantage of more of the bandwidth. If I do this without the cache tier 
then
the latency of the disk based OSDs is too variable and you cannot sustain a
consistent data rate. The NVMe devices are better about consistent device
latency, but the cache tier implementation seems to have a problem driving
the backing pool at anything close to its capabilities. It really only needs to 
move 40 or 50 objects in parallel to achieve that.

I am not attempting to provision a cache tier large enough for whole workload,
but as more of a debounce zone to avoid jitter making it back to the 
application.
I am trying to categorize what can and cannot be achieved with ceph here for
this type of workload, not build a complete production setup. My test represents
170 seconds of content and generates 209 Gbytes of data, so this is a small
scale test ;-) fortunately this stuff is not always used realtime.

All of those extra config options look to be around how fast promotion into the
cache can go, not ho

[ceph-users] mds segfault on cephfs snapshot creation

2016-04-19 Thread Brady Deetz
As soon as I create a snapshot on the root of my test cephfs deployment
with a single file within the root, my mds server kernel panics. I
understand that snapshots are not recommended. Is it beneficial to
developers for me to leave my cluster in its present state and provide
whatever debugging information they'd like? I'm not really looking for a
solution to a mission critical issue as much as providing an opportunity
for developers to pull stack traces, logs, etc from a system affected by
some sort of bug in cephfs/mds. This happens every time I create a
directory inside my .snap directory.

Let me know if I should blow my cluster away?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph cache tier clean rate too low

2016-04-19 Thread Christian Balzer

Hello,

On Wed, 20 Apr 2016 03:42:00 + Stephen Lord wrote:

> 
> OK, you asked ;-)
>

I certainly did. ^o^
 
> This is all via RBD, I am running a single filesystem on top of 8 RBD
> devices in an effort to get data striping across more OSDs, I had been
> using that setup before adding the cache tier.
>
Nods.
Depending on your use case (sequential writes) actual RADOS striping might
be more advantageous than this (with 4MB writes still going to the same
PG/OSD all the time).

 
> 3 nodes with 11 6 Tbyte SATA drives each for a base RBD pool, this is
> setup with replication size 3. No SSDs involved in those OSDs, since
> ceph-disk does not let you break a bluestore configuration into more
> than one device at the moment.
> 
That's a pity, but supposedly just  a limitation of ceph-disk. 
I'd venture you can work around that with symlinks to a raw SSD
partition, same as with current filestore journals.

As Sage recently wrote:
---
BlueStore can use as many as three devices: one for the WAL (journal, 
though it can be much smaller than FileStores, e.g., 128MB), one for 
metadata (e.g., an SSD partition), and one for data.
---

> The 600 Mbytes/sec is an approx sustained number for the data rate I can
> get going into this pool via RBD, that turns into 3 times that for raw
> data rate, so at 33 drives that is mid 50s Mbytes/sec per drive. I have
> pushed it harder than that from time to time, but the OSD really wants
> to use fdatasync a lot and that tends to suck up a lot of the potential
> of a device, these disks will do 160 Mbytes/sec if you stream data to
> them.
> 
> I just checked with rados bench to this set of 33 OSDs with a 3 replica
> pool, and 600 Mbytes/sec is what it will do from the same client host.
> 
This matches a cluster of mine with 32 OSDs (filestore of course) and SSD
journals on 4 nodes with a replica of 3.

So BlueStore is indeed faster than than filestore.

> All the networking is 40 GB ethernet, single port per host, generally I
> can push 2.2 Gbytes/sec in one direction between two hosts over a single
> tcp link, the max I have seen is about 2.7 Gbytes/sec coming into a
> node. Short of going to RDMA that appears to be about the limit for
> these processors.
> 
Yeah, didn't expect your network to be involved here bottleneck wise, but
a good data point to have nevertheless. 

> There are a grand total of 2 400 GB P3700s which are running a pool with
> a replication factor of 1, these are in 2 other nodes. Once I add in
> replication perf goes downhill. If I had more hardware I would be
> running more of these and using replication, but I am out of network
> cards right now.
> 
Alright, so at 900MB/s you're pretty close to what one would expect from 2
of these: 1080MB/s*2/2(journal).

How much downhill is that?

I have a production cache tier with 2 nodes (replica 2 of course) and 4
800GB DC S3610s each, IPoIB QDR (40Gbs) interconnect and the performance
is pretty much what I would expect.

> So 5 nodes running OSDs, and a 6th node running the RBD client using the
> kernel implementation.
> 
I assume there's are reason for use the kernel RBD client (which kernel?),
given that it tends to be behind the curve in terms of features and speed?

> Complete set of commands for creating the cache tier, I pulled this from
> history, so the line in the middle was a failed command actually so
> sorry for the red herring.
> 
>   982  ceph osd pool create nvme 512 512 replicated_nvme 
>   983  ceph osd pool set nvme size 1
>   984  ceph osd tier add rbd nvme
>   985  ceph osd tier cache-mode  nvme writeback
>   986  ceph osd tier set-overlay rbd nvme 
>   987  ceph osd pool set nvme  hit_set_type bloom 
>   988  ceph osd pool set target_max_bytes 5000 <<—— typo here,
> so never mind 989  ceph osd pool set nvme target_max_bytes 5000
>   990  ceph osd pool set nvme target_max_objects 50
>   991  ceph osd pool set nvme cache_target_dirty_ratio 0.5
>   992  ceph osd pool set nvme cache_target_full_ratio 0.8
> 
> I wish the cache tier would cause a health warning if it does not have
> a max size set, it lets you do that, flushes nothing and fills the OSDs.
> 
Oh yes, people have been bitten by this over and over again.
At least it's documented now.

> As for what the actual test is, this is 4K uncompressed DPX video frames,
> so 50 Mbyte files written at least 24 a second on a good day, ideally
> more. This needs to sustain around 1.3 Gbytes/sec in either direction
> from a single application and needs to do it consistently. There is a
> certain amount of buffering to deal with fluctuations in perf. I am
> pushing 4096 of these files sequentially with a queue depth of 32 so
> there is rather a lot of data in flight at any one time. I know I do not
> have enough hardware to achieve this rate on writes.
>
So this is your test AND actual intended use case I presume, right? 

> The are being written with direct I/O into a pool of 8 RBD LUNs. The 8
> LUN setup will not really help he

Re: [ceph-users] ceph cache tier clean rate too low

2016-04-19 Thread Josef Johansson
Hi,

response in line

On 20 Apr 2016 7:45 a.m., "Christian Balzer"  wrote:
>
>
> Hello,
>
> On Wed, 20 Apr 2016 03:42:00 + Stephen Lord wrote:
>
> >
> > OK, you asked ;-)
> >
>
> I certainly did. ^o^
>
> > This is all via RBD, I am running a single filesystem on top of 8 RBD
> > devices in an effort to get data striping across more OSDs, I had been
> > using that setup before adding the cache tier.
> >
> Nods.
> Depending on your use case (sequential writes) actual RADOS striping might
> be more advantageous than this (with 4MB writes still going to the same
> PG/OSD all the time).
>
>
> > 3 nodes with 11 6 Tbyte SATA drives each for a base RBD pool, this is
> > setup with replication size 3. No SSDs involved in those OSDs, since
> > ceph-disk does not let you break a bluestore configuration into more
> > than one device at the moment.
> >
> That's a pity, but supposedly just  a limitation of ceph-disk.
> I'd venture you can work around that with symlinks to a raw SSD
> partition, same as with current filestore journals.
>
> As Sage recently wrote:
> ---
> BlueStore can use as many as three devices: one for the WAL (journal,
> though it can be much smaller than FileStores, e.g., 128MB), one for
> metadata (e.g., an SSD partition), and one for data.
> ---

I believe he also mentioned the use of bcache and friends for the osd,
maybe a way forward in this case?

Regards
Josef
>
> > The 600 Mbytes/sec is an approx sustained number for the data rate I can
> > get going into this pool via RBD, that turns into 3 times that for raw
> > data rate, so at 33 drives that is mid 50s Mbytes/sec per drive. I have
> > pushed it harder than that from time to time, but the OSD really wants
> > to use fdatasync a lot and that tends to suck up a lot of the potential
> > of a device, these disks will do 160 Mbytes/sec if you stream data to
> > them.
> >
> > I just checked with rados bench to this set of 33 OSDs with a 3 replica
> > pool, and 600 Mbytes/sec is what it will do from the same client host.
> >
> This matches a cluster of mine with 32 OSDs (filestore of course) and SSD
> journals on 4 nodes with a replica of 3.
>
> So BlueStore is indeed faster than than filestore.
>
> > All the networking is 40 GB ethernet, single port per host, generally I
> > can push 2.2 Gbytes/sec in one direction between two hosts over a single
> > tcp link, the max I have seen is about 2.7 Gbytes/sec coming into a
> > node. Short of going to RDMA that appears to be about the limit for
> > these processors.
> >
> Yeah, didn't expect your network to be involved here bottleneck wise, but
> a good data point to have nevertheless.
>
> > There are a grand total of 2 400 GB P3700s which are running a pool with
> > a replication factor of 1, these are in 2 other nodes. Once I add in
> > replication perf goes downhill. If I had more hardware I would be
> > running more of these and using replication, but I am out of network
> > cards right now.
> >
> Alright, so at 900MB/s you're pretty close to what one would expect from 2
> of these: 1080MB/s*2/2(journal).
>
> How much downhill is that?
>
> I have a production cache tier with 2 nodes (replica 2 of course) and 4
> 800GB DC S3610s each, IPoIB QDR (40Gbs) interconnect and the performance
> is pretty much what I would expect.
>
> > So 5 nodes running OSDs, and a 6th node running the RBD client using the
> > kernel implementation.
> >
> I assume there's are reason for use the kernel RBD client (which kernel?),
> given that it tends to be behind the curve in terms of features and speed?
>
> > Complete set of commands for creating the cache tier, I pulled this from
> > history, so the line in the middle was a failed command actually so
> > sorry for the red herring.
> >
> >   982  ceph osd pool create nvme 512 512 replicated_nvme
> >   983  ceph osd pool set nvme size 1
> >   984  ceph osd tier add rbd nvme
> >   985  ceph osd tier cache-mode  nvme writeback
> >   986  ceph osd tier set-overlay rbd nvme
> >   987  ceph osd pool set nvme  hit_set_type bloom
> >   988  ceph osd pool set target_max_bytes 5000 <<—— typo here,
> > so never mind 989  ceph osd pool set nvme target_max_bytes 5000
> >   990  ceph osd pool set nvme target_max_objects 50
> >   991  ceph osd pool set nvme cache_target_dirty_ratio 0.5
> >   992  ceph osd pool set nvme cache_target_full_ratio 0.8
> >
> > I wish the cache tier would cause a health warning if it does not have
> > a max size set, it lets you do that, flushes nothing and fills the OSDs.
> >
> Oh yes, people have been bitten by this over and over again.
> At least it's documented now.
>
> > As for what the actual test is, this is 4K uncompressed DPX video
frames,
> > so 50 Mbyte files written at least 24 a second on a good day, ideally
> > more. This needs to sustain around 1.3 Gbytes/sec in either direction
> > from a single application and needs to do it consistently. There is a
> > certain amount of buffering to deal with fluctuations in perf. I a

Re: [ceph-users] Slow read on RBD mount, Hammer 0.94.5

2016-04-19 Thread Udo Lembke
Hi Mike,
I don't have experiences with RBD mounts, but see the same effect with RBD.

You can do some tuning to get better results (disable debug and so on).

As hint some values from a ceph.conf:
[osd]
 debug asok = 0/0
 debug auth = 0/0
 debug buffer = 0/0
 debug client = 0/0
 debug context = 0/0
 debug crush = 0/0
 debug filer = 0/0
 debug filestore = 0/0
 debug finisher = 0/0
 debug heartbeatmap = 0/0
 debug journal = 0/0
 debug journaler = 0/0
 debug lockdep = 0/0
 debug mds = 0/0
 debug mds balancer = 0/0
 debug mds locker = 0/0
 debug mds log = 0/0
 debug mds log expire = 0/0
 debug mds migrator = 0/0
 debug mon = 0/0
 debug monc = 0/0
 debug ms = 0/0
 debug objclass = 0/0
 debug objectcacher = 0/0
 debug objecter = 0/0
 debug optracker = 0/0
 debug osd = 0/0
 debug paxos = 0/0
 debug perfcounter = 0/0
 debug rados = 0/0
 debug rbd = 0/0
 debug rgw = 0/0
 debug throttle = 0/0
 debug timer = 0/0
 debug tp = 0/0
 filestore_op_threads = 4
 osd max backfills = 1
 osd mount options xfs =
"rw,noatime,inode64,logbufs=8,logbsize=256k,allocsize=4M"
 osd mkfs options xfs = "-f -i size=2048"
 osd recovery max active = 1
 osd_disk_thread_ioprio_class = idle
 osd_disk_thread_ioprio_priority = 7
 osd_disk_threads = 1
 osd_enable_op_tracker = false
 osd_op_num_shards = 10
 osd_op_num_threads_per_shard = 1
 osd_op_threads = 4

Udo

On 19.04.2016 11:21, Mike Miller wrote:
> Hi,
>
> RBD mount
> ceph v0.94.5
> 6 OSD with 9 HDD each
> 10 GBit/s public and private networks
> 3 MON nodes 1Gbit/s network
>
> A rbd mounted with btrfs filesystem format performs really badly when
> reading. Tried readahead in all combinations but that does not help in
> any way.
>
> Write rates are very good in excess of 600 MB/s up to 1200 MB/s,
> average 800 MB/s
> Read rates on the same mounted rbd are about 10-30 MB/s !?
>
> Of course, both writes and reads are from a single client machine with
> a single write/read command. So I am looking at single threaded
> performance.
> Actually, I was hoping to see at least 200-300 MB/s when reading, but
> I am seeing 10% of that at best.
>
> Thanks for your help.
>
> Mike
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com