Re: [ceph-users] Some monitors have still not reached quorum
some tips: 1.if you enabled auth_cluster_required, you may shoud have a check the keyring 2.can you reach the monitors from your admin node by ssh without passwd 2016-04-16 18:16 GMT+08:00 AJ NOURI : > Followed the preflight and quick start > http://docs.ceph.com/docs/master/start/quick-ceph-deploy/ > > Stuck here > > ajn@admin-node:~/my-cluster$* ceph-deploy mon create-initial * > > [ceph_deploy.conf][DEBUG ] found configuration file at: > /home/ajn/.cephdeploy.conf > [ceph_deploy.cli][INFO ] Invoked (1.5.31): /usr/bin/ceph-deploy mon > create-initial > [ceph_deploy.cli][INFO ] ceph-deploy options: > [ceph_deploy.cli][INFO ] username : None > [ceph_deploy.cli][INFO ] verbose : False > [ceph_deploy.cli][INFO ] overwrite_conf: False > [ceph_deploy.cli][INFO ] subcommand: create-initial > [ceph_deploy.cli][INFO ] quiet : False > [ceph_deploy.cli][INFO ] cd_conf : > > [ceph_deploy.cli][INFO ] cluster : ceph > [ceph_deploy.cli][INFO ] func : at 0x7fd1c323a668> > [ceph_deploy.cli][INFO ] ceph_conf : None > [ceph_deploy.cli][INFO ] default_release : False > [ceph_deploy.cli][INFO ] keyrings : None > [ceph_deploy.mon][DEBUG ] Deploying mon, cluster ceph hosts monitor > [ceph_deploy.mon][DEBUG ] detecting platform for host monitor ... > [monitor][DEBUG ] connection detected need for sudo > [monitor][DEBUG ] connected to host: monitor > [monitor][DEBUG ] detect platform information from remote host > [monitor][DEBUG ] detect machine type > [monitor][DEBUG ] find the location of an executable > [ceph_deploy.mon][INFO ] distro info: Ubuntu 14.04 trusty > [monitor][DEBUG ] determining if provided host has same hostname in remote > > [monitor][DEBUG ] get remote short hostname > [monitor][DEBUG ] deploying mon to monitor > [monitor][DEBUG ] get remote short hostname > [monitor][DEBUG ] remote hostname: monitor > [monitor][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf > [monitor][DEBUG ] create the mon path if it does not exist > [monitor][DEBUG ] checking for done path: > /var/lib/ceph/mon/ceph-monitor/done > [monitor][DEBUG ] create a done file to avoid re-doing the mon deployment > [monitor][DEBUG ] create the init path if it does not exist > [monitor][INFO ] Running command: sudo initctl emit ceph-mon cluster=ceph > id=monitor > [monitor][INFO ] Running command: sudo ceph --cluster=ceph --admin-daemon > /var/run/ceph/ceph-mon.monitor.asok mon_status > [monitor][ERROR ] admin_socket: exception getting command descriptions: > [Errno 2] No such file or directory > [monitor][WARNIN] monitor: mon.monitor, might not be running yet > [monitor][INFO ] Running command: sudo ceph --cluster=ceph --admin-daemon > /var/run/ceph/ceph-mon.monitor.asok mon_status > [monitor][ERROR ] admin_socket: exception getting command descriptions: > [Errno 2] No such file or directory > [monitor][WARNIN] monitor monitor does not exist in monmap > [monitor][WARNIN] neither `public_addr` nor `public_network` keys are > defined for monitors > [monitor][WARNIN] monitors may not be able to form quorum > [ceph_deploy.mon][INFO ] processing monitor mon.monitor > [monitor][DEBUG ] connection detected need for sudo > [monitor][DEBUG ] connected to host: monitor > [monitor][DEBUG ] detect platform information from remote host > [monitor][DEBUG ] detect machine type > [monitor][DEBUG ] find the location of an executable > [monitor][INFO ] Running command: sudo ceph --cluster=ceph --admin-daemon > /var/run/ceph/ceph-mon.monitor.asok mon_status > *[monitor][ERROR ] admin_socket: exception getting command descriptions: > [Errno 2] No such file or directory* > [ceph_deploy.mon][WARNIN] mon.monitor monitor is not yet in quorum, tries > left: 5 > [ceph_deploy.mon][WARNIN] waiting 5 seconds before retrying > [monitor][INFO ] Running command: sudo ceph --cluster=ceph --admin-daemon > /var/run/ceph/ceph-mon.monitor.asok mon_status > *[monitor][ERROR ] admin_socket: exception getting command descriptions: > [Errno 2] No such file or directory* > [ceph_deploy.mon][WARNIN] mon.monitor monitor is not yet in quorum, tries > left: 4 > [ceph_deploy.mon][WARNIN] waiting 10 seconds before retrying > [monitor][INFO ] Running command: sudo ceph --cluster=ceph --admin-daemon > /var/run/ceph/ceph-mon.monitor.asok mon_status > *[monitor][ERROR ] admin_socket: exception getting command descriptions: > [Errno 2] No such file or directory* > [ceph_deploy.mon][WARNIN] mon.monitor monitor is not yet in quorum, tries > left: 3 > [ceph_deploy.mon][WARNIN] waiting 10 seconds before retrying > [monitor][INFO ] Running command: sudo ceph --cluster=ceph --admin-daemon > /var/run/ceph/ceph-mon.monitor.asok mon_status > *[monitor][ERROR ] admin_socket: exception getting command descriptions: > [Errno 2] N
[ceph-users] Slow read on RBD mount, Hammer 0.94.5
Hi, RBD mount ceph v0.94.5 6 OSD with 9 HDD each 10 GBit/s public and private networks 3 MON nodes 1Gbit/s network A rbd mounted with btrfs filesystem format performs really badly when reading. Tried readahead in all combinations but that does not help in any way. Write rates are very good in excess of 600 MB/s up to 1200 MB/s, average 800 MB/s Read rates on the same mounted rbd are about 10-30 MB/s !? Of course, both writes and reads are from a single client machine with a single write/read command. So I am looking at single threaded performance. Actually, I was hoping to see at least 200-300 MB/s when reading, but I am seeing 10% of that at best. Thanks for your help. Mike ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Day Sunnyvale Presentations
I have read the SK‘s performance tuning work too, it's a good job especially the analysis of write/read latancy on OSD. I want to ask a question about the optimize on 'Long logging time', what's the meaning about 'split logging into another thread and do it later', AFAIK, ceph does logging async by logging thread. Can you share more info? 2016-04-13 18:11 GMT+08:00 Alexandre DERUMIER : > >>Based on discussion with them at Ceph day in Tokyo JP, they have their > own frozen the Ceph repository. > >>And they've been optimizing codes by their own team to meet their > requirements. > >>AFAICT they had not done any do PR. > > Thanks for the info > > @cc bspark8.sk.com : maybe can you give us more informations ? > > > - Mail original - > De: "Shinobu Kinjo" > À: "aderumier" > Cc: "Patrick McGarry" , "ceph-devel" < > ceph-de...@vger.kernel.org>, "ceph-users" > Envoyé: Mercredi 13 Avril 2016 05:56:01 > Objet: Re: [ceph-users] Ceph Day Sunnyvale Presentations > > Alexandre, > > Based on discussion with them at Ceph day in Tokyo JP, they have their own > frozen the Ceph repository. > And they've been optimizing codes by their own team to meet their > requirements. > AFAICT they had not done any do PR. > > Cheers, > Shinobu > > - Original Message - > From: "Alexandre DERUMIER" > To: "Patrick McGarry" > Cc: "ceph-devel" , "ceph-users" < > ceph-us...@ceph.com> > Sent: Wednesday, April 13, 2016 12:45:31 PM > Subject: Re: [ceph-users] Ceph Day Sunnyvale Presentations > > Hi, > > I was reading this presentation from SK telecom about flash optimisations > > AFCeph: Ceph Performance Analysis & Improvement on Flash [Slides] > > http://fr.slideshare.net/Inktank_Ceph/af-ceph-ceph-performance-analysis-and-improvement-on-flash > Byung-Su Park, SK Telecom > > > They seem to have made optimisations in ceph code. Is there any patches > reference ? (applied to infernalis/jewel ?) > > > They seem also to have done ceph config tuning and system tunning, but no > config details is provided :( > It could be great to share with the community :) > > Regards, > > Alexandre > > - Mail original - > De: "Patrick McGarry" > À: "ceph-devel" , "ceph-users" < > ceph-us...@ceph.com> > Envoyé: Mercredi 6 Avril 2016 18:18:28 > Objet: [ceph-users] Ceph Day Sunnyvale Presentations > > Hey cephers, > > I have all but one of the presentations from Ceph Day Sunnyvale, so > rather than wait for a full hand I went ahead and posted the link to > the slides on the event page: > > http://ceph.com/cephdays/ceph-day-sunnyvale/ > > The videos probably wont be processed until after next week, but I’ll > add those once we get them. Thanks to all of the presenters and > attendees that made this another great event. > > > -- > > Best Regards, > > Patrick McGarry > Director Ceph Community || Red Hat > http://ceph.com || http://community.redhat.com > @scuttlemonkey || @ceph > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] add mon and move mon
Dear friends: Hello,I have a small problem When I use ceph . my ceph has three monitor. I want to move out one. root@node01 ~]# ceph -s cluster b0d8bd0d-6269-4ce7-a10b-9adc7ee2c4c8 health HEALTH_WARN too many PGs per OSD (682 > max 300) monmap e23: 3 mons at {node01=172.168.2.185:6789/0,node02=172.168.2.186:6789/0,node03=172.168.2.187:6789/0} election epoch 472, quorum 0,1,2 node01,node02,node03 osdmap e7084: 18 osds: 18 up, 18 in pgmap v1051011: 4448 pgs, 15 pools, 7915 MB data, 12834 objects 27537 MB used, 23298 GB / 23325 GB avail 4448 active+clean So I do as this : #ceph-deploy mon destroy node03 Then I add it in the cluster again. #ceph-deploy mon add node03 The node03 is added to the cluster.but after a while,the monitor is down . When I see the /var/log/messages I find that Apr 19 11:12:01 node01 systemd: Starting Session 14091 of user root. Apr 19 11:12:01 node01 systemd: Started Session 14091 of user root. Apr 19 11:12:39 node01 bash: 2016-04-19 11:12:39.533817 7f6e51ec2700 -1 mon.node01@0(leader) e23 *** Got Signal Terminated *** When I start up the monitor ,then after a while it becomes down again. But I have enough system space. [root@node03 ~]# df -TH FilesystemType Size Used Avail Use% Mounted on /dev/mapper/rhel-root xfs11G 4.7G 6.1G 44% / devtmpfs devtmpfs 26G 0 26G 0% /dev tmpfs tmpfs 26G 82k 26G 1% /dev/shm tmpfs tmpfs 26G 147M 26G 1% /run tmpfs tmpfs 26G 0 26G 0% /sys/fs/cgroup /dev/mapper/rhel-usr xfs11G 4.1G 6.7G 38% /usr /dev/mapper/rhel-tmp xfs11G 34M 11G 1% /tmp /dev/mapper/rhel-home xfs11G 34M 11G 1% /home /dev/mapper/rhel-var xfs11G 1.6G 9.2G 15% /var /dev/sde1 xfs 2.0T 152M 2.0T 1% /var/lib/ceph/osd/ceph-15 /dev/sdg1 xfs 2.0T 3.8G 2.0T 1% /var/lib/ceph/osd/ceph-17 /dev/sdd1 xfs 2.0T 165M 2.0T 1% /var/lib/ceph/osd/ceph-14 /dev/sda1 xfs 521M 131M 391M 26% /boot /dev/sdb1 xfs 219G 989M 218G 1% /var/lib/ceph/osd/ceph-4 /dev/sdf1 xfs 2.0T 4.6G 2.0T 1% /var/lib/ceph/osd/ceph-16 /dev/sdc1 xfs 219G 129M 219G 1% /var/lib/ceph/osd/ceph-5 You have new mail in /var/spool/mail/root [root@node03 ~]# What’s the problem , is my operation wrong? Looking forward to your reply. --Dingxf48 发送自 Windows 10 版邮件应用 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Powercpu and ceph
On Tue, Apr 19, 2016 at 5:28 AM, min fang wrote: > I am confused on ceph/ceph-qa-suite and ceph/teuthology. Which one should I > use? thanks. ceph-qa-suite repository contains the test snippets, teuthology is the test framework that knows how to run them. It will pull the appropriate branch of ceph-qa-suite automatically or, in some cases, you can point it at your own checkout. Setting it up is not an easy task though, so I'd start with building and running "make check". Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] krbd map on Jewel, sysfs write failed when rbd map
On Mon, Apr 18, 2016 at 11:58 AM, Tim Bishop wrote: > I had the same issue when testing on Ubuntu xenial beta. That has 4.4, > so should be fine? I had to create images without the new RBD features > to make it works. None of the "new" features are currently supported by krbd. 4.7 will support exclusive-lock with most of the rest following in 4.8. You don't have to recreate images: while those features are enabled in jewel by default, you should be able to dynamically disable them with "rbd feature disable imagename deep-flatten fast-diff object-map exclusive-lock". Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Build Raw Volume from Recovered RBD Objects
All, I was called in to assist in a failed Ceph environment with the cluster in an inoperable state. No rbd volumes are mountable/exportable due to missing PGs. The previous operator was using a replica count of 2. The cluster suffered a power outage and various non-catastrophic hardware issues as they were starting it back up. At some point during recovery, drives were removed from the cluster leaving several PGs missing. Efforts to restore the missing PGs from the data on the removed drives failed using the process detailed in a Red Hat Customer Support blog post [0]. Upon starting the OSDs with recovered PGs, a segfault halts progress. The original operator isn't clear on when, but there may have been a software upgrade applied after the drives were pulled. I believe the cluster may be irrecoverable at this point. My recovery assistance has focused on a plan to: 1) Scrape all objects for several key rbd volumes from live OSDs and the removed former OSD drives. 2) Compare and deduplicate the two copies of each object. 3) Recombine the objects for each volume into a raw image. I have completed steps 1 and 2 with apparent success. My initial stab at step 3 yielded a raw image that could be mounted and had signs of a filesystem, but it could not be read. Could anyone assist me with the following questions? 1) Are the rbd objects in order by filename? If not, what is the method to determine their order? 2) How should objects smaller than the default 4MB chunk size be handled? Should they be padded somehow? 3) If any objects were completely missing and therefore unavailable to this process, how should they be handled? I assume we need to offset/pad to compensate. -- Thanks, Mike Dawson Co-Founder & Director of Cloud Architecture Cloudapt LLC 6330 East 75th Street, Suite 170 Indianapolis, IN 46250 M: 317-490-3018 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] cephfs does not seem to properly free up space
Hello, At my workplace we have a production cephfs cluster (334 TB on 60 OSDs) which was recently upgraded from Infernalis 9.2.0 to Infernalis 9.2.1 on Ubuntu 14.04.3 (linux 3.19.0-33). It seems that cephfs still doesn't free up space at all or at least that's what df command tells us. Is there a better way of getting a df-like output with other command for cephfs ? Thank you, Marius Rad SysAdmin www.propertyshark.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph-mon.target not enabled
Hi all, I just installed 3 monitors, using ceph-deploy, on CentOS 7.2. Ceph is 10.1.2. My ceph-mon processes do not come up after reboot. This is what ceph-deploy create-initial did: [ams1-ceph01-mon01][INFO ] Running command: sudo systemctl enable ceph.target [ams1-ceph01-mon01][WARNIN] Created symlink from /etc/systemd/system/multi-user.target.wants/ceph.target to /usr/lib/systemd/system/ceph.target. [ams1-ceph01-mon01][INFO ] Running command: sudo systemctl enable ceph-mon@ams1-ceph01-mon01 [ams1-ceph01-mon01][WARNIN] Created symlink from /etc/systemd/system/ceph-mon.target.wants/ceph-mon@ams1-ceph01-mon01.service to /usr/lib/systemd/system/ceph-mon@.service. [ams1-ceph01-mon01][INFO ] Running command: sudo systemctl start ceph-mon@ams1-ceph01-mon01 However, it did not enable ceph-mon.target: $ sudo systemctl is-enabled ceph-mon.target disabled Am I supposed to enable ceph-mon.target by hand? I did search the documentation but haven't been able to find anything that says so. Kind regards, Ruben Kerkhof ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs does not seem to properly free up space
On Tue, Apr 19, 2016 at 2:40 PM, Simion Rad wrote: > Hello, > > > At my workplace we have a production cephfs cluster (334 TB on 60 OSDs) > which was recently upgraded from Infernalis 9.2.0 to Infernalis 9.2.1 on > Ubuntu 14.04.3 (linux 3.19.0-33). > > It seems that cephfs still doesn't free up space at all or at least that's > what df command tells us. Hmm, historically there were bugs with the purging code, but I thought we fixed them before Infernalis. Does the space get freed after you unmount the client? Some issues have involved clients holding onto references to unlinked inodes. John > > Is there a better way of getting a df-like output with other command for > cephfs ? > > > Thank you, > > Marius Rad > > SysAdmin > > www.propertyshark.com > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs does not seem to properly free up space
Mounting and unmount doesn't change anyting. The used space reported by df command is nearly the same as the values returned by ceph -s command. Example 1, df output: ceph-fuse 334T 134T 200T 41% /cephfs Example 2, ceph -s output: health HEALTH_WARN mds0: Many clients (22) failing to respond to cache pressure noscrub,nodeep-scrub,sortbitwise flag(s) set monmap e1: 5 mons at {r730-12=10.103.213.12:6789/0,r730-4=10.103.213.4:6789/0,r730-5= 10.103.213.5:6789/0,r730-8=10.103.213.8:6789/0,r730-9=10.103.213.9:6789/0} election epoch 132, quorum 0,1,2,3,4 r730-4,r730-5,r730-8,r730-9,r730-12 mdsmap e14637: 1/1/1 up {0=ceph2-mds-2=up:active} osdmap e6549: 68 osds: 68 up, 68 in flags noscrub,nodeep-scrub,sortbitwise pgmap v4394151: 896 pgs, 3 pools, 54569 GB data, 56582 kobjects 133 TB used, 199 TB / 333 TB avail 896 active+clean client io 47395 B/s rd, 1979 kB/s wr, 388 op/s From: John Spray Sent: Tuesday, April 19, 2016 22:04 To: Simion Rad Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] cephfs does not seem to properly free up space On Tue, Apr 19, 2016 at 2:40 PM, Simion Rad wrote: > Hello, > > > At my workplace we have a production cephfs cluster (334 TB on 60 OSDs) > which was recently upgraded from Infernalis 9.2.0 to Infernalis 9.2.1 on > Ubuntu 14.04.3 (linux 3.19.0-33). > > It seems that cephfs still doesn't free up space at all or at least that's > what df command tells us. Hmm, historically there were bugs with the purging code, but I thought we fixed them before Infernalis. Does the space get freed after you unmount the client? Some issues have involved clients holding onto references to unlinked inodes. John > > Is there a better way of getting a df-like output with other command for > cephfs ? > > > Thank you, > > Marius Rad > > SysAdmin > > www.propertyshark.com > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph cache tier clean rate too low
I Have a setup using some Intel P3700 devices as a cache tier, and 33 sata drives hosting the pool behind them. I setup the cache tier with writeback, gave it a size and max object count etc: ceph osd pool set target_max_bytes 5000 ceph osd pool set nvme target_max_bytes 5000 ceph osd pool set nvme target_max_objects 50 ceph osd pool set nvme cache_target_dirty_ratio 0.5 ceph osd pool set nvme cache_target_full_ratio 0.8 This is all running Jewel using bluestore OSDs (I know experimental). The cache tier will write at about 900 Mbytes/sec and read at 2.2 Gbytes/sec, the sata pool can take writes at about 600 Mbytes/sec in aggregate. However, it looks like the mechanism for cleaning the cache down to the disk layer is being massively rate limited and I see about 47 Mbytes/sec of read activity from each SSD while this is going on. This means that while I could be pushing data into the cache at high speed, It cannot evict old content very fast at all, and it is very easy to hit the high water mark and the application I/O drops dramatically as it becomes throttled by how fast the cache can flush. I suspect it is operating on a placement group at a time so ends up targeting a very limited number of objects and hence disks at any one time. I can see individual disk drives going busy for very short periods, but most of them are idle at any one point in time. The only way to drive the disk based OSDs fast is to hit a lot of them at once which would mean issuing many cache flush operations in parallel. Are there any controls which can influence this behavior? Thanks Steve -- The information contained in this transmission may be confidential. Any disclosure, copying, or further distribution of confidential information is not permitted unless such privilege is explicitly granted in writing by Quantum. Quantum reserves the right to have electronic communications, including email and attachments, sent across its networks filtered through anti virus and spam software programs and retain such messages in order to comply with applicable data security and retention requirements. Quantum is not responsible for the proper and complete transmission of the substance of this communication or for any delay in its receipt. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs does not seem to properly free up space
have you ever used fancy layout? see http://tracker.ceph.com/issues/15050 On Wed, Apr 20, 2016 at 3:17 AM, Simion Rad wrote: > Mounting and unmount doesn't change anyting. > The used space reported by df command is nearly the same as the values > returned by ceph -s command. > > Example 1, df output: > ceph-fuse 334T 134T 200T 41% /cephfs > > Example 2, ceph -s output: > health HEALTH_WARN > mds0: Many clients (22) failing to respond to cache pressure > noscrub,nodeep-scrub,sortbitwise flag(s) set > monmap e1: 5 mons at > {r730-12=10.103.213.12:6789/0,r730-4=10.103.213.4:6789/0,r730-5= > 10.103.213.5:6789/0,r730-8=10.103.213.8:6789/0,r730-9=10.103.213.9:6789/0} > election epoch 132, quorum 0,1,2,3,4 > r730-4,r730-5,r730-8,r730-9,r730-12 > mdsmap e14637: 1/1/1 up {0=ceph2-mds-2=up:active} > osdmap e6549: 68 osds: 68 up, 68 in > flags noscrub,nodeep-scrub,sortbitwise > pgmap v4394151: 896 pgs, 3 pools, 54569 GB data, 56582 kobjects > 133 TB used, 199 TB / 333 TB avail > 896 active+clean > client io 47395 B/s rd, 1979 kB/s wr, 388 op/s > > > > From: John Spray > Sent: Tuesday, April 19, 2016 22:04 > To: Simion Rad > Cc: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] cephfs does not seem to properly free up space > > On Tue, Apr 19, 2016 at 2:40 PM, Simion Rad wrote: >> Hello, >> >> >> At my workplace we have a production cephfs cluster (334 TB on 60 OSDs) >> which was recently upgraded from Infernalis 9.2.0 to Infernalis 9.2.1 on >> Ubuntu 14.04.3 (linux 3.19.0-33). >> >> It seems that cephfs still doesn't free up space at all or at least that's >> what df command tells us. > > Hmm, historically there were bugs with the purging code, but I thought > we fixed them before Infernalis. > > Does the space get freed after you unmount the client? Some issues > have involved clients holding onto references to unlinked inodes. > > John > >> >> Is there a better way of getting a df-like output with other command for >> cephfs ? >> >> >> Thank you, >> >> Marius Rad >> >> SysAdmin >> >> www.propertyshark.com >> >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph cache tier clean rate too low
Hello, On Tue, 19 Apr 2016 20:21:39 + Stephen Lord wrote: > > > I Have a setup using some Intel P3700 devices as a cache tier, and 33 > sata drives hosting the pool behind them. A bit more details about the setup would be nice, as in how many nodes, interconnect, replication size of the cache tier and the backing HDD pool, etc. And "some" isn't a number, how many P3700s (which size?) in how many nodes? One assumes there are no further SSDs involved with those SATA HDDs? >I setup the cache tier with > writeback, gave it a size and max object count etc: > > ceph osd pool set target_max_bytes 5000 ^^^ This should have given you an error, it needs the pool name, as in your next line. > ceph osd pool set nvme target_max_bytes 5000 > ceph osd pool set nvme target_max_objects 50 > ceph osd pool set nvme cache_target_dirty_ratio 0.5 > ceph osd pool set nvme cache_target_full_ratio 0.8 > > This is all running Jewel using bluestore OSDs (I know experimental). Make sure to report all pyrotechnics, trap doors and sharp edges. ^_- > The cache tier will write at about 900 Mbytes/sec and read at 2.2 > Gbytes/sec, the sata pool can take writes at about 600 Mbytes/sec in > aggregate. ^ Key word there. That's just 18MB/s per HDD (60MB/s with a replication of 3), a pretty disappointing result for the supposedly twice as fast BlueStore. Again, replication size and topology might explain that up to a point, but we don't know them (yet). Also exact methodology of your tests please, i.e. the fio command line, how was the RBD device (if you tested with one) mounted and where, etc... > However, it looks like the mechanism for cleaning the cache > down to the disk layer is being massively rate limited and I see about > 47 Mbytes/sec of read activity from each SSD while this is going on. > This number is meaningless w/o knowing home many NVMe's you have. That being said, there are 2 levels of flushing past Hammer, but if you push the cache tier to the 2nd limit (cache_target_dirty_high_ratio) you will get full speed. > This means that while I could be pushing data into the cache at high > speed, It cannot evict old content very fast at all, and it is very easy > to hit the high water mark and the application I/O drops dramatically as > it becomes throttled by how fast the cache can flush. > > I suspect it is operating on a placement group at a time so ends up > targeting a very limited number of objects and hence disks at any one > time. I can see individual disk drives going busy for very short > periods, but most of them are idle at any one point in time. The only > way to drive the disk based OSDs fast is to hit a lot of them at once > which would mean issuing many cache flush operations in parallel. > Yes, it is all PG based, so your observations match the expectations and what everybody else is seeing. See also the thread "Cache tier operation clarifications" by me, version 2 is in the works. There are also some new knobs in Jewel that may be helpful, see: http://www.spinics.net/lists/ceph-users/msg25679.html If you have a use case with a clearly defined idle/low use time and a small enough growth in dirty objects, consider what I'm doing, dropping the cache_target_dirty_ratio a few percent (in my case 2-3% is enough for a whole day) via cron job,wait a bit and then up again to it's normal value. That way flushes won't normally happen at all during your peak usage times, though in my case that's purely cosmetic, flushes are not problematic at any time in that cluster currently. > Are there any controls which can influence this behavior? > See above (cache_target_dirty_high_ratio). Aside from that you might want to reflect on what your use case, workload is going to be and how your testing reflects on it. As in, are you really going to write MASSIVE amounts of data at very high speeds or is it (like in 90% of common cases) the amount of small write IOPS that is really going to be the limiting factor. Which is something that cache tiers can deal with very well (or sufficiently large and well designed "plain" clusters). Another thing to think about is using the "readforward" cache mode, leaving your cache tier free to just handle writes and thus giving it more space to work with. Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] join the users
join the users 发送自 Windows 10 版邮件应用 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph cache tier clean rate too low
OK, you asked ;-) This is all via RBD, I am running a single filesystem on top of 8 RBD devices in an effort to get data striping across more OSDs, I had been using that setup before adding the cache tier. 3 nodes with 11 6 Tbyte SATA drives each for a base RBD pool, this is setup with replication size 3. No SSDs involved in those OSDs, since ceph-disk does not let you break a bluestore configuration into more than one device at the moment. The 600 Mbytes/sec is an approx sustained number for the data rate I can get going into this pool via RBD, that turns into 3 times that for raw data rate, so at 33 drives that is mid 50s Mbytes/sec per drive. I have pushed it harder than that from time to time, but the OSD really wants to use fdatasync a lot and that tends to suck up a lot of the potential of a device, these disks will do 160 Mbytes/sec if you stream data to them. I just checked with rados bench to this set of 33 OSDs with a 3 replica pool, and 600 Mbytes/sec is what it will do from the same client host. All the networking is 40 GB ethernet, single port per host, generally I can push 2.2 Gbytes/sec in one direction between two hosts over a single tcp link, the max I have seen is about 2.7 Gbytes/sec coming into a node. Short of going to RDMA that appears to be about the limit for these processors. There are a grand total of 2 400 GB P3700s which are running a pool with a replication factor of 1, these are in 2 other nodes. Once I add in replication perf goes downhill. If I had more hardware I would be running more of these and using replication, but I am out of network cards right now. So 5 nodes running OSDs, and a 6th node running the RBD client using the kernel implementation. Complete set of commands for creating the cache tier, I pulled this from history, so the line in the middle was a failed command actually so sorry for the red herring. 982 ceph osd pool create nvme 512 512 replicated_nvme 983 ceph osd pool set nvme size 1 984 ceph osd tier add rbd nvme 985 ceph osd tier cache-mode nvme writeback 986 ceph osd tier set-overlay rbd nvme 987 ceph osd pool set nvme hit_set_type bloom 988 ceph osd pool set target_max_bytes 5000 <<—— typo here, so never mind 989 ceph osd pool set nvme target_max_bytes 5000 990 ceph osd pool set nvme target_max_objects 50 991 ceph osd pool set nvme cache_target_dirty_ratio 0.5 992 ceph osd pool set nvme cache_target_full_ratio 0.8 I wish the cache tier would cause a health warning if it does not have a max size set, it lets you do that, flushes nothing and fills the OSDs. As for what the actual test is, this is 4K uncompressed DPX video frames, so 50 Mbyte files written at least 24 a second on a good day, ideally more. This needs to sustain around 1.3 Gbytes/sec in either direction from a single application and needs to do it consistently. There is a certain amount of buffering to deal with fluctuations in perf. I am pushing 4096 of these files sequentially with a queue depth of 32 so there is rather a lot of data in flight at any one time. I know I do not have enough hardware to achieve this rate on writes. The are being written with direct I/O into a pool of 8 RBD LUNs. The 8 LUN setup will not really help here with the small number of OSDs in the cache pool, it does help when the RBD LUNs are going directly to a large pool of disk based OSDs as it gets all the OSDs moving in parallel. My basic point here is that there is a lot more potential bandwidth to be had in the backing pool, but I cannot get the cache tier to use more than a small fraction of the available bandwidth when flushing content. Since the front end of the cache can sustain around 900 Mbytes/sec over RBD, I am somewhat out of balance here: cache input rate 900 Mbytes/sec backing pool input rate 600 Mbytes/sec But not by a significant amount. The question is really about is there anything I can do to get cache flushing to take advantage of more of the bandwidth. If I do this without the cache tier then the latency of the disk based OSDs is too variable and you cannot sustain a consistent data rate. The NVMe devices are better about consistent device latency, but the cache tier implementation seems to have a problem driving the backing pool at anything close to its capabilities. It really only needs to move 40 or 50 objects in parallel to achieve that. I am not attempting to provision a cache tier large enough for whole workload, but as more of a debounce zone to avoid jitter making it back to the application. I am trying to categorize what can and cannot be achieved with ceph here for this type of workload, not build a complete production setup. My test represents 170 seconds of content and generates 209 Gbytes of data, so this is a small scale test ;-) fortunately this stuff is not always used realtime. All of those extra config options look to be around how fast promotion into the cache can go, not ho
[ceph-users] mds segfault on cephfs snapshot creation
As soon as I create a snapshot on the root of my test cephfs deployment with a single file within the root, my mds server kernel panics. I understand that snapshots are not recommended. Is it beneficial to developers for me to leave my cluster in its present state and provide whatever debugging information they'd like? I'm not really looking for a solution to a mission critical issue as much as providing an opportunity for developers to pull stack traces, logs, etc from a system affected by some sort of bug in cephfs/mds. This happens every time I create a directory inside my .snap directory. Let me know if I should blow my cluster away? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph cache tier clean rate too low
Hello, On Wed, 20 Apr 2016 03:42:00 + Stephen Lord wrote: > > OK, you asked ;-) > I certainly did. ^o^ > This is all via RBD, I am running a single filesystem on top of 8 RBD > devices in an effort to get data striping across more OSDs, I had been > using that setup before adding the cache tier. > Nods. Depending on your use case (sequential writes) actual RADOS striping might be more advantageous than this (with 4MB writes still going to the same PG/OSD all the time). > 3 nodes with 11 6 Tbyte SATA drives each for a base RBD pool, this is > setup with replication size 3. No SSDs involved in those OSDs, since > ceph-disk does not let you break a bluestore configuration into more > than one device at the moment. > That's a pity, but supposedly just a limitation of ceph-disk. I'd venture you can work around that with symlinks to a raw SSD partition, same as with current filestore journals. As Sage recently wrote: --- BlueStore can use as many as three devices: one for the WAL (journal, though it can be much smaller than FileStores, e.g., 128MB), one for metadata (e.g., an SSD partition), and one for data. --- > The 600 Mbytes/sec is an approx sustained number for the data rate I can > get going into this pool via RBD, that turns into 3 times that for raw > data rate, so at 33 drives that is mid 50s Mbytes/sec per drive. I have > pushed it harder than that from time to time, but the OSD really wants > to use fdatasync a lot and that tends to suck up a lot of the potential > of a device, these disks will do 160 Mbytes/sec if you stream data to > them. > > I just checked with rados bench to this set of 33 OSDs with a 3 replica > pool, and 600 Mbytes/sec is what it will do from the same client host. > This matches a cluster of mine with 32 OSDs (filestore of course) and SSD journals on 4 nodes with a replica of 3. So BlueStore is indeed faster than than filestore. > All the networking is 40 GB ethernet, single port per host, generally I > can push 2.2 Gbytes/sec in one direction between two hosts over a single > tcp link, the max I have seen is about 2.7 Gbytes/sec coming into a > node. Short of going to RDMA that appears to be about the limit for > these processors. > Yeah, didn't expect your network to be involved here bottleneck wise, but a good data point to have nevertheless. > There are a grand total of 2 400 GB P3700s which are running a pool with > a replication factor of 1, these are in 2 other nodes. Once I add in > replication perf goes downhill. If I had more hardware I would be > running more of these and using replication, but I am out of network > cards right now. > Alright, so at 900MB/s you're pretty close to what one would expect from 2 of these: 1080MB/s*2/2(journal). How much downhill is that? I have a production cache tier with 2 nodes (replica 2 of course) and 4 800GB DC S3610s each, IPoIB QDR (40Gbs) interconnect and the performance is pretty much what I would expect. > So 5 nodes running OSDs, and a 6th node running the RBD client using the > kernel implementation. > I assume there's are reason for use the kernel RBD client (which kernel?), given that it tends to be behind the curve in terms of features and speed? > Complete set of commands for creating the cache tier, I pulled this from > history, so the line in the middle was a failed command actually so > sorry for the red herring. > > 982 ceph osd pool create nvme 512 512 replicated_nvme > 983 ceph osd pool set nvme size 1 > 984 ceph osd tier add rbd nvme > 985 ceph osd tier cache-mode nvme writeback > 986 ceph osd tier set-overlay rbd nvme > 987 ceph osd pool set nvme hit_set_type bloom > 988 ceph osd pool set target_max_bytes 5000 <<—— typo here, > so never mind 989 ceph osd pool set nvme target_max_bytes 5000 > 990 ceph osd pool set nvme target_max_objects 50 > 991 ceph osd pool set nvme cache_target_dirty_ratio 0.5 > 992 ceph osd pool set nvme cache_target_full_ratio 0.8 > > I wish the cache tier would cause a health warning if it does not have > a max size set, it lets you do that, flushes nothing and fills the OSDs. > Oh yes, people have been bitten by this over and over again. At least it's documented now. > As for what the actual test is, this is 4K uncompressed DPX video frames, > so 50 Mbyte files written at least 24 a second on a good day, ideally > more. This needs to sustain around 1.3 Gbytes/sec in either direction > from a single application and needs to do it consistently. There is a > certain amount of buffering to deal with fluctuations in perf. I am > pushing 4096 of these files sequentially with a queue depth of 32 so > there is rather a lot of data in flight at any one time. I know I do not > have enough hardware to achieve this rate on writes. > So this is your test AND actual intended use case I presume, right? > The are being written with direct I/O into a pool of 8 RBD LUNs. The 8 > LUN setup will not really help he
Re: [ceph-users] ceph cache tier clean rate too low
Hi, response in line On 20 Apr 2016 7:45 a.m., "Christian Balzer" wrote: > > > Hello, > > On Wed, 20 Apr 2016 03:42:00 + Stephen Lord wrote: > > > > > OK, you asked ;-) > > > > I certainly did. ^o^ > > > This is all via RBD, I am running a single filesystem on top of 8 RBD > > devices in an effort to get data striping across more OSDs, I had been > > using that setup before adding the cache tier. > > > Nods. > Depending on your use case (sequential writes) actual RADOS striping might > be more advantageous than this (with 4MB writes still going to the same > PG/OSD all the time). > > > > 3 nodes with 11 6 Tbyte SATA drives each for a base RBD pool, this is > > setup with replication size 3. No SSDs involved in those OSDs, since > > ceph-disk does not let you break a bluestore configuration into more > > than one device at the moment. > > > That's a pity, but supposedly just a limitation of ceph-disk. > I'd venture you can work around that with symlinks to a raw SSD > partition, same as with current filestore journals. > > As Sage recently wrote: > --- > BlueStore can use as many as three devices: one for the WAL (journal, > though it can be much smaller than FileStores, e.g., 128MB), one for > metadata (e.g., an SSD partition), and one for data. > --- I believe he also mentioned the use of bcache and friends for the osd, maybe a way forward in this case? Regards Josef > > > The 600 Mbytes/sec is an approx sustained number for the data rate I can > > get going into this pool via RBD, that turns into 3 times that for raw > > data rate, so at 33 drives that is mid 50s Mbytes/sec per drive. I have > > pushed it harder than that from time to time, but the OSD really wants > > to use fdatasync a lot and that tends to suck up a lot of the potential > > of a device, these disks will do 160 Mbytes/sec if you stream data to > > them. > > > > I just checked with rados bench to this set of 33 OSDs with a 3 replica > > pool, and 600 Mbytes/sec is what it will do from the same client host. > > > This matches a cluster of mine with 32 OSDs (filestore of course) and SSD > journals on 4 nodes with a replica of 3. > > So BlueStore is indeed faster than than filestore. > > > All the networking is 40 GB ethernet, single port per host, generally I > > can push 2.2 Gbytes/sec in one direction between two hosts over a single > > tcp link, the max I have seen is about 2.7 Gbytes/sec coming into a > > node. Short of going to RDMA that appears to be about the limit for > > these processors. > > > Yeah, didn't expect your network to be involved here bottleneck wise, but > a good data point to have nevertheless. > > > There are a grand total of 2 400 GB P3700s which are running a pool with > > a replication factor of 1, these are in 2 other nodes. Once I add in > > replication perf goes downhill. If I had more hardware I would be > > running more of these and using replication, but I am out of network > > cards right now. > > > Alright, so at 900MB/s you're pretty close to what one would expect from 2 > of these: 1080MB/s*2/2(journal). > > How much downhill is that? > > I have a production cache tier with 2 nodes (replica 2 of course) and 4 > 800GB DC S3610s each, IPoIB QDR (40Gbs) interconnect and the performance > is pretty much what I would expect. > > > So 5 nodes running OSDs, and a 6th node running the RBD client using the > > kernel implementation. > > > I assume there's are reason for use the kernel RBD client (which kernel?), > given that it tends to be behind the curve in terms of features and speed? > > > Complete set of commands for creating the cache tier, I pulled this from > > history, so the line in the middle was a failed command actually so > > sorry for the red herring. > > > > 982 ceph osd pool create nvme 512 512 replicated_nvme > > 983 ceph osd pool set nvme size 1 > > 984 ceph osd tier add rbd nvme > > 985 ceph osd tier cache-mode nvme writeback > > 986 ceph osd tier set-overlay rbd nvme > > 987 ceph osd pool set nvme hit_set_type bloom > > 988 ceph osd pool set target_max_bytes 5000 <<—— typo here, > > so never mind 989 ceph osd pool set nvme target_max_bytes 5000 > > 990 ceph osd pool set nvme target_max_objects 50 > > 991 ceph osd pool set nvme cache_target_dirty_ratio 0.5 > > 992 ceph osd pool set nvme cache_target_full_ratio 0.8 > > > > I wish the cache tier would cause a health warning if it does not have > > a max size set, it lets you do that, flushes nothing and fills the OSDs. > > > Oh yes, people have been bitten by this over and over again. > At least it's documented now. > > > As for what the actual test is, this is 4K uncompressed DPX video frames, > > so 50 Mbyte files written at least 24 a second on a good day, ideally > > more. This needs to sustain around 1.3 Gbytes/sec in either direction > > from a single application and needs to do it consistently. There is a > > certain amount of buffering to deal with fluctuations in perf. I a
Re: [ceph-users] Slow read on RBD mount, Hammer 0.94.5
Hi Mike, I don't have experiences with RBD mounts, but see the same effect with RBD. You can do some tuning to get better results (disable debug and so on). As hint some values from a ceph.conf: [osd] debug asok = 0/0 debug auth = 0/0 debug buffer = 0/0 debug client = 0/0 debug context = 0/0 debug crush = 0/0 debug filer = 0/0 debug filestore = 0/0 debug finisher = 0/0 debug heartbeatmap = 0/0 debug journal = 0/0 debug journaler = 0/0 debug lockdep = 0/0 debug mds = 0/0 debug mds balancer = 0/0 debug mds locker = 0/0 debug mds log = 0/0 debug mds log expire = 0/0 debug mds migrator = 0/0 debug mon = 0/0 debug monc = 0/0 debug ms = 0/0 debug objclass = 0/0 debug objectcacher = 0/0 debug objecter = 0/0 debug optracker = 0/0 debug osd = 0/0 debug paxos = 0/0 debug perfcounter = 0/0 debug rados = 0/0 debug rbd = 0/0 debug rgw = 0/0 debug throttle = 0/0 debug timer = 0/0 debug tp = 0/0 filestore_op_threads = 4 osd max backfills = 1 osd mount options xfs = "rw,noatime,inode64,logbufs=8,logbsize=256k,allocsize=4M" osd mkfs options xfs = "-f -i size=2048" osd recovery max active = 1 osd_disk_thread_ioprio_class = idle osd_disk_thread_ioprio_priority = 7 osd_disk_threads = 1 osd_enable_op_tracker = false osd_op_num_shards = 10 osd_op_num_threads_per_shard = 1 osd_op_threads = 4 Udo On 19.04.2016 11:21, Mike Miller wrote: > Hi, > > RBD mount > ceph v0.94.5 > 6 OSD with 9 HDD each > 10 GBit/s public and private networks > 3 MON nodes 1Gbit/s network > > A rbd mounted with btrfs filesystem format performs really badly when > reading. Tried readahead in all combinations but that does not help in > any way. > > Write rates are very good in excess of 600 MB/s up to 1200 MB/s, > average 800 MB/s > Read rates on the same mounted rbd are about 10-30 MB/s !? > > Of course, both writes and reads are from a single client machine with > a single write/read command. So I am looking at single threaded > performance. > Actually, I was hoping to see at least 200-300 MB/s when reading, but > I am seeing 10% of that at best. > > Thanks for your help. > > Mike > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com