Re: [ceph-users] ceph master build fails on src/gmock, workaround?
On Sat, Jul 09, 2016 at 10:43:52AM +, Kevan Rehm wrote: > Greetings, > > I cloned the master branch of ceph at https://github.com/ceph/ceph.git > onto a Centos 7 machine, then did > > ./autogen.sh > ./configure --enable-xio > make BTW, you should be defaulting to cmake if you don't have a specific need to use the autotools build. -- Cheers, Brad ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph admin socket protocol
Hi, is the ceph admin socket protocol described anywhere? I want to talk directly to the socket instead of calling the ceph binary. I searched the doc but didn't find anything useful. Thanks, Stefan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Filestore merge and split
You need to set the option in the ceph.conf and restart the OSD I think. But it will only take effect when splitting or merging in the future, it won't adjust the current folder layout. > -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Paul > Renner > Sent: 09 July 2016 22:18 > To: ceph-users@lists.ceph.com > Subject: [ceph-users] Filestore merge and split > > Hello cephers > we have many (millions, small objects in our RadosGW system and are getting > not very good write performance, 100-200 PUTs /sec. > > I have read on the mailinglist that one possible tuning option would be to > increase the max. number of files per directory on OSDs with > eg. > > filestore merge threshold = 40 > filestore split multiple = 8 > Now my question is, do we need to rebuild the OSDs to make this effective? Or > is it a runtime setting? > I'm asking because when setting this with injectargs I get the message > "unchangeable" back. > Thanks for any insight. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Drive letters shuffled on reboot
Hi everyone, I have problem with swapping drive and partition names on reboot. My Ceph is Hammer on CentOS7, Dell R730 6xSSD (2xSSD OS RAID1 PERC, 4xSSD=Journal drives), 18x1.8T SAS for OSDs. Whenever I reboot, drives randomly seem to change names. This is extremely dangerous and frustrating when I've initially setup CEPH with ceph-deploy, zap, prepare and activate. It has happened that I've accidentally erased wrong disk too when e.g. /dev/sdX had become /dev/sdY. Please see an output below of how this drive swapping below appears SDC is shifted, indexes and drive names got shuffled. Ceph OSDs didn't come up properly. Please advice on how to get this corrected, with no more drive name shuffling. Can this be due to the PERC HW raid? thx will POST REBOOT 2 (expected outcome.. with sda,sdb,sdc,sdd as journal. sdw is a perc raid1) [cephnode3][INFO ] Running command: sudo /usr/sbin/ceph-disk list [cephnode3][DEBUG ] /dev/sda : [cephnode3][DEBUG ] /dev/sda1 other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG ] /dev/sda2 other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG ] /dev/sda3 other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG ] /dev/sda4 other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG ] /dev/sda5 other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG ] /dev/sdb : [cephnode3][DEBUG ] /dev/sdb1 other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG ] /dev/sdb2 other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG ] /dev/sdb3 other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG ] /dev/sdb4 other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG ] /dev/sdb5 other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG ] /dev/sdc : [cephnode3][DEBUG ] /dev/sdc1 other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG ] /dev/sdc2 other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG ] /dev/sdc3 other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG ] /dev/sdc4 other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG ] /dev/sdc5 other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG ] /dev/sdd : [cephnode3][DEBUG ] /dev/sdd1 other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG ] /dev/sdd2 other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG ] /dev/sdd3 other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG ] /dev/sdd4 other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG ] /dev/sdd5 other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG ] /dev/sde : [cephnode3][DEBUG ] /dev/sde1 ceph data, active, cluster ceph, osd.0 [cephnode3][DEBUG ] /dev/sdf : [cephnode3][DEBUG ] /dev/sdf1 ceph data, active, cluster ceph, osd.1 [cephnode3][DEBUG ] /dev/sdg : [cephnode3][DEBUG ] /dev/sdg1 ceph data, active, cluster ceph, osd.2 [cephnode3][DEBUG ] /dev/sdh : [cephnode3][DEBUG ] /dev/sdh1 ceph data, active, cluster ceph, osd.3 [cephnode3][DEBUG ] /dev/sdi : [cephnode3][DEBUG ] /dev/sdi1 ceph data, active, cluster ceph, osd.4 [cephnode3][DEBUG ] /dev/sdj : [cephnode3][DEBUG ] /dev/sdj1 ceph data, active, cluster ceph, osd.5 [cephnode3][DEBUG ] /dev/sdk : [cephnode3][DEBUG ] /dev/sdk1 ceph data, active, cluster ceph, osd.6 [cephnode3][DEBUG ] /dev/sdl : [cephnode3][DEBUG ] /dev/sdl1 ceph data, active, cluster ceph, osd.7 [cephnode3][DEBUG ] /dev/sdm : [cephnode3][DEBUG ] /dev/sdm1 other, xfs [cephnode3][DEBUG ] /dev/sdn : [cephnode3][DEBUG ] /dev/sdn1 ceph data, active, cluster ceph, osd.9 [cephnode3][DEBUG ] /dev/sdo : [cephnode3][DEBUG ] /dev/sdo1 ceph data, active, cluster ceph, osd.10 [cephnode3][DEBUG ] /dev/sdp : [cephnode3][DEBUG ] /dev/sdp1 ceph data, active, cluster ceph, osd.11 [cephnode3][DEBUG ] /dev/sdq : [cephnode3][DEBUG ] /dev/sdq1 ceph data, active, cluster ceph, osd.12 [cephnode3][DEBUG ] /dev/sdr : [cephnode3][DEBUG ] /dev/sdr1 ceph data, active, cluster ceph, osd.13 [cephnode3][DEBUG ] /dev/sds : [cephnode3][DEBUG ] /dev/sds1 ceph data, active, cluster ceph, osd.14 [cephnode3][DEBUG ] /dev/sdt : [cephnode3][DEBUG ] /dev/sdt1 ceph data, active, cluster ceph, osd.15 [cephnode3][DEBUG ] /dev/sdu : [cephnode3][DEBUG ] /dev/sdu1 ceph data, active, cluster ceph, osd.16 [cephnode3][DEBUG ] /dev/sdv : [cephnode3][DEBUG ] /dev/sdv1 ceph data, active, cluster ceph, osd.17 [cephnode3][DEBUG ] /dev/sdw : [cephnode3][DEBUG ] /dev/sdw1 other, xfs, mounted on / [cephnode3][DEBUG ] /dev/sdw2 swap, swap POST REBOOT 1: [cephnode3][DEBUG ] /dev/sda : [cephnode3][DEBUG ] /dev/sda1 other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG ] /dev/sda2 other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG ] /dev/sda3 other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG ] /dev/sda4 other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG ] /dev/sda5 other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG ] /dev/sdb : [cephnode3][DEBUG ] /dev/sdb1 other, ebd0a0a2-b9e5-4433-87c0-68b6b
Re: [ceph-users] ceph admin socket protocol
If you can read C code, there is a collectd plugin that talks directly to the admin socket: https://github.com/collectd/collectd/blob/master/src/ceph.c On 10/07/16 10:36, Stefan Priebe - Profihost AG wrote: Hi, is the ceph admin socket protocol described anywhere? I want to talk directly to the socket instead of calling the ceph binary. I searched the doc but didn't find anything useful. Thanks, Stefan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph for online file storage
Hello, >Those 2 servers are running Ceph? >If so, be more specific, what's the HW like, CPU, RAM. network, journal >SSDs? Yes, I was hesitating between GlusterFS and Ceph but the latter is much more scalable and is future-proof. Both have the same configuration, namely E5 2628L (6c/12t @ 1.9GHz), 8x16G 2133MHz, 2x10G bonded (we only use 10G and fiber links), multiple 120G SSDs avaailable for journals and caching. >Also, 2 servers indicate a replication of 2, something I'd avoid in >production. This is true. I was thinking about EC instead of replication. >Your first and foremost way to improve IOPS is to have SSD journals, >everybody who deployed Ceph w/o them in any serious production environment >came to regret it. I think it is clear that journal are a must, especially since many small files will be read and written to. >Doubling the OSDs while halving the size will give you the same >space but at a much better performance. It's true, but then the $/TB or even $/PB ratio is much higher. It would be interesting to compare the outcome with more lower-density disks vs less higher-density disks but with more (agressive) caching/journaling. Your overview of the whole system definitely helps sorting things out. As you suggested, it's best I try some combinations to find what suits my use case best. >If you were to use CephFS for storage, putting the metadata on SSDs will >be beneficial, too. All OS drives are SSDs, and considering the system will never use the SSD in full I think it would be safe to partition it for MDS, cache and journal data. -- Sincères salutations, Moïn Danai. Original Message From : ch...@gol.com Date : 01/07/2016 - 04:26 (CEST) To : ceph-users@lists.ceph.com Cc : m.da...@bluewin.ch Subject : Re: [ceph-users] Ceph for online file storage Hello, On Thu, 30 Jun 2016 08:34:12 + (GMT) m.da...@bluewin.ch wrote: > Thank you all for your prompt answers. > > >firstly, wall of text, makes things incredibly hard to read. > >Use paragraphs/returns liberally. > > I actually made sure to use paragraphs. For some reason, the formatting > was removed. > > >Is that your entire experience with Ceph, ML archives and docs? > > Of course not, I have already been through the whole documentation many > times. It's just that I couldn't really decide between the choices I was > given. > > >What's an "online storage"? > >I assume you're talking about what is is commonly referred as "cloud > storage". > > I try not to use the term "cloud", but if you must, then yes that's the > idea behind it. Basically an online hard disk. > While I can certainly agree that "cloud" is overused and often mis-used as well, it makes things clearer in this context. > >10MB is not a small file in my book, 1-4KB (your typical mail) are small > >files. > >How much data (volume/space) are you looking at initially and within a > >year of deployment? > > 10MB is small compared to the larger files, but it is indeed bigger that > smaller, IOPS-intensive files (like the emails you pointed out). > > Right now there are two servers, each with 12x8TB. I expect a growth > rate of about the same size every 2-3 months. > Those 2 servers are running Ceph? If so, be more specific, what's the HW like, CPU, RAM. network, journal SSDs? Also, 2 servers indicate a replication of 2, something I'd avoid in production. > >What usage patterns are you looking at, expecting? > > Since my customers will put their files on this "cloud", it's generally > write once, read many (or at least more reads than writes). As they most > likely will store private documents, but some bigger files too, the > smaller files are predominant. > Reads are helped by having plenty of RAM in your storage servers. > >That's quite the blanket statement and sounds like from A sales > >brochure. SSDs for OSD journals are always a good idea. > >Ceph scales first and foremost by adding more storage nodes and OSDs. > > What I meant by scaling is that as the number of customers grows, the > more small files there will be, and so in order to have decent > performance at that point, SSDs are a must. I can add many OSDs, but if > they are all struggling with IOPS then it's no use (except having more > space). > You seem to grasp the fact that IOPS are likely to be your bottleneck, yet are going for 8TB HDDs. Which as Oliver mentioned and plenty of experience shared on this ML shows is a poor choice unless it's for very low IOPS, large data use cases. Now while I certainly understand the appeal of dense storage nodes from cost/space perspective you will want to run several scenarios and calculations to see what actually turns out to be the best fit. Your HDDs can do about 150 IOPS, half of that if they have no SSD journals and then some 30% more lost to FS journals, LevelDB updates, etc. Let's call it 60 IOPS w/o SSD journals and 120 with. Your first and foremost way to improve IOPS is to have SSD journals, everybody who deployed Ceph
Re: [ceph-users] ceph admin socket protocol
On Sun, Jul 10, 2016 at 9:36 AM, Stefan Priebe - Profihost AG wrote: > Hi, > > is the ceph admin socket protocol described anywhere? I want to talk directly > to the socket instead of calling the ceph binary. I searched the doc but > didn't find anything useful. There's no binary involved in sending commands to the admin socket, the CLI is using the python code here: https://github.com/ceph/ceph/blob/master/src/pybind/ceph_daemon.py Cheers, John > Thanks, > Stefan > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Filestore merge and split
Thanks... Do you know when splitting or merging will happen? Is it enough that a directory is read, eg. through scrub? If possible I would like to initiate the process Regards Paul On Sun, Jul 10, 2016 at 10:47 AM, Nick Fisk wrote: > You need to set the option in the ceph.conf and restart the OSD I think. > But it will only take effect when splitting or merging in the future, it > won't adjust the current folder layout. > > > -Original Message- > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf > Of Paul Renner > > Sent: 09 July 2016 22:18 > > To: ceph-users@lists.ceph.com > > Subject: [ceph-users] Filestore merge and split > > > > Hello cephers > > we have many (millions, small objects in our RadosGW system and are > getting not very good write performance, 100-200 PUTs /sec. > > > > I have read on the mailinglist that one possible tuning option would be > to increase the max. number of files per directory on OSDs with > > eg. > > > > filestore merge threshold = 40 > > filestore split multiple = 8 > > Now my question is, do we need to rebuild the OSDs to make this > effective? Or is it a runtime setting? > > I'm asking because when setting this with injectargs I get the message > "unchangeable" back. > > Thanks for any insight. > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph admin socket protocol
Am 10.07.2016 um 16:33 schrieb Daniel Swarbrick: > If you can read C code, there is a collectd plugin that talks directly > to the admin socket: > > https://github.com/collectd/collectd/blob/master/src/ceph.c thanks can read that. Stefan > > On 10/07/16 10:36, Stefan Priebe - Profihost AG wrote: >> Hi, >> >> is the ceph admin socket protocol described anywhere? I want to talk >> directly to the socket instead of calling the ceph binary. I searched >> the doc but didn't find anything useful. >> >> Thanks, >> Stefan >> > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph admin socket protocol
Am 10.07.2016 um 20:08 schrieb John Spray: > On Sun, Jul 10, 2016 at 9:36 AM, Stefan Priebe - Profihost AG > wrote: >> Hi, >> >> is the ceph admin socket protocol described anywhere? I want to talk >> directly to the socket instead of calling the ceph binary. I searched the >> doc but didn't find anything useful. > > There's no binary involved in sending commands to the admin socket, > the CLI is using the python code here: > https://github.com/ceph/ceph/blob/master/src/pybind/ceph_daemon.py argh thanks ;-) never noticed the python code there. > > Cheers, > John > >> Thanks, >> Stefan >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph admin socket from non root
Hi, is there a proposed way how to connect from non root f.e. a monitoring system to the ceph admin socket? In the past they were created with 777 permissions but now they're 755 which prevents me from connecting from our monitoring daemon. I don't like to set CAP_DAC_OVERRIDE for the monitoring agent. Greets, Stefan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph admin socket protocol
On Sun, Jul 10, 2016 at 09:32:33PM +0200, Stefan Priebe - Profihost AG wrote: > > Am 10.07.2016 um 16:33 schrieb Daniel Swarbrick: > > If you can read C code, there is a collectd plugin that talks directly > > to the admin socket: > > > > https://github.com/collectd/collectd/blob/master/src/ceph.c > > thanks can read that. If you're interested in using the AdminSocketClient here's some example code. #include "common/admin_socket_client.h" #include int main(int argc, char** argv) { std::string response; AdminSocketClient client(argv[1]); //client.do_request("{\"prefix\":\"help\"}", &response); //client.do_request("{\"prefix\":\"help\", \"format\": \"json\"}", &response); client.do_request("{\"prefix\":\"perf dump\"}", &response); //client.do_request("{\"prefix\":\"perf dump\", \"format\": \"json\"}", &response); std::cout << response << '\n'; return 0; } // $ g++ -O2 -std=c++11 ceph-admin-socket-test.cpp -I../ceph/src/ -I../ceph/build/include/ ../ceph/build/lib/libcommon.a -- Cheers, Brad > > Stefan > > > > > On 10/07/16 10:36, Stefan Priebe - Profihost AG wrote: > >> Hi, > >> > >> is the ceph admin socket protocol described anywhere? I want to talk > >> directly to the socket instead of calling the ceph binary. I searched > >> the doc but didn't find anything useful. > >> > >> Thanks, > >> Stefan > >> > > > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Drive letters shuffled on reboot
Hello, On Sun, 10 Jul 2016 12:46:39 + (UTC) William Josefsson wrote: > Hi everyone, > > I have problem with swapping drive and partition names on reboot. My > Ceph is Hammer on CentOS7, Dell R730 6xSSD (2xSSD OS RAID1 PERC, > 4xSSD=Journal drives), 18x1.8T SAS for OSDs. > > Whenever I reboot, drives randomly seem to change names. This is > extremely dangerous and frustrating when I've initially setup CEPH with > ceph-deploy, zap, prepare and activate. It has happened that I've > accidentally erased wrong disk too when e.g. /dev/sdX had > become /dev/sdY. > This isn't a Ceph specific question per se and you could probably keep things from moving around by enforcing module loads in a particular order. But that of course still wouldn't help if something else changed or a drive totally failed. So in the context of Ceph, it doesn't (shouldn't) care if the OSD (HDD) changes names, especially since you did set it up with ceph-deploy. And to avoid the journals getting jumbled up, do what everybody does (outside of Ceph as well), use /dev/disk/by-id or uuid. Like: --- # ls -la /var/lib/ceph/osd/ceph-28/ journal -> /dev/disk/by-id/wwn-0x55cd2e404b73d569-part3 --- Christian > Please see an output below of how this drive swapping below appears SDC > is shifted, indexes and drive names got shuffled. Ceph OSDs didn't come > up properly. > > Please advice on how to get this corrected, with no more drive name > shuffling. Can this be due to the PERC HW raid? thx will > > > > POST REBOOT 2 (expected outcome.. with sda,sdb,sdc,sdd as journal. sdw > is a perc raid1) > > > [cephnode3][INFO ] Running command: sudo /usr/sbin/ceph-disk list > [cephnode3][DEBUG ] /dev/sda : > [cephnode3][DEBUG ] /dev/sda1 other, > ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG ] /dev/sda2 > other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG > ] /dev/sda3 other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 > [cephnode3][DEBUG ] /dev/sda4 other, > ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG ] /dev/sda5 > other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG > ] /dev/sdb : [cephnode3][DEBUG ] /dev/sdb1 other, > ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG ] /dev/sdb2 > other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG > ] /dev/sdb3 other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 > [cephnode3][DEBUG ] /dev/sdb4 other, > ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG ] /dev/sdb5 > other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG > ] /dev/sdc : [cephnode3][DEBUG ] /dev/sdc1 other, > ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG ] /dev/sdc2 > other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG > ] /dev/sdc3 other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 > [cephnode3][DEBUG ] /dev/sdc4 other, > ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG ] /dev/sdc5 > other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG > ] /dev/sdd : [cephnode3][DEBUG ] /dev/sdd1 other, > ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG ] /dev/sdd2 > other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG > ] /dev/sdd3 other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 > [cephnode3][DEBUG ] /dev/sdd4 other, > ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG ] /dev/sdd5 > other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG > ] /dev/sde : [cephnode3][DEBUG ] /dev/sde1 ceph data, active, cluster > ceph, osd.0 [cephnode3][DEBUG ] /dev/sdf : [cephnode3][DEBUG > ] /dev/sdf1 ceph data, active, cluster ceph, osd.1 [cephnode3][DEBUG > ] /dev/sdg : [cephnode3][DEBUG ] /dev/sdg1 ceph data, active, cluster > ceph, osd.2 [cephnode3][DEBUG ] /dev/sdh : [cephnode3][DEBUG > ] /dev/sdh1 ceph data, active, cluster ceph, osd.3 [cephnode3][DEBUG > ] /dev/sdi : [cephnode3][DEBUG ] /dev/sdi1 ceph data, active, cluster > ceph, osd.4 [cephnode3][DEBUG ] /dev/sdj : [cephnode3][DEBUG > ] /dev/sdj1 ceph data, active, cluster ceph, osd.5 [cephnode3][DEBUG > ] /dev/sdk : [cephnode3][DEBUG ] /dev/sdk1 ceph data, active, cluster > ceph, osd.6 [cephnode3][DEBUG ] /dev/sdl : [cephnode3][DEBUG > ] /dev/sdl1 ceph data, active, cluster ceph, osd.7 [cephnode3][DEBUG > ] /dev/sdm : [cephnode3][DEBUG ] /dev/sdm1 other, xfs > [cephnode3][DEBUG ] /dev/sdn : > [cephnode3][DEBUG ] /dev/sdn1 ceph data, active, cluster ceph, osd.9 > [cephnode3][DEBUG ] /dev/sdo : > [cephnode3][DEBUG ] /dev/sdo1 ceph data, active, cluster ceph, osd.10 > [cephnode3][DEBUG ] /dev/sdp : > [cephnode3][DEBUG ] /dev/sdp1 ceph data, active, cluster ceph, osd.11 > [cephnode3][DEBUG ] /dev/sdq : > [cephnode3][DEBUG ] /dev/sdq1 ceph data, active, cluster ceph, osd.12 > [cephnode3][DEBUG ] /dev/sdr : > [cephnode3][DEBUG ] /dev/sdr1 ceph data, active, cluster ceph, osd.13 > [cephnode3][DEBUG ] /dev/sds : > [cephnode3][DEBUG ] /dev/sds1 ceph data, active, cluster ceph, osd.14 > [cephnode3][DEBUG ] /dev/sdt : > [cephnode3][DEBUG ] /dev/sdt1 ceph data, active, cluster ceph, osd.15
Re: [ceph-users] Ceph for online file storage
Hello, On Sun, 10 Jul 2016 14:33:36 + (GMT) m.da...@bluewin.ch wrote: > Hello, > > >Those 2 servers are running Ceph? > >If so, be more specific, what's the HW like, CPU, RAM. network, journal > >SSDs? > > Yes, I was hesitating between GlusterFS and Ceph but the latter is much > more scalable and is future-proof. > > Both have the same configuration, namely E5 2628L (6c/12t @ 1.9GHz), > 8x16G 2133MHz, 2x10G bonded (we only use 10G and fiber links), multiple > 120G SSDs avaailable for journals and caching. > With two of these CPUs (and SSD journals) definitely not more than 24 OSDs per node. RAM is plentiful. Which exact SSD models? None of the 120GB ones I can think of would make good journal ones. > >Also, 2 servers indicate a replication of 2, something I'd avoid in > >production. > > This is true. I was thinking about EC instead of replication. > With EC you need to keep several things in mind: 1. Performance, especially IOPS, is worse than replicated. 2. More CPU power is needed. 3. A cache tier is mandatory. 4. Most importantly, you can't start small. With something akin to RAID6 levels of redundancy, you probably want nothing smaller than 8 nodes (K=6,M=2). > >Your first and foremost way to improve IOPS is to have SSD journals, > >everybody who deployed Ceph w/o them in any serious production > >environment came to regret it. > > I think it is clear that journal are a must, especially since many small > files will be read and written to. > > >Doubling the OSDs while halving the size will give you the same > >space but at a much better performance. > > It's true, but then the $/TB or even $/PB ratio is much higher. It would > be interesting to compare the outcome with more lower-density disks vs > less higher-density disks but with more (agressive) caching/journaling. > You may find that it's a zero-sum game, more or less. Basically you have the costs for chassis/MB/network cards per node that push you towards higher density nodes to save costs. OTOH cache-tier nodes (SSDs, NVMEs, CPUs) don't come cheap either. > Your overview of the whole system definitely helps sorting things out. > As you suggested, it's best I try some combinations to find what suits > my use case best. > > >If you were to use CephFS for storage, putting the metadata on SSDs will > >be beneficial, too. > > All OS drives are SSDs, and considering the system will never use the > SSD in full I think it would be safe to partition it for MDS, cache and > journal data. > Again, needs to be right kind of SSD for this to work, but in general, yes. I do share OS/journal SSDs all the time. Note that MDS in and by itself doesn't hold any persistent (on-disk) data, the metadata is all in the Ceph meta-data pool and that's the one you want to put on SSDs. Christian > -- > Sincères salutations, > > Moïn Danai. > Original Message > From : ch...@gol.com > Date : 01/07/2016 - 04:26 (CEST) > To : ceph-users@lists.ceph.com > Cc : m.da...@bluewin.ch > Subject : Re: [ceph-users] Ceph for online file storage > > > Hello, > > On Thu, 30 Jun 2016 08:34:12 + (GMT) m.da...@bluewin.ch wrote: > > > Thank you all for your prompt answers. > > > > >firstly, wall of text, makes things incredibly hard to read. > > >Use paragraphs/returns liberally. > > > > I actually made sure to use paragraphs. For some reason, the formatting > > was removed. > > > > >Is that your entire experience with Ceph, ML archives and docs? > > > > Of course not, I have already been through the whole documentation many > > times. It's just that I couldn't really decide between the choices I > > was given. > > > > >What's an "online storage"? > > >I assume you're talking about what is is commonly referred as "cloud > > storage". > > > > I try not to use the term "cloud", but if you must, then yes that's the > > idea behind it. Basically an online hard disk. > > > While I can certainly agree that "cloud" is overused and often mis-used > as well, it makes things clearer in this context. > > > >10MB is not a small file in my book, 1-4KB (your typical mail) are > > >small files. > > >How much data (volume/space) are you looking at initially and within a > > >year of deployment? > > > > 10MB is small compared to the larger files, but it is indeed bigger > > that smaller, IOPS-intensive files (like the emails you pointed out). > > > > Right now there are two servers, each with 12x8TB. I expect a growth > > rate of about the same size every 2-3 months. > > > Those 2 servers are running Ceph? > If so, be more specific, what's the HW like, CPU, RAM. network, journal > SSDs? > > Also, 2 servers indicate a replication of 2, something I'd avoid in > production. > > > > >What usage patterns are you looking at, expecting? > > > > Since my customers will put their files on this "cloud", it's generally > > write once, read many (or at least more reads than writes). As they > > most likely will store private documents, but some bigger fil
Re: [ceph-users] Drive letters shuffled on reboot
Hello, This is an interesting topic and would like to know a solution to this problem. Does that mean we should never use Dell storage as ceph storage device? I have similar setup with Dell 4 iscsi LUNs attached to openstack controller and compute node in active-active situation. As they were in active active 1 selected first 2 luns as osd on node1 and last 2 as osd on node 2. Is it ok to have this configuration specially when and node will be down or considering live migration. Regards Gaurav Goyal On 10-Jul-2016 9:02 pm, "Christian Balzer" wrote: > > Hello, > > On Sun, 10 Jul 2016 12:46:39 + (UTC) William Josefsson wrote: > > > Hi everyone, > > > > I have problem with swapping drive and partition names on reboot. My > > Ceph is Hammer on CentOS7, Dell R730 6xSSD (2xSSD OS RAID1 PERC, > > 4xSSD=Journal drives), 18x1.8T SAS for OSDs. > > > > Whenever I reboot, drives randomly seem to change names. This is > > extremely dangerous and frustrating when I've initially setup CEPH with > > ceph-deploy, zap, prepare and activate. It has happened that I've > > accidentally erased wrong disk too when e.g. /dev/sdX had > > become /dev/sdY. > > > This isn't a Ceph specific question per se and you could probably keep > things from moving around by enforcing module loads in a particular order. > > But that of course still wouldn't help if something else changed or a > drive totally failed. > > So in the context of Ceph, it doesn't (shouldn't) care if the OSD (HDD) > changes names, especially since you did set it up with ceph-deploy. > > And to avoid the journals getting jumbled up, do what everybody does > (outside of Ceph as well), use /dev/disk/by-id or uuid. > > Like: > --- > # ls -la /var/lib/ceph/osd/ceph-28/ > > journal -> /dev/disk/by-id/wwn-0x55cd2e404b73d569-part3 > --- > > Christian > > Please see an output below of how this drive swapping below appears SDC > > is shifted, indexes and drive names got shuffled. Ceph OSDs didn't come > > up properly. > > > > Please advice on how to get this corrected, with no more drive name > > shuffling. Can this be due to the PERC HW raid? thx will > > > > > > > > POST REBOOT 2 (expected outcome.. with sda,sdb,sdc,sdd as journal. sdw > > is a perc raid1) > > > > > > [cephnode3][INFO ] Running command: sudo /usr/sbin/ceph-disk list > > [cephnode3][DEBUG ] /dev/sda : > > [cephnode3][DEBUG ] /dev/sda1 other, > > ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG ] /dev/sda2 > > other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG > > ] /dev/sda3 other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 > > [cephnode3][DEBUG ] /dev/sda4 other, > > ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG ] /dev/sda5 > > other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG > > ] /dev/sdb : [cephnode3][DEBUG ] /dev/sdb1 other, > > ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG ] /dev/sdb2 > > other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG > > ] /dev/sdb3 other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 > > [cephnode3][DEBUG ] /dev/sdb4 other, > > ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG ] /dev/sdb5 > > other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG > > ] /dev/sdc : [cephnode3][DEBUG ] /dev/sdc1 other, > > ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG ] /dev/sdc2 > > other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG > > ] /dev/sdc3 other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 > > [cephnode3][DEBUG ] /dev/sdc4 other, > > ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG ] /dev/sdc5 > > other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG > > ] /dev/sdd : [cephnode3][DEBUG ] /dev/sdd1 other, > > ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG ] /dev/sdd2 > > other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG > > ] /dev/sdd3 other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 > > [cephnode3][DEBUG ] /dev/sdd4 other, > > ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG ] /dev/sdd5 > > other, ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 [cephnode3][DEBUG > > ] /dev/sde : [cephnode3][DEBUG ] /dev/sde1 ceph data, active, cluster > > ceph, osd.0 [cephnode3][DEBUG ] /dev/sdf : [cephnode3][DEBUG > > ] /dev/sdf1 ceph data, active, cluster ceph, osd.1 [cephnode3][DEBUG > > ] /dev/sdg : [cephnode3][DEBUG ] /dev/sdg1 ceph data, active, cluster > > ceph, osd.2 [cephnode3][DEBUG ] /dev/sdh : [cephnode3][DEBUG > > ] /dev/sdh1 ceph data, active, cluster ceph, osd.3 [cephnode3][DEBUG > > ] /dev/sdi : [cephnode3][DEBUG ] /dev/sdi1 ceph data, active, cluster > > ceph, osd.4 [cephnode3][DEBUG ] /dev/sdj : [cephnode3][DEBUG > > ] /dev/sdj1 ceph data, active, cluster ceph, osd.5 [cephnode3][DEBUG > > ] /dev/sdk : [cephnode3][DEBUG ] /dev/sdk1 ceph data, active, cluster > > ceph, osd.6 [cephnode3][DEBUG ] /dev/sdl : [cephnode3][DEBUG > > ] /dev/sdl1 ceph data, active, cluster ceph, osd.7 [cephnode3][DEBUG > > ] /dev/sdm : [cephnode3][DEBUG ] /dev/sdm1 other, xfs > > [cephn
Re: [ceph-users] Backing up RBD snapshots to a different cloud service
Hi Brendan, On Friday, July 8, 2016, Brendan Moloney wrote: > Hi, > > We have a smallish Ceph cluster for RBD images. We use snapshotting for > local incremental backups. I would like to start sending some of these > snapshots to an external cloud service (likely Amazon) for disaster > recovery purposes. > > Does anyone have advice on how to do this? I suppose I could just use the > rbd export/diff commands but some of our RBD images are quite large > (multiple terabytes) so I can imagine this becoming quite inefficient. We > would either need to keep all snapshots indefinitely and retrieve every > single snapshot to recover or we would have to occasionally send a new full > disk image. > > I guess doing the backups on the object level could potentially avoid > these issues, but I am not sure how to go about that. > We are currently rolling out a solution that utilizes merge-diff command to continuously create synthetic fulls at the remote site. The remote site needs to be more than just storage, e.g. a Linux VM or such, but as long as the continuity of snapshots is maintained, you should be able to recover from just the one image. Detecting start and end snapshot of a diff export file is not hard, I asked details earlier on this list, and would be happy to send you code stubs in Perl if you are interested. Another option, which we have not yet tried with RBD exports is the borgbackup project, which offers excellent deduplication. HTH, Alex > > > Any advice is greatly appreciated. > > Thanks, > Brendan > -- -- Alex Gorbachev Storcium ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Fwd: Ceph OSD suicide himself
Hi cephers. I need your help for some issues. The ceph cluster version is Jewel(10.2.1), and the filesytem is btrfs. I run 1 Mon and 48 OSD in 4 Nodes(each node has 12 OSDs). I've experienced one of OSDs was killed himself. Always it issued suicide timeout message. Below is detailed logs. == 0. ceph df detail $ sudo ceph df detail GLOBAL: SIZE AVAIL RAW USED %RAW USED OBJECTS 42989G 24734G 18138G 42.19 23443k POOLS: NAMEID CATEGORY QUOTA OBJECTS QUOTA BYTES USED %USED MAX AVAIL OBJECTS DIRTY READ WRITE RAW USED ha-pool 40 -N/A N/A 1405G 9.81 5270G 22986458 22447k 0 22447k4217G volumes 45 -N/A N/A 4093G 28.57 5270G 933401 911k 648M 649M 12280G images 46 -N/A N/A 53745M 0.37 5270G 6746 6746 1278k 21046 157G backups 47 -N/A N/A 0 0 5270G0 0 0 0 0 vms 48 -N/A N/A 309G 2.16 5270G79426 79426 92612k 46506k 928G 1. ceph no.15 log *(20:02 first timed out message)* 2016-07-08 20:02:01.049483 7fcd3caa5700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fcd2c284700' had timed out after 15 2016-07-08 20:02:01.050403 7fcd3b2a2700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fcd2c284700' had timed out after 15 2016-07-08 20:02:01.086792 7fcd3b2a2700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fcd2c284700' had timed out after 15 . . (sometimes this logs with..) 2016-07-08 20:02:11.379597 7fcd4d8f8700 0 log_channel(cluster) log [WRN] : 12 slow requests, 5 included below; oldest blocked for > 30.269577 secs 2016-07-08 20:02:11.379608 7fcd4d8f8700 0 log_channel(cluster) log [WRN] : slow request 30.269577 seconds old, received at 2016-07-08 20:01:41.109937: osd_op(client.895668.0:5302745 45.e2e779c2 rbd_data.cc460bc7fc8f.04d8 [stat,write 2596864~516096] snapc 0=[] ack+ondisk+write+known_if_redirected e30969) currently commit_sent 2016-07-08 20:02:11.379612 7fcd4d8f8700 0 log_channel(cluster) log [WRN] : slow request 30.269108 seconds old, received at 2016-07-08 20:01:41.110406: osd_op(client.895668.0:5302746 45.e2e779c2 rbd_data.cc460bc7fc8f.04d8 [stat,write 3112960~516096] snapc 0=[] ack+ondisk+write+known_if_redirected e30969) currently waiting for rw locks 2016-07-08 20:02:11.379630 7fcd4d8f8700 0 log_channel(cluster) log [WRN] : slow request 30.268607 seconds old, received at 2016-07-08 20:01:41.110907: osd_op(client.895668.0:5302747 45.e2e779c2 rbd_data.cc460bc7fc8f.04d8 [stat,write 3629056~516096] snapc 0=[] ack+ondisk+write+known_if_redirected e30969) currently waiting for rw locks 2016-07-08 20:02:11.379633 7fcd4d8f8700 0 log_channel(cluster) log [WRN] : slow request 30.268143 seconds old, received at 2016-07-08 20:01:41.111371: osd_op(client.895668.0:5302748 45.e2e779c2 rbd_data.cc460bc7fc8f.04d8 [stat,write 4145152~516096] snapc 0=[] ack+ondisk+write+known_if_redirected e30969) currently waiting for rw locks 2016-07-08 20:02:11.379636 7fcd4d8f8700 0 log_channel(cluster) log [WRN] : slow request 30.267662 seconds old, received at 2016-07-08 20:01:41.111852: osd_op(client.895668.0:5302749 45.e2e779c2 rbd_data.cc460bc7fc8f.04d8 [stat,write 4661248~516096] snapc 0=[] ack+ondisk+write+known_if_redirected e30969) currently waiting for rw locks . . (after a lot of same messages) 2016-07-08 20:03:53.682828 7fcd3caa5700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fcd2d286700' had timed out after 15 2016-07-08 20:03:53.682828 7fcd3caa5700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fcd2da87700' had timed out after 15 2016-07-08 20:03:53.682829 7fcd3caa5700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fcd48716700' had timed out after 60 2016-07-08 20:03:53.682830 7fcd3caa5700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fcd47f15700' had timed out after 60 . . (fault with nothing to send, going to standby massages) 2016-07-08 20:03:53.708665 7fcd15787700 0 -- 10.200.10.145:6818/6462 >> 10.200.10.146:6806/4642 pipe(0x55818727e000 sd=276 :51916 s=2 pgs=2225 cs=1 l=0 c=0x558186f61d80).fault with nothing to send, going to standby 2016-07-08 20:03:53.724928 7fcd072c2700 0 -- 10.200.10.145:6818/6462 >> 10.200.10.146:6800/4336 pipe(0x55818a25b400 sd=109 :6818 s=2 pgs=2440 cs=13 l=0 c=0x55818730f080).fault with nothing to send, going to standby 2016-07-08 20:03:53.738216 7fcd0b7d3700 0 -- 10.200.10.145:6818/6462 >> 10.200.10.144:6814/5069 pipe(0x55816c6a4800 sd=334 :53850 s=2 pgs=43 cs=1 l=0 c=0x55818611f800).fa
Re: [ceph-users] Fwd: Ceph OSD suicide himself
On Mon, Jul 11, 2016 at 11:48:57AM +0900, 한승진 wrote: > Hi cephers. > > I need your help for some issues. > > The ceph cluster version is Jewel(10.2.1), and the filesytem is btrfs. > > I run 1 Mon and 48 OSD in 4 Nodes(each node has 12 OSDs). > > I've experienced one of OSDs was killed himself. > > Always it issued suicide timeout message. > > Below is detailed logs. > > > == > 0. ceph df detail > $ sudo ceph df detail > GLOBAL: > SIZE AVAIL RAW USED %RAW USED OBJECTS > 42989G 24734G 18138G 42.19 23443k > POOLS: > NAMEID CATEGORY QUOTA OBJECTS QUOTA BYTES USED > %USED MAX AVAIL OBJECTS DIRTY READ WRITE > RAW USED > ha-pool 40 -N/A N/A > 1405G 9.81 5270G 22986458 22447k 0 > 22447k4217G > volumes 45 -N/A N/A > 4093G 28.57 5270G 933401 911k 648M > 649M 12280G > images 46 -N/A N/A > 53745M 0.37 5270G 6746 6746 1278k > 21046 157G > backups 47 -N/A N/A > 0 0 5270G0 0 0 0 > 0 > vms 48 -N/A N/A > 309G 2.16 5270G79426 79426 92612k 46506k > 928G > > 1. ceph no.15 log > > *(20:02 first timed out message)* > 2016-07-08 20:02:01.049483 7fcd3caa5700 1 heartbeat_map is_healthy > 'OSD::osd_op_tp thread 0x7fcd2c284700' had timed out after 15 > 2016-07-08 20:02:01.050403 7fcd3b2a2700 1 heartbeat_map is_healthy > 'OSD::osd_op_tp thread 0x7fcd2c284700' had timed out after 15 > 2016-07-08 20:02:01.086792 7fcd3b2a2700 1 heartbeat_map is_healthy > 'OSD::osd_op_tp thread 0x7fcd2c284700' had timed out after 15 > . > . > (sometimes this logs with..) > 2016-07-08 20:02:11.379597 7fcd4d8f8700 0 log_channel(cluster) log [WRN] : > 12 slow requests, 5 included below; oldest blocked for > 30.269577 secs > 2016-07-08 20:02:11.379608 7fcd4d8f8700 0 log_channel(cluster) log [WRN] : > slow request 30.269577 seconds old, received at 2016-07-08 20:01:41.109937: > osd_op(client.895668.0:5302745 45.e2e779c2 > rbd_data.cc460bc7fc8f.04d8 [stat,write 2596864~516096] snapc > 0=[] ack+ondisk+write+known_if_redirected e30969) currently commit_sent > 2016-07-08 20:02:11.379612 7fcd4d8f8700 0 log_channel(cluster) log [WRN] : > slow request 30.269108 seconds old, received at 2016-07-08 20:01:41.110406: > osd_op(client.895668.0:5302746 45.e2e779c2 > rbd_data.cc460bc7fc8f.04d8 [stat,write 3112960~516096] snapc > 0=[] ack+ondisk+write+known_if_redirected e30969) currently waiting for rw > locks > 2016-07-08 20:02:11.379630 7fcd4d8f8700 0 log_channel(cluster) log [WRN] : > slow request 30.268607 seconds old, received at 2016-07-08 20:01:41.110907: > osd_op(client.895668.0:5302747 45.e2e779c2 > rbd_data.cc460bc7fc8f.04d8 [stat,write 3629056~516096] snapc > 0=[] ack+ondisk+write+known_if_redirected e30969) currently waiting for rw > locks > 2016-07-08 20:02:11.379633 7fcd4d8f8700 0 log_channel(cluster) log [WRN] : > slow request 30.268143 seconds old, received at 2016-07-08 20:01:41.111371: > osd_op(client.895668.0:5302748 45.e2e779c2 > rbd_data.cc460bc7fc8f.04d8 [stat,write 4145152~516096] snapc > 0=[] ack+ondisk+write+known_if_redirected e30969) currently waiting for rw > locks > 2016-07-08 20:02:11.379636 7fcd4d8f8700 0 log_channel(cluster) log [WRN] : > slow request 30.267662 seconds old, received at 2016-07-08 20:01:41.111852: > osd_op(client.895668.0:5302749 45.e2e779c2 > rbd_data.cc460bc7fc8f.04d8 [stat,write 4661248~516096] snapc > 0=[] ack+ondisk+write+known_if_redirected e30969) currently waiting for rw > locks > . > . > (after a lot of same messages) > 2016-07-08 20:03:53.682828 7fcd3caa5700 1 heartbeat_map is_healthy > 'OSD::osd_op_tp thread 0x7fcd2d286700' had timed out after 15 > 2016-07-08 20:03:53.682828 7fcd3caa5700 1 heartbeat_map is_healthy > 'OSD::osd_op_tp thread 0x7fcd2da87700' had timed out after 15 > 2016-07-08 20:03:53.682829 7fcd3caa5700 1 heartbeat_map is_healthy > 'FileStore::op_tp thread 0x7fcd48716700' had timed out after 60 > 2016-07-08 20:03:53.682830 7fcd3caa5700 1 heartbeat_map is_healthy > 'FileStore::op_tp thread 0x7fcd47f15700' had timed out after 60 > . > . > (fault with nothing to send, going to standby massages) > 2016-07-08 20:03:53.708665 7fcd15787700 0 -- 10.200.10.145:6818/6462 >> > 10.200.10.146:6806/4642 pipe(0x55818727e000 sd=276 :51916 s=2 pgs=2225 cs=1 > l=0 c=0x558186f61d80).fault with nothing to send, going to standby > 2016-07-08 20:03:53.724928 7fcd072c2700 0 -- 10.200.10.145:6818/6462 >> > 10.200.10.146:6800/4336 pipe(0x55818a25b400 sd=109 :681
Re: [ceph-users] Fwd: Ceph OSD suicide himself
On Mon, Jul 11, 2016 at 1:21 PM, Brad Hubbard wrote: > On Mon, Jul 11, 2016 at 11:48:57AM +0900, 한승진 wrote: >> Hi cephers. >> >> I need your help for some issues. >> >> The ceph cluster version is Jewel(10.2.1), and the filesytem is btrfs. >> >> I run 1 Mon and 48 OSD in 4 Nodes(each node has 12 OSDs). >> >> I've experienced one of OSDs was killed himself. >> >> Always it issued suicide timeout message. >> >> Below is detailed logs. >> >> >> == >> 0. ceph df detail >> $ sudo ceph df detail >> GLOBAL: >> SIZE AVAIL RAW USED %RAW USED OBJECTS >> 42989G 24734G 18138G 42.19 23443k >> POOLS: >> NAMEID CATEGORY QUOTA OBJECTS QUOTA BYTES USED >> %USED MAX AVAIL OBJECTS DIRTY READ WRITE >> RAW USED >> ha-pool 40 -N/A N/A >> 1405G 9.81 5270G 22986458 22447k 0 >> 22447k4217G >> volumes 45 -N/A N/A >> 4093G 28.57 5270G 933401 911k 648M >> 649M 12280G >> images 46 -N/A N/A >> 53745M 0.37 5270G 6746 6746 1278k >> 21046 157G >> backups 47 -N/A N/A >> 0 0 5270G0 0 0 0 >> 0 >> vms 48 -N/A N/A >> 309G 2.16 5270G79426 79426 92612k 46506k >> 928G >> >> 1. ceph no.15 log >> >> *(20:02 first timed out message)* >> 2016-07-08 20:02:01.049483 7fcd3caa5700 1 heartbeat_map is_healthy >> 'OSD::osd_op_tp thread 0x7fcd2c284700' had timed out after 15 >> 2016-07-08 20:02:01.050403 7fcd3b2a2700 1 heartbeat_map is_healthy >> 'OSD::osd_op_tp thread 0x7fcd2c284700' had timed out after 15 >> 2016-07-08 20:02:01.086792 7fcd3b2a2700 1 heartbeat_map is_healthy >> 'OSD::osd_op_tp thread 0x7fcd2c284700' had timed out after 15 >> . >> . >> (sometimes this logs with..) >> 2016-07-08 20:02:11.379597 7fcd4d8f8700 0 log_channel(cluster) log [WRN] : >> 12 slow requests, 5 included below; oldest blocked for > 30.269577 secs >> 2016-07-08 20:02:11.379608 7fcd4d8f8700 0 log_channel(cluster) log [WRN] : >> slow request 30.269577 seconds old, received at 2016-07-08 20:01:41.109937: >> osd_op(client.895668.0:5302745 45.e2e779c2 >> rbd_data.cc460bc7fc8f.04d8 [stat,write 2596864~516096] snapc >> 0=[] ack+ondisk+write+known_if_redirected e30969) currently commit_sent >> 2016-07-08 20:02:11.379612 7fcd4d8f8700 0 log_channel(cluster) log [WRN] : >> slow request 30.269108 seconds old, received at 2016-07-08 20:01:41.110406: >> osd_op(client.895668.0:5302746 45.e2e779c2 >> rbd_data.cc460bc7fc8f.04d8 [stat,write 3112960~516096] snapc >> 0=[] ack+ondisk+write+known_if_redirected e30969) currently waiting for rw >> locks >> 2016-07-08 20:02:11.379630 7fcd4d8f8700 0 log_channel(cluster) log [WRN] : >> slow request 30.268607 seconds old, received at 2016-07-08 20:01:41.110907: >> osd_op(client.895668.0:5302747 45.e2e779c2 >> rbd_data.cc460bc7fc8f.04d8 [stat,write 3629056~516096] snapc >> 0=[] ack+ondisk+write+known_if_redirected e30969) currently waiting for rw >> locks >> 2016-07-08 20:02:11.379633 7fcd4d8f8700 0 log_channel(cluster) log [WRN] : >> slow request 30.268143 seconds old, received at 2016-07-08 20:01:41.111371: >> osd_op(client.895668.0:5302748 45.e2e779c2 >> rbd_data.cc460bc7fc8f.04d8 [stat,write 4145152~516096] snapc >> 0=[] ack+ondisk+write+known_if_redirected e30969) currently waiting for rw >> locks >> 2016-07-08 20:02:11.379636 7fcd4d8f8700 0 log_channel(cluster) log [WRN] : >> slow request 30.267662 seconds old, received at 2016-07-08 20:01:41.111852: >> osd_op(client.895668.0:5302749 45.e2e779c2 >> rbd_data.cc460bc7fc8f.04d8 [stat,write 4661248~516096] snapc >> 0=[] ack+ondisk+write+known_if_redirected e30969) currently waiting for rw >> locks >> . >> . >> (after a lot of same messages) >> 2016-07-08 20:03:53.682828 7fcd3caa5700 1 heartbeat_map is_healthy >> 'OSD::osd_op_tp thread 0x7fcd2d286700' had timed out after 15 >> 2016-07-08 20:03:53.682828 7fcd3caa5700 1 heartbeat_map is_healthy >> 'OSD::osd_op_tp thread 0x7fcd2da87700' had timed out after 15 >> 2016-07-08 20:03:53.682829 7fcd3caa5700 1 heartbeat_map is_healthy >> 'FileStore::op_tp thread 0x7fcd48716700' had timed out after 60 >> 2016-07-08 20:03:53.682830 7fcd3caa5700 1 heartbeat_map is_healthy >> 'FileStore::op_tp thread 0x7fcd47f15700' had timed out after 60 >> . >> . >> (fault with nothing to send, going to standby massages) >> 2016-07-08 20:03:53.708665 7fcd15787700 0 -- 10.200.10.145:6818/6462 >> >> 10.200.10.146:6806/4642 pipe(0x55818727e000 sd=276 :51916 s=2 pgs=2225 cs=1 >> l=0 c=0x558186f61d80).fault with nothing to send, go
[ceph-users] Slow performance into windows VM
Hello, guys I to face a task poor performance into windows 2k12r2 instance running on rbd (openstack cluster). RBD disk have a size 17Tb. My ceph cluster consist from: - 3 monitors nodes (Celeron G530/6Gb RAM, DualCore E6500/2Gb RAM, Core2Duo E7500/2Gb RAM). Each node have 1Gbit network to frontend subnet od Ceph cluster - 2 block nodes (Xeon E5620/32Gb RAM/2*1Gbit NIC). Each node have 2*500Gb HDD for operation system and 9*3Tb SATA HDD (WD SE). Total 18 OSD daemons on 2 nodes. Journals placed on same HDD as a rados data. I know that better using for those purpose separate SSD disk. When I test new windows instance performance was good (read/write something about 100Mb/sec). But after I copied 16Tb data to windows instance read performance has down to 10Mb/sec. Type of data on VM - image and video. ceph.conf on client side: [global] auth cluster required = cephx auth service required = cephx auth client required = cephx filestore xattr use omap = true filestore max sync interval = 10 filestore queue max ops = 3000 filestore queue commiting max bytes = 1048576000 filestore queue commiting max ops = 5000 filestore queue max bytes = 1048576000 filestore queue committing max ops = 4096 filestore queue committing max bytes = 16 MiB filestore op threads = 20 filestore flusher = false filestore journal parallel = false filestore journal writeahead = true journal dio = true journal aio = true journal force aio = true journal block align = true journal max write bytes = 1048576000 journal_discard = true osd pool default size = 2 # Write an object n times. osd pool default min size = 1 osd pool default pg num = 333 osd pool default pgp num = 333 osd crush chooseleaf type = 1 [client] rbd cache = true rbd cache size = 67108864 rbd cache max dirty = 50331648 rbd cache target dirty = 33554432 rbd cache max dirty age = 2 rbd cache writethrough until flush = true rados bench show from block node show: rados bench -p scbench 120 write --no-cleanup Total time run: 120.399337 Total writes made: 3538 Write size: 4194304 Object size: 4194304 Bandwidth (MB/sec): 117.542 Stddev Bandwidth: 9.31244 Max bandwidth (MB/sec): 148 Min bandwidth (MB/sec): 92 Average IOPS: 29 Stddev IOPS: 2 Max IOPS: 37 Min IOPS: 23 Average Latency(s): 0.544365 Stddev Latency(s): 0.35825 Max latency(s): 5.42548 Min latency(s): 0.101533 rados bench -p scbench 120 seq Total time run: 120.880920 Total reads made: 1932 Read size: 4194304 Object size: 4194304 Bandwidth (MB/sec): 63.9307 Average IOPS 15 Stddev IOPS: 3 Max IOPS: 25 Min IOPS: 5 Average Latency(s): 0.999095 Max latency(s): 8.50774 Min latency(s): 0.0391591 rados bench -p scbench 120 rand Total time run: 121.059005 Total reads made: 1920 Read size: 4194304 Object size: 4194304 Bandwidth (MB/sec): 63.4401 Average IOPS: 15 Stddev IOPS: 4 Max IOPS: 26 Min IOPS: 1 Average Latency(s): 1.00785 Max latency(s): 6.48138 Min latency(s): 0.038925 On XFS partitions fragmentation no more than 1% On libvirt disk connected so: 4680524c-2c10-47a3-af59-2e1bd12a7ce4 Do anybody some idea? Konstantin___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Slow performance into windows VM
Hello, On Mon, 11 Jul 2016 07:35:02 +0300 K K wrote: > > Hello, guys > > I to face a task poor performance into windows 2k12r2 instance running > on rbd (openstack cluster). RBD disk have a size 17Tb. My ceph cluster > consist from: > - 3 monitors nodes (Celeron G530/6Gb RAM, DualCore E6500/2Gb RAM, > Core2Duo E7500/2Gb RAM). Each node have 1Gbit network to frontend subnet > od Ceph cluster I hope the fastest of these MONs (CPU and storage) has the lowest IP number and thus is the leader. Also what Ceph, OS, kernel version? > - 2 block nodes (Xeon E5620/32Gb RAM/2*1Gbit NIC). Each node have > 2*500Gb HDD for operation system and 9*3Tb SATA HDD (WD SE). Total 18 > OSD daemons on 2 nodes. Two GbE ports, given the "frontend" up there with the MON description I assume that's 1 port per client (front) and cluster (back) network? >Journals placed on same HDD as a rados data. I > know that better using for those purpose separate SSD disk. Indeed... >When I test > new windows instance performance was good (read/write something about > 100Mb/sec). But after I copied 16Tb data to windows instance read > performance has down to 10Mb/sec. Type of data on VM - image and video. > 100MB/s would be absolute perfect with the setup you have, assuming no contention (other clients). Is there any other client on than that Windows VM on your Ceph cluster? > ceph.conf on client side: > [global] > auth cluster required = cephx > auth service required = cephx > auth client required = cephx > filestore xattr use omap = true > filestore max sync interval = 10 > filestore queue max ops = 3000 > filestore queue commiting max bytes = 1048576000 > filestore queue commiting max ops = 5000 > filestore queue max bytes = 1048576000 > filestore queue committing max ops = 4096 > filestore queue committing max bytes = 16 MiB ^^^ Is Ceph understanding this now? Other than that, the queue options aren't likely to do much good with pure HDD OSDs. > filestore op threads = 20 > filestore flusher = false > filestore journal parallel = false > filestore journal writeahead = true > journal dio = true > journal aio = true > journal force aio = true > journal block align = true > journal max write bytes = 1048576000 > journal_discard = true > osd pool default size = 2 # Write an object n times. > osd pool default min size = 1 > osd pool default pg num = 333 > osd pool default pgp num = 333 That should be 512, 1024 really with one RBD pool. http://ceph.com/pgcalc/ > osd crush chooseleaf type = 1 > > [client] > rbd cache = true > rbd cache size = 67108864 > rbd cache max dirty = 50331648 > rbd cache target dirty = 33554432 > rbd cache max dirty age = 2 > rbd cache writethrough until flush = true > > > rados bench show from block node show: Wrong way to test this, test it from a monitor node, another client node (like your openstack nodes). In your 2 node cluster half of the reads or writes will be local, very much skewing your results. > rados bench -p scbench 120 write --no-cleanup Default tests with 4MB "blocks", what are the writes or reads from you client VM like? > Total time run: 120.399337 > Total writes made: 3538 > Write size: 4194304 > Object size: 4194304 > Bandwidth (MB/sec): 117.542 > Stddev Bandwidth: 9.31244 > Max bandwidth (MB/sec): 148 ^^^ That wouldn't be possible from an external client. > Min bandwidth (MB/sec): 92 > Average IOPS: 29 > Stddev IOPS: 2 > Max IOPS: 37 > Min IOPS: 23 > Average Latency(s): 0.544365 > Stddev Latency(s): 0.35825 > Max latency(s): 5.42548 Very high max latency, telling us that your cluster ran out of steam at some point. > Min latency(s): 0.101533 > > rados bench -p scbench 120 seq > Total time run: 120.880920 > Total reads made: 1932 > Read size: 4194304 > Object size: 4194304 > Bandwidth (MB/sec): 63.9307 > Average IOPS 15 > Stddev IOPS: 3 > Max IOPS: 25 > Min IOPS: 5 > Average Latency(s): 0.999095 > Max latency(s): 8.50774 > Min latency(s): 0.0391591 > > rados bench -p scbench 120 rand > Total time run: 121.059005 > Total reads made: 1920 > Read size: 4194304 > Object size: 4194304 > Bandwidth (MB/sec): 63.4401 > Average IOPS: 15 > Stddev IOPS: 4 > Max IOPS: 26 > Min IOPS: 1 > Average Latency(s): 1.00785 > Max latency(s): 6.48138 > Min latency(s): 0.038925 > > On XFS partitions fragmentation no more than 1% I'd de-frag anyway, just to rule that out. When doing your tests or normal (busy) operations from the client VM, run atop on your storage nodes and observe your OSD HDDs. Do they get busy, around 100%? Check with iperf or NPtcp that your network to the clients from the storage nodes is fully functional. Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph
Re: [ceph-users] Slow performance into windows VM
> I hope the fastest of these MONs (CPU and storage) has the lowest IP > number and thus is the leader. no, the lowest IP has slowest CPU. But zabbix didn't show any load at all mons. > Also what Ceph, OS, kernel version? ubuntu 16.04 kernel 4.4.0-22 > Two GbE ports, given the "frontend" up there with the MON description I > assume that's 1 port per client (front) and cluster (back) network? yes, one GbE for ceph client, one GbE for back network. > Is there any other client on than that Windows VM on your Ceph cluster? Yes, another one instance but without load. > Is Ceph understanding this now? > Other than that, the queue options aren't likely to do much good with pure >HDD OSDs. I can't find those parameter in running config: ceph --admin-daemon /var/run/ceph/ceph-mon.block01.asok config show|grep "filestore_queue" "filestore_queue_max_ops": "3000", "filestore_queue_max_bytes": "1048576000", "filestore_queue_max_delay_multiple": "0", "filestore_queue_high_delay_multiple": "0", "filestore_queue_low_threshhold": "0.3", "filestore_queue_high_threshhold": "0.9", > That should be 512, 1024 really with one RBD pool. Yes, I know. Today for test I added scbench pool with 128 pg There are output status and osd tree: ceph status cluster 830beb43-9898-4fa9-98c1-ee04c1cdf69c health HEALTH_OK monmap e6: 3 mons at {block01=10.30.9.21:6789/0,object01=10.30.9.129:6789/0,object02=10.30.9.130:6789/0} election epoch 238, quorum 0,1,2 block01,object01,object02 osdmap e6887: 18 osds: 18 up, 18 in pgmap v9738812: 1280 pgs, 3 pools, 17440 GB data, 4346 kobjects 35049 GB used, 15218 GB / 50267 GB avail 1275 active+clean 3 active+clean+scrubbing+deep 2 active+clean+scrubbing client io 5030 kB/s rd, 1699 B/s wr, 19 op/s rd, 0 op/s wr ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 54.0 root default -2 27.0 host cn802 0 3.0 osd.0 up 1.0 1.0 2 3.0 osd.2 up 1.0 1.0 4 3.0 osd.4 up 1.0 1.0 6 3.0 osd.6 up 0.89995 1.0 8 3.0 osd.8 up 1.0 1.0 10 3.0 osd.10 up 1.0 1.0 12 3.0 osd.12 up 0.8 1.0 16 3.0 osd.16 up 1.0 1.0 18 3.0 osd.18 up 0.90002 1.0 -3 27.0 host cn803 1 3.0 osd.1 up 1.0 1.0 3 3.0 osd.3 up 0.95316 1.0 5 3.0 osd.5 up 1.0 1.0 7 3.0 osd.7 up 1.0 1.0 9 3.0 osd.9 up 1.0 1.0 11 3.0 osd.11 up 0.95001 1.0 13 3.0 osd.13 up 1.0 1.0 17 3.0 osd.17 up 0.84999 1.0 19 3.0 osd.19 up 1.0 1.0 > Wrong way to test this, test it from a monitor node, another client node > (like your openstack nodes). > In your 2 node cluster half of the reads or writes will be local, very > much skewing your results. I have been tested from copmute node also and have same result. 80-100Mb/sec > Very high max latency, telling us that your cluster ran out of steam at some point. I copying data from my windows instance right now. > I'd de-frag anyway, just to rule that out. >When doing your tests or normal (busy) operations from the client VM, run > atop on your storage nodes and observe your OSD HDDs. > Do they get busy, around 100%? Yes, high IO load (600-800 io). But this is very strange on SATA HDD. All HDD have own OSD daemon and presented in OS as hardware RAID0(each block node have hardware RAID). Example: avg-cpu: %user %nice %system %iowait %steal %idle 1.44 0.00 3.56 17.56 0.00 77.44 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdc 0.00 0.00 649.00 0.00 82912.00 0.00 255.51 8.30 12.74 12.74 0.00 1.26 81.60 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdf 0.00 0.00 761.00 0.00 94308.00 0.00 247.85 8.66 11.26 11.26 0.00 1.18 90.00 sdg 0.00 0.00 761.00 0.00 97408.00 0.00 256.00 7.80 10.22 10.22 0.00 1.01 76.80 sdh 0.00 0.00 801.00 0.00 102344.00 0.00 255.54 8.05 10.05 10.05 0.00 0.96 76.80 sdi 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdj 0.00 0.00 537.00 0.00 68736.00 0.00 256.00 5.54 10.26 10.26 0.00 0.98 52.80 > Check with iperf or NPtcp that your network to the clients from the > storage nodes is fully functional. The network have been tested by iperf. 950-970Mbit among all nodes in clustes (openstack and ceph) Понедельник, 11 июля 2016, 10:58 +05:00 от Christian Balzer : > > >Hello, > >On Mon, 11 Jul 2016 07:35:02 +0300 K K wrote: > >> >> Hello, guys >> >> I to face a task poor performance into windows 2k12r2 instance running >> on rbd (openstack cluster). RBD disk have a size 17Tb. My ceph cluster >> consist from: >> - 3 monitors nodes (Celeron G530/6Gb RAM, DualCore E6500/2Gb RAM, >> Core2Duo E7500/2Gb RAM). Each node have 1Gbit network to frontend subnet >> od Ceph cluster > >I hope the fastest of these MONs (CPU and storage) has the lowest IP >number and thus is th
[ceph-users] drop i386 support
Hi Cephers, I am proposing drop the support of i386. as we don't compile Ceph with any i386 gitbuilder now[1] and hence don't test the i386 builds on sepia on a regular basis. Also, based on the assumption that people don't use i386 in production, I think we can drop it from the minimum hardware document[2]? And we won't explicitly disable the i386 build in code if we decide to drop the i386 support, as we always try to be portable if possible. But just don't claim the i386 as the officially supported arch anymore. What do you think? --- [1] http://ceph.com/gitbuilder.cgi [2] http://docs.ceph.com/docs/master/start/hardware-recommendations/#minimum-hardware-recommendations -- Regards Kefu Chai ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com