[ceph-users] PG Calculation query
Hello, We are facing some performance issue with rados bench marking on a 5 node cluster with PG num 4096 vs 8192. As per the PG calculation below is our specification Size OSD % Data Targets PG count 5 340 100 100 8192 5 340 100 50 4096 With 8192 PG count we got good performance with 4096 compared to 8192 With PG count - 4096 -->> Filesize 256000 512000 1024000 2048000 4096000 12288000 Write Bandwidth MB/sec 1448.38 2503.98 3941.42 5354.7 5333.9 5271.16 Read Bandwidth MB/sec 2924.83 3417.9 4236.65 4469.4 4602.65 4584.6 WRITE Average Latency seconds 0.088355 0.102214 0.129855 0.191155 0.377685 1.13953 WRITE Maximum Latency seconds 0.280164 0.485391 1.15953 13.5175 27.9876 86.3103 READ Average Latency seconds 0.0437188 0.0747644 0.120604 0.228535 0.436566 1.30415 READ Maximum Latency seconds 1.13067 3.21548 2.99734 4.08429 9.0224 16.6047 Average IOPS.. #grep "op/s" cephio_0%.txt | awk 'NF { print $(NF - 1) }'| awk '{ total += $0 } END { print total/NR }' 7517.49 -->> With PG count - 8192 -->> Filesize 256000 512000 1024000 2048000 4096000 12288000 Write Bandwidth MB/sec 534.749 1020.49 1864.58 3100.92 4717.23 5251.76 Read Bandwidth MB/sec 1615.56 2764.25 4061.55 4265.39 4229.38 4042.18 WRITE Average Latency seconds 0.239263 0.250769 0.27448 0.328981 0.427056 1.14352 WRITE Maximum Latency seconds 9.21752 10.3353 10.8132 11.2135 12.5497 44.8133 READ Average Latency seconds 0.0791822 0.0925167 0.12583 0.239571 0.475198 1.47916 READ Maximum Latency seconds 2.01021 2.29139 3.60456 3.8435 7.43755 37.6106 #grep "op/s" cephio_0%.txt | awk 'NF { print $(NF - 1) }'| awk '{ total += $0 } END { print total/NR }' 4970.26 With 4096 PG - Average IOPS - 7517 With 8192 PG - Average IOPS - 4970 For smaller bits with 8192, the performance is badly affected. As per our test we are not adding any nodes in future. We mostly select 'Targets per OSD' as 100 instead of 200/300. Awaiting for comments to how to suit the best PG count as per the cluster size or how to choose appropriate PG count. ENV:- Kraken - 11.2.0 - bluestore EC 4+1 RHEL 7.3 3.10.0-514.10.2.el7.x86_64 5 node - 5x68 - 340 OSD Thanks ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Questions on rbd-mirror
Hi Fulvio, On 03/24/2017 07:19 PM, Fulvio Galeazzi wrote: Hallo, apologies for my (silly) questions, I did try to find some doc on rbd-mirror but was unable to, apart from a number of pages explaining how to install it. My environment is CenOS7 and Ceph 10.2.5. Can anyone help me understand a few minor things: - is there a cleaner way to configure the user which will be used for rbd-mirror, other than editing the ExecStart in file /usr/lib/systemd/system/ceph-rbd-mirror@.service ? For example some line in ceph.conf... looks like the username defaults to the cluster name, am I right? It should just be "ceph", no matter what the cluster name is, if I read the code correctly. - is it possible to throttle mirroring? Sure, it's a crazy thing to do for "cinder" pools, but may make sense for slowly changing ones, like a "glance" pool. The rbd core team is working on this. Jason, right? - is it possible to set per-pool default features? I read about "rbd default features = ###" but this is a global setting. (Ok, I can still restrict pools to be mirrored with "ceph auth" for the user doing mirroring) "per-pool default features" sounds like a reasonable feature request. About the "ceph auth" for mirroring, I am working on a rbd acl design, will consider pool-level, namespace-level and image-level. Then I think we can do a permission check on this. Thanx Yang Thanks! Fulvio ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD image perf counters: usage, access
Hi Yang, Thank you for your reply. This is very useful indeed that there are many ImageCtx objects for one image. But in my setting, I don't have any particular ceph client connected to ceph (I could, but this is not the point). I'm trying to get metrics for particular image while not performing anything with it myself. And I'm trying to get access to performance counters listed in the ImageCtx class, they don't seem to be reported by the perf tool. Thanks! On 27/03/17 12:29, Dongsheng Yang wrote: Hi Masha you can get the counters by perf dump command on the asok file of your client. such as that: $ ceph --admin-daemon out/client.admin.9921.asok perf dump|grep rd "rd": 656754, "rd_bytes": 656754, "rd_latency": { "discard": 0, "discard_bytes": 0, "discard_latency": { "omap_rd": 0, But, note that, this is a counter of this one ImageCtx, but not the counter for this image. There are possible several ImageCtxes reading or writing on the same image. Yang On 03/27/2017 12:23 PM, Masha Atakova wrote: Hi everyone, I was going around trying to figure out how to get ceph metrics on a more detailed level than daemons. Of course, I found and explored API for watching rados objects, but I'm more interested in getting metrics about RBD images. And while I could get list of objects for particular image, and then watch all of them, it doesn't seem like very efficient way to go about it. I checked librbd API and there isn't anything helping with my goal. So I went through the source code and found list of performance counters for image which are incremented by other parts of ceph when making corresponding operations: https://github.com/ceph/ceph/blob/master/src/librbd/ImageCtx.cc#L364 I have 2 questions about it: 1) is there any workaround to use those counters right now? maybe when compiling against ceph the code doing it. Looks like I need to be able to access particular ImageCtx object (instead of creating my own), and I just can't find appropriate class / part of the librbd allowing me to do so. 2) are there any plans on making those counters accessible via API like librbd or librados? I see that these questions might be more appropriate for the devel list, but: - it seems to me that question of getting ceph metrics is more interesting for those who use ceph - I couldn't subscribe to it with an error provided below. Thanks! majord...@vger.kernel.org: SMTP error from remote server for MAIL FROM command, host: vger.kernel.org (209.132.180.67) reason: 553 5.7.1 Hello [74.208.4.201], for your MAIL FROM address policy analysis reported: Your address is not liked source for email --- The header of the original message is following. --- Received: from [192.168.1.10] ([223.206.146.181]) by mail.gmx.com (mrgmxus001 [74.208.5.15]) with ESMTPSA (Nemesis) id 0M92q3-1d0LS03yov-00CTwW for ; Mon, 27 Mar 2017 05:55:46 +0200 To:majord...@vger.kernel.org From: Masha Atakova Subject: subscribe ceph-devel Message-ID:<174d9bc0-b50d-fc80-ede8-5ba9d472e...@mail.com> Date: Mon, 27 Mar 2017 10:55:43 +0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.7.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Provags-ID: V03:K0:Lau7llt7/MuJt+nRLjXIhY91IuCvCBJGtqDzxLgqkh2ERVkWeep 5CDyh9GHW7QSodn80xWCPOOD2kvvnr6YxrB5R9SZ1iloI9VO2YoTXAauDq4mtWh+abwUOiY wQgj6YvUcLjfUinsh0t68Q9m3h3ufZIoKIeWhKFGbsRALqsvZjgWBVlaAR/V5Vt4O/wFJGG YULQ6/t4oDSsBuy4agFdQ== X-UI-Out-Filterresults: notjunk:1;V01:K0:xLdjozptxu8=:nO7vxZvAbidrXk7gcv7Wqc Bjr14pXiTEv8gVIlRTZ78cNDEQthT557sAgBBRnJkDGXkP1efvEN2QqsZAzfa52Og4ysSFXub BPSiDOI0wkzxQMu1QHqWzvURobFX9LxrctwYB3k9nrOtHFgJwm0eQWfV1QKg7i0ESzT244u2c 2xKpGGrhNUspJtEep97xjY3DyDvR3ApYx9x+RO9ZQAE0Is9AO0mBYqDR3NqrF1KzabJWuCA7I yu1y9N0QILgr/WmUf74qxeh1k20n+7yYuYPzgIl9Cm2vyrVu2ONUTJMpN2p+iUit8hhUsTuYQ /TNde22Q5OOCz+oGVhWq04J+CBP23VrEkent4kw2vhejDjQD/F2J4o2XkfkPt7ZqpMreGWBfB jtpfz4jHyp+voLlldhw7+cKUGY4ux8dihtlaCm9N3FQ2qvQ9CTsFuLsTNHNe7uRx5oeZgBFFh 6t1OVBLlRR1wwSMDbx6vE5UTx47vbAtu5I/vyryQ1jVnzyQitjWE6iLMEC8faatMquOxJreoF 4ALLNVStuHEkaGC0zimjQ5YkiFe6nHqxwsaYU7Vcy0j9GXTkiakh6kwluOyLqy5Q1e1FHPfSG /swFoOHGvb07bK81+G1OLT7nIIArC+NrsHGmsrycXpw9gvZGubLYoYSgRskhJ1F+QxCzspFK0 XOgA5Ko3M3djFYkMM0S+xHHyVIIpUr4qQXv1sKuWUY63wlalu3JLwWn7t8CBhC2R0s/3ec0WT WD+iDs0hWe0INwfX+BNVWIuyzim7qKg8wbG95YWyAI9J9dyx7lv4VETd2Zf5raU1TgNFB/6OP RQrUx3O ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD image perf counters: usage, access
On 03/27/2017 04:06 PM, Masha Atakova wrote: Hi Yang, Hi Masha, Thank you for your reply. This is very useful indeed that there are many ImageCtx objects for one image. But in my setting, I don't have any particular ceph client connected to ceph (I could, but this is not the point). I'm trying to get metrics for particular image while not performing anything with it myself. The perf counter you mentioned in your first mail, is just for one particular image client, that means, these perf counter will disappear as the client disconnected. And I'm trying to get access to performance counters listed in the ImageCtx class, they don't seem to be reported by the perf tool. Do you mean get the perf counters via api? At first this counter is only for a particular ImageCtx (connected client), then you can read the counters by the perf dump command in my last mail I think. If you want to get the performance counter for an image (no matter how many ImageCtx, connected or disconnected), maybe you need to wait this one: http://pad.ceph.com/p/ceph-top Yang Thanks! On 27/03/17 12:29, Dongsheng Yang wrote: Hi Masha you can get the counters by perf dump command on the asok file of your client. such as that: $ ceph --admin-daemon out/client.admin.9921.asok perf dump|grep rd "rd": 656754, "rd_bytes": 656754, "rd_latency": { "discard": 0, "discard_bytes": 0, "discard_latency": { "omap_rd": 0, But, note that, this is a counter of this one ImageCtx, but not the counter for this image. There are possible several ImageCtxes reading or writing on the same image. Yang On 03/27/2017 12:23 PM, Masha Atakova wrote: Hi everyone, I was going around trying to figure out how to get ceph metrics on a more detailed level than daemons. Of course, I found and explored API for watching rados objects, but I'm more interested in getting metrics about RBD images. And while I could get list of objects for particular image, and then watch all of them, it doesn't seem like very efficient way to go about it. I checked librbd API and there isn't anything helping with my goal. So I went through the source code and found list of performance counters for image which are incremented by other parts of ceph when making corresponding operations: https://github.com/ceph/ceph/blob/master/src/librbd/ImageCtx.cc#L364 I have 2 questions about it: 1) is there any workaround to use those counters right now? maybe when compiling against ceph the code doing it. Looks like I need to be able to access particular ImageCtx object (instead of creating my own), and I just can't find appropriate class / part of the librbd allowing me to do so. 2) are there any plans on making those counters accessible via API like librbd or librados? I see that these questions might be more appropriate for the devel list, but: - it seems to me that question of getting ceph metrics is more interesting for those who use ceph - I couldn't subscribe to it with an error provided below. Thanks! majord...@vger.kernel.org: SMTP error from remote server for MAIL FROM command, host: vger.kernel.org (209.132.180.67) reason: 553 5.7.1 Hello [74.208.4.201], for your MAIL FROM address policy analysis reported: Your address is not liked source for email --- The header of the original message is following. --- Received: from [192.168.1.10] ([223.206.146.181]) by mail.gmx.com (mrgmxus001 [74.208.5.15]) with ESMTPSA (Nemesis) id 0M92q3-1d0LS03yov-00CTwW for ; Mon, 27 Mar 2017 05:55:46 +0200 To:majord...@vger.kernel.org From: Masha Atakova Subject: subscribe ceph-devel Message-ID:<174d9bc0-b50d-fc80-ede8-5ba9d472e...@mail.com> Date: Mon, 27 Mar 2017 10:55:43 +0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.7.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Provags-ID: V03:K0:Lau7llt7/MuJt+nRLjXIhY91IuCvCBJGtqDzxLgqkh2ERVkWeep 5CDyh9GHW7QSodn80xWCPOOD2kvvnr6YxrB5R9SZ1iloI9VO2YoTXAauDq4mtWh+abwUOiY wQgj6YvUcLjfUinsh0t68Q9m3h3ufZIoKIeWhKFGbsRALqsvZjgWBVlaAR/V5Vt4O/wFJGG YULQ6/t4oDSsBuy4agFdQ== X-UI-Out-Filterresults: notjunk:1;V01:K0:xLdjozptxu8=:nO7vxZvAbidrXk7gcv7Wqc Bjr14pXiTEv8gVIlRTZ78cNDEQthT557sAgBBRnJkDGXkP1efvEN2QqsZAzfa52Og4ysSFXub BPSiDOI0wkzxQMu1QHqWzvURobFX9LxrctwYB3k9nrOtHFgJwm0eQWfV1QKg7i0ESzT244u2c 2xKpGGrhNUspJtEep97xjY3DyDvR3ApYx9x+RO9ZQAE0Is9AO0mBYqDR3NqrF1KzabJWuCA7I yu1y9N0QILgr/WmUf74qxeh1k20n+7yYuYPzgIl9Cm2vyrVu2ONUTJMpN2p+iUit8hhUsTuYQ /TNde22Q5OOCz+oGVhWq04J+CBP23VrEkent4kw2vhejDjQD/F2J4o2XkfkPt7ZqpMreGWBfB jtpfz4jHyp+voLlldhw7+cKUGY4ux8dihtlaCm9N3FQ2qvQ9CTsFuLsTNHNe7uRx5oeZgBFFh 6t1OVBLlRR1wwSMDbx6vE5UTx47vbAtu5I/vyryQ1jVnzyQitjWE6iLMEC8faatMquOxJreoF 4ALLNVStuHEkaGC0zimjQ5YkiFe6nHqxwsaYU7Vcy0j9GXTkiakh6kwluOyLqy5Q1e1FHPfSG /swFoOHGvb07bK81+G1OLT7nIIArC+NrsHGmsrycXpw9gvZGubLYoYSgRskhJ1F+QxCzspFK0 XOgA5Ko3M3djFYkMM0
[ceph-users] Kraken + Bluestore
Hi, Does anyone have any cluster of a decent scale running on Kraken and bluestore? How are you finding it? Have you had any big issues arise? Was it running non bluestore before and have you noticed any improvement? Read ? Write? IOPS? ,Ashley Sent from my iPhone ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] New hardware for OSDs
Hello all, we are currently in the process of buying new hardware to expand an existing Ceph cluster that already has 1200 osds. We are currently using 24 * 4 TB SAS drives per osd with an SSD journal shared among 4 osds. For the upcoming expansion we were thinking of switching to either 6 or 8 TB hard drives (9 or 12 per host) in order to drive down space and cost requirements. Has anyone any experience in mid-sized/large-sized deployment using such hard drives? Our main concern is the rebalance time but we might be overlooking some other aspects. We currently use the cluster as storage for openstack services: Glance, Cinder and VMs' ephemeral disks. Thanks in advance for any advice. Mattia ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Recompiling source code - to find exact RPM
Hey Brad, Many thanks for the explanation... > ~~~ > WARNING: the following dangerous and experimental features are enabled: > ~~~ > Can I ask why you want to disable this warning? We using bluestore with kraken, we are aware that this is in tech preview. To hide these warning compiled like this. Thanks On Mon, Mar 27, 2017 at 5:04 AM, Brad Hubbard wrote: > > > On Fri, Mar 24, 2017 at 6:49 PM, nokia ceph > wrote: > > Brad, cool now we are on the same track :) > > > > So whatever change we made after this location src/* as it mapped to > > respective rpm correct? > > > > For eg:- > > src/osd/* -- ceph-osd > > src/common - ceph-common > > src/mon - ceph-mon > > src/mgr - ceph-mgr > > I think this is true in most, if not all, cases. > > > > > Since we are using bluestore with kraken, I though to disable the below > > warning while triggering `ceph -s` > > > > ~~~ > > WARNING: the following dangerous and experimental features are enabled: > > ~~~ > > Can I ask why you want to disable this warning? > > > > > Here I made a comment in this file > > > >>vim src/common/ceph_context.cc > > 307 // if (!cct->_experimental_features.empty()) > > 308 // lderr(cct) << "WARNING: the following dangerous and > experimental > > features are enabled: " > > 309 // << cct->_experimental_features << dendl; > > Right. > > > > > As per my assumption, the change should reflect in this binary > > "ceph-common" > > libceph-common specifically. > > > > > But when I closely looked on librados library as these warning showing > here > > also. > > #strings -a /usr/lib64/librados.so.2 | grep dangerous > > WARNING: the following dangerous and experimental features are enabled: > --> > > > > Then I conclude for this change ceph-common and librados were required. > > > > Please correct me if I'm wrong. > > So I looked at this on current master built on Fedora and see the > following. > > $ for lib in $(find . \! -type l -type f -name lib\*); do strings > $lib|grep "following dangerous and experimenta l"; if [ $? -eq 0 ]; then > echo $lib; fi; done > WARNING: the following dangerous and experimental features are enabled: > ./libcephd.a > WARNING: the following dangerous and experimental features are enabled: > ./libceph-common.so.0 > WARNING: the following dangerous and experimental features are enabled: > ./libcommon.a > > So in my case the only shared object that has this string is > libceph-common. > However, that library is dynamically linked to libceph-common. > > $ ldd librados.so.2.0.0|grep libceph-common > libceph-common.so.0 => > /home/brad/working/src/ceph/build/lib/libceph-common.so.0 > (0x7faa2cf42000) > > I checked a rhel version and sure enough the string is there, because in > that > version on rhel/CentOS we statically linked libcommon.a into librados IIRC. > > # ldd librados.so.2.0.0|grep libceph-common > # > > So if the string shows up in your librados then I'd suggest it is also > statically linked ([1] we only changed this fairly recently) and you will > need > to replace it to reflect your change. > > [1] https://github.com/ceph/ceph/commit/8f7643792c9e6a3d1ba4a06ca7d09b > 0de9af1443 > > > > > On Fri, Mar 24, 2017 at 5:41 AM, Brad Hubbard > wrote: > >> > >> Oh wow, I completely misunderstood your question. > >> > >> Yes, src/osd/PG.cc and src/osd/PG.h are compiled into the ceph-osd > binary > >> which > >> is included in the ceph-osd rpm as you said in your OP. > >> > >> On Fri, Mar 24, 2017 at 3:10 AM, nokia ceph > >> wrote: > >> > Hello Piotr, > >> > > >> > I didn't understand, could you please elaborate about this procedure > as > >> > mentioned in the last update. It would be really helpful if you share > >> > any > >> > useful link/doc to understand what you actually meant. Yea correct, > >> > normally > >> > we do this procedure but it takes more time. But here my intention is > to > >> > how > >> > to find out the rpm which caused the change. I think we are in > opposite > >> > direction. > >> > > >> >>> But wouldn't be faster and/or more convenient if you would just > >> >>> recompile > >> >>> binaries in-place (or use network symlinks) > >> > > >> > Thanks > >> > > >> > > >> > > >> > On Thu, Mar 23, 2017 at 6:47 PM, Piotr Dałek < > piotr.da...@corp.ovh.com> > >> > wrote: > >> >> > >> >> On 03/23/2017 02:02 PM, nokia ceph wrote: > >> >> > >> >>> Hello Piotr, > >> >>> > >> >>> We do customizing ceph code for our testing purpose. It's a part of > >> >>> our > >> >>> R&D :) > >> >>> > >> >>> Recompiling source code will create 38 rpm's out of these I need to > >> >>> find > >> >>> which one is the correct rpm which I made change in the source code. > >> >>> That's > >> >>> what I'm try to figure out. > >> >> > >> >> > >> >> Yes, I understand that. But wouldn't be faster and/or more convenient > >> >> if > >> >> you would just recompile binaries in-place (or use network symlinks) > >> >> instead > >> >> of packaging entire Ceph and (re)installing its packages each
Re: [ceph-users] New hardware for OSDs
Hello, On Mon, 27 Mar 2017 12:27:40 +0200 Mattia Belluco wrote: > Hello all, > we are currently in the process of buying new hardware to expand an > existing Ceph cluster that already has 1200 osds. That's quite sizable, is the expansion driven by the need for more space (big data?) or to increase IOPS (or both)? > We are currently using 24 * 4 TB SAS drives per osd with an SSD journal > shared among 4 osds. For the upcoming expansion we were thinking of > switching to either 6 or 8 TB hard drives (9 or 12 per host) in order to > drive down space and cost requirements. > > Has anyone any experience in mid-sized/large-sized deployment using such > hard drives? Our main concern is the rebalance time but we might be > overlooking some other aspects. > If you researched the ML archives, you should already know to stay well away from SMR HDDs. Both HGST and Seagate have large Enterprise HDDs that have journals/caches (MediaCache in HGST speak IIRC) that drastically improve write IOPS compared to plain HDDs. Even with SSD journals you will want to consider those, as these new HDDs will see at least twice the action than your current ones. Rebalance time is a concern of course, especially if your cluster like most HDD based ones has these things throttled down to not impede actual client I/O. To get a rough idea, take a look at: https://www.memset.com/tools/raid-calculator/ For Ceph with replication 3 and the typical PG distribution, assume 100 disks and the RAID6 with hotspares numbers are relevant. For rebuild speed, consult your experience, you must have had a few failures. ^o^ For example with a recovery speed of 100MB/s, a 1TB disk (used data with Ceph actually) looks decent at 1:16000 DLO/y. At 5TB though it enters scary land Christian > We currently use the cluster as storage for openstack services: Glance, > Cinder and VMs' ephemeral disks. > > Thanks in advance for any advice. > > Mattia > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] leveldb takes a lot of space
> Op 26 maart 2017 om 9:44 schreef Niv Azriel : > > > after network issues, ceph cluster fails. > leveldb grows and takes a lot of space > ceph mon cant write to leveldb because there is not enough space on > filesystem. > (there is a lot of ldb file on /var/lib/ceph/mon) > It is normal that the database will grow as the MON will keep all historic OSDMaps when one or more PGs are not active+clean > ceph compact on start is not helping. > my erasure-code is too big. > > how to fix it? Make sure you have enough space available on your MONs, that is the main advise. Under normal operations <2GB should be enough, but it can grow much bigger. On most clusters I design I make sure there is at least 200GB of space available on each MON on a fast DC-grade SSD. Wido > thanks in advanced > > ceph version: jewel > os : ubuntu16.04 > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] New hardware for OSDs
> Op 27 maart 2017 om 13:22 schreef Christian Balzer : > > > > Hello, > > On Mon, 27 Mar 2017 12:27:40 +0200 Mattia Belluco wrote: > > > Hello all, > > we are currently in the process of buying new hardware to expand an > > existing Ceph cluster that already has 1200 osds. > > That's quite sizable, is the expansion driven by the need for more space > (big data?) or to increase IOPS (or both)? > > > We are currently using 24 * 4 TB SAS drives per osd with an SSD journal > > shared among 4 osds. For the upcoming expansion we were thinking of > > switching to either 6 or 8 TB hard drives (9 or 12 per host) in order to > > drive down space and cost requirements. > > > > Has anyone any experience in mid-sized/large-sized deployment using such > > hard drives? Our main concern is the rebalance time but we might be > > overlooking some other aspects. > > > > If you researched the ML archives, you should already know to stay well > away from SMR HDDs. > Amen! Just don't. Stay away from SMR with Ceph. > Both HGST and Seagate have large Enterprise HDDs that have > journals/caches (MediaCache in HGST speak IIRC) that drastically improve > write IOPS compared to plain HDDs. > Even with SSD journals you will want to consider those, as these new HDDs > will see at least twice the action than your current ones. > I also have good experiences with bcache on NVM-E device in Ceph clusters. A single Intel P3600/P3700 which is the caching device for bcache. > Rebalance time is a concern of course, especially if your cluster like > most HDD based ones has these things throttled down to not impede actual > client I/O. > > To get a rough idea, take a look at: > https://www.memset.com/tools/raid-calculator/ > > For Ceph with replication 3 and the typical PG distribution, assume 100 > disks and the RAID6 with hotspares numbers are relevant. > For rebuild speed, consult your experience, you must have had a few > failures. ^o^ > > For example with a recovery speed of 100MB/s, a 1TB disk (used data with > Ceph actually) looks decent at 1:16000 DLO/y. > At 5TB though it enters scary land > Yes, those recoveries will take a long time. Let's say your 6TB drive is filled for 80% you need to rebalance 4.8TB 4.8TB / 100MB/sec = 13 hours rebuild time 13 hours is a long time. And you will probably not have 100MB/sec sustained, I think that 50MB/sec is much more realistic. That means that a single disk failure will take >24 hours to recover from a rebuild. I don't like very big disks that much. Not in RAID, not in Ceph. Wido > Christian > > > We currently use the cluster as storage for openstack services: Glance, > > Cinder and VMs' ephemeral disks. > > > > Thanks in advance for any advice. > > > > Mattia > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > -- > Christian BalzerNetwork/Systems Engineer > ch...@gol.com Global OnLine Japan/Rakuten Communications > http://www.gol.com/ > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] object store backup tool recommendations
Thanks for the useful reply Robin and sorry for not getting back sooner... > On Fri, Mar 03, 2017 at 18:01:00 +, Robin H. Johnson wrote: > On Fri, Mar 03, 2017 at 10:55:06 +1100, Blair Bethwaite wrote: >> Does anyone have any recommendations for good tools to perform >> file-system/tree backups and restores to/from a RGW object store (Swift or >> S3 APIs)? Happy to hear about both FOSS and commercial options please. > This isn't Ceph specific, but is something that has come up for me, and > I did a lot of research into it for the Gentoo distribution to use on > it's infrastructure. > The wiki page with all of our needs & contenders is here: > https://wiki.gentoo.org/wiki/Project:Infrastructure/Backups_v3 That's a useful resource. > TL;DR: restic is probably the closest fit to your needs, but do evaluate > it carefully. Yes I agree, restic does look like a decent fit and we are planning to trial it soon. Though it took me a while to find that it does in fact support object storage, as that info is buried in the usage docs and (I thought somewhat bizarrely) not mentioned as a prominent feature. Anybody else have recommendations? I'm surprised there were not more suggestions, perhaps (OpenStack-)Swift users will have some suggestions... -- Cheers, ~Blairo ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] object store backup tool recommendations
I suppose the other option here, which I initially dismissed because Red Hat are not supporting it, is to have a CephFS dir/tree bound to a cache-tier fronted EC pool. Is anyone having luck with such a setup? On 3 March 2017 at 21:40, Blair Bethwaite wrote: > Hi Marc, > > Whilst I agree CephFS would probably help compared to your present solution, > what I'm looking for something that can talk to a the RadosGW restful object > storage APIs, so that the backing storage can be durable and low-cost, i.e., > on an erasure coded pool. In this case we're looking to backup a Lustre > filesystem. > > Cheers, > > On 3 March 2017 at 21:29, Marc Roos wrote: >> >> >> Hi Blair, >> >> We are also thinking of using ceph for 'backup'. At the moment we are >> using rsync and hardlinks on a drbd setup. But I think when using cephfs >> things could speed up, because file information is gotten from the mds >> daemon, so this should save on one rsync file lookup, and we expect that >> we can run more tasks in parallel. >> >> >> >> >> >> -Original Message- >> From: Blair Bethwaite [mailto:blair.bethwa...@gmail.com] >> Sent: vrijdag 3 maart 2017 0:55 >> To: ceph-users@lists.ceph.com >> Subject: [ceph-users] object store backup tool recommendations >> >> Hi all, >> >> Does anyone have any recommendations for good tools to perform >> file-system/tree backups and restores to/from a RGW object store (Swift >> or S3 APIs)? Happy to hear about both FOSS and commercial options >> please. >> >> I'm interested in: >> 1) tools known to work or not work at all for a basic file-based data >> backup >> >> Plus these extras: >> 2) preserves/restores correct file metadata (e.g. owner, group, acls >> etc) >> 3) preserves/restores xattrs >> 4) backs up empty directories and files >> 5) supports some sort of snapshot/versioning/differential functionality, >> i.e., will keep a copy or diff or last N versions of a file or whole >> backup set, e.g., so that one can restore yesterday's file/s or last >> week's but not have to keep two full copies to achieve it >> 6) is readily able to restore individual files >> 7) can encrypt/decrypt client side >> >> 8) anything else I should be considering >> >> -- >> >> Cheers, >> ~Blairo >> >> > > > > -- > Cheers, > ~Blairo -- Cheers, ~Blairo ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSDs cannot match up with fast OSD map changes (epochs) during recovery
> Op 27 maart 2017 om 8:41 schreef Muthusamy Muthiah > : > > > Hi Wido, > > Yes slow map update was happening and CPU hitting 100%. So it indeed seems you are CPU bound at that moment. That's indeed a problem when you have a lot of map changes to work through on the OSDs. It's recommended to have 1 CPU core per OSD as during recovery/boot this power is needed badly by the OSDs. > We also tried to set noup flag to true so that the cluster osdmap remained > in same version . This made each OSD updated to the current map slowly . At > one point we lost patience due to critical timelines and re-insalled the > cluster. However we plan to do this recovery again and find optimum > procedure for recovery . The noup flag can indeed 'help' here to prevent new maps from being produced. > Sage was commenting that there is another solution available in Luminous > which would recover the OSDs at much faster rate than the current one by > skipping some maps instead of going in sequential way. I am not aware of those improvements. Sage (or another dev) would need to comment on that. Wido > > Thanks, > Muthu > > On 20 March 2017 at 22:13, Wido den Hollander wrote: > > > > > > Op 18 maart 2017 om 10:39 schreef Muthusamy Muthiah < > > muthiah.muthus...@gmail.com>: > > > > > > > > > Hi, > > > > > > We had similar issue on one of the 5 node cluster cluster again during > > > recovery(200/335 OSDs are to be recovered) , we see a lot of differences > > > in the OSDmap epocs between OSD which is booting and the current one same > > > is below, > > > > > > - In the current situation the OSD are trying to register with > > an > > > old OSDMAP version *7620 * but he current version in the cluster is > > > higher *13102 > > > *version – as a result it takes longer for OSD to update to this version > > .. > > > > > > > Do you see these OSDs eating 100% CPU at that moment? Eg, could it be that > > the CPUs are not fast enough to process all the map updates quick enough. > > > > iirc map updates are not processed multi-threaded. > > > > Wido > > > > > > > > We also see 2017-03-18 09:19:04.628206 7f2056735700 0 -- > > > 10.139.4.69:6836/777372 >> - conn(0x7f20c1bfa800 :6836 > > > s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to > > > send and in the half accept state just closed messages on many osds which > > > are recovering. > > > > > > Suggestions would be helpful. > > > > > > > > > Thanks, > > > > > > Muthu > > > > > > On 13 February 2017 at 18:14, Wido den Hollander wrote: > > > > > > > > > > > > Op 13 februari 2017 om 12:57 schreef Muthusamy Muthiah < > > > > muthiah.muthus...@gmail.com>: > > > > > > > > > > > > > > > Hi All, > > > > > > > > > > We also have same issue on one of our platforms which was upgraded > > from > > > > > 11.0.2 to 11.2.0 . The issue occurs on one node alone where CPU hits > > 100% > > > > > and OSDs of that node marked down. Issue not seen on cluster which > > was > > > > > installed from scratch with 11.2.0. > > > > > > > > > > > > > How many maps is this OSD behind? > > > > > > > > Does it help if you set the nodown flag for a moment to let it catch > > up? > > > > > > > > Wido > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *[r...@cn3.c7.vna ~] # systemctl start ceph-osd@315.service > > > > > [r...@cn3.c7.vna ~] # cd /var/log/ceph/ > > > > > [r...@cn3.c7.vna ceph] # tail -f *osd*315.log 2017-02-13 > > 11:29:46.752897 > > > > > 7f995c79b940 0 > > > > > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_ > > > > 64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/ > > > > centos7/MACHINE_SIZE/huge/release/11.2.0/rpm/el7/BUILD/ > > > > ceph-11.2.0/src/cls/hello/cls_hello.cc:296: > > > > > loading cls_hello 2017-02-13 11:29:46.753065 7f995c79b940 0 > > _get_class > > > > not > > > > > permitted to load kvs 2017-02-13 11:29:46.757571 7f995c79b940 0 > > > > _get_class > > > > > not permitted to load lua 2017-02-13 11:29:47.058720 7f995c79b940 0 > > > > > osd.315 44703 crush map has features 288514119978713088, adjusting > > msgr > > > > > requires for clients 2017-02-13 11:29:47.058728 7f995c79b940 0 > > osd.315 > > > > > 44703 crush map has features 288514394856620032 was 8705, adjusting > > msgr > > > > > requires for mons 2017-02-13 11:29:47.058732 7f995c79b940 0 osd.315 > > > > 44703 > > > > > crush map has features 288531987042664448, adjusting msgr requires > > for > > > > osds > > > > > 2017-02-13 11:29:48.343979 7f995c79b940 0 osd.315 44703 load_pgs > > > > > 2017-02-13 11:29:55.913550 7f995c79b940 0 osd.315 44703 load_pgs > > opened > > > > > 130 pgs 2017-02-13 11:29:55.913604 7f995c79b940 0 osd.315 44703 > > using 1 > > > > op > > > > > queue with priority op cut off at 64. 2017-02-13 11:29:55.914102 > > > > > 7f995c79b940 -1 osd.315 44703 log_to_monitors {def
[ceph-users] 答复: leveldb takes a lot of space
@ Niv Azriel : What is your leveldb version and has it been fixed now? @ Wido den Hollander : I also meet a similar problem: the size of my leveldb is about 17GB(300+ osds), there are a lot of sst files(each sst file is 2MB) in /var/lib/ceph/mon. (a network abnormal situation once happened) The leveldb version is 1.2 (ubuntu 12.04, ceph 0.94.5) -邮件原件- 发件人: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] 代表 Wido den Hollander 发送时间: 2017年3月27日 19:30 收件人: ceph-users@lists.ceph.com; Niv Azriel 主题: Re: [ceph-users] leveldb takes a lot of space > Op 26 maart 2017 om 9:44 schreef Niv Azriel : > > > after network issues, ceph cluster fails. > leveldb grows and takes a lot of space ceph mon cant write to leveldb > because there is not enough space on filesystem. > (there is a lot of ldb file on /var/lib/ceph/mon) > It is normal that the database will grow as the MON will keep all historic OSDMaps when one or more PGs are not active+clean > ceph compact on start is not helping. > my erasure-code is too big. > > how to fix it? Make sure you have enough space available on your MONs, that is the main advise. Under normal operations <2GB should be enough, but it can grow much bigger. On most clusters I design I make sure there is at least 200GB of space available on each MON on a fast DC-grade SSD. Wido > thanks in advanced > > ceph version: jewel > os : ubuntu16.04 > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com - 本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出 的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、 或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本 邮件! This e-mail and its attachments contain confidential information from H3C, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Questions on rbd-mirror
On Mon, Mar 27, 2017 at 4:00 AM, Dongsheng Yang wrote: > Hi Fulvio, > > On 03/24/2017 07:19 PM, Fulvio Galeazzi wrote: > > Hallo, apologies for my (silly) questions, I did try to find some doc on > rbd-mirror but was unable to, apart from a number of pages explaining how to > install it. > > My environment is CenOS7 and Ceph 10.2.5. > > Can anyone help me understand a few minor things: > > - is there a cleaner way to configure the user which will be used for >rbd-mirror, other than editing the ExecStart in file > /usr/lib/systemd/system/ceph-rbd-mirror@.service ? >For example some line in ceph.conf... looks like the username >defaults to the cluster name, am I right? > > > It should just be "ceph", no matter what the cluster name is, if I read the > code correctly. The user id is passed in via the systemd instance name. For example, if you wanted to use the "mirror" user id to connect to the local cluster, you would run "systemctl enable ceph-rbd-mirror@mirror". > - is it possible to throttle mirroring? Sure, it's a crazy thing to do >for "cinder" pools, but may make sense for slowly changing ones, like >a "glance" pool. > > > The rbd core team is working on this. Jason, right? This is in our backlog of desired items for the rbd-mirror daemon. Having different settings for different pools was not in our original plan, but this is something that also came up during the Vault conference last week. I've added an additional backlog item to cover per-pool settings. > - is it possible to set per-pool default features? I read about > "rbd default features = ###" >but this is a global setting. (Ok, I can still restrict pools to be >mirrored with "ceph auth" for the user doing mirroring) > > > "per-pool default features" sounds like a reasonable feature request. > > About the "ceph auth" for mirroring, I am working on a rbd acl design, > will consider pool-level, namespace-level and image-level. Then I think > we can do a permission check on this. Right now, the best way to achieve that is by using different configs / user ids for different services. For example, if OpenStack glance used "glance" and cinder user "cinder", the ceph.conf's "[client.glance]" section could have different default features as compared to a "[client.cinder]" section. > Thanx > Yang > > > > Thanks! > > Fulvio > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- Jason ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Questions on rbd-mirror
Jason, do you think it's good idea to introduce a rbd_config object to record some configurations of per-pool, such as default_features. That means, we can set some configurations differently in different pool. In this way, we can also handle the per-pool setting in rbd-mirror. Thanx Yang On 27/03/2017, 21:20, Jason Dillaman wrote: On Mon, Mar 27, 2017 at 4:00 AM, Dongsheng Yang wrote: Hi Fulvio, On 03/24/2017 07:19 PM, Fulvio Galeazzi wrote: Hallo, apologies for my (silly) questions, I did try to find some doc on rbd-mirror but was unable to, apart from a number of pages explaining how to install it. My environment is CenOS7 and Ceph 10.2.5. Can anyone help me understand a few minor things: - is there a cleaner way to configure the user which will be used for rbd-mirror, other than editing the ExecStart in file /usr/lib/systemd/system/ceph-rbd-mirror@.service ? For example some line in ceph.conf... looks like the username defaults to the cluster name, am I right? It should just be "ceph", no matter what the cluster name is, if I read the code correctly. The user id is passed in via the systemd instance name. For example, if you wanted to use the "mirror" user id to connect to the local cluster, you would run "systemctl enable ceph-rbd-mirror@mirror". - is it possible to throttle mirroring? Sure, it's a crazy thing to do for "cinder" pools, but may make sense for slowly changing ones, like a "glance" pool. The rbd core team is working on this. Jason, right? This is in our backlog of desired items for the rbd-mirror daemon. Having different settings for different pools was not in our original plan, but this is something that also came up during the Vault conference last week. I've added an additional backlog item to cover per-pool settings. - is it possible to set per-pool default features? I read about "rbd default features = ###" but this is a global setting. (Ok, I can still restrict pools to be mirrored with "ceph auth" for the user doing mirroring) "per-pool default features" sounds like a reasonable feature request. About the "ceph auth" for mirroring, I am working on a rbd acl design, will consider pool-level, namespace-level and image-level. Then I think we can do a permission check on this. Right now, the best way to achieve that is by using different configs / user ids for different services. For example, if OpenStack glance used "glance" and cinder user "cinder", the ceph.conf's "[client.glance]" section could have different default features as compared to a "[client.cinder]" section. Thanx Yang Thanks! Fulvio ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] libjemalloc.so.1 not used?
Hi, we are testing Ceph as block storage (XFS based OSDs) running in a hyper converged setup with KVM as hypervisor. We are using NVMe SSD only (Intel DC P5320) and I would like to use jemalloc on Ubuntu xenial (current kernel 4.4.0-64-generic). I tried to use /etc/default/ceph and uncommented: # /etc/default/ceph # # Environment file for ceph daemon systemd unit files. # # Increase tcmalloc cache size TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=134217728 ## use jemalloc instead of tcmalloc # # jemalloc is generally faster for small IO workloads and when # ceph-osd is backed by SSDs. However, memory usage is usually # higher by 200-300mb. # LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1 and it looks like the OSDs are using jemalloc: lsof |grep -e "ceph-osd.*8074.*malloc" ceph-osd 8074 ceph mem REG 252,0 294776 659213 /usr/lib/libtcmalloc.so.4.2.6 ceph-osd 8074 ceph mem REG 252,0 219816 658861 /usr/lib/x86_64-linux-gnu/libjemalloc.so.1 ceph-osd 8074 8116 ceph mem REG 252,0 294776 659213 /usr/lib/libtcmalloc.so.4.2.6 ceph-osd 8074 8116 ceph mem REG 252,0 219816 658861 /usr/lib/x86_64-linux-gnu/libjemalloc.so.1 ceph-osd 8074 8117 ceph mem REG 252,0 294776 659213 /usr/lib/libtcmalloc.so.4.2.6 ceph-osd 8074 8117 ceph mem REG 252,0 219816 658861 /usr/lib/x86_64-linux-gnu/libjemalloc.so.1 ceph-osd 8074 8118 ceph mem REG 252,0 294776 659213 /usr/lib/libtcmalloc.so.4.2.6 ceph-osd 8074 8118 ceph mem REG 252,0 219816 658861 /usr/lib/x86_64-linux-gnu/libjemalloc.so.1 [...] But perf top shows something different: Samples: 11M of event 'cycles:pp', Event count (approx.): 603904862529620 Overhead Shared Object Symbol 1.86% libtcmalloc.so.4.2.6 [.] operator new[] 1.73% [kernel] [k] mem_cgroup_iter 1.34% libstdc++.so.6.0.21 [.] std::__ostream_insert > 1.29% libpthread-2.23.so[.] pthread_mutex_lock 1.10% [kernel] [k] __switch_to 0.97% libpthread-2.23.so[.] pthread_mutex_unlock 0.94% [kernel] [k] native_queued_spin_lock_slowpath 0.92% [kernel] [k] update_cfs_shares 0.90% libc-2.23.so [.] __memcpy_avx_unaligned 0.87% libtcmalloc.so.4.2.6 [.] operator delete[] 0.80% ceph-osd [.] ceph::buffer::ptr::release 0.80% [kernel] [k] mem_cgroup_zone_lruvec Do my OSDs use jemalloc or don't they? All the best, Florian EveryWare AG Florian Engelmann Systems Engineer Zurlindenstrasse 52a CH-8003 Zürich T +41 44 466 60 00 F +41 44 466 60 10 florian.engelm...@everyware.ch www.everyware.ch smime.p7s Description: S/MIME cryptographic signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] New hardware for OSDs
> -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Wido den Hollander > Sent: 27 March 2017 12:35 > To: ceph-users@lists.ceph.com; Christian Balzer > Subject: Re: [ceph-users] New hardware for OSDs > > > > Op 27 maart 2017 om 13:22 schreef Christian Balzer : > > > > > > > > Hello, > > > > On Mon, 27 Mar 2017 12:27:40 +0200 Mattia Belluco wrote: > > > > > Hello all, > > > we are currently in the process of buying new hardware to expand an > > > existing Ceph cluster that already has 1200 osds. > > > > That's quite sizable, is the expansion driven by the need for more > > space (big data?) or to increase IOPS (or both)? > > > > > We are currently using 24 * 4 TB SAS drives per osd with an SSD > > > journal shared among 4 osds. For the upcoming expansion we were > > > thinking of switching to either 6 or 8 TB hard drives (9 or 12 per > > > host) in order to drive down space and cost requirements. > > > > > > Has anyone any experience in mid-sized/large-sized deployment using > > > such hard drives? Our main concern is the rebalance time but we > > > might be overlooking some other aspects. > > > > > > > If you researched the ML archives, you should already know to stay > > well away from SMR HDDs. > > > > Amen! Just don't. Stay away from SMR with Ceph. > > > Both HGST and Seagate have large Enterprise HDDs that have > > journals/caches (MediaCache in HGST speak IIRC) that drastically > > improve write IOPS compared to plain HDDs. > > Even with SSD journals you will want to consider those, as these new > > HDDs will see at least twice the action than your current ones. > > I've got a mixture of WD Red Pro 6TB and HGST He8 8TB drives. Recovery for ~70% full disks takes around 3-4 hours, this is for a cluster containing 60 OSD's. I'm usually seeing recovery speeds up around 1GB/s or more. Depends on your workload, mine is for archiving/backups so big disks are a must. I wouldn't recommend using them for more active workloads unless you are planning a beefy cache tier or some other sort of caching solution. The He8 (and He10) drives also use a fair bit less power due to less friction, but I think this only applies to the sata model. My 12x3.5 8TB node with CPU...etc uses ~140W at idle. Hoping to get this down further with a new Xeon-D design on next expansion phase. The only thing I will say about big disks is beware of cold FS inodes/dentry's and PG splitting. The former isn't a problem if you will only be actively accessing a small portion of your data, but I see increases in latency if I access cold data even with VFS cache pressure set to 1. Currently investigating using bcache under the OSD to try and cache this. PG splitting becomes a problem when the disks start to fill up, playing with the split/merge thresholds may help, but you have to be careful you don't end up with massive splits when they do finally happen, as otherwise OSD's start timing out. > > I also have good experiences with bcache on NVM-E device in Ceph clusters. > A single Intel P3600/P3700 which is the caching device for bcache. > > > Rebalance time is a concern of course, especially if your cluster like > > most HDD based ones has these things throttled down to not impede > > actual client I/O. > > > > To get a rough idea, take a look at: > > https://www.memset.com/tools/raid-calculator/ > > > > For Ceph with replication 3 and the typical PG distribution, assume > > 100 disks and the RAID6 with hotspares numbers are relevant. > > For rebuild speed, consult your experience, you must have had a few > > failures. ^o^ > > > > For example with a recovery speed of 100MB/s, a 1TB disk (used data > > with Ceph actually) looks decent at 1:16000 DLO/y. > > At 5TB though it enters scary land > > > > Yes, those recoveries will take a long time. Let's say your 6TB drive is filled for > 80% you need to rebalance 4.8TB > > 4.8TB / 100MB/sec = 13 hours rebuild time > > 13 hours is a long time. And you will probably not have 100MB/sec > sustained, I think that 50MB/sec is much more realistic. Are we talking backfill or recovery here? Recovery will go at the combined speed of all the disks in the cluster. If the OP's cluster is already at 1200 OSD's, a single disk will be a tiny percentage per OSD to recover. But yes, backfill will probably crawl along at 50MB/s, but is this a problem? > > That means that a single disk failure will take >24 hours to recover from a > rebuild. > > I don't like very big disks that much. Not in RAID, not in Ceph. > > Wido > > > Christian > > > > > We currently use the cluster as storage for openstack services: > > > Glance, Cinder and VMs' ephemeral disks. > > > > > > Thanks in advance for any advice. > > > > > > Mattia > > > ___ > > > ceph-users mailing list > > > ceph-users@lists.ceph.com > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > -- > > Christian BalzerNetwork/S
Re: [ceph-users] New hardware for OSDs
I mistakenly answered to Wido instead of the whole Mailing list ( weird ml settings I suppose) Here it is my message: Thanks for replying so quickly. I commented inline. On 03/27/2017 01:34 PM, Wido den Hollander wrote: > >> Op 27 maart 2017 om 13:22 schreef Christian Balzer : >> >> >> >> Hello, >> >> On Mon, 27 Mar 2017 12:27:40 +0200 Mattia Belluco wrote: >> >>> Hello all, >>> we are currently in the process of buying new hardware to expand an >>> existing Ceph cluster that already has 1200 osds. >> >> That's quite sizable, is the expansion driven by the need for more space >> (big data?) or to increase IOPS (or both)? >> >>> We are currently using 24 * 4 TB SAS drives per osd with an SSD journal >>> shared among 4 osds. For the upcoming expansion we were thinking of >>> switching to either 6 or 8 TB hard drives (9 or 12 per host) in order to >>> drive down space and cost requirements. >>> >>> Has anyone any experience in mid-sized/large-sized deployment using such >>> hard drives? Our main concern is the rebalance time but we might be >>> overlooking some other aspects. >>> >> >> If you researched the ML archives, you should already know to stay well >> away from SMR HDDs. >> > > Amen! Just don't. Stay away from SMR with Ceph. > We were planning on using regular enterprise disks. No SMR :) We are bit puzzled about the possible performance gain of the 4k native ones but that's about it. >> Both HGST and Seagate have large Enterprise HDDs that have >> journals/caches (MediaCache in HGST speak IIRC) that drastically improve >> write IOPS compared to plain HDDs. >> Even with SSD journals you will want to consider those, as these new HDDs >> will see at least twice the action than your current ones. >> > > I also have good experiences with bcache on NVM-E device in Ceph clusters. A > single Intel P3600/P3700 which is the caching device for bcache. > No experience with those but I am a bit skeptical in including new solutions in the current cluster as the current setup seems to work quite well (no IOPS problem). Those could be a nice solution for a new cluster, though. >> Rebalance time is a concern of course, especially if your cluster like >> most HDD based ones has these things throttled down to not impede actual >> client I/O. >> >> To get a rough idea, take a look at: >> https://www.memset.com/tools/raid-calculator/ >> >> For Ceph with replication 3 and the typical PG distribution, assume 100 >> disks and the RAID6 with hotspares numbers are relevant. >> For rebuild speed, consult your experience, you must have had a few >> failures. ^o^ >> >> For example with a recovery speed of 100MB/s, a 1TB disk (used data with >> Ceph actually) looks decent at 1:16000 DLO/y. >> At 5TB though it enters scary land >> > > Yes, those recoveries will take a long time. Let's say your 6TB drive is > filled for 80% you need to rebalance 4.8TB > > 4.8TB / 100MB/sec = 13 hours rebuild time > > 13 hours is a long time. And you will probably not have 100MB/sec sustained, > I think that 50MB/sec is much more realistic. > > That means that a single disk failure will take >24 hours to recover from a > rebuild. > > I don't like very big disks that much. Not in RAID, not in Ceph. I don't think I am followinj the calculations. Maybe I need to provide a few more details on our current network configuration: each host (24 disks/osds) has 4 * 10 Gbit interfaces, 2 for client I/O and 2 for the recovery network. Rebalancing an OSD that was 50% full (2000GB) with the current setup tool a little less than 30 mins. It would still take 1.5 hour to rebalance 6 TB of data but that should still be reasonable,no? What am I overlooking here? >From our perspective having 9 * 8TB noded should provide a better recovery time than the current 24 * 4TB ones if a whole node goes down provide the rebalance is shared among several hundreds osds. Thanks for any additional input. Mattia > > Wido > >> Christian >> [snip] ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] osds down after upgrade hammer to jewel
Hi all, I'm upgrading ceph cluster from Hammer 0.94.9 to jewel 10.2.6. The ceph cluster has 3 servers (one mon and one mds each) and another 6 servers with 12 osds each. The monitoring and mds have been succesfully upgraded to latest jewel release, however after upgrade the first osd server(12 osds), ceph is not aware of them and are marked as down ceph -s cluster 4a158d27-f750-41d5-9e7f-26ce4c9d2d45 health HEALTH_WARN [...] 12/72 in osds are down noout flag(s) set osdmap e14010: 72 osds: 60 up, 72 in; 14641 remapped pgs flags noout [...] ceph osd tree 3 3.64000 osd.3 down 1.0 1.0 8 3.64000 osd.8 down 1.0 1.0 14 3.64000 osd.14 down 1.0 1.0 18 3.64000 osd.18 down 1.0 1.0 21 3.64000 osd.21 down 1.0 1.0 28 3.64000 osd.28 down 1.0 1.0 31 3.64000 osd.31 down 1.0 1.0 37 3.64000 osd.37 down 1.0 1.0 42 3.64000 osd.42 down 1.0 1.0 47 3.64000 osd.47 down 1.0 1.0 51 3.64000 osd.51 down 1.0 1.0 56 3.64000 osd.56 down 1.0 1.0 If I run this command with one of the down osd ceph osd in 14 osd.14 is already in. however ceph doesn't mark it as up and the cluster health remains in degraded state. Do I have to upgrade all the osds to jewel first? Any help as I'm running out of ideas? Thanks Jaime -- Jaime Ibar High Performance & Research Computing, IS Services Lloyd Building, Trinity College Dublin, Dublin 2, Ireland. http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie Tel: +353-1-896-3725 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] radosgw global quotas - how to set in jewel?
I'm following up to myself here, but I'd love to hear if anyone knows how the global quotas can be set in jewel's radosgw. I haven't found anything which has an effect - the documentation says to use: radosgw-admin region-map get > regionmap.json ...edit the json file radosgw-admin region-map set < regionmap.json but this has no effect on jewel. There doesn't seem to be any analogous function in the "period"-related commands which I think would be the right place to look for jewel. Am I missing something, or should I open a bug? Graham On 03/21/2017 03:18 PM, Graham Allan wrote: On 03/17/2017 11:47 AM, Casey Bodley wrote: On 03/16/2017 03:47 PM, Graham Allan wrote: This might be a dumb question, but I'm not at all sure what the "global quotas" in the radosgw region map actually do. It is like a default quota which is applied to all users or buckets, without having to set them individually, or is it a blanket/aggregate quota applied across all users and buckets in the region/zonegroup? Graham They're defaults that are applied in the absence of quota settings on specific users/buckets, not aggregate quotas. I agree that the documentation in http://docs.ceph.com/docs/master/radosgw/admin/ is not clear about the relationship between 'default quotas' and 'global quotas' - they're basically the same thing, except for their scope. Thanks, that's great to know, and exactly what I hoped it would do. It seemed most likely but not 100% obvious! My next question is how to set/enable the master quota, since I'm not sure that the documented procedure still works for jewel. Although radosgw-admin doesn't acknowledge the "region-map" command in its help output any more, it does accept it, however the "region-map set" appears to have no effect. I think I should be using the radosgw-admin period commands, but it's not clear to me how I can update the quotas within the period_config G. -- Graham Allan Minnesota Supercomputing Institute - g...@umn.edu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] disk timeouts in libvirt/qemu VMs...
In an OpenStack (mitaka) cloud, backed by a ceph cluster (10.2.6 jewel), using libvirt/qemu (1.3.1/2.5) hypervisors on Ubuntu 14.04.5 compute and ceph hosts, we occasionally see hung processes (usually during boot, but otherwise as well), with errors reported in the instance logs as shown below. Configuration is vanilla, based on openstack/ceph docs. Neither the compute hosts nor the ceph hosts appear to be overloaded in terms of memory or network bandwidth, none of the 67 osds are over 80% full, nor do any of them appear to be overwhelmed in terms of IO. Compute hosts and ceph cluster are connected via a relatively quiet 1Gb network, with an IBoE net between the ceph nodes. Neither network appears overloaded. I don’t see any related (to my eye) errors in client or server logs, even with 20/20 logging from various components (rbd, rados, client, objectcacher, etc.) I’ve increased the qemu file descriptor limit (currently 64k... overkill for sure.) I “feels” like a performance problem, but I can’t find any capacity issues or constraining bottlenecks. Any suggestions or insights into this situation are appreciated. Thank you for your time, -- Eric [Fri Mar 24 20:30:40 2017] INFO: task jbd2/vda1-8:226 blocked for more than 120 seconds. [Fri Mar 24 20:30:40 2017] Not tainted 3.13.0-52-generic #85-Ubuntu [Fri Mar 24 20:30:40 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Fri Mar 24 20:30:40 2017] jbd2/vda1-8 D 88043fd13180 0 226 2 0x [Fri Mar 24 20:30:40 2017] 88003728bbd8 0046 88042690 88003728bfd8 [Fri Mar 24 20:30:40 2017] 00013180 00013180 88042690 88043fd13a18 [Fri Mar 24 20:30:40 2017] 88043ffb9478 0002 811ef7c0 88003728bc50 [Fri Mar 24 20:30:40 2017] Call Trace: [Fri Mar 24 20:30:40 2017] [] ? generic_block_bmap+0x50/0x50 [Fri Mar 24 20:30:40 2017] [] io_schedule+0x9d/0x140 [Fri Mar 24 20:30:40 2017] [] sleep_on_buffer+0xe/0x20 [Fri Mar 24 20:30:40 2017] [] __wait_on_bit+0x62/0x90 [Fri Mar 24 20:30:40 2017] [] ? generic_block_bmap+0x50/0x50 [Fri Mar 24 20:30:40 2017] [] out_of_line_wait_on_bit+0x77/0x90 [Fri Mar 24 20:30:40 2017] [] ? autoremove_wake_function+0x40/0x40 [Fri Mar 24 20:30:40 2017] [] __wait_on_buffer+0x2a/0x30 [Fri Mar 24 20:30:40 2017] [] jbd2_journal_commit_transaction+0x185d/0x1ab0 [Fri Mar 24 20:30:40 2017] [] ? try_to_del_timer_sync+0x4f/0x70 [Fri Mar 24 20:30:40 2017] [] kjournald2+0xbd/0x250 [Fri Mar 24 20:30:40 2017] [] ? prepare_to_wait_event+0x100/0x100 [Fri Mar 24 20:30:40 2017] [] ? commit_timeout+0x10/0x10 [Fri Mar 24 20:30:40 2017] [] kthread+0xd2/0xf0 [Fri Mar 24 20:30:40 2017] [] ? kthread_create_on_node+0x1c0/0x1c0 [Fri Mar 24 20:30:40 2017] [] ret_from_fork+0x7c/0xb0 [Fri Mar 24 20:30:40 2017] [] ? kthread_create_on_node+0x1c0/0x1c0 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph OSD network with IPv6 SLAAC networks?
Has anyone run their Ceph OSD cluster network on IPv6 using SLAAC? I know that ceph supports IPv6, but I'm not sure how it would deal with the address rotation in SLAAC, permanent vs outgoing address, etc. It would be very nice for me, as I wouldn't have to run any kind of DHCP server or use static addressing -- just configure RA's and go. On that note, does anyone have any experience with running ceph in a mixed v4 and v6 environment? Thanks, -richard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] disk timeouts in libvirt/qemu VMs...
I can't guarantee it's the same as my issue, but from that it sounds the same. Jewel 10.2.4, 10.2.5 tested hypervisors are proxmox qemu-kvm, using librbd 3 ceph nodes with mon+osd on each -faster journals, more disks, bcache, rbd_cache, fewer VMs on ceph, iops and bw limits on client side, jumbo frames, etc. all improve/smooth out performance and mitigate the hangs, but don't prevent it. -hangs are usually associated with blocked requests (I set the complaint time to 5s to see them) -hangs are very easily caused by rbd snapshot + rbd export-diff to do incremental backup (one snap persistent, plus one more during backup) -when qemu VM io hangs, I have to kill -9 the qemu process for it to stop. Some broken VMs don't appear to be hung until I try to live migrate them (live migrating all VMs helped test solutions) Finally I have a workaround... disable exclusive-lock, object-map, and fast-diff rbd features (and restart clients via live migrate). (object-map and fast-diff appear to have no effect on dif or export-diff ... so I don't miss them). I'll file a bug at some point (after I move all VMs back and see if it is still stable). And one other user on IRC said this solved the same problem (also using rbd snapshots). And strangely, they don't seem to hang if I put back those features, until a few days later (making testing much less easy...but now I'm very sure removing them prevents the issue) I hope this works for you (and maybe gets some attention from devs too), so you don't waste months like me. On 03/27/17 19:31, Hall, Eric wrote: > In an OpenStack (mitaka) cloud, backed by a ceph cluster (10.2.6 jewel), > using libvirt/qemu (1.3.1/2.5) hypervisors on Ubuntu 14.04.5 compute and ceph > hosts, we occasionally see hung processes (usually during boot, but otherwise > as well), with errors reported in the instance logs as shown below. > Configuration is vanilla, based on openstack/ceph docs. > > Neither the compute hosts nor the ceph hosts appear to be overloaded in terms > of memory or network bandwidth, none of the 67 osds are over 80% full, nor do > any of them appear to be overwhelmed in terms of IO. Compute hosts and ceph > cluster are connected via a relatively quiet 1Gb network, with an IBoE net > between the ceph nodes. Neither network appears overloaded. > > I don’t see any related (to my eye) errors in client or server logs, even > with 20/20 logging from various components (rbd, rados, client, objectcacher, > etc.) I’ve increased the qemu file descriptor limit (currently 64k... > overkill for sure.) > > I “feels” like a performance problem, but I can’t find any capacity issues or > constraining bottlenecks. > > Any suggestions or insights into this situation are appreciated. Thank you > for your time, > -- > Eric > > > [Fri Mar 24 20:30:40 2017] INFO: task jbd2/vda1-8:226 blocked for more than > 120 seconds. > [Fri Mar 24 20:30:40 2017] Not tainted 3.13.0-52-generic #85-Ubuntu > [Fri Mar 24 20:30:40 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [Fri Mar 24 20:30:40 2017] jbd2/vda1-8 D 88043fd13180 0 226 > 2 0x > [Fri Mar 24 20:30:40 2017] 88003728bbd8 0046 > 88042690 88003728bfd8 > [Fri Mar 24 20:30:40 2017] 00013180 00013180 > 88042690 88043fd13a18 > [Fri Mar 24 20:30:40 2017] 88043ffb9478 0002 > 811ef7c0 88003728bc50 > [Fri Mar 24 20:30:40 2017] Call Trace: > [Fri Mar 24 20:30:40 2017] [] ? > generic_block_bmap+0x50/0x50 > [Fri Mar 24 20:30:40 2017] [] io_schedule+0x9d/0x140 > [Fri Mar 24 20:30:40 2017] [] sleep_on_buffer+0xe/0x20 > [Fri Mar 24 20:30:40 2017] [] __wait_on_bit+0x62/0x90 > [Fri Mar 24 20:30:40 2017] [] ? > generic_block_bmap+0x50/0x50 > [Fri Mar 24 20:30:40 2017] [] > out_of_line_wait_on_bit+0x77/0x90 > [Fri Mar 24 20:30:40 2017] [] ? > autoremove_wake_function+0x40/0x40 > [Fri Mar 24 20:30:40 2017] [] __wait_on_buffer+0x2a/0x30 > [Fri Mar 24 20:30:40 2017] [] > jbd2_journal_commit_transaction+0x185d/0x1ab0 > [Fri Mar 24 20:30:40 2017] [] ? > try_to_del_timer_sync+0x4f/0x70 > [Fri Mar 24 20:30:40 2017] [] kjournald2+0xbd/0x250 > [Fri Mar 24 20:30:40 2017] [] ? > prepare_to_wait_event+0x100/0x100 > [Fri Mar 24 20:30:40 2017] [] ? commit_timeout+0x10/0x10 > [Fri Mar 24 20:30:40 2017] [] kthread+0xd2/0xf0 > [Fri Mar 24 20:30:40 2017] [] ? > kthread_create_on_node+0x1c0/0x1c0 > [Fri Mar 24 20:30:40 2017] [] ret_from_fork+0x7c/0xb0 > [Fri Mar 24 20:30:40 2017] [] ? > kthread_create_on_node+0x1c0/0x1c0 > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Kraken release and RGW --> "S3 bucket lifecycle API has been added. Note that currently it only supports object expiration."
Hi Cephers. Couldn't find any special documentation about the "S3 object expiration" so I assume it should work "AWS S3 like" (?!?) ... BUT ... we have a test cluster based on 11.2.0 - Kraken and I set some object expiration dates via CyberDuck and DragonDisk, but the objects are still there, days after the applied date/time. Do I miss something? Thanks & regards ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osds down after upgrade hammer to jewel
Make sure the OSD processes on the Jewel node are running. If you didn't change the ownership to user ceph, they won't start. > On Mar 27, 2017, at 11:53, Jaime Ibar wrote: > > Hi all, > > I'm upgrading ceph cluster from Hammer 0.94.9 to jewel 10.2.6. > > The ceph cluster has 3 servers (one mon and one mds each) and another 6 > servers with > 12 osds each. > The monitoring and mds have been succesfully upgraded to latest jewel > release, however > after upgrade the first osd server(12 osds), ceph is not aware of them and > are marked as down > > ceph -s > > cluster 4a158d27-f750-41d5-9e7f-26ce4c9d2d45 > health HEALTH_WARN > [...] >12/72 in osds are down >noout flag(s) set > osdmap e14010: 72 osds: 60 up, 72 in; 14641 remapped pgs >flags noout > [...] > > ceph osd tree > > 3 3.64000 osd.3 down 1.0 1.0 > 8 3.64000 osd.8 down 1.0 1.0 > 14 3.64000 osd.14 down 1.0 1.0 > 18 3.64000 osd.18 down 1.0 1.0 > 21 3.64000 osd.21 down 1.0 1.0 > 28 3.64000 osd.28 down 1.0 1.0 > 31 3.64000 osd.31 down 1.0 1.0 > 37 3.64000 osd.37 down 1.0 1.0 > 42 3.64000 osd.42 down 1.0 1.0 > 47 3.64000 osd.47 down 1.0 1.0 > 51 3.64000 osd.51 down 1.0 1.0 > 56 3.64000 osd.56 down 1.0 1.0 > > If I run this command with one of the down osd > ceph osd in 14 > osd.14 is already in. > however ceph doesn't mark it as up and the cluster health remains > in degraded state. > > Do I have to upgrade all the osds to jewel first? > Any help as I'm running out of ideas? > > Thanks > Jaime > > -- > > Jaime Ibar > High Performance & Research Computing, IS Services > Lloyd Building, Trinity College Dublin, Dublin 2, Ireland. > http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie > Tel: +353-1-896-3725 > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] libjemalloc.so.1 not used?
you need to recompile ceph with jemalloc, without have tcmalloc dev librairies. LD_PRELOAD has never work for jemalloc and ceph - Mail original - De: "Engelmann Florian" À: "ceph-users" Envoyé: Lundi 27 Mars 2017 16:54:33 Objet: [ceph-users] libjemalloc.so.1 not used? Hi, we are testing Ceph as block storage (XFS based OSDs) running in a hyper converged setup with KVM as hypervisor. We are using NVMe SSD only (Intel DC P5320) and I would like to use jemalloc on Ubuntu xenial (current kernel 4.4.0-64-generic). I tried to use /etc/default/ceph and uncommented: # /etc/default/ceph # # Environment file for ceph daemon systemd unit files. # # Increase tcmalloc cache size TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=134217728 ## use jemalloc instead of tcmalloc # # jemalloc is generally faster for small IO workloads and when # ceph-osd is backed by SSDs. However, memory usage is usually # higher by 200-300mb. # LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1 and it looks like the OSDs are using jemalloc: lsof |grep -e "ceph-osd.*8074.*malloc" ceph-osd 8074 ceph mem REG 252,0 294776 659213 /usr/lib/libtcmalloc.so.4.2.6 ceph-osd 8074 ceph mem REG 252,0 219816 658861 /usr/lib/x86_64-linux-gnu/libjemalloc.so.1 ceph-osd 8074 8116 ceph mem REG 252,0 294776 659213 /usr/lib/libtcmalloc.so.4.2.6 ceph-osd 8074 8116 ceph mem REG 252,0 219816 658861 /usr/lib/x86_64-linux-gnu/libjemalloc.so.1 ceph-osd 8074 8117 ceph mem REG 252,0 294776 659213 /usr/lib/libtcmalloc.so.4.2.6 ceph-osd 8074 8117 ceph mem REG 252,0 219816 658861 /usr/lib/x86_64-linux-gnu/libjemalloc.so.1 ceph-osd 8074 8118 ceph mem REG 252,0 294776 659213 /usr/lib/libtcmalloc.so.4.2.6 ceph-osd 8074 8118 ceph mem REG 252,0 219816 658861 /usr/lib/x86_64-linux-gnu/libjemalloc.so.1 [...] But perf top shows something different: Samples: 11M of event 'cycles:pp', Event count (approx.): 603904862529620 Overhead Shared Object Symbol 1.86% libtcmalloc.so.4.2.6 [.] operator new[] 1.73% [kernel] [k] mem_cgroup_iter 1.34% libstdc++.so.6.0.21 [.] std::__ostream_insert > 1.29% libpthread-2.23.so [.] pthread_mutex_lock 1.10% [kernel] [k] __switch_to 0.97% libpthread-2.23.so [.] pthread_mutex_unlock 0.94% [kernel] [k] native_queued_spin_lock_slowpath 0.92% [kernel] [k] update_cfs_shares 0.90% libc-2.23.so [.] __memcpy_avx_unaligned 0.87% libtcmalloc.so.4.2.6 [.] operator delete[] 0.80% ceph-osd [.] ceph::buffer::ptr::release 0.80% [kernel] [k] mem_cgroup_zone_lruvec Do my OSDs use jemalloc or don't they? All the best, Florian EveryWare AG Florian Engelmann Systems Engineer Zurlindenstrasse 52a CH-8003 Zürich T +41 44 466 60 00 F +41 44 466 60 10 florian.engelm...@everyware.ch www.everyware.ch ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] How to check SMR vs PMR before buying disks?
What's the biggest PMR disk I can buy, and how do I tell if a disk is PMR? I'm well aware that I shouldn't use SMR disks: http://ceph.com/planet/do-not-use-smr-disks-with-ceph/ But newegg and the like don't seem to advertise SMR vs PMR and I can't even find it on manufacturer's websites (at least not from Seagate). Is there any way to tell? Is there a rule of thumb, such "as 4T+ is probably SMR" or "enterprise usually means PMR"? Thanks -- Adam Carheden ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to check SMR vs PMR before buying disks?
On Mon, 27 Mar 2017 17:32:53 -0600 Adam Carheden wrote: > What's the biggest PMR disk I can buy, and how do I tell if a disk is PMR? > > I'm well aware that I shouldn't use SMR disks: > http://ceph.com/planet/do-not-use-smr-disks-with-ceph/ > > But newegg and the like don't seem to advertise SMR vs PMR and I can't > even find it on manufacturer's websites (at least not from Seagate). > You need to work on your google/website scouring foo. http://www.seagate.com/enterprise-storage/hard-disk-drives/archive-hdd/#features Clearly says SMR there, I would assume "archive" is a good hint, too. > Is there any way to tell? Is there a rule of thumb, such "as 4T+ is > probably SMR" or "enterprise usually means PMR"? > Size isn't conclusive, enterprise and non-archive more so. http://www.seagate.com/enterprise-storage/hard-disk-drives/enterprise-capacity-3-5-hdd/ "Proven conventional PMR technology backed by highest field reliability ratings and an MTBF of 2M hours" HTH, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] New hardware for OSDs
Hello, On Mon, 27 Mar 2017 16:09:09 +0100 Nick Fisk wrote: > > -Original Message- > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > > Wido den Hollander > > Sent: 27 March 2017 12:35 > > To: ceph-users@lists.ceph.com; Christian Balzer > > Subject: Re: [ceph-users] New hardware for OSDs > > > > > > > Op 27 maart 2017 om 13:22 schreef Christian Balzer : > > > > > > > > > > > > Hello, > > > > > > On Mon, 27 Mar 2017 12:27:40 +0200 Mattia Belluco wrote: > > > > > > > Hello all, > > > > we are currently in the process of buying new hardware to expand an > > > > existing Ceph cluster that already has 1200 osds. > > > > > > That's quite sizable, is the expansion driven by the need for more > > > space (big data?) or to increase IOPS (or both)? > > > > > > > We are currently using 24 * 4 TB SAS drives per osd with an SSD > > > > journal shared among 4 osds. For the upcoming expansion we were > > > > thinking of switching to either 6 or 8 TB hard drives (9 or 12 per > > > > host) in order to drive down space and cost requirements. > > > > > > > > Has anyone any experience in mid-sized/large-sized deployment using > > > > such hard drives? Our main concern is the rebalance time but we > > > > might be overlooking some other aspects. > > > > > > > > > > If you researched the ML archives, you should already know to stay > > > well away from SMR HDDs. > > > > > > > Amen! Just don't. Stay away from SMR with Ceph. > > > > > Both HGST and Seagate have large Enterprise HDDs that have > > > journals/caches (MediaCache in HGST speak IIRC) that drastically > > > improve write IOPS compared to plain HDDs. > > > Even with SSD journals you will want to consider those, as these new > > > HDDs will see at least twice the action than your current ones. > > > > > I've got a mixture of WD Red Pro 6TB and HGST He8 8TB drives. Recovery for > ~70% full disks takes around 3-4 hours, this is for a cluster containing 60 > OSD's. I'm usually seeing recovery speeds up around 1GB/s or more. > Good data point. How busy is your cluster at those times, client I/O impact? > Depends on your workload, mine is for archiving/backups so big disks are a > must. I wouldn't recommend using them for more active workloads unless you > are planning a beefy cache tier or some other sort of caching solution. > > The He8 (and He10) drives also use a fair bit less power due to less > friction, but I think this only applies to the sata model. My 12x3.5 8TB > node with CPU...etc uses ~140W at idle. Hoping to get this down further with > a new Xeon-D design on next expansion phase. > > The only thing I will say about big disks is beware of cold FS > inodes/dentry's and PG splitting. The former isn't a problem if you will > only be actively accessing a small portion of your data, but I see increases > in latency if I access cold data even with VFS cache pressure set to 1. > Currently investigating using bcache under the OSD to try and cache this. > I've seen this kind of behavior on my (non-Ceph) mailbox servers. As in, the maximum SLAB space may not be large enough to hold all inodes or the pagecache will eat into it over time when not constantly referenced, despite cache pressure settings. > PG splitting becomes a problem when the disks start to fill up, playing with > the split/merge thresholds may help, but you have to be careful you don't > end up with massive splits when they do finally happen, as otherwise OSD's > start timing out. > Getting this right (and predictable) is one of the darker arts with Ceph. OTOH it will go away with Bluestore (just to be replaced by other oddities no doubt). > > > > I also have good experiences with bcache on NVM-E device in Ceph clusters. > > A single Intel P3600/P3700 which is the caching device for bcache. > > > > > Rebalance time is a concern of course, especially if your cluster like > > > most HDD based ones has these things throttled down to not impede > > > actual client I/O. > > > > > > To get a rough idea, take a look at: > > > https://www.memset.com/tools/raid-calculator/ > > > > > > For Ceph with replication 3 and the typical PG distribution, assume > > > 100 disks and the RAID6 with hotspares numbers are relevant. > > > For rebuild speed, consult your experience, you must have had a few > > > failures. ^o^ > > > > > > For example with a recovery speed of 100MB/s, a 1TB disk (used data > > > with Ceph actually) looks decent at 1:16000 DLO/y. > > > At 5TB though it enters scary land > > > > > > > Yes, those recoveries will take a long time. Let's say your 6TB drive is > filled for > > 80% you need to rebalance 4.8TB > > > > 4.8TB / 100MB/sec = 13 hours rebuild time > > > > 13 hours is a long time. And you will probably not have 100MB/sec > > sustained, I think that 50MB/sec is much more realistic. > > Are we talking backfill or recovery here? Recovery will go at the combined > speed of all the disks in the cluster. If the OP's cl
Re: [ceph-users] New hardware for OSDs
Hello, On Mon, 27 Mar 2017 17:48:38 +0200 Mattia Belluco wrote: > I mistakenly answered to Wido instead of the whole Mailing list ( weird > ml settings I suppose) > > Here it is my message: > > > Thanks for replying so quickly. I commented inline. > > On 03/27/2017 01:34 PM, Wido den Hollander wrote: > > > >> Op 27 maart 2017 om 13:22 schreef Christian Balzer : > >> > >> > >> > >> Hello, > >> > >> On Mon, 27 Mar 2017 12:27:40 +0200 Mattia Belluco wrote: > >> > >>> Hello all, > >>> we are currently in the process of buying new hardware to expand an > >>> existing Ceph cluster that already has 1200 osds. > >> > >> That's quite sizable, is the expansion driven by the need for more space > >> (big data?) or to increase IOPS (or both)? > >> > >>> We are currently using 24 * 4 TB SAS drives per osd with an SSD journal > >>> shared among 4 osds. For the upcoming expansion we were thinking of > >>> switching to either 6 or 8 TB hard drives (9 or 12 per host) in order to > >>> drive down space and cost requirements. > >>> > >>> Has anyone any experience in mid-sized/large-sized deployment using such > >>> hard drives? Our main concern is the rebalance time but we might be > >>> overlooking some other aspects. > >>> > >> > >> If you researched the ML archives, you should already know to stay well > >> away from SMR HDDs. > >> > > > > Amen! Just don't. Stay away from SMR with Ceph. > > > We were planning on using regular enterprise disks. No SMR :) > We are bit puzzled about the possible performance gain of the 4k native > ones but that's about it. > AFAIK Linux will even with 512e (4K native, 512B emulation) drives do the right thing [TM]. > >> Both HGST and Seagate have large Enterprise HDDs that have > >> journals/caches (MediaCache in HGST speak IIRC) that drastically improve > >> write IOPS compared to plain HDDs. > >> Even with SSD journals you will want to consider those, as these new HDDs > >> will see at least twice the action than your current ones. > >> > > > > I also have good experiences with bcache on NVM-E device in Ceph clusters. > > A single Intel P3600/P3700 which is the caching device for bcache. > > > No experience with those but I am a bit skeptical in including new > solutions in the current cluster as the current setup seems to work > quite well (no IOPS problem). > Those could be a nice solution for a new cluster, though. > I have no experiences (or no current ones at last) with those either and a new cluster (as in late this year or early next year) would likely to be Bluestore based and thus have different needs, tuning knobs, etc. > > >> Rebalance time is a concern of course, especially if your cluster like > >> most HDD based ones has these things throttled down to not impede actual > >> client I/O. > >> > >> To get a rough idea, take a look at: > >> https://www.memset.com/tools/raid-calculator/ > >> > >> For Ceph with replication 3 and the typical PG distribution, assume 100 > >> disks and the RAID6 with hotspares numbers are relevant. > >> For rebuild speed, consult your experience, you must have had a few > >> failures. ^o^ > >> > >> For example with a recovery speed of 100MB/s, a 1TB disk (used data with > >> Ceph actually) looks decent at 1:16000 DLO/y. > >> At 5TB though it enters scary land > >> > > > > Yes, those recoveries will take a long time. Let's say your 6TB drive is > > filled for 80% you need to rebalance 4.8TB > > > > 4.8TB / 100MB/sec = 13 hours rebuild time > > > > 13 hours is a long time. And you will probably not have 100MB/sec > > sustained, I think that 50MB/sec is much more realistic. > > > > That means that a single disk failure will take >24 hours to recover from a > > rebuild. > > > > I don't like very big disks that much. Not in RAID, not in Ceph. > I don't think I am followinj the calculations. Maybe I need to provide a > few more details on our current network configuration: > each host (24 disks/osds) has 4 * 10 Gbit interfaces, 2 for client I/O > and 2 for the recovery network. > Rebalancing an OSD that was 50% full (2000GB) with the current setup > tool a little less than 30 mins. It would still take 1.5 hour to > rebalance 6 TB of data but that should still be reasonable,no? > What am I overlooking here? > We're playing devils advocate here, not knowing your configuration. And most of all, if your cluster is busy or busier than usual, those times will go up. Your numbers suggest a recovery speed of around 1GB/s, which is very nice and something I'd expect (hope) to see from such a large cluster. Plunging that into the calculator above with 5TB gives us a 1:6500 DLO/y, not utterly frightening but also quite a bit lower than your current example with 2TB at 1:4. > From our perspective having 9 * 8TB noded should provide a better > recovery time than the current 24 * 4TB ones if a whole node goes down > provide the rebalance is shared among several hundreds osds. > You'll have 25% less data per nod
Re: [ceph-users] RBD image perf counters: usage, access
Hi Yang, > Do you mean get the perf counters via api? At first this counter is only for a particular ImageCtx (connected client), then you can read the counters by the perf dump command in my last mail I think. Yes, I did mean to get counters via API. And looks like I can adapt this admin-daemon command for my purposes. Thanks! Having ceph-top would be just great and much more useful for me, yes. I'm glad there are some discussions about that and I didn't know about them. So thanks for pointing me out :) On 27/03/17 15:38, Dongsheng Yang wrote: On 03/27/2017 04:06 PM, Masha Atakova wrote: Hi Yang, Hi Masha, Thank you for your reply. This is very useful indeed that there are many ImageCtx objects for one image. But in my setting, I don't have any particular ceph client connected to ceph (I could, but this is not the point). I'm trying to get metrics for particular image while not performing anything with it myself. The perf counter you mentioned in your first mail, is just for one particular image client, that means, these perf counter will disappear as the client disconnected. And I'm trying to get access to performance counters listed in the ImageCtx class, they don't seem to be reported by the perf tool. Do you mean get the perf counters via api? At first this counter is only for a particular ImageCtx (connected client), then you can read the counters by the perf dump command in my last mail I think. If you want to get the performance counter for an image (no matter how many ImageCtx, connected or disconnected), maybe you need to wait this one: http://pad.ceph.com/p/ceph-top Yang Thanks! On 27/03/17 12:29, Dongsheng Yang wrote: Hi Masha you can get the counters by perf dump command on the asok file of your client. such as that: $ ceph --admin-daemon out/client.admin.9921.asok perf dump|grep rd "rd": 656754, "rd_bytes": 656754, "rd_latency": { "discard": 0, "discard_bytes": 0, "discard_latency": { "omap_rd": 0, But, note that, this is a counter of this one ImageCtx, but not the counter for this image. There are possible several ImageCtxes reading or writing on the same image. Yang On 03/27/2017 12:23 PM, Masha Atakova wrote: Hi everyone, I was going around trying to figure out how to get ceph metrics on a more detailed level than daemons. Of course, I found and explored API for watching rados objects, but I'm more interested in getting metrics about RBD images. And while I could get list of objects for particular image, and then watch all of them, it doesn't seem like very efficient way to go about it. I checked librbd API and there isn't anything helping with my goal. So I went through the source code and found list of performance counters for image which are incremented by other parts of ceph when making corresponding operations: https://github.com/ceph/ceph/blob/master/src/librbd/ImageCtx.cc#L364 I have 2 questions about it: 1) is there any workaround to use those counters right now? maybe when compiling against ceph the code doing it. Looks like I need to be able to access particular ImageCtx object (instead of creating my own), and I just can't find appropriate class / part of the librbd allowing me to do so. 2) are there any plans on making those counters accessible via API like librbd or librados? I see that these questions might be more appropriate for the devel list, but: - it seems to me that question of getting ceph metrics is more interesting for those who use ceph - I couldn't subscribe to it with an error provided below. Thanks! majord...@vger.kernel.org: SMTP error from remote server for MAIL FROM command, host: vger.kernel.org (209.132.180.67) reason: 553 5.7.1 Hello [74.208.4.201], for your MAIL FROM address policy analysis reported: Your address is not liked source for email --- The header of the original message is following. --- Received: from [192.168.1.10] ([223.206.146.181]) by mail.gmx.com (mrgmxus001 [74.208.5.15]) with ESMTPSA (Nemesis) id 0M92q3-1d0LS03yov-00CTwW for ; Mon, 27 Mar 2017 05:55:46 +0200 To:majord...@vger.kernel.org From: Masha Atakova Subject: subscribe ceph-devel Message-ID:<174d9bc0-b50d-fc80-ede8-5ba9d472e...@mail.com> Date: Mon, 27 Mar 2017 10:55:43 +0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.7.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Provags-ID: V03:K0:Lau7llt7/MuJt+nRLjXIhY91IuCvCBJGtqDzxLgqkh2ERVkWeep 5CDyh9GHW7QSodn80xWCPOOD2kvvnr6YxrB5R9SZ1iloI9VO2YoTXAauDq4mtWh+abwUOiY wQgj6YvUcLjfUinsh0t68Q9m3h3ufZIoKIeWhKFGbsRALqsvZjgWBVlaAR/V5Vt4O/wFJGG YULQ6/t4oDSsBuy4agFdQ== X-UI-Out-Filterresults: notjunk:1;V01:K0:xLdjozptxu8=:nO7vxZvAbidrXk7gcv7Wqc Bjr14pXiTEv8gVIlRTZ78cNDEQthT557sAgBBRnJkDGXkP1efvEN2QqsZAzfa52Og4ysSFXub BPSiDOI0wkzxQMu1QHqWzvURobFX9LxrctwYB3k9nrOtHFgJwm0eQWfV1QKg7i0
Re: [ceph-users] Ceph OSD network with IPv6 SLAAC networks?
Nix the second question, as I understand it, ceph doesn't work in mixed IPv6 and legacy IPv4 environments. Still, would like to hear from people running it in SLAAC environments. On Mon, Mar 27, 2017 at 12:49 PM, Richard Hesse wrote: > Has anyone run their Ceph OSD cluster network on IPv6 using SLAAC? I know > that ceph supports IPv6, but I'm not sure how it would deal with the > address rotation in SLAAC, permanent vs outgoing address, etc. It would be > very nice for me, as I wouldn't have to run any kind of DHCP server or use > static addressing -- just configure RA's and go. > > On that note, does anyone have any experience with running ceph in a mixed > v4 and v6 environment? > > Thanks, > -richard > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-rest-api's behavior
Hi Brad, Thanks for your help. I found that's my problem. Forget attach file name with words ''keyring". And sorry to bother you again. Is it possible to create a minimum privilege client for the api to run? Best wishes, Mika 2017-03-24 19:32 GMT+08:00 Brad Hubbard : > On Fri, Mar 24, 2017 at 8:20 PM, Mika c wrote: > > Hi Brad, > > Thanks for your reply. The environment already created keyring file > and > > put it in /etc/ceph but not working. > > What was it called? > > > I have to write config into ceph.conf like below. > > > > ---ceph.conf start--- > > [client.symphony] > > log_file = / > > var/log/ceph/rest-api.log > > > > keyring = /etc/ceph/ceph.client.symphony > > public addr = > > 0.0.0.0 > > :5 > > 000 > > > > restapi base url = /api/v0.1 > > ---ceph.conf > > end > > --- > > > > > > Another question, have I must setting capabilities for this client like > > admin ? > > But I just want to take some information like health or df. > > > > If this client setting > > for a particular > > capabilities > > like.. > > --- > > --- > > > > client.symphony > >key: AQBP8NRYGehDKRAAzyChAvAivydLqRBsHeTPjg== > >caps: [mon] allow r > >caps: [osd] allow r > > x > > --- > > --- > > Error list: > > Traceback (most recent call last): > > File "/usr/bin/ceph-rest-api", line 59, in > >rest, > > File "/usr/lib/python2.7/dist-packages/ceph_rest_api.py", line 495, in > > generate_a > > pp > >addr, port = api_setup(app, conf, cluster, clientname, clientid, args) > > File "/usr/lib/python2.7/dist-packages/ceph_rest_api.py", line 146, in > > api_setup > >target=('osd', int(osdid))) > > File "/usr/lib/python2.7/dist-packages/ceph_rest_api.py", line 84, in > > get_command > > _descriptions > >raise EnvironmentError(ret, err) > > EnvironmentError: [Errno -1] Can't get command descriptions: > > > > > > > > > > Best wishes, > > Mika > > > > > > 2017-03-24 16:21 GMT+08:00 Brad Hubbard : > >> > >> On Fri, Mar 24, 2017 at 4:06 PM, Mika c wrote: > >> > Hi all, > >> > Same question with CEPH 10.2.3 and 11.2.0. > >> > Is this command only for client.admin ? > >> > > >> > client.symphony > >> >key: AQD0tdRYjhABEhAAaG49VhVXBTw0MxltAiuvgg== > >> >caps: [mon] allow * > >> >caps: [osd] allow * > >> > > >> > Traceback (most recent call last): > >> > File "/usr/bin/ceph-rest-api", line 43, in > >> >rest, > >> > File "/usr/lib/python2.7/dist-packages/ceph_rest_api.py", line 504, > in > >> > generate_a > >> > pp > >> >addr, port = api_setup(app, conf, cluster, clientname, clientid, > >> > args) > >> > File "/usr/lib/python2.7/dist-packages/ceph_rest_api.py", line 106, > in > >> > api_setup > >> >app.ceph_cluster.connect() > >> > File "rados.pyx", line 811, in rados.Rados.connect > >> > (/tmp/buildd/ceph-11.2.0/obj-x > >> > 86_64-linux-gnu/src/pybind/rados/pyrex/rados.c:10178) > >> > rados.ObjectNotFound: error connecting to the cluster > >> > >> # strace -eopen /bin/ceph-rest-api |& grep keyring > >> open("/etc/ceph/ceph.client.restapi.keyring", O_RDONLY) = -1 ENOENT > >> (No such file or directory) > >> open("/etc/ceph/ceph.keyring", O_RDONLY) = -1 ENOENT (No such file or > >> directory) > >> open("/etc/ceph/keyring", O_RDONLY) = -1 ENOENT (No such file or > >> directory) > >> open("/etc/ceph/keyring.bin", O_RDONLY) = -1 ENOENT (No such file or > >> directory) > >> > >> # ceph auth get-or-create client.restapi mon 'allow *' mds 'allow *' > >> osd 'allow *' >/etc/ceph/ceph.client.restapi.keyring > >> > >> # /bin/ceph-rest-api > >> * Running on http://0.0.0.0:5000/ > >> > >> > > >> > > >> > > >> > Best wishes, > >> > Mika > >> > > >> > > >> > 2016-03-03 12:25 GMT+08:00 Shinobu Kinjo : > >> >> > >> >> Yes. > >> >> > >> >> On Wed, Jan 27, 2016 at 1:10 PM, Dan Mick wrote: > >> >> > Is the client.test-admin key in the keyring read by ceph-rest-api? > >> >> > > >> >> > On 01/22/2016 04:05 PM, Shinobu Kinjo wrote: > >> >> >> Does anyone have any idea about that? > >> >> >> > >> >> >> Rgds, > >> >> >> Shinobu > >> >> >> > >> >> >> - Original Message - > >> >> >> From: "Shinobu Kinjo" > >> >> >> To: "ceph-users" > >> >> >> Sent: Friday, January 22, 2016 7:15:36 AM > >> >> >> Subject: ceph-rest-api's behavior > >> >> >> > >> >> >> Hello, > >> >> >> > >> >> >> "ceph-rest-api" works greatly with client.admin. > >> >> >> But with client.test-admin which I created just after building the > >> >> >> Ceph > >> >> >> cluster , it does not work. > >> >> >> > >> >> >> ~$ ceph auth get-or-create client.test-admin mon 'allow *' mds > >> >> >> 'allow > >> >> >> *' osd 'allow *' > >> >> >> > >> >> >> ~$ sudo ceph auth list > >> >> >> installed auth entries: > >> >> >>... > >> >> >> client.test-admin > >> >> >> key: AQCOVaFWTYr2ORAAKwruANTLXqdHOchkVvRApg== > >> >> >> caps: [mds] allow * > >> >> >> caps: [mon] allow * >
Re: [ceph-users] ceph-rest-api's behavior
I've copied Dan who may have some thoughts on this and has been involved with this code. On Tue, Mar 28, 2017 at 3:58 PM, Mika c wrote: > Hi Brad, >Thanks for your help. I found that's my problem. Forget attach file name > with words ''keyring". > > And sorry to bother you again. Is it possible to create a minimum privilege > client for the api to run? > > > > Best wishes, > Mika > > > 2017-03-24 19:32 GMT+08:00 Brad Hubbard : >> >> On Fri, Mar 24, 2017 at 8:20 PM, Mika c wrote: >> > Hi Brad, >> > Thanks for your reply. The environment already created keyring file >> > and >> > put it in /etc/ceph but not working. >> >> What was it called? >> >> > I have to write config into ceph.conf like below. >> > >> > ---ceph.conf start--- >> > [client.symphony] >> > log_file = / >> > var/log/ceph/rest-api.log >> > >> > keyring = /etc/ceph/ceph.client.symphony >> > public addr = >> > 0.0.0.0 >> > :5 >> > 000 >> > >> > restapi base url = /api/v0.1 >> > ---ceph.conf >> > end >> > --- >> > >> > >> > Another question, have I must setting capabilities for this client like >> > admin ? >> > But I just want to take some information like health or df. >> > >> > If this client setting >> > for a particular >> > capabilities >> > like.. >> > --- >> > --- >> > >> > client.symphony >> >key: AQBP8NRYGehDKRAAzyChAvAivydLqRBsHeTPjg== >> >caps: [mon] allow r >> >caps: [osd] allow r >> > x >> > --- >> > --- >> > Error list: >> > Traceback (most recent call last): >> > File "/usr/bin/ceph-rest-api", line 59, in >> >rest, >> > File "/usr/lib/python2.7/dist-packages/ceph_rest_api.py", line 495, in >> > generate_a >> > pp >> >addr, port = api_setup(app, conf, cluster, clientname, clientid, >> > args) >> > File "/usr/lib/python2.7/dist-packages/ceph_rest_api.py", line 146, in >> > api_setup >> >target=('osd', int(osdid))) >> > File "/usr/lib/python2.7/dist-packages/ceph_rest_api.py", line 84, in >> > get_command >> > _descriptions >> >raise EnvironmentError(ret, err) >> > EnvironmentError: [Errno -1] Can't get command descriptions: >> > >> > >> > >> > >> > Best wishes, >> > Mika >> > >> > >> > 2017-03-24 16:21 GMT+08:00 Brad Hubbard : >> >> >> >> On Fri, Mar 24, 2017 at 4:06 PM, Mika c wrote: >> >> > Hi all, >> >> > Same question with CEPH 10.2.3 and 11.2.0. >> >> > Is this command only for client.admin ? >> >> > >> >> > client.symphony >> >> >key: AQD0tdRYjhABEhAAaG49VhVXBTw0MxltAiuvgg== >> >> >caps: [mon] allow * >> >> >caps: [osd] allow * >> >> > >> >> > Traceback (most recent call last): >> >> > File "/usr/bin/ceph-rest-api", line 43, in >> >> >rest, >> >> > File "/usr/lib/python2.7/dist-packages/ceph_rest_api.py", line 504, >> >> > in >> >> > generate_a >> >> > pp >> >> >addr, port = api_setup(app, conf, cluster, clientname, clientid, >> >> > args) >> >> > File "/usr/lib/python2.7/dist-packages/ceph_rest_api.py", line 106, >> >> > in >> >> > api_setup >> >> >app.ceph_cluster.connect() >> >> > File "rados.pyx", line 811, in rados.Rados.connect >> >> > (/tmp/buildd/ceph-11.2.0/obj-x >> >> > 86_64-linux-gnu/src/pybind/rados/pyrex/rados.c:10178) >> >> > rados.ObjectNotFound: error connecting to the cluster >> >> >> >> # strace -eopen /bin/ceph-rest-api |& grep keyring >> >> open("/etc/ceph/ceph.client.restapi.keyring", O_RDONLY) = -1 ENOENT >> >> (No such file or directory) >> >> open("/etc/ceph/ceph.keyring", O_RDONLY) = -1 ENOENT (No such file or >> >> directory) >> >> open("/etc/ceph/keyring", O_RDONLY) = -1 ENOENT (No such file or >> >> directory) >> >> open("/etc/ceph/keyring.bin", O_RDONLY) = -1 ENOENT (No such file or >> >> directory) >> >> >> >> # ceph auth get-or-create client.restapi mon 'allow *' mds 'allow *' >> >> osd 'allow *' >/etc/ceph/ceph.client.restapi.keyring >> >> >> >> # /bin/ceph-rest-api >> >> * Running on http://0.0.0.0:5000/ >> >> >> >> > >> >> > >> >> > >> >> > Best wishes, >> >> > Mika >> >> > >> >> > >> >> > 2016-03-03 12:25 GMT+08:00 Shinobu Kinjo : >> >> >> >> >> >> Yes. >> >> >> >> >> >> On Wed, Jan 27, 2016 at 1:10 PM, Dan Mick wrote: >> >> >> > Is the client.test-admin key in the keyring read by ceph-rest-api? >> >> >> > >> >> >> > On 01/22/2016 04:05 PM, Shinobu Kinjo wrote: >> >> >> >> Does anyone have any idea about that? >> >> >> >> >> >> >> >> Rgds, >> >> >> >> Shinobu >> >> >> >> >> >> >> >> - Original Message - >> >> >> >> From: "Shinobu Kinjo" >> >> >> >> To: "ceph-users" >> >> >> >> Sent: Friday, January 22, 2016 7:15:36 AM >> >> >> >> Subject: ceph-rest-api's behavior >> >> >> >> >> >> >> >> Hello, >> >> >> >> >> >> >> >> "ceph-rest-api" works greatly with client.admin. >> >> >> >> But with client.test-admin which I created just after building >> >> >> >> the >> >> >> >> Ceph >> >> >> >> cluster , it does not work. >> >> >> >> >> >> >> >> ~$ ceph auth get-or-create client.test
Re: [ceph-users] XFS attempt to access beyond end of device
On 22 March 2017 at 19:36, Brad Hubbard wrote: > On Wed, Mar 22, 2017 at 5:24 PM, Marcus Furlong wrote: >> [435339.965817] [ cut here ] >> [435339.965874] WARNING: at fs/xfs/xfs_aops.c:1244 >> xfs_vm_releasepage+0xcb/0x100 [xfs]() >> [435339.965876] Modules linked in: vfat fat uas usb_storage mpt3sas >> mpt2sas raid_class scsi_transport_sas mptctl mptbase iptable_filter >> dell_rbu team_mode_loadbalance team rpcrdma ib_isert iscsi_target_mod >> ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp >> scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad >> rdma_cm ib_cm iw_cm mlx5_ib ib_core intel_powerclamp coretemp >> intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul >> ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper >> cryptd ipmi_devintf iTCO_wdt iTCO_vendor_support mxm_wmi dcdbas pcspkr >> ipmi_ssif sb_edac edac_core sg mei_me mei lpc_ich shpchp ipmi_si >> ipmi_msghandler wmi acpi_power_meter nfsd auth_rpcgss nfs_acl lockd >> grace sunrpc ip_tables xfs sd_mod crc_t10dif crct10dif_generic mgag200 >> i2c_algo_bit >> [435339.965942] crct10dif_pclmul crct10dif_common drm_kms_helper >> crc32c_intel syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm >> bnx2x ahci libahci mlx5_core i2c_core libata mdio ptp megaraid_sas >> nvme pps_core libcrc32c fjes dm_mirror dm_region_hash dm_log dm_mod >> [435339.965991] CPU: 8 PID: 223 Comm: kswapd0 Not tainted >> 3.10.0-514.10.2.el7.x86_64 #1 >> [435339.965993] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS >> 2.3.4 11/08/2016 >> [435339.965994] 6ea9561d 881ffc2c7aa0 >> 816863ef >> [435339.965998] 881ffc2c7ad8 81085940 ea00015d4e20 >> ea00015d4e00 >> [435339.966000] 880f4d7c5af8 881ffc2c7da0 ea00015d4e00 >> 881ffc2c7ae8 >> [435339.966003] Call Trace: >> [435339.966010] [] dump_stack+0x19/0x1b >> [435339.966015] [] warn_slowpath_common+0x70/0xb0 >> [435339.966018] [] warn_slowpath_null+0x1a/0x20 >> [435339.966060] [] xfs_vm_releasepage+0xcb/0x100 [xfs] >> [435339.966120] [] try_to_release_page+0x32/0x50 >> [435339.966128] [] shrink_active_list+0x3d6/0x3e0 >> [435339.966133] [] shrink_lruvec+0x3f1/0x770 >> [435339.966138] [] shrink_zone+0x76/0x1a0 >> [435339.966143] [] balance_pgdat+0x48c/0x5e0 >> [435339.966147] [] kswapd+0x173/0x450 >> [435339.966155] [] ? wake_up_atomic_t+0x30/0x30 >> [435339.966158] [] ? balance_pgdat+0x5e0/0x5e0 >> [435339.966161] [] kthread+0xcf/0xe0 >> [435339.966165] [] ? kthread_create_on_node+0x140/0x140 >> [435339.966170] [] ret_from_fork+0x58/0x90 >> [435339.966173] [] ? kthread_create_on_node+0x140/0x140 >> [435339.966175] ---[ end trace 58233bbca77fd5e2 ]--- > > With regards to the above stack trace, > https://bugzilla.redhat.com/show_bug.cgi?id=1079818 was opened, and > remains open, for the same stack. I would suggest discussing this > issue with your kernel support organisation as it is likely unrelated > to the sizing issue IIUC. Hi Brad, Thanks for clarifying that. That bug is not public. Is there any workaround mentioned in it? Cheers, Marcus. -- Marcus Furlong ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] XFS attempt to access beyond end of device
On Tue, Mar 28, 2017 at 4:22 PM, Marcus Furlong wrote: > On 22 March 2017 at 19:36, Brad Hubbard wrote: >> On Wed, Mar 22, 2017 at 5:24 PM, Marcus Furlong wrote: > >>> [435339.965817] [ cut here ] >>> [435339.965874] WARNING: at fs/xfs/xfs_aops.c:1244 >>> xfs_vm_releasepage+0xcb/0x100 [xfs]() >>> [435339.965876] Modules linked in: vfat fat uas usb_storage mpt3sas >>> mpt2sas raid_class scsi_transport_sas mptctl mptbase iptable_filter >>> dell_rbu team_mode_loadbalance team rpcrdma ib_isert iscsi_target_mod >>> ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp >>> scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad >>> rdma_cm ib_cm iw_cm mlx5_ib ib_core intel_powerclamp coretemp >>> intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul >>> ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper >>> cryptd ipmi_devintf iTCO_wdt iTCO_vendor_support mxm_wmi dcdbas pcspkr >>> ipmi_ssif sb_edac edac_core sg mei_me mei lpc_ich shpchp ipmi_si >>> ipmi_msghandler wmi acpi_power_meter nfsd auth_rpcgss nfs_acl lockd >>> grace sunrpc ip_tables xfs sd_mod crc_t10dif crct10dif_generic mgag200 >>> i2c_algo_bit >>> [435339.965942] crct10dif_pclmul crct10dif_common drm_kms_helper >>> crc32c_intel syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm >>> bnx2x ahci libahci mlx5_core i2c_core libata mdio ptp megaraid_sas >>> nvme pps_core libcrc32c fjes dm_mirror dm_region_hash dm_log dm_mod >>> [435339.965991] CPU: 8 PID: 223 Comm: kswapd0 Not tainted >>> 3.10.0-514.10.2.el7.x86_64 #1 >>> [435339.965993] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS >>> 2.3.4 11/08/2016 >>> [435339.965994] 6ea9561d 881ffc2c7aa0 >>> 816863ef >>> [435339.965998] 881ffc2c7ad8 81085940 ea00015d4e20 >>> ea00015d4e00 >>> [435339.966000] 880f4d7c5af8 881ffc2c7da0 ea00015d4e00 >>> 881ffc2c7ae8 >>> [435339.966003] Call Trace: >>> [435339.966010] [] dump_stack+0x19/0x1b >>> [435339.966015] [] warn_slowpath_common+0x70/0xb0 >>> [435339.966018] [] warn_slowpath_null+0x1a/0x20 >>> [435339.966060] [] xfs_vm_releasepage+0xcb/0x100 [xfs] >>> [435339.966120] [] try_to_release_page+0x32/0x50 >>> [435339.966128] [] shrink_active_list+0x3d6/0x3e0 >>> [435339.966133] [] shrink_lruvec+0x3f1/0x770 >>> [435339.966138] [] shrink_zone+0x76/0x1a0 >>> [435339.966143] [] balance_pgdat+0x48c/0x5e0 >>> [435339.966147] [] kswapd+0x173/0x450 >>> [435339.966155] [] ? wake_up_atomic_t+0x30/0x30 >>> [435339.966158] [] ? balance_pgdat+0x5e0/0x5e0 >>> [435339.966161] [] kthread+0xcf/0xe0 >>> [435339.966165] [] ? kthread_create_on_node+0x140/0x140 >>> [435339.966170] [] ret_from_fork+0x58/0x90 >>> [435339.966173] [] ? kthread_create_on_node+0x140/0x140 >>> [435339.966175] ---[ end trace 58233bbca77fd5e2 ]--- >> >> With regards to the above stack trace, >> https://bugzilla.redhat.com/show_bug.cgi?id=1079818 was opened, and >> remains open, for the same stack. I would suggest discussing this >> issue with your kernel support organisation as it is likely unrelated >> to the sizing issue IIUC. > > Hi Brad, > > Thanks for clarifying that. That bug is not public. Is there any > workaround mentioned in it? No, there isn't. The upstream fix is http://oss.sgi.com/pipermail/xfs/2016-July/050281.html > > Cheers, > Marcus. > > -- > Marcus Furlong -- Cheers, Brad ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com