date:20170622

Re: [ceph-users] radosgw: scrub causing slow requests in the md log

2017-06-22 Thread Dan van der Ster

I'm now running the three relevant OSDs with that patch. (Recompiled,
replaced /usr/lib64/rados-classes/libcls_log.so with the new version,
then restarted the osds).

It's working quite well, trimming 10 entries at a time instead of
1000, and no more timeouts.

Do you think it would be worth decreasing this hardcoded value in ceph proper?

-- Dan


On Wed, Jun 21, 2017 at 3:51 PM, Casey Bodley  wrote:
> That patch looks reasonable. You could also try raising the values of
> osd_op_thread_suicide_timeout and filestore_op_thread_suicide_timeout on
> that osd in order to trim more at a time.
>
>
> On 06/21/2017 09:27 AM, Dan van der Ster wrote:
>>
>> Hi Casey,
>>
>> I managed to trim up all shards except for that big #54. The others
>> all trimmed within a few seconds.
>>
>> But 54 is proving difficult. It's still going after several days, and
>> now I see that the 1000-key trim is indeed causing osd timeouts. I've
>> manually compacted the relevant osd leveldbs, but haven't found any
>> way to speed up the trimming. It's now going at ~1-2Hz, so 1000 trims
>> per op locks things up for quite awhile.
>>
>> I'm thinking of running those ceph-osd's with this patch:
>>
>> # git diff
>> diff --git a/src/cls/log/cls_log.cc b/src/cls/log/cls_log.cc
>> index 89745bb..7dcd933 100644
>> --- a/src/cls/log/cls_log.cc
>> +++ b/src/cls/log/cls_log.cc
>> @@ -254,7 +254,7 @@ static int cls_log_trim(cls_method_context_t hctx,
>> bufferlist *in, bufferlist *o
>>   to_index = op.to_marker;
>> }
>>
>> -#define MAX_TRIM_ENTRIES 1000
>> +#define MAX_TRIM_ENTRIES 10
>> size_t max_entries = MAX_TRIM_ENTRIES;
>>
>> int rc = cls_cxx_map_get_vals(hctx, from_index, log_index_prefix,
>> max_entries, &keys);
>>
>>
>> What do you think?
>>
>> -- Dan
>>
>>
>>
>>
>> On Mon, Jun 19, 2017 at 5:32 PM, Casey Bodley  wrote:
>>>
>>> Hi Dan,
>>>
>>> That's good news that it can remove 1000 keys at a time without hitting
>>> timeouts. The output of 'du' will depend on when the leveldb compaction
>>> runs. If you do find that compaction leads to suicide timeouts on this
>>> osd
>>> (you would see a lot of 'leveldb:' output in the log), consider running
>>> offline compaction by adding 'leveldb compact on mount = true' to the osd
>>> config and restarting.
>>>
>>> Casey
>>>
>>>
>>> On 06/19/2017 11:01 AM, Dan van der Ster wrote:

 On Thu, Jun 15, 2017 at 7:56 PM, Casey Bodley 
 wrote:
>
> On 06/14/2017 05:59 AM, Dan van der Ster wrote:
>>
>> Dear ceph users,
>>
>> Today we had O(100) slow requests which were caused by deep-scrubbing
>> of the metadata log:
>>
>> 2017-06-14 11:07:55.373184 osd.155
>> [2001:1458:301:24::100:d]:6837/3817268 7387 : cluster [INF] 24.1d
>> deep-scrub starts
>> ...
>> 2017-06-14 11:22:04.143903 osd.155
>> [2001:1458:301:24::100:d]:6837/3817268 8276 : cluster [WRN] slow
>> request 480.140904 seconds old, received at 2017-06-14
>> 11:14:04.002913: osd_op(client.3192010.0:11872455 24.be8b305d
>> meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54 [call log.add] snapc
>> 0=[] ondisk+write+known_if_redirected e7752) currently waiting for
>> scrub
>> ...
>> 2017-06-14 11:22:06.729306 osd.155
>> [2001:1458:301:24::100:d]:6837/3817268 8277 : cluster [INF] 24.1d
>> deep-scrub ok
>>
>> We have log_meta: true, log_data: false on this (our only) region [1],
>> which IIRC we setup to enable indexless buckets.
>>
>> I'm obviously unfamiliar with rgw meta and data logging, and have a
>> few questions:
>>
>> 1. AFAIU, it is used by the rgw multisite feature. Is it safe to
>> turn
>> it off when not using multisite?
>
>
> It's a good idea to turn that off, yes.
>
> First, make sure that you have configured a default
> realm/zonegroup/zone:
>
> $ radosgw-admin realm default --rgw-realm   (you can
> determine
> realm name from 'radosgw-admin realm list')
> $ radosgw-admin zonegroup default --rgw-zonegroup default
> $ radosgw-admin zone default --rgw-zone default
>
 Thanks. This had already been done, as confirmed with radosgw-admin
 realm get-default.

> Then you can modify the zonegroup (aka region):
>
> $ radosgw-admin zonegroup get > zonegroup.json
> $ sed -i 's/log_meta": "true/log_meta":"false/' zonegroup.json
> $ radosgw-admin zonegroup set < zonegroup.json
>
> Then commit the updated period configuration:
>
> $ radosgw-admin period update --commit
>
> Verify that the resulting period contains "log_meta": "false". Take
> care
> with future radosgw-admin commands on the zone/zonegroup, as they may
> revert
> log_meta back to true [1].
>
 Great, this worked. FYI (and for others trying this in future), the
 period update --commit blocks all rgws for ~30s while they reload the
 realm.

>> 2. I started dumping the output of radosgw-adm

Re: [ceph-users] Transitioning to Intel P4600 from P3700 Journals

2017-06-22 Thread Luis Periquito

> Keep in mind that 1.6TB P4600 is going to last about as long as your 400GB
> P3700, so if wear-out is a concern, don't put more stress on them.
>

I've been looking at the 2T ones, but it's about the same as the 400G P3700

> Also the P4600 is only slightly faster in writes than the P3700, so that's
> where putting more workload onto them is going to be a notable issue.

The latency is somewhat worse than the P3700. When you're talking
journal device latency will be more important than bandwidth,
specially on small and/or sync writes.

>
>> I've seen some talk on here regarding this, but wanted to throw an idea
>> around. I was okay throwing away 280GB of fast capacity for the purpose of
>> providing reliable journals. But with as much free capacity as we'd have
>> with a 4600, maybe I could use that extra capacity as a cache tier for
>> writes on an rbd ec pool. If I wanted to go that route, I'd probably
>> replace several existing 3700s with 4600s to get additional cache capacity.
>> But, that sounds risky...
>>
> Risky as in high failure domain concentration and as mentioned above a
> cache-tier with obvious inline journals and thus twice the bandwidth needs
> will likely eat into the write speed capacity of the journals.

I tend to agree. Also the cache tier only starts to be interesting if
it's big enough overall... If you have to keep promoting/demoting
because it's full it'll kill the whole cluster very quickly.

>
> If (and seems to be a big IF) you can find them, the Samsung PM1725a 1.6TB
> seems to be a) cheaper and b) at 2GB/s write speed more likely to be
> suitable for double duty.
> Similar (slightly better on paper) endurance than then P4600, so keep that
> in mind, too.

As I'm more than happy for the 400G size, and given the price of the
P4600 2T, for slightly more (10%) I'm considering the P4800X. This is
for a full SSD cluster.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] radosgw: scrub causing slow requests in the md log

2017-06-22 Thread Dan van der Ster

On Wed, Jun 21, 2017 at 4:16 PM, Peter Maloney
 wrote:
> On 06/14/17 11:59, Dan van der Ster wrote:
>> Dear ceph users,
>>
>> Today we had O(100) slow requests which were caused by deep-scrubbing
>> of the metadata log:
>>
>> 2017-06-14 11:07:55.373184 osd.155
>> [2001:1458:301:24::100:d]:6837/3817268 7387 : cluster [INF] 24.1d
>> deep-scrub starts
>> ...
>> 2017-06-14 11:22:04.143903 osd.155
>> [2001:1458:301:24::100:d]:6837/3817268 8276 : cluster [WRN] slow
>> request 480.140904 seconds old, received at 2017-06-14
>> 11:14:04.002913: osd_op(client.3192010.0:11872455 24.be8b305d
>> meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54 [call log.add] snapc
>> 0=[] ondisk+write+known_if_redirected e7752) currently waiting for
>> scrub
>> ...
>> 2017-06-14 11:22:06.729306 osd.155
>> [2001:1458:301:24::100:d]:6837/3817268 8277 : cluster [INF] 24.1d
>> deep-scrub ok
>
> This looks just like my problem in my thread on ceph-devel "another
> scrub bug? blocked for > 10240.948831 secs" except that your scrub
> eventually finished (mine ran hours before I stopped it manually), and
> I'm not using rgw.
>
> Sage commented that it is likely related to snaps being removed at some
> point and interacting with scrub.
>
> Restarting the osd that is mentioned there (osd.155 in  your case) will
> fix it for now. And tuning scrub changes the way it behaves (defaults
> make it happen more rarely than what I had before).

In my case it's not related to snaps -- there are no snaps (or
trimming) in a (normal) set of rgw pools.

My problem is about the cls_log class, which tries to do a lot of work
in one op, timing out the osds.

Well, the *real* problem in my case is about this rgw mdlog, which can
grow unboundedly, then eventually become un-scrubbable, leading to
this huge amount of cleanup to be done.

-- dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] 答复: Can't start ceph-mon through systemctl start ceph-mon@.service after upgrading from Hammer to Jewel

2017-06-22 Thread 许雪寒

I set mon_data to “/home/ceph/software/ceph/var/lib/ceph/mon”, and its owner 
has always been “ceph” since we were running Hammer.
And I also tried to set the permission to “777”, it also didn’t work.


发件人: Linh Vu [mailto:v...@unimelb.edu.au] 
发送时间: 2017年6月22日 14:26
收件人: 许雪寒; ceph-users@lists.ceph.com
主题: Re: [ceph-users] Can't start ceph-mon through systemctl start 
ceph-mon@.service after upgrading from Hammer to Jewel

Permissions of your mon data directory under /var/lib/ceph/mon/ might have 
changed as part of Hammer -> Jewel upgrade. Have you had a look there?

From: ceph-users  on behalf of 许雪寒 

Sent: Thursday, 22 June 2017 3:32:45 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Can't start ceph-mon through systemctl start 
ceph-mon@.service after upgrading from Hammer to Jewel 
 
Hi, everyone.

I upgraded one of our ceph clusters from Hammer to Jewel. After upgrading, I 
can’t start ceph-mon through “systemctl start ceph-mon@ceph1”, while, on the 
other hand, I can start ceph-mon, either as user ceph or root, if I directly 
call “/usr/bin/ceph-mon –cluster ceph –id ceph1 –setuser ceph –setgroup ceph”. 
I looked “/var/log/messages”, and find that the reason systemctl can’t start 
ceph-mon is that ceph-mon can’t access its configured data directory. Why 
ceph-mon can’t access its data directory when its called by systemctl?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Transitioning to Intel P4600 from P3700 Journals

2017-06-22 Thread Maxime Guyot

Hi,

One of the benefits of PCIe NVMe is that it does not take a disk slot,
resulting in a higher density. For example a 6048R-E1CR36N with 3x PCIe
NVMe yields 36 OSDs per servers (12 OSD per NVMe) where it yields 30 OSDs
per server if using SATA SSDs (6 OSDs per SSD).

Since you say that you used 10% of P3700 endurance in 1 year (7.3PB
endurance, so 0.73PB/year), so a 400GB P3600 would work for 3 years. Maybe
good enough until BlueStore is more stable.

Cheers,
Maxime

On Thu, 22 Jun 2017 at 03:59 Christian Balzer  wrote:

>
> Hello,
>
> Hmm, gmail client not grokking quoting these days?
>
> On Wed, 21 Jun 2017 20:40:48 -0500 Brady Deetz wrote:
>
> > On Jun 21, 2017 8:15 PM, "Christian Balzer"  wrote:
> >
> > On Wed, 21 Jun 2017 19:44:08 -0500 Brady Deetz wrote:
> >
> > > Hello,
> > > I'm expanding my 288 OSD, primarily cephfs, cluster by about 16%. I
> have
> > 12
> > > osd nodes with 24 osds each. Each osd node has 2 P3700 400GB NVMe PCIe
> > > drives providing 10GB journals for groups of 12 6TB spinning rust
> drives
> > > and 2x lacp 40gbps ethernet.
> > >
> > > Our hardware provider is recommending that we start deploying P4600
> drives
> > > in place of our P3700s due to availability.
> > >
> > Welcome to the club and make sure to express your displeasure about
> > Intel's "strategy" to your vendor.
> >
> > The P4600s are a poor replacement for P3700s and also still just
> > "announced" according to ARK.
> >
> > Are you happy with your current NVMes?
> > Firstly as in, what is their wearout, are you expecting them to easily
> > survive 5 years at the current rate?
> > Secondly, how about speed? with 12 HDDs and 1GB/s write capacity of the
> > NVMe I'd expect them to not be a bottleneck in nearly all real life
> > situations.
> >
> > Keep in mind that 1.6TB P4600 is going to last about as long as your
> 400GB
> > P3700, so if wear-out is a concern, don't put more stress on them.
> >
> >
> > Oddly enough, the Intel tools are telling me that we've only used about
> 10%
> > of each drive's endurance over the past year. This honestly surprises me
> > due to our workload, but maybe I'm thinking my researchers are doing more
> > science than they actually are.
> >
> That's pretty impressive still, but also lets you do numbers as to what
> kind of additional load you _may_ be able to consider, obviously not more
> than twice the current amount to stay within 5 years before wearing
> them out.
>
>
> >
> > Also the P4600 is only slightly faster in writes than the P3700, so
> that's
> > where putting more workload onto them is going to be a notable issue.
> >
> > > I've seen some talk on here regarding this, but wanted to throw an idea
> > > around. I was okay throwing away 280GB of fast capacity for the
> purpose of
> > > providing reliable journals. But with as much free capacity as we'd
> have
> > > with a 4600, maybe I could use that extra capacity as a cache tier for
> > > writes on an rbd ec pool. If I wanted to go that route, I'd probably
> > > replace several existing 3700s with 4600s to get additional cache
> > capacity.
> > > But, that sounds risky...
> > >
> > Risky as in high failure domain concentration and as mentioned above a
> > cache-tier with obvious inline journals and thus twice the bandwidth
> needs
> > will likely eat into the write speed capacity of the journals.
> >
> >
> > Agreed. On the topic of journals and double bandwidth, am I correct in
> > thinking that btrfs (as insane as it may be) does not require double
> > bandwidth like xfs? Furthermore with bluestore being close to stable,
> will
> > my architecture need to change?
> >
> BTRFS at this point is indeed a bit insane, given the current levels of
> support, issues (search the ML archives) and future developments.
> And you'll still wind up with double writes most likely, IIRC.
>
> These aspects of Bluestore have been discussed here recently, too.
> Your SSD/NVMe space requirements will go down, but if you want to have the
> same speeds and more importantly low latencies you'll wind up with all
> writes going through them again, so endurance wise you're still in that
> "Lets make SSDs great again" hellhole.
>
> >
> > If (and seems to be a big IF) you can find them, the Samsung PM1725a
> 1.6TB
> > seems to be a) cheaper and b) at 2GB/s write speed more likely to be
> > suitable for double duty.
> > Similar (slightly better on paper) endurance than then P4600, so keep
> that
> > in mind, too.
> >
> >
> > My vendor is an HPC vendor so /maybe/ they have access to these elusive
> > creatures. In which case, how many do you want? Haha
> >
> I was just looking at availability with a few google searches, our current
> needs are amply satisfied with S37xx SSDs, no need for NVMes really.
> But as things are going, maybe I'll be forced to Optane and friends simply
> by lack of alternatives.
>
> Christian
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Rakuten Communications
> _

Re: [ceph-users] Transitioning to Intel P4600 from P3700 Journals

2017-06-22 Thread Willem Jan Withagen

On 22-6-2017 03:59, Christian Balzer wrote:
>> Agreed. On the topic of journals and double bandwidth, am I correct in
>> thinking that btrfs (as insane as it may be) does not require double
>> bandwidth like xfs? Furthermore with bluestore being close to stable, will
>> my architecture need to change?
>>
> BTRFS at this point is indeed a bit insane, given the current levels of
> support, issues (search the ML archives) and future developments. 
> And you'll still wind up with double writes most likely, IIRC.
> 
> These aspects of Bluestore have been discussed here recently, too.
> Your SSD/NVMe space requirements will go down, but if you want to have the
> same speeds and more importantly low latencies you'll wind up with all
> writes going through them again, so endurance wise you're still in that
> "Lets make SSDs great again" hellhole. 

Please note that I know little about btrfs, but its sister ZFS can
include caching/log devices transparent in its architecture. And even
better, they are allowed to fail without much problems. :)

Now the problem I have is that first Ceph journals the writes to its
log, then hands the write over to ZFS, where its gets logged again.
So that are 2 writes, (and in the case of ZFS, they only get read iff
the filesystems had a crash)

Bad thing about ZFS is that the journal log need not be very big: about
5 sec of max required diskwrites. I have 'm a 1Gb and they never filled
up yet. But the used bandwidth is going to doubled due to double the
amount of writes.

If logging of btrfs is anything like this, then you have to look at how
you architecture the filesystems/devices underlying Ceph.

--WjW
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Does CephFS support SELinux?

2017-06-22 Thread Stéphane Klein

Hi,

Does CephFS support SELinux?

I have this issue with OpenShift (with SELinux) + CephFS:
http://lists.openshift.redhat.com/openshift-archives/users/2017-June/msg00116.html

Best regards,
Stéphane
-- 
Stéphane Klein 
blog: http://stephane-klein.info
cv : http://cv.stephane-klein.info
Twitter: http://twitter.com/klein_stephane
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Does CephFS support SELinux?

2017-06-22 Thread John Spray

On Thu, Jun 22, 2017 at 10:25 AM, Stéphane Klein
 wrote:
> Hi,
>
> Does CephFS support SELinux?
>
> I have this issue with OpenShift (with SELinux) + CephFS:
> http://lists.openshift.redhat.com/openshift-archives/users/2017-June/msg00116.html

We do test running CephFS server and client bits on machines where
selinux is enabled, but we don't test doing selinux stuff inside the
filesystem (setting labels etc).  As far as I know, the comments in
http://tracker.ceph.com/issues/13231 are still relevant.

John

> Best regards,
> Stéphane
> --
> Stéphane Klein 
> blog: http://stephane-klein.info
> cv : http://cv.stephane-klein.info
> Twitter: http://twitter.com/klein_stephane
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Does CephFS support SELinux?

2017-06-22 Thread Stéphane Klein

2017-06-22 11:48 GMT+02:00 John Spray :

> On Thu, Jun 22, 2017 at 10:25 AM, Stéphane Klein
>  wrote:
> > Hi,
> >
> > Does CephFS support SELinux?
> >
> > I have this issue with OpenShift (with SELinux) + CephFS:
> > http://lists.openshift.redhat.com/openshift-archives/users/
> 2017-June/msg00116.html
>
> We do test running CephFS server and client bits on machines where
> selinux is enabled, but we don't test doing selinux stuff inside the
> filesystem (setting labels etc).  As far as I know, the comments in
> http://tracker.ceph.com/issues/13231 are still relevant.
>
>
# mount -t ceph ceph-test-1:6789:/ /mnt/mycephfs -o
name=admin,secretfile=/etc/ceph/admin.secret
# touch /mnt/mycephfs/foo
# ls /mnt/mycephfs/ -lZ
-rw-r--r-- root root ?foo
# chcon system_u:object_r:admin_home_t:s0 /mnt/mycephfs/foo
chcon: failed to change context of ‘/mnt/mycephfs/foo’ to
‘system_u:object_r:admin_home_t:s0’: Operation not supported

Then SELinux isn't supported with CephFS volume :(
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] FW: radosgw: stale/leaked bucket index entries

2017-06-22 Thread Pavan Rallabhandi

Looks like I’ve now got a consistent repro scenario, please find the gory 
details here http://tracker.ceph.com/issues/20380

Thanks!

On 20/06/17, 2:04 PM, "Pavan Rallabhandi"  wrote:

Hi Orit,

No, we do not use multi-site.

Thanks,
-Pavan.

From: Orit Wasserman 
Date: Tuesday, 20 June 2017 at 12:49 PM
To: Pavan Rallabhandi 
Cc: "ceph-users@lists.ceph.com" 
Subject: EXT: Re: [ceph-users] FW: radosgw: stale/leaked bucket index 
entries

Hi Pavan, 

On Tue, Jun 20, 2017 at 8:29 AM, Pavan Rallabhandi 
 wrote:
Trying one more time with ceph-users

On 19/06/17, 11:07 PM, "Pavan Rallabhandi"  
wrote:

On many of our clusters running Jewel (10.2.5+), am running into a 
strange problem of having stale bucket index entries left over for (some of 
the) objects deleted. Though it is not reproducible at will, it has been pretty 
consistent of late and am clueless at this point for the possible reasons to 
happen so.

The symptoms are that the actual delete operation of an object is 
reported successful in the RGW logs, but a bucket list on the container would 
still show the deleted object. An attempt to download/stat of the object 
appropriately results in a failure. No failures are seen in the respective OSDs 
where the bucket index object is located. And rebuilding the bucket index by 
running ‘radosgw-admin bucket check –fix’ would fix the issue.

Though I could simulate the problem by instrumenting the code, to not 
to have invoked `complete_del` on the bucket index op 
https://github.com/ceph/ceph/blob/master/src/rgw/rgw_rados.cc#L8793, but that 
call is always seem to be made unless there is a cascading error from the 
actual delete operation of the object, which doesn’t seem to be the case here.

I wanted to know the possible reasons where the bucket index would be 
left in such limbo, any pointers would be much appreciated. FWIW, we are not 
sharding the buckets and very recently I’ve seen this happen with buckets 
having as low as
< 10 objects, and we are using swift for all the operations.

Do you use multisite? 

Regards,
Orit
 
Thanks,
-Pavan.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] VMware + CEPH Integration

2017-06-22 Thread Nick Fisk

> -Original Message-
> From: Adrian Saul [mailto:adrian.s...@tpgtelecom.com.au]
> Sent: 19 June 2017 06:54
> To: n...@fisk.me.uk; 'Alex Gorbachev' 
> Cc: 'ceph-users' 
> Subject: RE: [ceph-users] VMware + CEPH Integration
> 
> > Hi Alex,
> >
> > Have you experienced any problems with timeouts in the monitor action
> > in pacemaker? Although largely stable, every now and again in our
> > cluster the FS and Exportfs resources timeout in pacemaker. There's no
> > mention of any slow requests or any peering..etc from the ceph logs so it's
> a bit of a mystery.
> 
> Yes - we have that in our setup which is very similar.  Usually  I find it 
> related
> to RBD device latency  due to scrubbing or similar but even when tuning
> some of that down we still get it randomly.
> 
> The most annoying part is that once it comes up, having to use  "resource
> cleanup" to try and remove the failed usually has more impact than the
> actual error.

Are you using Stonith? Pacemaker should be able to recover from any sort of 
failure as long as it can bring the cluster into a known state.

I'm still struggling to get to the bottom of it in our environment. When it 
happens, every RBD on the same client host seems to hang, but all other hosts 
are fine. This seems to suggest it's not a Ceph cluster issue/performance, as 
this would affect the majority of RBD's and not just ones on a single client.

> Confidentiality: This email and any attachments are confidential and may be
> subject to copyright, legal or some other professional privilege. They are
> intended solely for the attention and use of the named addressee(s). They
> may only be copied, distributed or disclosed with the consent of the
> copyright owner. If you have received this email by mistake or by breach of
> the confidentiality clause, please notify the sender immediately by return
> email and delete or destroy all copies of the email. Any confidentiality,
> privilege or copyright is not waived or lost because this email has been sent
> to you by mistake.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Transitioning to Intel P4600 from P3700 Journals

2017-06-22 Thread David Turner

Cristian and everyone else have expertly responded to the SSD capabilities,
pros, and cons so I'll ignore that. I believe you were saying that it was
risky to swap out your existing journals to a new journal device. That is
actually a very simple operation that can be scripted to only take minutes
per node with no risk to data.

You just stop the osd, flush the journal, delete the old journal partition,
create the new partition with the same guid, initialize the journal, and
start the osd.

On Wed, Jun 21, 2017, 8:44 PM Brady Deetz  wrote:

> Hello,
> I'm expanding my 288 OSD, primarily cephfs, cluster by about 16%. I have
> 12 osd nodes with 24 osds each. Each osd node has 2 P3700 400GB NVMe PCIe
> drives providing 10GB journals for groups of 12 6TB spinning rust drives
> and 2x lacp 40gbps ethernet.
>
> Our hardware provider is recommending that we start deploying P4600 drives
> in place of our P3700s due to availability.
>
> I've seen some talk on here regarding this, but wanted to throw an idea
> around. I was okay throwing away 280GB of fast capacity for the purpose of
> providing reliable journals. But with as much free capacity as we'd have
> with a 4600, maybe I could use that extra capacity as a cache tier for
> writes on an rbd ec pool. If I wanted to go that route, I'd probably
> replace several existing 3700s with 4600s to get additional cache capacity.
> But, that sounds risky...
>
> What do you guys think?
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSD OSD's Dual Use

2017-06-22 Thread David Turner

I wouldn't see this as problematic at all. As long as you're watching the
disk utilizations and durability, those are the only factors that would
eventually tell you that they are busy enough.

On Thu, Jun 22, 2017, 1:36 AM Ashley Merrick  wrote:

> Hello,
>
>
> Currently have a pool of SSD's running as a Cache in front of a EC Pool.
>
>
> The cache is very under used and the SSD's spend most time idle, would
> like to create a small SSD Pool for a selection of very small RBD disk's as
> scratch disks within the OS, should I expect any issues running the two
> pool's (Cache + RBD Data) on the same set of SSD's?
>
>
> ,Ashley
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] 答复: Can't start ceph-mon through systemctl start ceph-mon@.service after upgrading from Hammer to Jewel

2017-06-22 Thread David Turner

Did you previously edit the init scripts to look in your custom location?
Those could have been overwritten. As was mentioned, Jewel changed what
user the daemon runs as, but you said that you tested running the daemon
manually under the ceph user? Was this without sudo? It used to run as root
under Hammer and would have needed to be chown'd recursively to allow the
ceph user to run it.

On Thu, Jun 22, 2017, 4:39 AM 许雪寒  wrote:

> I set mon_data to “/home/ceph/software/ceph/var/lib/ceph/mon”, and its
> owner has always been “ceph” since we were running Hammer.
> And I also tried to set the permission to “777”, it also didn’t work.
>
>
> 发件人: Linh Vu [mailto:v...@unimelb.edu.au]
> 发送时间: 2017年6月22日 14:26
> 收件人: 许雪寒; ceph-users@lists.ceph.com
> 主题: Re: [ceph-users] Can't start ceph-mon through systemctl start 
> ceph-mon@.service
> after upgrading from Hammer to Jewel
>
> Permissions of your mon data directory under /var/lib/ceph/mon/ might have
> changed as part of Hammer -> Jewel upgrade. Have you had a look there?
> 
> From: ceph-users  on behalf of 许雪寒 <
> xuxue...@360.cn>
> Sent: Thursday, 22 June 2017 3:32:45 PM
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] Can't start ceph-mon through systemctl start
> ceph-mon@.service after upgrading from Hammer to Jewel
>
> Hi, everyone.
>
> I upgraded one of our ceph clusters from Hammer to Jewel. After upgrading,
> I can’t start ceph-mon through “systemctl start ceph-mon@ceph1”, while,
> on the other hand, I can start ceph-mon, either as user ceph or root, if I
> directly call “/usr/bin/ceph-mon –cluster ceph –id ceph1 –setuser ceph
> –setgroup ceph”. I looked “/var/log/messages”, and find that the reason
> systemctl can’t start ceph-mon is that ceph-mon can’t access its configured
> data directory. Why ceph-mon can’t access its data directory when its
> called by systemctl?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Config parameters for system tuning

2017-06-22 Thread Maged Mokhtar


Looking at the sources, the config values were in Hammer but not Jewel.
for jounral config i recommend
journal_queue_max_ops
journal_queue_max_bytes
be removed from the docs:
http://docs.ceph.com/docs/master/rados/configuration/journal-ref/

Also for the added filestore throttling params:
filestore_queue_max_delay_multiple
filestore_queue_high_delay_multiple
filestore_queue_low_threshhold
filestore_queue_high_threshhold
again it will be good to update the docs:
http://docs.ceph.com/docs/master/rados/configuration/filestore-config-ref/

I guess all eyes are on Bluestore now :)

Maged Mokhtar
PetaSAN
--
From: "Maged Mokhtar" 
Sent: Wednesday, June 21, 2017 12:33 AM
To: 
Subject: [ceph-users] Config parameters for system tuning


Hi,

1) I am trying to set some of the following config values which seems to 
be present in most config examples relating to performance tuning:

journal_queue_max_ops
journal_queue_max_bytes
filestore_queue_committing_max_bytes
filestore_queue_committing_max_ops

I am using 10.2.7 but not able to set these parameters either via conf 
file or injections, also ceph --show-config does not list them. Have 
they been deprecated and should be ignored ?


2) For osd_op_threads i have seen some examples (not the official docs) 
fixing this to the number of cpu cores, is this the best recommendation 
or can could we use more threads than cores ?


Cheers
Maged Mokhtar
PetaSAN
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] radosgw: scrub causing slow requests in the md log

2017-06-22 Thread Casey Bodley



On 06/22/2017 04:00 AM, Dan van der Ster wrote:

I'm now running the three relevant OSDs with that patch. (Recompiled,
replaced /usr/lib64/rados-classes/libcls_log.so with the new version,
then restarted the osds).

It's working quite well, trimming 10 entries at a time instead of
1000, and no more timeouts.

Do you think it would be worth decreasing this hardcoded value in ceph proper?

-- Dan


I do, yeah. At least, the trim operation should be able to pass in its 
own value for that. I opened a ticket for that at 
http://tracker.ceph.com/issues/20382.


I'd also like to investigate using the ObjectStore's OP_OMAP_RMKEYRANGE 
operation to trim a range of keys in a single osd op, instead of 
generating a different op for each key. I have a PR that does this at 
https://github.com/ceph/ceph/pull/15183. But it's still hard to 
guarantee that leveldb can process the entire range inside of the 
suicide timeout.


Casey




On Wed, Jun 21, 2017 at 3:51 PM, Casey Bodley  wrote:

That patch looks reasonable. You could also try raising the values of
osd_op_thread_suicide_timeout and filestore_op_thread_suicide_timeout on
that osd in order to trim more at a time.


On 06/21/2017 09:27 AM, Dan van der Ster wrote:

Hi Casey,

I managed to trim up all shards except for that big #54. The others
all trimmed within a few seconds.

But 54 is proving difficult. It's still going after several days, and
now I see that the 1000-key trim is indeed causing osd timeouts. I've
manually compacted the relevant osd leveldbs, but haven't found any
way to speed up the trimming. It's now going at ~1-2Hz, so 1000 trims
per op locks things up for quite awhile.

I'm thinking of running those ceph-osd's with this patch:

# git diff
diff --git a/src/cls/log/cls_log.cc b/src/cls/log/cls_log.cc
index 89745bb..7dcd933 100644
--- a/src/cls/log/cls_log.cc
+++ b/src/cls/log/cls_log.cc
@@ -254,7 +254,7 @@ static int cls_log_trim(cls_method_context_t hctx,
bufferlist *in, bufferlist *o
   to_index = op.to_marker;
 }

-#define MAX_TRIM_ENTRIES 1000
+#define MAX_TRIM_ENTRIES 10
 size_t max_entries = MAX_TRIM_ENTRIES;

 int rc = cls_cxx_map_get_vals(hctx, from_index, log_index_prefix,
max_entries, &keys);


What do you think?

-- Dan




On Mon, Jun 19, 2017 at 5:32 PM, Casey Bodley  wrote:

Hi Dan,

That's good news that it can remove 1000 keys at a time without hitting
timeouts. The output of 'du' will depend on when the leveldb compaction
runs. If you do find that compaction leads to suicide timeouts on this
osd
(you would see a lot of 'leveldb:' output in the log), consider running
offline compaction by adding 'leveldb compact on mount = true' to the osd
config and restarting.

Casey


On 06/19/2017 11:01 AM, Dan van der Ster wrote:

On Thu, Jun 15, 2017 at 7:56 PM, Casey Bodley 
wrote:

On 06/14/2017 05:59 AM, Dan van der Ster wrote:

Dear ceph users,

Today we had O(100) slow requests which were caused by deep-scrubbing
of the metadata log:

2017-06-14 11:07:55.373184 osd.155
[2001:1458:301:24::100:d]:6837/3817268 7387 : cluster [INF] 24.1d
deep-scrub starts
...
2017-06-14 11:22:04.143903 osd.155
[2001:1458:301:24::100:d]:6837/3817268 8276 : cluster [WRN] slow
request 480.140904 seconds old, received at 2017-06-14
11:14:04.002913: osd_op(client.3192010.0:11872455 24.be8b305d
meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54 [call log.add] snapc
0=[] ondisk+write+known_if_redirected e7752) currently waiting for
scrub
...
2017-06-14 11:22:06.729306 osd.155
[2001:1458:301:24::100:d]:6837/3817268 8277 : cluster [INF] 24.1d
deep-scrub ok

We have log_meta: true, log_data: false on this (our only) region [1],
which IIRC we setup to enable indexless buckets.

I'm obviously unfamiliar with rgw meta and data logging, and have a
few questions:

 1. AFAIU, it is used by the rgw multisite feature. Is it safe to
turn
it off when not using multisite?


It's a good idea to turn that off, yes.

First, make sure that you have configured a default
realm/zonegroup/zone:

$ radosgw-admin realm default --rgw-realm   (you can
determine
realm name from 'radosgw-admin realm list')
$ radosgw-admin zonegroup default --rgw-zonegroup default
$ radosgw-admin zone default --rgw-zone default


Thanks. This had already been done, as confirmed with radosgw-admin
realm get-default.


Then you can modify the zonegroup (aka region):

$ radosgw-admin zonegroup get > zonegroup.json
$ sed -i 's/log_meta": "true/log_meta":"false/' zonegroup.json
$ radosgw-admin zonegroup set < zonegroup.json

Then commit the updated period configuration:

$ radosgw-admin period update --commit

Verify that the resulting period contains "log_meta": "false". Take
care
with future radosgw-admin commands on the zone/zonegroup, as they may
revert
log_meta back to true [1].


Great, this worked. FYI (and for others trying this in future), the
period update --commit blocks all rgws for ~30s while they reload the
realm.


 2. I started dumping the output of radosgw-admin mdlog list,

Re: [ceph-users] radosgw: scrub causing slow requests in the md log

2017-06-22 Thread Dan van der Ster

On Thu, Jun 22, 2017 at 4:25 PM, Casey Bodley  wrote:
>
> On 06/22/2017 04:00 AM, Dan van der Ster wrote:
>>
>> I'm now running the three relevant OSDs with that patch. (Recompiled,
>> replaced /usr/lib64/rados-classes/libcls_log.so with the new version,
>> then restarted the osds).
>>
>> It's working quite well, trimming 10 entries at a time instead of
>> 1000, and no more timeouts.
>>
>> Do you think it would be worth decreasing this hardcoded value in ceph
>> proper?
>>
>> -- Dan
>
>
> I do, yeah. At least, the trim operation should be able to pass in its own
> value for that. I opened a ticket for that at
> http://tracker.ceph.com/issues/20382.
>
> I'd also like to investigate using the ObjectStore's OP_OMAP_RMKEYRANGE
> operation to trim a range of keys in a single osd op, instead of generating
> a different op for each key. I have a PR that does this at
> https://github.com/ceph/ceph/pull/15183. But it's still hard to guarantee
> that leveldb can process the entire range inside of the suicide timeout.

I wonder if that would help. Here's what I've learned today:

  * two of the 3 relevant OSDs have something screwy with their
leveldb. The primary and 3rd replica are ~quick at trimming for only a
few hundred keys, whilst the 2nd OSD is very very fast always.
  * After manually compacting the two slow OSDs, they are fast again
for just a few hundred trims. So I'm compacting, trimming, ..., in a
loop now.
  * I moved the omaps to SSDs -- doesn't help. (iostat confirms this
is not IO bound).
  * CPU util on the slow OSDs gets quite high during the slow trimming.
  * perf top is below [1]. leveldb::Block::Iter::Prev and
leveldb::InternalKeyComparator::Compare are notable.
  * The always fast OSD shows no leveldb functions in perf top while trimming.

I've tried bigger leveldb cache and block sizes, compression on and
off, and played with the bloom size up to 14 bits -- none of these
changes make any difference.

At this point I'm not confident this trimming will ever complete --
there are ~20 million records to remove at maybe 1Hz.

How about I just delete the meta.log object? Would this use a
different, perhaps quicker, code path to remove those omap keys?

Thanks!

Dan

[1]

   4.92%  libtcmalloc.so.4.2.6;5873e42b (deleted)  [.]
0x00023e8d
   4.47%  libc-2.17.so [.] __memcmp_sse4_1
   4.13%  libtcmalloc.so.4.2.6;5873e42b (deleted)  [.]
0x000273bb
   3.81%  libleveldb.so.1.0.7  [.]
leveldb::Block::Iter::Prev
   3.07%  libc-2.17.so [.]
__memcpy_ssse3_back
   2.84%  [kernel] [k] port_inb
   2.77%  libstdc++.so.6.0.19  [.]
std::string::_M_mutate
   2.75%  libstdc++.so.6.0.19  [.]
std::string::append
   2.53%  libleveldb.so.1.0.7  [.]
leveldb::InternalKeyComparator::Compare
   1.32%  libtcmalloc.so.4.2.6;5873e42b (deleted)  [.]
0x00023e77
   0.85%  [kernel] [k] _raw_spin_lock
   0.80%  libleveldb.so.1.0.7  [.]
leveldb::Block::Iter::Next
   0.77%  libtcmalloc.so.4.2.6;5873e42b (deleted)  [.]
0x00023a05
   0.67%  libleveldb.so.1.0.7  [.]
leveldb::MemTable::KeyComparator::operator()
   0.61%  libtcmalloc.so.4.2.6;5873e42b (deleted)  [.]
0x00023a09
   0.58%  libleveldb.so.1.0.7  [.]
leveldb::MemTableIterator::Prev
   0.51%  [kernel] [k] __schedule
   0.48%  libruby.so.2.1.0 [.] ruby_yyparse
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] radosgw: scrub causing slow requests in the md log

2017-06-22 Thread Casey Bodley



On 06/22/2017 10:40 AM, Dan van der Ster wrote:

On Thu, Jun 22, 2017 at 4:25 PM, Casey Bodley  wrote:

On 06/22/2017 04:00 AM, Dan van der Ster wrote:

I'm now running the three relevant OSDs with that patch. (Recompiled,
replaced /usr/lib64/rados-classes/libcls_log.so with the new version,
then restarted the osds).

It's working quite well, trimming 10 entries at a time instead of
1000, and no more timeouts.

Do you think it would be worth decreasing this hardcoded value in ceph
proper?

-- Dan


I do, yeah. At least, the trim operation should be able to pass in its own
value for that. I opened a ticket for that at
http://tracker.ceph.com/issues/20382.

I'd also like to investigate using the ObjectStore's OP_OMAP_RMKEYRANGE
operation to trim a range of keys in a single osd op, instead of generating
a different op for each key. I have a PR that does this at
https://github.com/ceph/ceph/pull/15183. But it's still hard to guarantee
that leveldb can process the entire range inside of the suicide timeout.

I wonder if that would help. Here's what I've learned today:

   * two of the 3 relevant OSDs have something screwy with their
leveldb. The primary and 3rd replica are ~quick at trimming for only a
few hundred keys, whilst the 2nd OSD is very very fast always.
   * After manually compacting the two slow OSDs, they are fast again
for just a few hundred trims. So I'm compacting, trimming, ..., in a
loop now.
   * I moved the omaps to SSDs -- doesn't help. (iostat confirms this
is not IO bound).
   * CPU util on the slow OSDs gets quite high during the slow trimming.
   * perf top is below [1]. leveldb::Block::Iter::Prev and
leveldb::InternalKeyComparator::Compare are notable.
   * The always fast OSD shows no leveldb functions in perf top while trimming.

I've tried bigger leveldb cache and block sizes, compression on and
off, and played with the bloom size up to 14 bits -- none of these
changes make any difference.

At this point I'm not confident this trimming will ever complete --
there are ~20 million records to remove at maybe 1Hz.

How about I just delete the meta.log object? Would this use a
different, perhaps quicker, code path to remove those omap keys?

Thanks!

Dan

[1]

4.92%  libtcmalloc.so.4.2.6;5873e42b (deleted)  [.]
0x00023e8d
4.47%  libc-2.17.so [.] __memcmp_sse4_1
4.13%  libtcmalloc.so.4.2.6;5873e42b (deleted)  [.]
0x000273bb
3.81%  libleveldb.so.1.0.7  [.]
leveldb::Block::Iter::Prev
3.07%  libc-2.17.so [.]
__memcpy_ssse3_back
2.84%  [kernel] [k] port_inb
2.77%  libstdc++.so.6.0.19  [.]
std::string::_M_mutate
2.75%  libstdc++.so.6.0.19  [.]
std::string::append
2.53%  libleveldb.so.1.0.7  [.]
leveldb::InternalKeyComparator::Compare
1.32%  libtcmalloc.so.4.2.6;5873e42b (deleted)  [.]
0x00023e77
0.85%  [kernel] [k] _raw_spin_lock
0.80%  libleveldb.so.1.0.7  [.]
leveldb::Block::Iter::Next
0.77%  libtcmalloc.so.4.2.6;5873e42b (deleted)  [.]
0x00023a05
0.67%  libleveldb.so.1.0.7  [.]
leveldb::MemTable::KeyComparator::operator()
0.61%  libtcmalloc.so.4.2.6;5873e42b (deleted)  [.]
0x00023a09
0.58%  libleveldb.so.1.0.7  [.]
leveldb::MemTableIterator::Prev
0.51%  [kernel] [k] __schedule
0.48%  libruby.so.2.1.0 [.] ruby_yyparse


Hi Dan,

Removing an object will try to delete all of its keys at once, which 
should be much faster. It's also very likely to hit your suicide 
timeout, so you'll have to keep retrying until it stops killing your osd.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] red IO hang (was disk timeouts in libvirt/qemu VMs...)

2017-06-22 Thread Hall, Eric

After some testing (doing heavy IO on a rdb-based VM with 
hung_task_timeout_secs=1 while manually requesting deep-scrubs on the 
underlying pgs (as determined via rados ls->osdmaptool), I don’t think 
scrubbing is the cause.

At least, I can’t make it happen this way… although I can’t *always* make it 
happen whileeither.  I will continue testing as above, but suggestions on 
improved test methodology are welcome.


We occasionally see blocked requests in a running log (ceph –w > log), but not 
correlated with hung VM IO.  Scrubbing doesn’t seem correlated either.

-- 
Eric

On 6/21/17, 2:55 PM, "Jason Dillaman"  wrote:

Do your VMs or OSDs show blocked requests? If you disable scrub or
restart the blocked OSD, does the issue go away? If yes, it most
likely is this issue [1].

[1] http://tracker.ceph.com/issues/20041

On Wed, Jun 21, 2017 at 3:33 PM, Hall, Eric  
wrote:
> The VMs are using stock Ubuntu14/16 images so yes, there is the default 
“/sbin/fstrim –all” in /etc/cron.weekly/fstrim.
>
> --
> Eric
>
> On 6/21/17, 1:58 PM, "Jason Dillaman"  wrote:
>
> Are some or many of your VMs issuing periodic fstrims to discard
> unused extents?
>
> On Wed, Jun 21, 2017 at 2:36 PM, Hall, Eric 
 wrote:
> > After following/changing all suggested items (turning off 
exclusive-lock
> > (and associated object-map and fast-diff), changing host cache 
behavior,
> > etc.) this is still a blocking issue for many uses of our 
OpenStack/Ceph
> > installation.
> >
> >
> >
> > We have upgraded Ceph to 10.2.7, are running 4.4.0-62 or later 
kernels on
> > all storage, compute hosts, and VMs, with libvirt 1.3.1 on compute 
hosts.
> > Have also learned quite a bit about producing debug logs. ;)
> >
> >
> >
> > I’ve followed the related threads since March with bated breath, 
but still
> > find no resolution.
> >
> >
> >
> > Previously, I got pulled away before I could produce/report 
discussed debug
> > info, but am back on the case now. Please let me know how I can help
> > diagnose and resolve this problem.
> >
> >
> >
> > Any assistance appreciated,
> >
> > --
> >
> > Eric
> >
> >
> >
> > On 3/28/17, 3:05 AM, "Marius Vaitiekunas" 

> > wrote:
> >
> >
> >
> >
> >
> >
> >
> > On Mon, Mar 27, 2017 at 11:17 PM, Peter Maloney
> >  wrote:
> >
> > I can't guarantee it's the same as my issue, but from that it 
sounds the
> > same.
> >
> > Jewel 10.2.4, 10.2.5 tested
> > hypervisors are proxmox qemu-kvm, using librbd
> > 3 ceph nodes with mon+osd on each
> >
> > -faster journals, more disks, bcache, rbd_cache, fewer VMs on ceph, 
iops
> > and bw limits on client side, jumbo frames, etc. all improve/smooth 
out
> > performance and mitigate the hangs, but don't prevent it.
> > -hangs are usually associated with blocked requests (I set the 
complaint
> > time to 5s to see them)
> > -hangs are very easily caused by rbd snapshot + rbd export-diff to 
do
> > incremental backup (one snap persistent, plus one more during 
backup)
> > -when qemu VM io hangs, I have to kill -9 the qemu process for it to
> > stop. Some broken VMs don't appear to be hung until I try to live
> > migrate them (live migrating all VMs helped test solutions)
> >
> > Finally I have a workaround... disable exclusive-lock, object-map, 
and
> > fast-diff rbd features (and restart clients via live migrate).
> > (object-map and fast-diff appear to have no effect on dif or 
export-diff
> > ... so I don't miss them). I'll file a bug at some point (after I 
move
> > all VMs back and see if it is still stable). And one other user on 
IRC
> > said this solved the same problem (also using rbd snapshots).
> >
> > And strangely, they don't seem to hang if I put back those features,
> > until a few days later (making testing much less easy...but now I'm 
very
> > sure removing them prevents the issue)
> >
> > I hope this works for you (and maybe gets some attention from devs 
too),
> > so you don't waste months like me.
> >
> >
> > On 03/27/17 19:31, Hall, Eric wrote:
> >> In an OpenStack (mitaka) cloud, backed by a ceph cluster (10.2.6 
jewel),
> >> using libvirt/qemu (1.3.1/2.5) hypervisors on Ubuntu 14.04.5 
compute and
> >> ceph hosts, we occasionally see hung processes (usually during 
boot, but
> >> otherwise as well), with errors reported in the ins

[ceph-users] Squeezing Performance of CEPH

2017-06-22 Thread Massimiliano Cuttini


Hi everybody,

I want to squeeze all the performance of CEPH (we are using jewel 10.2.7).
We are testing a testing environment with 2 nodes having the same 
configuration:


 * CentOS 7.3
 * 24 CPUs (12 for real in hyper threading)
 * 32Gb of RAM
 * 2x 100Gbit/s ethernet cards
 * 2x OS dedicated in raid SSD Disks
 * 4x OSD SSD Disks SATA 6Gbit/s

We are already expecting the following bottlenecks:

 * [ SATA speed x n° disks ] = 24Gbit/s
 * [ Networks speed x n° bonded cards ] = 200Gbit/s

So the minimum between them is 24 Gbit/s per node (not taking in account 
protocol loss).


24Gbit/s per node x2 = 48Gbit/s of maximum hypotetical theorical gross 
speed.


Here are the tests:
///IPERF2/// Tests are quite good scoring 88% of the bottleneck.
Note: iperf2 can use only 1 connection from a bond.(it's a well know issue).

   [ ID] Interval   Transfer Bandwidth
   [ 12]  0.0-10.0 sec  9.55 GBytes  8.21 Gbits/sec
   [  3]  0.0-10.0 sec  10.3 GBytes  8.81 Gbits/sec
   [  5]  0.0-10.0 sec  9.54 GBytes  8.19 Gbits/sec
   [  7]  0.0-10.0 sec  9.52 GBytes  8.18 Gbits/sec
   [  6]  0.0-10.0 sec  9.96 GBytes  8.56 Gbits/sec
   [  8]  0.0-10.0 sec  12.1 GBytes  10.4 Gbits/sec
   [  9]  0.0-10.0 sec  12.3 GBytes  10.6 Gbits/sec
   [ 10]  0.0-10.0 sec  10.2 GBytes  8.80 Gbits/sec
   [ 11]  0.0-10.0 sec  9.34 GBytes  8.02 Gbits/sec
   [  4]  0.0-10.0 sec  10.3 GBytes  8.82 Gbits/sec
   [SUM]  0.0-10.0 sec   103 GBytes  88.6 Gbits/sec

///RADOS BENCH

Take in consideration the maximum hypotetical speed of 48Gbit/s tests 
(due to disks bottleneck), tests are not good enought.


 * Average MB/s in write is almost 5-7Gbit/sec (12,5% of the mhs)
 * Average MB/s in seq read is almost 24Gbit/sec (50% of the mhs)
 * Average MB/s in random read is almost 27Gbit/se (56,25% of the mhs).

Here are the reports.
Write:

   # rados bench -p scbench 10 write --no-cleanup
   Total time run: 10.229369
   Total writes made:  1538
   Write size: 4194304
   Object size:4194304
   Bandwidth (MB/sec): 601.406
   Stddev Bandwidth:   357.012
   Max bandwidth (MB/sec): 1080
   Min bandwidth (MB/sec): 204
   Average IOPS:   150
   Stddev IOPS:89
   Max IOPS:   270
   Min IOPS:   51
   Average Latency(s): 0.106218
   Stddev Latency(s):  0.198735
   Max latency(s): 1.87401
   Min latency(s): 0.0225438

sequential read:

   # rados bench -p scbench 10 seq
   Total time run:   2.054359
   Total reads made: 1538
   Read size:4194304
   Object size:  4194304
   Bandwidth (MB/sec):   2994.61
   Average IOPS  748
   Stddev IOPS:  67
   Max IOPS: 802
   Min IOPS: 707
   Average Latency(s):   0.0202177
   Max latency(s):   0.223319
   Min latency(s):   0.00589238

random read:

   # rados bench -p scbench 10 rand
   Total time run:   10.036816
   Total reads made: 8375
   Read size:4194304
   Object size:  4194304
   Bandwidth (MB/sec):   3337.71
   Average IOPS: 834
   Stddev IOPS:  78
   Max IOPS: 927
   Min IOPS: 741
   Average Latency(s):   0.0182707
   Max latency(s):   0.257397
   Min latency(s):   0.00469212

//

It's seems like that there are some bottleneck somewhere that we are 
understimating.

Can you help me to found it?




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Obtaining perf counters/stats from krbd client

2017-06-22 Thread Prashant Murthy

Hi Ceph users,

We are currently using the Ceph kernel client module (krbd) in our
deployment and we were looking to determine if there are ways by which we
can obtain perf counters, log dumps, etc from such a deployment. Has
anybody been able to obtain such stats?

It looks like the libvirt interface allows for an admin socket to be
configured on the client (
http://docs.ceph.com/docs/master/rbd/libvirt/#configuring-ceph) into which
you can issue commands, but is this specific to the librbd implementation?

Thanks,
Prashant

-- 
Prashant Murthy
Sr Director, Software Engineering | Salesforce
Mobile: 919-961-3041


--
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Squeezing Performance of CEPH

2017-06-22 Thread Mark Nelson


Hello Massimiliano,

Based on the configuration below, it appears you have 8 SSDs total (2 
nodes with 4 SSDs each)?


I'm going to assume you have 3x replication and are you using filestore, 
so in reality you are writing 3 copies and doing full data journaling 
for each copy, for 6x writes per client write.  Taking this into 
account, your per-SSD throughput should be somewhere around:


Sequential write:
~600 * 3 (copies) * 2 (journal write per copy) / 8 (ssds) = ~450MB/s

Sequential read
~3000 / 8 (ssds) = ~375MB/s

Random read
~3337 / 8 (ssds) = ~417MB/s

These numbers are pretty reasonable for SATA based SSDs, though the read 
throughput is a little low.  You didn't include the model of SSD, but if 
you look at Intel's DC S3700 which is a fairly popular SSD for ceph:


https://www.intel.com/content/www/us/en/solid-state-drives/ssd-dc-s3700-spec.html

Sequential read is up to ~500MB/s and Sequential write speeds up to 
460MB/s.  Not too far off from what you are seeing.  You might try 
playing with readahead on the OSD devices to see if that improves things 
at all.  Still, unless I've missed something these numbers aren't terrible.


Mark

On 06/22/2017 12:19 PM, Massimiliano Cuttini wrote:

Hi everybody,

I want to squeeze all the performance of CEPH (we are using jewel 10.2.7).
We are testing a testing environment with 2 nodes having the same
configuration:

  * CentOS 7.3
  * 24 CPUs (12 for real in hyper threading)
  * 32Gb of RAM
  * 2x 100Gbit/s ethernet cards
  * 2x OS dedicated in raid SSD Disks
  * 4x OSD SSD Disks SATA 6Gbit/s

We are already expecting the following bottlenecks:

  * [ SATA speed x n° disks ] = 24Gbit/s
  * [ Networks speed x n° bonded cards ] = 200Gbit/s

So the minimum between them is 24 Gbit/s per node (not taking in account
protocol loss).

24Gbit/s per node x2 = 48Gbit/s of maximum hypotetical theorical gross
speed.

Here are the tests:
///IPERF2/// Tests are quite good scoring 88% of the bottleneck.
Note: iperf2 can use only 1 connection from a bond.(it's a well know issue).

[ ID] Interval   Transfer Bandwidth
[ 12]  0.0-10.0 sec  9.55 GBytes  8.21 Gbits/sec
[  3]  0.0-10.0 sec  10.3 GBytes  8.81 Gbits/sec
[  5]  0.0-10.0 sec  9.54 GBytes  8.19 Gbits/sec
[  7]  0.0-10.0 sec  9.52 GBytes  8.18 Gbits/sec
[  6]  0.0-10.0 sec  9.96 GBytes  8.56 Gbits/sec
[  8]  0.0-10.0 sec  12.1 GBytes  10.4 Gbits/sec
[  9]  0.0-10.0 sec  12.3 GBytes  10.6 Gbits/sec
[ 10]  0.0-10.0 sec  10.2 GBytes  8.80 Gbits/sec
[ 11]  0.0-10.0 sec  9.34 GBytes  8.02 Gbits/sec
[  4]  0.0-10.0 sec  10.3 GBytes  8.82 Gbits/sec
[SUM]  0.0-10.0 sec   103 GBytes  88.6 Gbits/sec

///RADOS BENCH

Take in consideration the maximum hypotetical speed of 48Gbit/s tests
(due to disks bottleneck), tests are not good enought.

  * Average MB/s in write is almost 5-7Gbit/sec (12,5% of the mhs)
  * Average MB/s in seq read is almost 24Gbit/sec (50% of the mhs)
  * Average MB/s in random read is almost 27Gbit/se (56,25% of the mhs).

Here are the reports.
Write:

# rados bench -p scbench 10 write --no-cleanup
Total time run: 10.229369
Total writes made:  1538
Write size: 4194304
Object size:4194304
Bandwidth (MB/sec): 601.406
Stddev Bandwidth:   357.012
Max bandwidth (MB/sec): 1080
Min bandwidth (MB/sec): 204
Average IOPS:   150
Stddev IOPS:89
Max IOPS:   270
Min IOPS:   51
Average Latency(s): 0.106218
Stddev Latency(s):  0.198735
Max latency(s): 1.87401
Min latency(s): 0.0225438

sequential read:

# rados bench -p scbench 10 seq
Total time run:   2.054359
Total reads made: 1538
Read size:4194304
Object size:  4194304
Bandwidth (MB/sec):   2994.61
Average IOPS  748
Stddev IOPS:  67
Max IOPS: 802
Min IOPS: 707
Average Latency(s):   0.0202177
Max latency(s):   0.223319
Min latency(s):   0.00589238

random read:

# rados bench -p scbench 10 rand
Total time run:   10.036816
Total reads made: 8375
Read size:4194304
Object size:  4194304
Bandwidth (MB/sec):   3337.71
Average IOPS: 834
Stddev IOPS:  78
Max IOPS: 927
Min IOPS: 741
Average Latency(s):   0.0182707
Max latency(s):   0.257397
Min latency(s):   0.00469212

//

It's seems like that there are some bottleneck somewhere that we are
understimating.
Can you help me to found it?






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://

Re: [ceph-users] Squeezing Performance of CEPH

2017-06-22 Thread Ashley Merrick

Hello,

Also as Mark put, one minute your testing bandwidth capacity, next minute your 
testing disk capacity.

No way is a small set of SSD’s going to be able to max your current bandwidth, 
even if you removed the CEPH / Journal overhead. I would say the speeds you are 
getting are what you should expect , see with many other setups.

,Ashley

Sent from my iPhone

On 23 Jun 2017, at 12:42 AM, Mark Nelson 
mailto:mnel...@redhat.com>> wrote:

Hello Massimiliano,

Based on the configuration below, it appears you have 8 SSDs total (2 nodes 
with 4 SSDs each)?

I'm going to assume you have 3x replication and are you using filestore, so in 
reality you are writing 3 copies and doing full data journaling for each copy, 
for 6x writes per client write.  Taking this into account, your per-SSD 
throughput should be somewhere around:

Sequential write:
~600 * 3 (copies) * 2 (journal write per copy) / 8 (ssds) = ~450MB/s

Sequential read
~3000 / 8 (ssds) = ~375MB/s

Random read
~3337 / 8 (ssds) = ~417MB/s

These numbers are pretty reasonable for SATA based SSDs, though the read 
throughput is a little low.  You didn't include the model of SSD, but if you 
look at Intel's DC S3700 which is a fairly popular SSD for ceph:

https://www.intel.com/content/www/us/en/solid-state-drives/ssd-dc-s3700-spec.html

Sequential read is up to ~500MB/s and Sequential write speeds up to 460MB/s.  
Not too far off from what you are seeing.  You might try playing with readahead 
on the OSD devices to see if that improves things at all.  Still, unless I've 
missed something these numbers aren't terrible.

Mark

On 06/22/2017 12:19 PM, Massimiliano Cuttini wrote:
Hi everybody,

I want to squeeze all the performance of CEPH (we are using jewel 10.2.7).
We are testing a testing environment with 2 nodes having the same
configuration:

 * CentOS 7.3
 * 24 CPUs (12 for real in hyper threading)
 * 32Gb of RAM
 * 2x 100Gbit/s ethernet cards
 * 2x OS dedicated in raid SSD Disks
 * 4x OSD SSD Disks SATA 6Gbit/s

We are already expecting the following bottlenecks:

 * [ SATA speed x n° disks ] = 24Gbit/s
 * [ Networks speed x n° bonded cards ] = 200Gbit/s

So the minimum between them is 24 Gbit/s per node (not taking in account
protocol loss).

24Gbit/s per node x2 = 48Gbit/s of maximum hypotetical theorical gross
speed.

Here are the tests:
///IPERF2/// Tests are quite good scoring 88% of the bottleneck.
Note: iperf2 can use only 1 connection from a bond.(it's a well know issue).

   [ ID] Interval   Transfer Bandwidth
   [ 12]  0.0-10.0 sec  9.55 GBytes  8.21 Gbits/sec
   [  3]  0.0-10.0 sec  10.3 GBytes  8.81 Gbits/sec
   [  5]  0.0-10.0 sec  9.54 GBytes  8.19 Gbits/sec
   [  7]  0.0-10.0 sec  9.52 GBytes  8.18 Gbits/sec
   [  6]  0.0-10.0 sec  9.96 GBytes  8.56 Gbits/sec
   [  8]  0.0-10.0 sec  12.1 GBytes  10.4 Gbits/sec
   [  9]  0.0-10.0 sec  12.3 GBytes  10.6 Gbits/sec
   [ 10]  0.0-10.0 sec  10.2 GBytes  8.80 Gbits/sec
   [ 11]  0.0-10.0 sec  9.34 GBytes  8.02 Gbits/sec
   [  4]  0.0-10.0 sec  10.3 GBytes  8.82 Gbits/sec
   [SUM]  0.0-10.0 sec   103 GBytes  88.6 Gbits/sec

///RADOS BENCH

Take in consideration the maximum hypotetical speed of 48Gbit/s tests
(due to disks bottleneck), tests are not good enought.

 * Average MB/s in write is almost 5-7Gbit/sec (12,5% of the mhs)
 * Average MB/s in seq read is almost 24Gbit/sec (50% of the mhs)
 * Average MB/s in random read is almost 27Gbit/se (56,25% of the mhs).

Here are the reports.
Write:

   # rados bench -p scbench 10 write --no-cleanup
   Total time run: 10.229369
   Total writes made:  1538
   Write size: 4194304
   Object size:4194304
   Bandwidth (MB/sec): 601.406
   Stddev Bandwidth:   357.012
   Max bandwidth (MB/sec): 1080
   Min bandwidth (MB/sec): 204
   Average IOPS:   150
   Stddev IOPS:89
   Max IOPS:   270
   Min IOPS:   51
   Average Latency(s): 0.106218
   Stddev Latency(s):  0.198735
   Max latency(s): 1.87401
   Min latency(s): 0.0225438

sequential read:

   # rados bench -p scbench 10 seq
   Total time run:   2.054359
   Total reads made: 1538
   Read size:4194304
   Object size:  4194304
   Bandwidth (MB/sec):   2994.61
   Average IOPS  748
   Stddev IOPS:  67
   Max IOPS: 802
   Min IOPS: 707
   Average Latency(s):   0.0202177
   Max latency(s):   0.223319
   Min latency(s):   0.00589238

random read:

   # rados bench -p scbench 10 rand
   Total time run:   10.036816
   Total reads made: 8375
   Read size:4194304
   Object size:  4194304
   Bandwidth (MB/sec):   3337.71
   Average IOPS: 834
   Stddev IOPS:  78
   Max IOPS: 927
   Min IOPS: 741
   Average Latency(s):   0.0182707
   Max latency(s):   0.257397
   Min latency(s):   0.00469212

//

It's

Re: [ceph-users] Squeezing Performance of CEPH

2017-06-22 Thread Maged Mokhtar

Generally you can measure your bottleneck via a tool like
atop/collectl/sysstat  and see how busy (ie %busy, %util ) your
resources are: cpu/disks/net. 

As was pointed out, in your case you will most probably have maxed out
on your disks. But the above tools should help as you grow and tune your
cluster. 

Cheers, 

Maged Mokhtar 

PetaSAN 

On 2017-06-22 19:19, Massimiliano Cuttini wrote:

> Hi everybody, 
> 
> I want to squeeze all the performance of CEPH (we are using jewel 10.2.7).
> We are testing a testing environment with 2 nodes having the same 
> configuration: 
> 
> * CentOS 7.3
> * 24 CPUs (12 for real in hyper threading)
> * 32Gb of RAM
> * 2x 100Gbit/s ethernet cards
> * 2x OS dedicated in raid SSD Disks
> * 4x OSD SSD Disks SATA 6Gbit/s
> 
> We are already expecting the following bottlenecks: 
> 
> * [ SATA speed x n° disks ] = 24Gbit/s
> * [ Networks speed x n° bonded cards ] = 200Gbit/s
> 
> So the minimum between them is 24 Gbit/s per node (not taking in account 
> protocol loss). 
> 
> 24Gbit/s per node x2 = 48Gbit/s of maximum hypotetical theorical gross speed. 
> 
> Here are the tests:
> ///IPERF2/// Tests are quite good scoring 88% of the bottleneck.
> Note: iperf2 can use only 1 connection from a bond.(it's a well know issue).
> 
>> [ ID] Interval   Transfer Bandwidth
>> [ 12]  0.0-10.0 sec  9.55 GBytes  8.21 Gbits/sec
>> [  3]  0.0-10.0 sec  10.3 GBytes  8.81 Gbits/sec
>> [  5]  0.0-10.0 sec  9.54 GBytes  8.19 Gbits/sec
>> [  7]  0.0-10.0 sec  9.52 GBytes  8.18 Gbits/sec
>> [  6]  0.0-10.0 sec  9.96 GBytes  8.56 Gbits/sec
>> [  8]  0.0-10.0 sec  12.1 GBytes  10.4 Gbits/sec
>> [  9]  0.0-10.0 sec  12.3 GBytes  10.6 Gbits/sec
>> [ 10]  0.0-10.0 sec  10.2 GBytes  8.80 Gbits/sec
>> [ 11]  0.0-10.0 sec  9.34 GBytes  8.02 Gbits/sec
>> [  4]  0.0-10.0 sec  10.3 GBytes  8.82 Gbits/sec
>> [SUM]  0.0-10.0 sec   103 GBytes  88.6 Gbits/sec
> 
> ///RADOS BENCH 
> 
> Take in consideration the maximum hypotetical speed of 48Gbit/s tests (due to 
> disks bottleneck), tests are not good enought. 
> 
> * Average MB/s in write is almost 5-7Gbit/sec (12,5% of the mhs)
> * Average MB/s in seq read is almost 24Gbit/sec (50% of the mhs)
> * Average MB/s in random read is almost 27Gbit/se (56,25% of the mhs).
> 
> Here are the reports.
> Write:
> 
>> # rados bench -p scbench 10 write --no-cleanup
>> Total time run: 10.229369
>> Total writes made:  1538
>> Write size: 4194304
>> Object size:4194304
>> Bandwidth (MB/sec): 601.406
>> Stddev Bandwidth:   357.012
>> Max bandwidth (MB/sec): 1080
>> Min bandwidth (MB/sec): 204
>> Average IOPS:   150
>> Stddev IOPS:89
>> Max IOPS:   270
>> Min IOPS:   51
>> Average Latency(s): 0.106218
>> Stddev Latency(s):  0.198735
>> Max latency(s): 1.87401
>> Min latency(s): 0.0225438
> 
> sequential read:
> 
>> # rados bench -p scbench 10 seq
>> Total time run:   2.054359
>> Total reads made: 1538
>> Read size:4194304
>> Object size:  4194304
>> Bandwidth (MB/sec):   2994.61
>> Average IOPS  748
>> Stddev IOPS:  67
>> Max IOPS: 802
>> Min IOPS: 707
>> Average Latency(s):   0.0202177
>> Max latency(s):   0.223319
>> Min latency(s):   0.00589238
> 
> random read:
> 
>> # rados bench -p scbench 10 rand
>> Total time run:   10.036816
>> Total reads made: 8375
>> Read size:4194304
>> Object size:  4194304
>> Bandwidth (MB/sec):   3337.71
>> Average IOPS: 834
>> Stddev IOPS:  78
>> Max IOPS: 927
>> Min IOPS: 741
>> Average Latency(s):   0.0182707
>> Max latency(s):   0.257397
>> Min latency(s):   0.00469212
> 
> // 
> 
> It's seems like that there are some bottleneck somewhere that we are 
> understimating.
> Can you help me to found it? 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Squeezing Performance of CEPH

2017-06-22 Thread ceph

On 22/06/2017 19:19, Massimiliano Cuttini wrote:
> We are already expecting the following bottlenecks:
> 
>  * [ SATA speed x n° disks ] = 24Gbit/s
>  * [ Networks speed x n° bonded cards ] = 200Gbit/s

6Gbps SATA does not mean you can read 6Gbps from that device

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Mon Create currently at the state of probing

2017-06-22 Thread Jim Forde

David,

SUCCESS!! Thank you so much!

I rebuilt the node because I could not install Jewel over the remnants of 
Kraken.
So, while I did install Jewel I am not convinced that was the solution. I did 
something that I had not tried under the Kraken attempts that solved the 
problem.

For future_me here was the solution.

Removed all references to r710e from the ceph.conf on ceph-deploy node in the 
original deployment folder home/cephadminaccount/ceph-cluster/ceph.conf
“Ceph-deploy –overwrite-conf config push r710a r710b r710c” etc to all nodes 
including the ceph-deploy node so it is now in the /etc/ceph/ceph.conf
“Ceph-deploy install --release jewel r710e”
“Ceph-deploy admin r710e”
“sudo chmod +r /etc/ceph/ceph.client.admin.keyring” Run on node r710e
“ceph-deploy mon create r710e”

Node was created but still had the very same probing errors. Ugh.

Then I went to home/cephadminaccount/ceph-cluster/ceph.conf and added r710e 
back in just the way it was before and pushed it to all nodes.
“Ceph-deploy –overwrite-conf config push r710a r710b r710c” etc
“Sudo reboot” on r710g don’t know if this was necessary. When it came up ceph 
-s was good. Rebooted r710e for good measure. Did not reboot r710f.

I am wondering if I had just pushed the ceph.conf back out in the first place, 
would it have solved the problem.
That is for another day.

-Jim


From: David Turner [mailto:drakonst...@gmail.com]
Sent: Wednesday, June 21, 2017 4:19 PM
To: Jim Forde 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Mon Create currently at the state of probing

You can specify an option in ceph-deploy to tell it which release of ceph to 
install, jewel, kraken, hammer, etc.  `ceph-deploy --release jewel` would pin 
the command to using jewel instead of kraken.

While running a mixed environment is supported, it should always be tested 
before assuming it will work for you in production.  The Mons are quick enough 
to upgrade, I always do them together.  Following I upgrade half of my OSDs in 
a test environment and leave it there for a couple weeks (or until adequate 
testing is done) before upgrading the remaining OSDs and again waiting until 
the testing is done, I would probably do the MDS before the OSDs, but don't 
usually think about that since I don't have them in a production environment.  
Lastly I would test upgrading the clients (vm hosts, RGW, kernel clients, etc) 
and test this state the most thoroughly.  In production I haven't had to worry 
about an upgrade taking longer than a few hours with over 60 OSD nodes, 5 mons, 
and a dozen clients.  I just don't see a need to run in a mixed environment in 
production, even if it is supported.

Back to your problem with adding in the mon.  Do your existing mons know about 
the third mon, or have you removed it from their running config?  It might be 
worth double checking their config file and restarting the daemons after you 
know they will pick up the correct settings.  It's hard for me to help with 
this part as I've been lucky enough not to have any problems with the docs 
online for this when it's come up.  I've replaced 5 mons without any issues.  I 
didn't use ceph-deploy, except to install the packages, though and did the 
manual steps for it.

Hopefully adding the mon back on Jewel fixes the issue.  That would be the 
easiest outcome.  I don't know that the Ceph team has tested adding upgraded 
mons to an old quorum.

On Wed, Jun 21, 2017 at 4:52 PM Jim Forde 
mailto:j...@mninc.net>> wrote:
David,
Thanks for the reply.

The scenario:
Monitor node fails for whatever reason, Bad blocks in HD, or Motherboard fail, 
whatever.

Procedure:
Remove the monitor from the cluster, replace hardware, reinstall OS and add 
monitor to cluster.

That is exactly what I did. However, my ceph-deploy node had already been 
upgraded to Kraken.
The goal is to not use this as an upgrade path per se, but to recover from a 
failed monitor node in a cluster where there is an upgrade in progress.

The upgrade notes for Jewel to Kraken say you may upgrade OSDs Monitors and 
MSDs in any order. Perhaps I am reading too much into this, but I took it as I 
could proceed with the upgrade at my leisure. Making sure each node is 
successfully upgraded before proceeding to the next node. The implication is 
that I can run the cluster with different version daemons (at least during the 
upgrade process).

So that brings me to the problem at hand.
What is the correct procedure for replacing a failed Monitor Node, especially 
if the failed Monitor is a mon_initial_member?
Does it have to be the same version as the other Monitors in the cluster?
I do have a public network statement in the ceph.conf file.
The monitor r710e is listed as one of the mon_initial_members in ceph.conf with 
the correct IP address, but the error message is:
“[r710e][WARNIN] r710e is not defined in `mon initial members`”
Also “[r710e][WARNIN] monitor r710e does not exist in monmap”
Should I manually inject r710e in the monmap?

[ceph-users] osd down but the service is up

2017-06-22 Thread Alex Wang

Hi All

I am recently testing a new ceph cluster with SSD as journal.
ceph -v
ceph version 10.2.7
cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.4 Beta (Maipo)

I followed 
http://ceph.com/geen-categorie/ceph-recover-osds-after-ssd-journal-failure/
to replace the journal drive. (for testing)

All the other ceph service are running but the osd@0 got crashed.

#systemctl -l status ceph-osd@0
● ceph-osd@0.service - Ceph object storage daemon
   Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled-runtime; 
vendor preset: disabled)
   Active: activating (auto-restart) (Result: signal) since Thu 2017-06-22 
15:44:04 EDT; 1s ago
  Process: 9580 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i 
--setuser ceph --setgroup ceph (code=killed, signal=ABRT)
  Process: 9535 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster 
${CLUSTER} --id %i (code=exited, status=0/SUCCESS)
 Main PID: 9580 (code=killed, signal=ABRT)

Jun 22 15:44:04 tinsfsceph01.abc.ca systemd[1]: Unit ceph-osd@0.service
entered failed state.
Jun 22 15:44:04 tinsfsceph01.abc.ca systemd[1]: ceph-osd@0.service
failed.

Log file shows:
--- begin dump of recent events ---
 0> 2017-06-22 15:45:45.396425 7f4df5030800 -1 *** Caught signal (Aborted) 
**
 in thread 7f4df5030800 thread_name:ceph-osd

 ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
 1: (()+0x91d8ea) [0x561eda3988ea]
 2: (()+0xf5e0) [0x7f4df377d5e0]
 3: (gsignal()+0x37) [0x7f4df1d3c1f7]
 4: (abort()+0x148) [0x7f4df1d3d8e8]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x267) [0x561eda4962e7]
 6: (()+0x30640e) [0x561ed9d8140e]
 7: (FileJournal::~FileJournal()+0x24a) [0x561eda17d7ca]
 8: (JournalingObjectStore::journal_replay(unsigned long)+0xff2) 
[0x561eda18cc52]
 9: (FileStore::mount()+0x3cd6) [0x561eda163576]
 10: (OSD::init()+0x27d) [0x561ed9e21a1d]
 11: (main()+0x2c55) [0x561ed9d86dc5]
 12: (__libc_start_main()+0xf5) [0x7f4df1d28c05]
 13: (()+0x3561e7) [0x561ed9dd11e7]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.


Any help?

Thanks

Alex

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] radosgw: scrub causing slow requests in the md log

Re: [ceph-users] Transitioning to Intel P4600 from P3700 Journals

Re: [ceph-users] radosgw: scrub causing slow requests in the md log

[ceph-users] 答复: Can't start ceph-mon through systemctl start ceph-mon@.service after upgrading from Hammer to Jewel

Re: [ceph-users] Transitioning to Intel P4600 from P3700 Journals

Re: [ceph-users] Transitioning to Intel P4600 from P3700 Journals

[ceph-users] Does CephFS support SELinux?

Re: [ceph-users] Does CephFS support SELinux?

Re: [ceph-users] Does CephFS support SELinux?

Re: [ceph-users] FW: radosgw: stale/leaked bucket index entries

Re: [ceph-users] VMware + CEPH Integration

Re: [ceph-users] Transitioning to Intel P4600 from P3700 Journals

Re: [ceph-users] SSD OSD's Dual Use

Re: [ceph-users] 答复: Can't start ceph-mon through systemctl start ceph-mon@.service after upgrading from Hammer to Jewel

Re: [ceph-users] Config parameters for system tuning

Re: [ceph-users] radosgw: scrub causing slow requests in the md log

Re: [ceph-users] radosgw: scrub causing slow requests in the md log

Re: [ceph-users] radosgw: scrub causing slow requests in the md log

Re: [ceph-users] red IO hang (was disk timeouts in libvirt/qemu VMs...)

[ceph-users] Squeezing Performance of CEPH

[ceph-users] Obtaining perf counters/stats from krbd client

Re: [ceph-users] Squeezing Performance of CEPH

Re: [ceph-users] Squeezing Performance of CEPH

Re: [ceph-users] Squeezing Performance of CEPH

Re: [ceph-users] Squeezing Performance of CEPH

Re: [ceph-users] Mon Create currently at the state of probing

[ceph-users] osd down but the service is up

27 matches

Site Navigation

Mail list logo

Footer information