Re: [ceph-users] Jewel -> Luminous upgrade, package install stopped all daemons

2017-09-18 Thread Brad Hubbard
Well OK now.

Before we go setting off the fire alarms all over town let's work out what is
happening, and why. I spent some time reproducing this and, it is indeed tied to
selinux being (at least) permissive. It does not happen when selinux is
disabled.

If we look at the journalctl output in the OP we see that yum reports ceph-base
installed successfully and it is only after that that ceph daemons start
shutting down. Then yum reports that the ceph-selinux package has been installed
so a closer look at that package appears warranted.

# rpm -q --scripts ceph-selinux|head -28
postinstall scriptlet (using /bin/sh):
# backup file_contexts before update
. /etc/selinux/config
FILE_CONTEXT=/etc/selinux/${SELINUXTYPE}/contexts/files/file_contexts
cp ${FILE_CONTEXT} ${FILE_CONTEXT}.pre

# Install the policy
/usr/sbin/semodule -i /usr/share/selinux/packages/ceph.pp

# Load the policy if SELinux is enabled
if ! /usr/sbin/selinuxenabled; then
# Do not relabel if selinux is not enabled
exit 0
fi

if diff ${FILE_CONTEXT} ${FILE_CONTEXT}.pre > /dev/null 2>&1; then
   # Do not relabel if file contexts did not change
   exit 0
fi

# Check whether the daemons are running
/usr/bin/systemctl status ceph.target > /dev/null 2>&1
STATUS=$?

# Stop the daemons if they were running
if test $STATUS -eq 0; then
/usr/bin/systemctl stop ceph.target > /dev/null 2>&1
fi

Note that if selinux is disabled we do nothing but, if selinux is enabled and
the ceph daemons are running we stop them. That's this section here;

https://github.com/ceph/ceph/blob/28c8e8953c39893978137285a0577cf8c01ebc19/ceph.spec.in#L1671

Note the same thing will happen if you uninstall that package.

https://github.com/ceph/ceph/blob/28c8e8953c39893978137285a0577cf8c01ebc19/ceph.spec.in#L1740

Now given this code has been there for a considerable amount of time more or
less unaltered I'd say it hasn't been *extensively tested" in the wild. It's
likely the solution here is something similar to the
CEPH_AUTO_RESTART_ON_UPGRADE solution but I'll leave it to those that understand
the selinux implications better than I to critique the solution. if everyone's
happy this is the actual issue we are seeing and that we need a bug opened for
it I'll open a tracker for it tomorrow and we can start moving towards a
solution.

On Sun, Sep 17, 2017 at 9:05 AM, Matthias Ferdinand  
wrote:
>> On Fri, Sep 15, 2017 at 3:49 PM, Gregory Farnum  wrote:
>> > On Fri, Sep 15, 2017 at 3:34 PM David Turner  wrote:
>> >>
>> >> I don't understand a single use case where I want updating my packages
>> >> using yum, apt, etc to restart a ceph daemon.  ESPECIALLY when there are 
>> >> so
>> >> many clusters out there with multiple types of daemons running on the same
>> >> server.
>> >>
>> >> My home setup is 3 nodes each running 3 OSDs, a MON, and an MDS server.
>> >> If upgrading the packages restarts all of those daemons at once, then I'm
>> >> mixing MON versions, OSD versions and MDS versions every time I upgrade my
>> >> cluster.  It removes my ability to methodically upgrade my MONs, OSDs, and
>> >> then clients.
>> I think the choice one makes with small cluster is the upgrade is
>> going to be disruptive, but for the large redundant cluster
>> it is better that the upgrade do the *full* job for better user
>
> Hi, if upgrades on small clusters are _supposed_ to be disruptive, that
> should be documented very prominently, including the minimum
> requirements to be met for an update to _not_ be disruptive. Smooth
> upgrade experience is probably more important for small clusters.
> Larger installations will have less of a tendency to colocate different
> daemon types and will have deployment/management tools with all the
> necessary bells and whistles. If a ceph cluster has only a few machines
> that does not always mean it can afford downtime.
>
> On Ubuntu/Debian systems, you could create a script at
> /usr/sbin/policy-rc.d with return code 101 to suppress all
> start/stop/restart actions at install time:
> 
> http://blog.zugschlus.de/archives/974-Debians-Policy-rc.d-infrastructure-explained.html
> Remember to remove it afterwards :-)
>
> Don't know about RPM-based systems.
>
> Regards
> Matthias
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Clarification on sequence of recovery and client ops after OSDs rejoin cluster (also, slow requests)

2017-09-18 Thread Florian Haas
On Mon, Sep 18, 2017 at 8:48 AM, Christian Theune  wrote:
> Hi Josh,
>
>> On Sep 16, 2017, at 3:13 AM, Josh Durgin  wrote:
>>
>> (Sorry for top posting, this email client isn't great at editing)
>
> Thanks for taking the time to respond. :)
>
>> The mitigation strategy I mentioned before of forcing backfill could be 
>> backported to jewel, but I don't think it's a very good option for RBD users 
>> without SSDs.
>
> Interestingly enough, we don’t see this problem on our pure SSD pool.

I think it's been established before that for those at liberty to
clobber the problem with hardware, it's unlikely to be that much of a
hassle. The problem with that is that for most cloud operators,
throwing SSD/NVMe hardware at *everything* is usually not a
cost-effective option.

>> In luminous there is a command (something like 'ceph pg force-recovery') 
>> that you can use to prioritize recovery of particular PGs (and thus rbd 
>> images with some scripting). This would at least let you limit the scope of 
>> affected images. A couple folks from OVH added it for just this purpose.
>
> Uhm. I haven’t measured, but my impression is that for us it’s all over the 
> map anyway. I don’t think we’d have many PGs that have objects of only 
> specific rbd images … why would that happen anyway?
>
>> Neither of these is an ideal workaround, but I haven't thought of a better 
>> one for existing versions.
>
> I’ll discuss more strategies with Florian today, however, a few questions 
> arise:
>
> a) Do you have any ideas whether certain settings (recovery / backfill 
> limits, network / disk / cpu saturation, ceph version) may be contributing in 
> a way that this seems to hurt us more than others?

For Josh's and others' benefit, I think you might want to share how
many nodes you operate, as that would be quite relevant to the
discussion. Generally, the larger the *percentage* of OSDs that
simultaneously recover, the more likely it will actually cause a
problem. If you have, say, 100 OSD nodes with 10 OSDs each, then only
1% of your 1,000 OSDs are affected by the reboot of a node, and the
slow request problem would be unlikely to be extremely disruptive.

But, of course, the issue would still be relevant even with
significantly larger clusters, considering such clusters would be
typically using CRUSH rulesets defining racks, aisles, rooms etc. as
failure domains. And while it's great that the simultaneous failure of
all nodes in a rack does not cause any data loss, nor downtime while
the failure is active, it's rather problematic for it to bring VMs to
a crawl after the failure has been resolved.

>I’m also surprised that a prioritized recovery causes 30-60 seconds of 
> delay for a single IOP. I mean, I understand degraded throughput and latency 
> during recovery, but what gets me are those extremely blocked individual 
> operations.

If I read Josh correctly, that would simply be a result of
"everything" having moved to the front of the queue. It's like having
one priority lane at airport security, and then giving everyone
frequent flyer status.

>After we reviewed other’s settings incl. last years Cern recommendations 
> we’ve set the following “interesting” options. Did we maybe unintentionally 
> hit a combination that worsens this behaviour? Could the “backfill scan” and 
> “max chunk” options make this worse?
>
>fd cache size = 2048
>filestore max sync interval = 60 # fsync files every 60s
>filestore op threads = 8  # more threads where needed
>filestore queue max ops = 100 # allow more queued ops
>filestore fiemap = true
>osd backfill scan max = 128
>osd backfill scan min = 32
>osd max backfills = 5
>osd recovery max active = 3
>osd recovery max single start = 1
>osd recovery op priority = 1
>osd recovery threads = 1
>osd recovery max chunk = 1048576
>osd disk threads = 1
>osd disk thread ioprio class = idle
>osd disk thread ioprio priority = 0
>osd snap trim sleep = 0.5 # throttle some long lived OSD ops
>osd op threads = 4  # more threads where needed
>
>The full OSD config is here (for a week from now on):
>http://dpaste.com/35ABA0N
>
> b) I just upgraded to hammer 0.94.10 (+segfault fix) in our development 
> environment and _may_ have seen an improvement on this. Could this be
>
> http://tracker.ceph.com/issues/16128
>
> c) Are all Ceph users just silently happy with this and are we the only ones 
> where this makes us feel uneasy? Or are we the only ones hit by this? (Well, 
> I guess others are. Alibaba seems to have been working on the async + partial 
> recovery, too.)
>
> d) with size=3/min_size=2 we like to perform _quick_ maintenance operations 
> (i.e. a simple host reboot) without evacuating the host. However, with the 
> situation regarding recovery having such a high impact I’m now considering to 
> do just that. Is everyone else just doing host evacuations all the time?
>
> We’ve become edgy about 

Re: [ceph-users] Clarification on sequence of recovery and client ops after OSDs rejoin cluster (also, slow requests)

2017-09-18 Thread Christian Theune
Hi,

> On Sep 18, 2017, at 9:51 AM, Florian Haas  wrote:
> 
> For Josh's and others' benefit, I think you might want to share how
> many nodes you operate, as that would be quite relevant to the
> discussion.

Sure. See the OSD tree at the end.

We’re doing the typical SSD/non-SSD pool separation. Currently we effectively 
only use 2 pools: rbd.hdd and rbd.ssd. The ~4TB OSDs in the rbd.hdd pool are 
“capacity endurance” SSDs (Micron S610DC). We have 10 machines at the moment 
with 10 OSDs on average (2 SSD, 1-2 capacity SSD and 6-7 HDDs).

cluster d4b91002-eaf4-11e2-bc7c-020311c1
 health HEALTH_OK
 monmap e43: 5 mons at 
{cartman06=172.22.4.42:6789/0,cartman07=172.22.4.43:6789/0,cartman11=172.22.4.54:6789/0,cartman15=172.22.4.69:6789/0,cartman16=172.22.4.70:6789/0}
election epoch 7278, quorum 0,1,2,3,4 
cartman06,cartman07,cartman11,cartman15,cartman16
 osdmap e1360049: 96 osds: 95 up, 95 in
  pgmap v115515846: 6336 pgs, 5 pools, 28786 GB data, 8244 kobjects
92181 GB used, 69333 GB / 157 TB avail
6335 active+clean
   1 active+clean+scrubbing+deep
  client io 379 MB/s rd, 197 MB/s wr, 10313 op/s

ID  WEIGHTTYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
-11  20.0 root ssd
-10  20.0 datacenter rzob-ssd
 -9   6.0 rack OB-1.R206.A-5-ssd
 -7   2.0 host cartman06-ssd
 13   1.0 osd.13  up  1.0  1.0
 87   1.0 osd.87  up  0.8  1.0
-20   2.0 host cartman09-ssd
 47   1.0 osd.47  up  1.0  1.0
 24   1.0 osd.24  up  1.0  1.0
-12   2.0 host cartman11-ssd
 43   1.0 osd.43  up  1.0  1.0
 57   1.0 osd.57  up  1.0  1.0
-16   6.0 rack OB-1.R206.D-5-ssd
-17   2.0 host cartman07-ssd
 44   1.0 osd.44  up  0.95000  1.0
 56   1.0 osd.56  up  1.0  1.0
-18   2.0 host cartman08-ssd
 45   1.0 osd.45  up  1.0  1.0
  6   1.0 osd.6   up  1.0  1.0
-19   2.0 host cartman10-ssd
 46   1.0 osd.46  up  1.0  1.0
 27   1.0 osd.27  up  1.0  1.0
-27   8.0 rack OB-1.R206.A-6-ssd
-26   2.0 host cartman16-ssd
 75   1.0 osd.75  up  1.0  1.0
 71   1.0 osd.71  up  1.0  1.0
-28   2.0 host cartman15-ssd
 83   1.0 osd.83  up  1.0  1.0
 80   1.0 osd.80  up  1.0  1.0
-29   2.0 host cartman18-ssd
 62   1.0 osd.62  up  1.0  1.0
 69   1.0 osd.69  up  1.0  1.0
-30   2.0 host cartman17-ssd
 50   1.0 osd.50  up  1.0  0.5
 54   1.0 osd.54  up  1.0  1.0
 -1 143.81467 root default
 -4 143.81467 datacenter rzob
 -3  34.28958 rack OB-1.R206.A-5
 -8  12.58118 host cartman06
  0   1.81799 osd.0   up  1.00
  3   1.81799 osd.3   up  1.00
 14   1.81799 osd.14  up  1.00
 37   1.81799 osd.37  up  1.00
 38   1.81799 osd.38  up  1.00
 86   3.49121 osd.86  up  1.0  1.0
-15  14.39722 host cartman09
 21   1.81799 osd.21  up  1.00
  2   1.81799 osd.2   up  1.00
 25   1.81799 osd.25  up  1.00
 39   1.81799 osd.39  up  1.00
 40   1.81799 osd.40  up  1.00
 41   1.81799 osd.41  up  1.00
 60   3.48926 osd.60  up  1.0  1.0
 -5   7.31119 host cartman11
  5   0.54599 osd.5   up  1.0  0.00999
 26   0.54599 osd.26  up  1.0  0.00999
  8   0.54599 osd.8   up  1.0  0.00999
  9   0.5459

[ceph-users] Ceph 12.2.0 and replica count

2017-09-18 Thread Max Krasilnikov
Hello!

In the times of Hammer it was actual to have 3 replicas for data to avoid
situation with non-identical data on different OSDs. Now we have full data and
metadata checksumming. So, is it actual now to have 3 replicas? Do the
checksumming get us out from requirement of 3 replicas?

Thanks a lot!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph 12.2.0 and replica count

2017-09-18 Thread Wido den Hollander

> Op 18 september 2017 om 10:14 schreef Max Krasilnikov :
> 
> 
> Hello!
> 
> In the times of Hammer it was actual to have 3 replicas for data to avoid
> situation with non-identical data on different OSDs. Now we have full data and
> metadata checksumming. So, is it actual now to have 3 replicas? Do the
> checksumming get us out from requirement of 3 replicas?
> 

No! You still need 3 replicas of your data. checksumming doesn't help against a 
disk failure, so the recommendation of 3x replication is still valid.

Don't use 2x replication please. I've seen to much data go lost because of 
people running 2x replication :(

Wido

> Thanks a lot!
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Help change civetweb front port error: Permission denied

2017-09-18 Thread 谭林江
Hi


I create a gateway node and change it port is rgw_frontends = "civetweb 
port=80”, when run it response error:

2017-09-18 04:25:16.967378 7f2dd72e08c0  0 ceph version 9.2.1 
(752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd), process radosgw, pid 3151
2017-09-18 04:25:17.025703 7f2dd72e08c0  0 framework: civetweb
2017-09-18 04:25:17.025712 7f2dd72e08c0  0 framework conf key: port, val: 80
2017-09-18 04:25:17.025716 7f2dd72e08c0  0 starting handler: civetweb
2017-09-18 04:25:17.025943 7f2dd72e08c0  0 civetweb: 0x55ac3b9bab20: 
set_ports_option: cannot bind to 80: 13 (Permission denied)
2017-09-18 04:25:17.032177 7f2db4ff9700 -1 failed to list objects pool_iterate 
returned r=-2
2017-09-18 04:25:17.032183 7f2db4ff9700  0 ERROR: lists_keys_next(): ret=-2
2017-09-18 04:25:17.032186 7f2db4ff9700  0 ERROR: sync_all_users() returned 
ret=-2


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [RGW] SignatureDoesNotMatch using curl

2017-09-18 Thread junho_k...@tmax.co.kr
I’m trying to use Ceph Object Storage in CLI.
I used curl to make a request to the RGW with S3 way.

When I use a python library, which is boto, all things work fine, but when I 
tried to make same request using curl, I always got error 
“SignatureDoesNotMatch”
I don’t know what goes wrong.

Here is my script when I tried to make a request using curl
---
#!/bin/bash

resource="/my-new-bucket/"
dateValue=`date -Ru`
S3KEY="MY_KEY"
S3SECRET="MY_SECRET_KEY"
stringToSign="GET\n\n${dateValue}\n${resource}"
signature=`echo -en ${stringToSign} | openssl sha1 -hmac ${S3SECRET} -binary | 
base64`

curl -X GET \
 -H "authorization: AWS ${S3KEY}:${signature}"\
 -H "date: ${dateValue}"\
 -H "host: 10.0.2.15:7480"\
 http://10.0.2.15:7480/my-new-bucket --verbose



The result 

SignatureDoesNotMatchtx00019-0059bf7de0-5e25-default5e25-default-default


Ceph log in /var/log/ceph/ceph-client.rgw.node0.log is

2017-09-18 16:51:50.922935 7fc996fa5700  1 == starting new request 
req=0x7fc996f9f7e0 =
2017-09-18 16:51:50.923135 7fc996fa5700  1 == req done req=0x7fc996f9f7e0 
op status=0 http_status=403 ==
2017-09-18 16:51:50.923156 7fc996fa5700  1 civetweb: 0x7fc9cc00d0c0: 10.0.2.15 
- - [18/Sep/2017:16:51:50 +0900] "GET /my-new-bucket HTTP/1.1" 403 0 - 
curl/7.47.0


Many Thanks
-Juno
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help change civetweb front port error: Permission denied

2017-09-18 Thread Marcus Haarmann
Ceph is running as non-root user, so normally it is not permitted 
to listen to a port < 1024 for non-root users. 
This is not specific to ceph. 

You could trick a listener on port 80 with a redirect via iptables or you might 
proxy the connection 
through an apache/nginx instance. 

Marcus Haarmann 


Von: "谭林江"  
An: "ceph-users"  
Gesendet: Montag, 18. September 2017 10:32:53 
Betreff: [ceph-users] Help change civetweb front port error: Permission denied 

Hi 


I create a gateway node and change it port is rgw_frontends = "civetweb 
port=80”, when run it response error: 

2017-09-18 04:25:16.967378 7f2dd72e08c0 0 ceph version 9.2.1 
(752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd), process radosgw, pid 3151 
2017-09-18 04:25:17.025703 7f2dd72e08c0 0 framework: civetweb 
2017-09-18 04:25:17.025712 7f2dd72e08c0 0 framework conf key: port, val: 80 
2017-09-18 04:25:17.025716 7f2dd72e08c0 0 starting handler: civetweb 
2017-09-18 04:25:17.025943 7f2dd72e08c0 0 civetweb: 0x55ac3b9bab20: 
set_ports_option: cannot bind to 80: 13 (Permission denied) 
2017-09-18 04:25:17.032177 7f2db4ff9700 -1 failed to list objects pool_iterate 
returned r=-2 
2017-09-18 04:25:17.032183 7f2db4ff9700 0 ERROR: lists_keys_next(): ret=-2 
2017-09-18 04:25:17.032186 7f2db4ff9700 0 ERROR: sync_all_users() returned 
ret=-2 



___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD: How many snapshots is too many?

2017-09-18 Thread Florian Haas
On 09/16/2017 01:36 AM, Gregory Farnum wrote:
> On Mon, Sep 11, 2017 at 1:10 PM Florian Haas  > wrote:
> 
> On Mon, Sep 11, 2017 at 8:27 PM, Mclean, Patrick
> mailto:patrick.mcl...@sony.com>> wrote:
> >
> > On 2017-09-08 06:06 PM, Gregory Farnum wrote:
> > > On Fri, Sep 8, 2017 at 5:47 PM, Mclean, Patrick
> mailto:patrick.mcl...@sony.com>> wrote:
> > >
> > >> On a related note, we are very curious why the snapshot id is
> > >> incremented when a snapshot is deleted, this creates lots
> > >> phantom entries in the deleted snapshots set. Interleaved
> > >> deletions and creations will cause massive fragmentation in
> > >> the interval set. The only reason we can come up for this
> > >> is to track if anything changed, but I suspect a different
> > >> value that doesn't inject entries in to the interval set might
> > >> be better for this purpose.
> > > Yes, it's because having a sequence number tied in with the
> snapshots
> > > is convenient for doing comparisons. Those aren't leaked snapids
> that
> > > will make holes; when we increment the snapid to delete something we
> > > also stick it in the removed_snaps set. (I suppose if you alternate
> > > deleting a snapshot with adding one that does increase the size
> until
> > > you delete those snapshots; hrmmm. Another thing to avoid doing I
> > > guess.)
> > >
> >
> >
> > Fair enough, though it seems like these limitations of the
> > snapshot system should be documented.
> 
> This is why I was so insistent on numbers, formulae or even
> rules-of-thumb to predict what works and what does not. Greg's "one
> snapshot per RBD per day is probably OK" from a few months ago seemed
> promising, but looking at your situation it's probably not that useful
> a rule.
> 
> 
> > We most likely would
> > have used a completely different strategy if it was documented
> > that certain snapshot creation and removal patterns could
> > cause the cluster to fall over over time.
> 
> I think right now there are probably very few people, if any, who
> could *describe* the pattern that causes this. That complicates
> matters of documentation. :)
> 
> 
> > >>> It might really just be the osdmap update processing -- that would
> > >>> make me happy as it's a much easier problem to resolve. But
> I'm also
> > >>> surprised it's *that* expensive, even at the scales you've
> described.
> 
> ^^ This is what I mean. It's kind of tough to document things if we're
> still in "surprised that this is causing harm" territory.
> 
> 
> > >> That would be nice, but unfortunately all the data is pointing
> > >> to PGPool::Update(),
> > > Yes, that's the OSDMap update processing I referred to. This is good
> > > in terms of our ability to remove it without changing client
> > > interfaces and things.
> >
> > That is good to hear, hopefully this stuff can be improved soon
> > then.
> 
> Greg, can you comment on just how much potential improvement you see
> here? Is it more like "oh we know we're doing this one thing horribly
> inefficiently, but we never thought this would be an issue so we shied
> away from premature optimization, but we can easily reduce 70% CPU
> utilization to 1%" or rather like "we might be able to improve this by
> perhaps 5%, but 100,000 RBDs is too many if you want to be using
> snapshotting at all, for the foreseeable future"?
> 
> 
> I got the chance to discuss this a bit with Patrick at the Open Source
> Summit Wednesday (good to see you!).
> 
> So the idea in the previously-referenced CDM talk essentially involves
> changing the way we distribute snap deletion instructions from a
> "deleted_snaps" member in the OSDMap to a "deleting_snaps" member that
> gets trimmed once the OSDs report to the manager that they've finished
> removing that snapid. This should entirely resolve the CPU burn they're
> seeing during OSDMap processing on the nodes, as it shrinks the
> intersection operation down from "all the snaps" to merely "the snaps
> not-done-deleting".
> 
> The other reason we maintain the full set of deleted snaps is to prevent
> client operations from re-creating deleted snapshots — we filter all
> client IO which includes snaps against the deleted_snaps set in the PG.
> Apparently this is also big enough in RAM to be a real (but much
> smaller) problem.
> 
> Unfortunately eliminating that is a lot harder

Just checking here, for clarification: what is "that" here? Are you
saying that eliminating the full set of deleted snaps is harder than
introducing a deleting_snaps member, or that both are harder than
potential mitigation strategies that were previously discussed in this
thread?

> and a permanent fix will
> involve changing the client protocol in ways nobody has quite figured
> out h

Re: [ceph-users] Clarification on sequence of recovery and client ops after OSDs rejoin cluster (also, slow requests)

2017-09-18 Thread Christian Theune
Hi,

> On Sep 18, 2017, at 10:06 AM, Christian Theune  wrote:
> 
> We’re doing the typical SSD/non-SSD pool separation. Currently we effectively 
> only use 2 pools: rbd.hdd and rbd.ssd. The ~4TB OSDs in the rbd.hdd pool are 
> “capacity endurance” SSDs (Micron S610DC). We have 10 machines at the moment 
> with 10 OSDs on average (2 SSD, 1-2 capacity SSD and 6-7 HDDs).

Maybe this might be too confusing to how our pools are structured, so I’ll try 
to clear this up again:

We have a pool “rbd.ssd” which uses the OSDs in “datacenter rzob-ssd”.
This is an all-flash pool using inline journals and runs on Intel DC S3610.

The other pool is “rbd.hdd” which generally uses different disks:

* 2TB 7.k SATA HDDs which have a primary affinity of 0
* a couple of 8x600 GB SAS II HGST 3,5” 15k, which have a small primary affinity
* 1-2 Micron S610DC 3.8TB with a primary affinity of 1

The HDD pool has grown over time and we’re slowly moving it towards “endurance 
capacity” SSD models (using external journals on Intel NVME). That’s why it’s 
not a single OSD configuration.

Hope this helps,
Christian

Liebe Grüße,
Christian Theune

--
Christian Theune · c...@flyingcircus.io · +49 345 219401 0
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick



signature.asc
Description: Message signed with OpenPGP
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Clarification on sequence of recovery and client ops after OSDs rejoin cluster (also, slow requests)

2017-09-18 Thread Christian Theune
Hi,

and here’s another update which others might find quite interesting.

Florian and I spend some time discussing the issue further, face to face. I had 
one switch that I brought up again (—osd-recovery-start-delay) which I looked 
at a few weeks ago but came to the conclusion that its rules are 
underdocumented and from the appearance it didn’t seem to do anything.

After stepping through what we learned about prioritized recovery, I brought 
this up again and we started to experiment with this further - and it turns out 
this switch might be quite helpful.

Here’s what I found and maybe others can chime in whether this is going in the 
right direction or not:

1. Setting --osd-recovery-start-delay (e.g. 60 seconds) causes no PG
   to start its recovery when the OSD boots and goes from ‘down/in’ to
   ‘up/in’.

2. Client traffic starts getting processed immediately.

3. Writes from client traffic cause individual objects to require a
   (prioritized) recovery. As no other recovery is happening, everything
   is pretty relaxed and the recovery happens quickly and no slow
   requests appear. (Even when pushing the complaint time to 15s)

4. When an object from a PG gets recovered this way, the PG is marked as
   ‘active+recovering+degraded’. In my test cluster this went up to ~37
   and made me wonder, because it exceeded my ‘--osd-recovery-max-
   active’ setting. Looking at the recovery rate you can see that no
   objects are recovering, and only every now and then an object
   gets recovered.

5. After two minutes, no sudden “everyone else please start recovering”
   thundering happens. I scratch my head. I think.

   My conclusion is, that the “active+recovering+degraded” marker is
   actually just that: a marker. The organic writes now (implicitly)
   signal Ceph that there is a certain amount organic traffic that
   requires recovery and pushes the recovering PGs beyond the point
   where “real” recovery would start, because my limits are 3 PGs per
   OSD recovering.

6. After a while your “hot set” of objects that get written to (I used
   to VMs with a random write fio[1] is recovered by organic means and
   the ‘recovering’ PGs count goes down.

7. Once an OSD’s “recovering” count falls below the limit, it begins
   to start “real” recoveries. However, the hot set is now already
   recovered, so slow requests due to prioritized recoveries
   become unlikely.

This actually feels like a quite nice way to handle this. Yes, recovery time 
will be longer, but with a size=3/min_size=2 this still feels fast enough. (In 
my test setup it took about 1h to recover fully from a 30% failure with heavy 
client traffic).

In my experiment I did see slow requests but none of those were ‘waiting for 
missing object’ or 'waiting for degraded object’.

I consider this a success and wonder what you guys think.

Christian

[1] fio --rw=randwrite --name=test --size=50M --direct=1 --bs=4k-128k 
--numjobs=20 --iodepth=64 --group_reporting --runtime=6000 --time_based


Liebe Grüße,
Christian Theune

--
Christian Theune · c...@flyingcircus.io · +49 345 219401 0
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick



signature.asc
Description: Message signed with OpenPGP
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD: How many snapshots is too many?

2017-09-18 Thread Piotr Dałek

On 17-09-16 01:36 AM, Gregory Farnum wrote:
I got the chance to discuss this a bit with Patrick at the Open Source 
Summit Wednesday (good to see you!).


So the idea in the previously-referenced CDM talk essentially involves 
changing the way we distribute snap deletion instructions from a 
"deleted_snaps" member in the OSDMap to a "deleting_snaps" member that gets 
trimmed once the OSDs report to the manager that they've finished removing 
that snapid. This should entirely resolve the CPU burn they're seeing during 
OSDMap processing on the nodes, as it shrinks the intersection operation 
down from "all the snaps" to merely "the snaps not-done-deleting".


The other reason we maintain the full set of deleted snaps is to prevent 
client operations from re-creating deleted snapshots — we filter all client 
IO which includes snaps against the deleted_snaps set in the PG. Apparently 
this is also big enough in RAM to be a real (but much smaller) problem.


Unfortunately eliminating that is a lot harder and a permanent fix will 
involve changing the client protocol in ways nobody has quite figured out 
how to do. But Patrick did suggest storing the full set of deleted snaps 
on-disk and only keeping in-memory the set which covers snapids in the range 
we've actually *seen* from clients. I haven't gone through the code but that 
seems broadly feasible — the hard part will be working out the rules when 
you have to go to disk to read a larger part of the deleted_snaps set. 
(Perfectly feasible.)


PRs are of course welcome! ;)


There you go: https://github.com/ceph/ceph/pull/17493

We are hitting limitations of current implementation - we have over 9 
thousands of removed snap intervals, with snap counter over 65. In our 
particular case, this shows up as a bad CPU usage spike every few minutes, 
and it's going to be only worse, as we're going to have more snapshots over 
time. My PR halves that spike, and is a change small enough to be backported 
to both Jewel and Luminous without breaking too much at once - not a final 
solution, but should make life a bit more tolerable until actual, working 
solution is in place.


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] bluestore compression statistics

2017-09-18 Thread Peter Gervai
Hello,

Is there any way to get compression stats of compressed bluestore storage?

Thanks,
Peter
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Collectd issues

2017-09-18 Thread Matthew Vernon
On 13/09/17 15:06, Marc Roos wrote:
> 
> 
> Am I the only one having these JSON issues with collectd, did I do 
> something wrong in configuration/upgrade?

I also see these, although my dashboard seems to mostly be working. I'd
be interested in knowing what the problem is!

> Sep 13 15:44:15 c01 collectd: ceph plugin: ds 
> Bluestore.kvFlushLat.avgtime was not properly initialized.
> Sep 13 15:44:15 c01 collectd: ceph plugin: JSON handler failed with 
> status -1.

[ours is slightly different, as we're not running Bluestore]

Regards,

Matthew


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Collectd issues

2017-09-18 Thread Matthew Vernon
On 18/09/17 16:37, Matthew Vernon wrote:
> On 13/09/17 15:06, Marc Roos wrote:
>>
>>
>> Am I the only one having these JSON issues with collectd, did I do 
>> something wrong in configuration/upgrade?
> 
> I also see these, although my dashboard seems to mostly be working. I'd
> be interested in knowing what the problem is!
> 
>> Sep 13 15:44:15 c01 collectd: ceph plugin: ds 
>> Bluestore.kvFlushLat.avgtime was not properly initialized.
>> Sep 13 15:44:15 c01 collectd: ceph plugin: JSON handler failed with 
>> status -1.
> 
> [ours is slightly different, as we're not running Bluestore]

To add what might be helpful in tracking this down- we're only seeing
this on our nodes which are running the radosgw...

Sep 18 06:26:27 sto-1-2 collectd[423236]: ceph plugin: ds
ThrottleMsgrDispatchThrottlerRadosclient0x75f799e740.getOrFailF was not
properly initialized.
Sep 18 06:26:27 sto-1-2 collectd[423236]: ceph plugin: JSON handler
failed with status -1.
Sep 18 06:26:27 sto-1-2 collectd[423236]: ceph plugin:
cconn_handle_event(name=client.rgw.sto-1-2,i=60,st=4): error 1

Regards,

Matthew



-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS Segfault 12.2.0

2017-09-18 Thread Derek Yarnell
We have a recent cluster upgraded from Jewel to Luminous.  Today we had
a segmentation fault that led to file system degraded.  Systemd then
decided to restart the daemon over and over with a different stack trace
(can be seen after the 10k events in the log file[0]).

After trying to fail over to the standby which also kept failing.  After
shutting down both MDSs for some time we brought one back online and
what seemed to be the clients had been out long enough to be evicted.
We were able to then reboot clients (RHEL 7.4) and have them re-connect
to the file system.

2017-09-18 13:27:12.836699 7f9c0ca51700 -1 *** Caught signal
(Segmentation fault) **
 in thread 7f9c0ca51700 thread_name:fn_anonymous

 ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous
(rc)
 1: (()+0x590c21) [0x55a40867ac21]
 2: (()+0xf5e0) [0x7f9c17cb75e0]
 3:
(Server::handle_client_readdir(boost::intrusive_ptr&)+0xbb9)
[0x55a4083f74b9]
 4:
(Server::dispatch_client_request(boost::intrusive_ptr&)+0x9c1)
[0x55a408428591]
 5: (MDSInternalContextBase::complete(int)+0x1eb) [0x55a408605c0b]
 6: (void finish_contexts(CephContext*,
std::list >&, int)+0xac) [0x55a4083c69ac]
 7: (MDSCacheObject::finish_waiting(unsigned long, int)+0x46)
[0x55a40861d856]
 8: (Locker::eval_gather(SimpleLock*, bool, bool*,
std::list >*)+0x10df) [0x55a40851f93f]
 9: (Locker::wrlock_finish(SimpleLock*, MutationImpl*, bool*)+0x310)
[0x55a408521210]
 10: (Locker::_drop_non_rdlocks(MutationImpl*, std::set, std::allocator >*)+0x22c) [0x55a408524adc]
 11: (Locker::drop_non_rdlocks(MutationImpl*, std::set, std::allocator >*)+0x59) [0x55a4085253d9]
 12: (Server::reply_client_request(boost::intrusive_ptr&,
MClientReply*)+0x433) [0x55a4083f21a3]
 13: (Server::respond_to_request(boost::intrusive_ptr&,
int)+0x459) [0x55a4083f2dd9]
 14: (Server::_unlink_local_finish(boost::intrusive_ptr&,
CDentry*, CDentry*, unsigned long)+0x2ab) [0x55a4083fd7fb]
 15: (MDSIOContextBase::complete(int)+0xa4) [0x55a408605d44]
 16: (MDSLogContextBase::complete(int)+0x3c) [0x55a4086060fc]
 17: (Finisher::finisher_thread_entry()+0x198) [0x55a4086ba718]
 18: (()+0x7e25) [0x7f9c17cafe25]
 19: (clone()+0x6d) [0x7f9c16d9234dC


[0] -
https://obj.umiacs.umd.edu/derek_support/mds_20170918/ceph-mds.objmds01.log?Signature=VJB4qL34j5UKM%2BCxeiR8n0JA1gE%3D&Expires=1508357409&AWSAccessKeyId=936291C3OMB2LBD7FLK4

-- 
Derek T. Yarnell
Director of Computing Facilities
University of Maryland
Institute for Advanced Computer Studies
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Rbd resize, refresh rescan

2017-09-18 Thread Marc Roos

Is there something like this for scsi, to rescan the size of the rbd 
device and make it available? (while it is being used)

echo 1 >  /sys/class/scsi_device/2\:0\:0\:0/device/rescan




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Rbd resize, refresh rescan

2017-09-18 Thread David Turner
I've never needed to do anything other than extend the partition and/or
filesystem when I increased the size of an RBD.  Particularly if I didn't
partition the RBD I only needed to extend the filesystem.

Which method are you mapping/mounting the RBD?  Is it through a Hypervisor
or just mapped to a server?  What are you seeing to indicate that the RBD
isn't already reflecting the larger size?  Which version of Ceph are you
using?

On Mon, Sep 18, 2017 at 4:31 PM Marc Roos  wrote:

>
> Is there something like this for scsi, to rescan the size of the rbd
> device and make it available? (while it is being used)
>
> echo 1 >  /sys/class/scsi_device/2\:0\:0\:0/device/rescan
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Segfault 12.2.0

2017-09-18 Thread Patrick Donnelly
Hi Derek,

On Mon, Sep 18, 2017 at 1:30 PM, Derek Yarnell  wrote:
> We have a recent cluster upgraded from Jewel to Luminous.  Today we had
> a segmentation fault that led to file system degraded.  Systemd then
> decided to restart the daemon over and over with a different stack trace
> (can be seen after the 10k events in the log file[0]).
>
> After trying to fail over to the standby which also kept failing.  After
> shutting down both MDSs for some time we brought one back online and
> what seemed to be the clients had been out long enough to be evicted.
> We were able to then reboot clients (RHEL 7.4) and have them re-connect
> to the file system.

This looks like an instance of:

http://tracker.ceph.com/issues/21070

Upcoming v12.2.1 has the fix. Until then, you will need to apply the
patch locally.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Rbd resize, refresh rescan

2017-09-18 Thread Marc Roos
  
Yes, I think you are right, after I saw this in dmesg, I noticed with 
fdisk the block device was updated
 rbd21: detected capacity change from 5368709120 to 6442450944

Maybe this also works (found a something that refered to a /sys/class, 
which I don’t have) echo 1 > /sys/devices/rbd/21/refresh

(I am trying to online increase the size via kvm, virtio disk in win 
2016)


-Original Message-
From: David Turner [mailto:drakonst...@gmail.com] 
Sent: maandag 18 september 2017 22:42
To: Marc Roos; ceph-users
Subject: Re: [ceph-users] Rbd resize, refresh rescan

I've never needed to do anything other than extend the partition and/or 
filesystem when I increased the size of an RBD.  Particularly if I 
didn't partition the RBD I only needed to extend the filesystem.

Which method are you mapping/mounting the RBD?  Is it through a 
Hypervisor or just mapped to a server?  What are you seeing to indicate 
that the RBD isn't already reflecting the larger size?  Which version of 
Ceph are you using?

On Mon, Sep 18, 2017 at 4:31 PM Marc Roos  
wrote:



Is there something like this for scsi, to rescan the size of the 
rbd
device and make it available? (while it is being used)

echo 1 >  /sys/class/scsi_device/2\:0\:0\:0/device/rescan




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Rbd resize, refresh rescan

2017-09-18 Thread David Turner
Disk Management in Windows should very easily extend a partition to use the
rest of the disk.  You should just right click the partition and select
"Extend Volume" and that's it.  I did it in Windows 10 over the weekend for
a laptop that had been set up weird.

On Mon, Sep 18, 2017 at 4:49 PM Marc Roos  wrote:

>
> Yes, I think you are right, after I saw this in dmesg, I noticed with
> fdisk the block device was updated
>  rbd21: detected capacity change from 5368709120 to 6442450944
>
> Maybe this also works (found a something that refered to a /sys/class,
> which I don’t have) echo 1 > /sys/devices/rbd/21/refresh
>
> (I am trying to online increase the size via kvm, virtio disk in win
> 2016)
>
>
> -Original Message-
> From: David Turner [mailto:drakonst...@gmail.com]
> Sent: maandag 18 september 2017 22:42
> To: Marc Roos; ceph-users
> Subject: Re: [ceph-users] Rbd resize, refresh rescan
>
> I've never needed to do anything other than extend the partition and/or
> filesystem when I increased the size of an RBD.  Particularly if I
> didn't partition the RBD I only needed to extend the filesystem.
>
> Which method are you mapping/mounting the RBD?  Is it through a
> Hypervisor or just mapped to a server?  What are you seeing to indicate
> that the RBD isn't already reflecting the larger size?  Which version of
> Ceph are you using?
>
> On Mon, Sep 18, 2017 at 4:31 PM Marc Roos 
> wrote:
>
>
>
> Is there something like this for scsi, to rescan the size of the
> rbd
> device and make it available? (while it is being used)
>
> echo 1 >  /sys/class/scsi_device/2\:0\:0\:0/device/rescan
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore aio_nr?

2017-09-18 Thread Sage Weil
On Tue, 19 Sep 2017, Xiaoxi Chen wrote:
> Hi,
>  I just hit an OSD cannot start due to insufficient aio_nr.  Each
> OSD is with a separate SSD partition as db.block

Can you paste the message you saw?  I'm not sure which check you mean.

>  Further checking showing 6144 AIO contexts were required per OSD,
>  could anyone explain a little bit on where the 6144 aio contexts goes
> to?
> 
>  It looks to me like the bdev_aio_max_queue_depth is default to
> 1024, but  how can we have 6 bdev to get 6144?

I'm guessing the is fallout from the kernel's behavior.  When you set up 
an IO queue you specify how many aios you want to allow (that's where we 
use the max_queue_depth value), but the kernel rounds the buffer up to a 
page boundary, so in reality it will use more.  That can make you hit the 
host maximum sooner.

sage

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Jewel -> Luminous upgrade, package install stopped all daemons

2017-09-18 Thread Brad Hubbard
On Sat, Sep 16, 2017 at 8:34 AM, David Turner  wrote:
> I don't understand a single use case where I want updating my packages using
> yum, apt, etc to restart a ceph daemon.  ESPECIALLY when there are so many
> clusters out there with multiple types of daemons running on the same
> server.
>
> My home setup is 3 nodes each running 3 OSDs, a MON, and an MDS server.  If
> upgrading the packages restarts all of those daemons at once, then I'm
> mixing MON versions, OSD versions and MDS versions every time I upgrade my
> cluster.  It removes my ability to methodically upgrade my MONs, OSDs, and
> then clients.
>
> Now let's take the Luminous upgrade which REQUIRES you to upgrade all of
> your MONs before anything else... I'm screwed.  I literally can't perform
> the upgrade if it's going to restart all of my daemons because it is
> impossible for me to achieve a paxos quorum of MONs running the Luminous
> binaries BEFORE I upgrade any other daemon in the cluster.  The only way to
> achieve that is to stop the entire cluster and every daemon, upgrade all of
> the packages, then start the mons, then start the rest of the cluster
> again... There is no way that is a desired behavior.
>
> All of this is ignoring large clusters using something like Puppet to manage
> their package versions.  I want to just be able to update the ceph version
> and push that out to the cluster.  It will install the new packages to the
> entire cluster and then my automated scripts can perform a rolling restart
> of the cluster upgrading all of the daemons while ensuring that the cluster
> is healthy every step of the way.  I don't want to add in the time of
> installing the packages on every node DURING the upgrade.  I want that done
> before I initiate my script to be in a mixed version state as little as
> possible.
>
> Claiming that having anything other than an issued command to specifically
> restart a Ceph daemon is anything but a bug and undesirable sounds crazy to
> me.  I don't ever want anything restarting my Ceph daemons that is not
> explicitly called to do so.  That just sounds like it's begging to put my
> entire cluster into a world of hurt by accidentally restarting too many
> daemons at the same time making the data in my cluster inaccessible.
>
> I'm used to the Ubuntu side of things.  I've never seen upgrading the Ceph
> packages to ever affect a daemon before.  If that's actually a thing that is
> done on purpose in RHEL and CentOS... good riddance! That's ridiculous!

Deb based releases are unaffected because we don't ship an selinux
package for them and selinux is turned off by default IIUC.

For the rpm story see my previous email in this thread. The problem
has existed for a long time but not many people have hit it. Now we
understand it we can solve it.

>
> On Fri, Sep 15, 2017 at 6:06 PM Vasu Kulkarni  wrote:
>>
>> On Fri, Sep 15, 2017 at 2:10 PM, David Turner 
>> wrote:
>> > I'm glad that worked for you to finish the upgrade.
>> >
>> > He has multiple MONs, but all of them are on nodes with OSDs as well.
>> > When
>> > he updated the packages on the first node, it restarted the MON and all
>> > of
>> > the OSDs.  This is strictly not supported in the Luminous upgrade as the
>> > OSDs can't be running Luminous code until all of the MONs are running
>> > Luminous.  I have never seen updating Ceph packages cause a restart of
>> > the
>> > daemons because you need to schedule the restarts and wait until the
>> > cluster
>> > is back to healthy before restarting the next node to upgrade the
>> > daemons.
>> > If upgrading the packages is causing a restart of the Ceph daemons, it
>> > is
>> > most definitely a bug and needs to be fixed.
>>
>> The current spec file tell that unless CEPH_AUTO_RESTART_ON_UPGRADE is
>> set to "yes", it shoudn't restart, but I remember
>> it does restart in my own testing as well. Although I see no harm
>> since the underlying binaries have changed and for the cluster
>> in redundant mode restarting of service shoudn't cause any issue. But
>> maybe its still useful for some use cases.
>>
>>
>> >
>> > On Fri, Sep 15, 2017 at 4:48 PM David  wrote:
>> >>
>> >> Happy to report I got everything up to Luminous, used your tip to keep
>> >> the
>> >> OSDs running, David, thanks again for that.
>> >>
>> >> I'd say this is a potential gotcha for people collocating MONs. It
>> >> appears
>> >> that if you're running selinux, even in permissive mode, upgrading the
>> >> ceph-selinux packages forces a restart on all the OSDs. You're left
>> >> with a
>> >> load of OSDs down that you can't start as you don't have a Luminous mon
>> >> quorum yet.
>> >>
>> >>
>> >> On 15 Sep 2017 4:54 p.m., "David"  wrote:
>> >>
>> >> Hi David
>> >>
>> >> I like your thinking! Thanks for the suggestion. I've got a maintenance
>> >> window later to finish the update so will give it a try.
>> >>
>> >>
>> >> On Thu, Sep 14, 2017 at 6:24 PM, David Turner 
>> >> wrote:
>> >>>
>> >>> This isn't a great solution, but some