If you've two rooms then I'd go for two OSD nodes in each room, a target
replication level of 3 with a min of 1 across the node level, then have
5 monitors and put the last monitor outside of either room (The other
MON's can share with the OSD nodes if needed). Then you've got 'safe'
replication for OSD/node replacement on failure with some 'shuffle' room
for when it's needed and either room can be down while the external last
monitor allows the decisions required to allow a single room to operate.
There's no way you can do a 3/2 MON split that doesn't risk the two
nodes being up and unable to serve data while the three are down so
you'd need to find a way to make it a 2/2/1 split instead.
-Michael
On 28/07/2014 18:41, Robert Fantini wrote:
OK for higher availability then 5 nodes is better then 3 . So we'll
run 5 . However we want normal operations with just 2 nodes. Is
that possible?
Eventually 2 nodes will be next building 10 feet away , with a brick
wall in between. Connected with Infiniband or better. So one room can
go off line the other will be on. The flip of the coin means the 3
node room will probably go down.
All systems will have dual power supplies connected to different
UPS'. In addition we have a power generator. Later we'll have a 2-nd
generator. and then the UPS's will use different lines attached to
those generators somehow..
Also of course we never count on one cluster to have our data. We
have 2 co-locations with backup going to often using zfs send receive
and or rsync .
So for the 5 node cluster, how do we set it so 2 nodes up = OK ? Or
is that a bad idea?
PS: any other idea on how to increase availability are welcome .
On Mon, Jul 28, 2014 at 12:29 PM, Christian Balzer <ch...@gol.com
<mailto:ch...@gol.com>> wrote:
On Mon, 28 Jul 2014 11:22:38 +0100 Joao Eduardo Luis wrote:
> On 07/28/2014 08:49 AM, Christian Balzer wrote:
> >
> > Hello,
> >
> > On Sun, 27 Jul 2014 18:20:43 -0400 Robert Fantini wrote:
> >
> >> Hello Christian,
> >>
> >> Let me supply more info and answer some questions.
> >>
> >> * Our main concern is high availability, not speed.
> >> Our storage requirements are not huge.
> >> However we want good keyboard response 99.99% of the time. We
> >> mostly do data entry and reporting. 20-25 users doing mostly
> >> order , invoice processing and email.
> >>
> >> * DRBD has been very reliable , but I am the SPOF . Meaning
that
> >> when split brain occurs [ every 18-24 months ] it is me or no
one who
> >> knows what to do. Try to explain how to deal with split brain in
> >> advance.... For the future ceph looks like it will be easier to
> >> maintain.
> >>
> > The DRBD people would of course tell you to configure things
in a way
> > that a split brain can't happen. ^o^
> >
> > Note that given the right circumstances (too many OSDs down,
MONs down)
> > Ceph can wind up in a similar state.
>
>
> I am not sure what you mean by ceph winding up in a similar
state. If
> you mean regarding 'split brain' in the usual sense of the term,
it does
> not occur in Ceph. If it does, you have surely found a bug and you
> should let us know with lots of CAPS.
>
> What you can incur though if you have too many monitors down is
cluster
> downtime. The monitors will ensure you need a strict majority of
> monitors up in order to operate the cluster, and will not serve
requests
> if said majority is not in place. The monitors will only serve
requests
> when there's a formed 'quorum', and a quorum is only formed by
(N/2)+1
> monitors, N being the total number of monitors in the cluster
(via the
> monitor map -- monmap).
>
> This said, if out of 3 monitors you have 2 monitors down, your
cluster
> will cease functioning (no admin commands, no writes or reads
served).
> As there is no configuration in which you can have two strict
> majorities, thus no two partitions of the cluster are able to
function
> at the same time, you do not incur in split brain.
>
I wrote similar state, not "same state".
From a user perspective it is purely semantics how and why your shared
storage has seized up, the end result is the same.
And yes, that MON example was exactly what I was aiming for, your
cluster
might still have all the data (another potential failure mode of
cause),
but is inaccessible.
DRBD will see and call it a split brain, Ceph will call it a Paxos
voting
failure, it doesn't matter one iota to the poor sod relying on that
particular storage.
My point was and is, when you design a cluster of whatever flavor,
make
sure you understand how it can (and WILL) fail, how to prevent
that from
happening if at all possible and how to recover from it if not.
Potentially (hopefully) in the case of Ceph it would be just to get a
missing MON back up.
But given that the failed MON might have a corrupted leveldb (it
happened
to me) will put Robert back into square one, as in, a highly qualified
engineer has to deal with the issue.
I.e somebody who can say "screw this dead MON, lets get a new one
in" and
is capable of doing so.
Regards,
Christian
> If you are a creative admin however, you may be able to enforce
split
> brain by modifying monmaps. In the end you'd obviously end up
with two
> distinct monitor clusters, but if you so happened to not inform the
> clients about this there's a fair chance that it would cause
havoc with
> unforeseen effects. Then again, this would be the operator's
fault, not
> Ceph itself -- especially because rewriting monitor maps is not
trivial
> enough for someone to mistakenly do something like this.
>
> -Joao
>
>
--
Christian Balzer Network/Systems Engineer
ch...@gol.com <mailto:ch...@gol.com> Global OnLine
Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com