On 08/30/2014 08:03 AM, pragya jain wrote:
Thanks Greg, Joao and David,
The concept why odd no. of monitors are preferred is clear to me, but
still I am not clear about the working of Paxos algorithm:
#1. All changes in any data structure of monitor whether it is monitor
map, OSD map, PG map, MDS map or CRUSH map; are made through Paxos
algorithm and
#2. Paxos algorithm also establish a quorum among the monitors for
recent copy of cluster map.
I am unable to understand how these two things are related and connected
? how does Paxos provide these two functionalities?
As Greg mentioned before, Paxos is a consensus algorithm thus we can
leverage Paxos for anything that may require consensus.
We have two portions of the monitors that will use a modified version of
Paxos (but still Paxos in nature): map consensus and elections.
Let me give you a (rough) temporal view of how the monitor applies this
once it starts. Say you have 5 monitors total, 2 of which are down.
1. Alive monitors will "probe" all monitors in the monmap (all other 4
of them) -- the probing phase is independent from anything-Paxos and is
meant to raise awareness to the monitors that are up, alive and reachable.
2. Once enough monitors to form a quorum (i.e., at least (N+1)/2) reply
to the probes, the monitors will enter the election phase.
3. The election phase is a stripped-down version of Paxos and goes
something like this:
- mon.a has rank 0 and thinks it must be the leader
- mon.b has rank 1 and thinks it must be the leader
- mon.c has rank 2 and thinks it must be the leader
- mon.a receives mon.b's and mon.c's leader proposals and ignores
them as mon.a has a higher rank than mon.b or mon.c (lowest the value,
highest the rank)
- mon.c receives mon.a's leader proposal and defers to mon.a (a's
rank 0 > c's rank 2).
- mon.c receives mon.b's leader proposal and ignores as it has
already deferred to a monitor with higher rank than b's (a's rank 0 >
b's rank 1).
- mon.b receives mon.a's leader proposal and defers to mon.a (a's
rank 0 > b's rank 2).
- mon.a got 3 accepts (mon.a's + mon.b's + mon.c's), which is a
absolute majority (3 == (N+1)/2, for N = 5). mon.a declares itself the
leader, every other monitor declares itself a peon.
The election phase follows Paxos 'prepare', 'promise', 'accept' and
'accepted' phases.
Same goes for maps. Once the leader has been elected and the peons
established we can state that a quorum was reached. The quorum is the
set of all monitors participating in the cluster, and in this case the
quorum will be { mon.a, mon.b, mon.c }. After a quorum has been
established the monitors will be able to allow map modifications as needed.
So say a new OSD is added to the cluster. The osdmap needs to reflect
this. The leader handles the modification and keeps it on a temporary,
to-be-committed osdmap, and proposes the changes to all monitors in the
quorum.
1. Leader proposes the modification to all quorum participants. Each
modification is packed with a version and a proposal number.
2. Each monitor will check if it has seen said proposal number before.
If not it will take the proposal from the leader, stash it on disk on a
temporary location, and will let the leader that it has been accepted.
If on the other hand the monitor sees that said proposal number has been
proposed before, then it will not accept the proposal and simply ignore
the leader.
3. The leader will collect all 'accepts' from peons. If (N+1)/2
monitors (counting with the leader, which accepts its proposals by
default) accepted the proposal, then the leader will issue a 'commit'
instructing everyone to move the proposal from its temporary location to
its final location (for instance, from 'stashed_proposal' to
'osdmap:version_10'). If by chance not enough monitors accepted the
proposal (i.e., less than (N+1)/2), eventually a timeout will be
triggered and the quorum will undergo a new election.
This also follows Paxos 'prepare', 'promise', 'accept' and 'accepted'
phases, even if we cut corners to reduce message passing.
Hope this helps.
-Joao
Please help to clarify these points.
Regards
Pragya Jain
On Saturday, 30 August 2014 7:29 AM, Joao Eduardo Luis
<joao.l...@inktank.com> wrote:
On 08/29/2014 11:22 PM, J David wrote:
> So an even number N of monitors doesn't give you any better fault
> resilience than N-1 monitors. And the more monitors you have, the
> more traffic there is between them. So when N is even, N monitors
> consume more resources and provide no extra benefit compared to N-1
> monitors.
Except for more copies ;)
But yeah, if you're going with 2 or 4, you'll be better off with 3
or 5.
As long as you don't go with 1 you should be okay. Only go with
1 if
you're truly okay with losing whatever you're storing if that one
monitor's disk is fried.
-Joao
--
Joao Eduardo Luis
Software Engineer | http://inktank.com <http://inktank.com/>|
http://ceph.com <http://ceph.com/>
--
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com