thanks all, for your great explanation.
Regards
Pragya Jain
On Saturday, 30 August 2014 4:51 PM, Joao Eduardo Luis <joao.l...@inktank.com>
wrote:
>
>
>On 08/30/2014 08:03 AM, pragya jain wrote:
>> Thanks Greg, Joao and David,
>>
>> The concept why odd no. of monitors are preferred is clear to me, but
>> still I am not clear about the working of Paxos algorithm:
>>
>> #1. All changes in any data structure of monitor whether it is monitor
>> map, OSD map, PG map, MDS map or CRUSH map; are made through Paxos
>> algorithm and
>> #2. Paxos algorithm also establish a quorum among the monitors for
>> recent copy of cluster map.
>>
>> I am unable to understand how these two things are related and connected
>> ? how does Paxos provide these two functionalities?
>
>As Greg mentioned before, Paxos is a consensus algorithm thus we can
>leverage Paxos for anything that may require consensus.
>
>We have two portions of the monitors that will use a modified version of
>Paxos (but still Paxos in nature): map consensus and elections.
>
>Let me give you a (rough) temporal view of how the monitor applies this
>once it starts. Say you have 5 monitors total, 2 of which are down.
>
>1. Alive monitors will "probe" all monitors in the monmap (all other 4
>of them) -- the probing phase is independent from anything-Paxos and is
>meant to raise awareness to the monitors that are up, alive and reachable.
>
>2. Once enough monitors to form a quorum (i.e., at least (N+1)/2) reply
>to the probes, the monitors will enter the election phase.
>
>3. The election phase is a stripped-down version of Paxos and goes
>something like this:
> - mon.a has rank 0 and thinks it must be the leader
> - mon.b has rank 1 and thinks it must be the leader
> - mon.c has rank 2 and thinks it must be the leader
>
> - mon.a receives mon.b's and mon.c's leader proposals and ignores
>them as mon.a has a higher rank than mon.b or mon.c (lowest the value,
>highest the rank)
>
> - mon.c receives mon.a's leader proposal and defers to mon.a (a's
>rank 0 > c's rank 2).
> - mon.c receives mon.b's leader proposal and ignores as it has
>already deferred to a monitor with higher rank than b's (a's rank 0 >
>b's rank 1).
>
> - mon.b receives mon.a's leader proposal and defers to mon.a (a's
>rank 0 > b's rank 2).
>
> - mon.a got 3 accepts (mon.a's + mon.b's + mon.c's), which is a
>absolute majority (3 == (N+1)/2, for N = 5). mon.a declares itself the
>leader, every other monitor declares itself a peon.
>
>The election phase follows Paxos 'prepare', 'promise', 'accept' and
>'accepted' phases.
>
>Same goes for maps. Once the leader has been elected and the peons
>established we can state that a quorum was reached. The quorum is the
>set of all monitors participating in the cluster, and in this case the
>quorum will be { mon.a, mon.b, mon.c }. After a quorum has been
>established the monitors will be able to allow map modifications as needed.
>
>So say a new OSD is added to the cluster. The osdmap needs to reflect
>this. The leader handles the modification and keeps it on a temporary,
>to-be-committed osdmap, and proposes the changes to all monitors in the
>quorum.
>
>1. Leader proposes the modification to all quorum participants. Each
>modification is packed with a version and a proposal number.
>
>2. Each monitor will check if it has seen said proposal number before.
>If not it will take the proposal from the leader, stash it on disk on a
>temporary location, and will let the leader that it has been accepted.
>If on the other hand the monitor sees that said proposal number has been
>proposed before, then it will not accept the proposal and simply ignore
>the leader.
>
>3. The leader will collect all 'accepts' from peons. If (N+1)/2
>monitors (counting with the leader, which accepts its proposals by
>default) accepted the proposal, then the leader will issue a 'commit'
>instructing everyone to move the proposal from its temporary location to
>its final location (for instance, from 'stashed_proposal' to
>'osdmap:version_10'). If by chance not enough monitors accepted the
>proposal (i.e., less than (N+1)/2), eventually a timeout will be
>triggered and the quorum will undergo a new election.
>
>This also follows Paxos 'prepare', 'promise', 'accept' and 'accepted'
>phases, even if we cut corners to reduce message passing.
>
>Hope this helps.
>
> -Joao
>
>>
>> Please help to clarify these points.
>>
>> Regards
>> Pragya Jain
>>
>>
>>
>>
>> On Saturday, 30 August 2014 7:29 AM, Joao Eduardo Luis
>> <joao.l...@inktank.com> wrote:
>>
>>
>>
>> On 08/29/2014 11:22 PM, J David wrote:
>>
>> > So an even number N of monitors doesn't give you any better fault
>> > resilience than N-1 monitors. And the more monitors you have, the
>> > more traffic there is between them. So when N is even, N monitors
>> > consume more resources and provide no extra benefit compared to N-1
>> > monitors.
>>
>>
>> Except for more copies ;)
>>
>> But yeah, if you're going with 2 or 4, you'll be better off with 3
>> or 5.
>> As long as you don't go with 1 you should be okay. Only go with
>> 1 if
>> you're truly okay with losing whatever you're storing if that one
>> monitor's disk is fried.
>>
>> -Joao
>>
>>
>> --
>> Joao Eduardo Luis
>> Software Engineer | http://inktank.com <http://inktank.com/>|
>> http://ceph.com <http://ceph.com/>
>
>>
>>
>>
>
>
>--
>Joao Eduardo Luis
>Software Engineer | http://inktank.com | http://ceph.com
>
>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com