Re: [ceph-users] question about monitor and paxos relationship

Joao Eduardo Luis Sat, 30 Aug 2014 04:19:23 -0700

On 08/30/2014 08:03 AM, pragya jain wrote:

Thanks Greg, Joao and David,


The concept why odd no. of monitors are preferred is clear to me, but
still I am not clear about the working of Paxos algorithm:

#1. All changes in any data structure of monitor whether it is monitor
map, OSD map, PG map, MDS map or CRUSH map; are made through Paxos
algorithm and
#2. Paxos algorithm also establish a quorum among the monitors for
recent copy of cluster map.

I am unable to understand how these two things are related and connected
? how does Paxos provide these two functionalities?

As Greg mentioned before, Paxos is a consensus algorithm thus we canleverage Paxos for anything that may require consensus.

We have two portions of the monitors that will use a modified version ofPaxos (but still Paxos in nature): map consensus and elections.

Let me give you a (rough) temporal view of how the monitor applies thisonce it starts. Say you have 5 monitors total, 2 of which are down.

1. Alive monitors will "probe" all monitors in the monmap (all other 4of them) -- the probing phase is independent from anything-Paxos and ismeant to raise awareness to the monitors that are up, alive and reachable.

2. Once enough monitors to form a quorum (i.e., at least (N+1)/2) replyto the probes, the monitors will enter the election phase.

3. The election phase is a stripped-down version of Paxos and goessomething like this:

  - mon.a has rank 0 and thinks it must be the leader
  - mon.b has rank 1 and thinks it must be the leader
  - mon.c has rank 2 and thinks it must be the leader

- mon.a receives mon.b's and mon.c's leader proposals and ignoresthem as mon.a has a higher rank than mon.b or mon.c (lowest the value,highest the rank)

- mon.c receives mon.a's leader proposal and defers to mon.a (a'srank 0 > c's rank 2).- mon.c receives mon.b's leader proposal and ignores as it hasalready deferred to a monitor with higher rank than b's (a's rank 0 >b's rank 1).

- mon.b receives mon.a's leader proposal and defers to mon.a (a'srank 0 > b's rank 2).

- mon.a got 3 accepts (mon.a's + mon.b's + mon.c's), which is aabsolute majority (3 == (N+1)/2, for N = 5). mon.a declares itself theleader, every other monitor declares itself a peon.

The election phase follows Paxos 'prepare', 'promise', 'accept' and'accepted' phases.

Same goes for maps. Once the leader has been elected and the peonsestablished we can state that a quorum was reached. The quorum is theset of all monitors participating in the cluster, and in this case thequorum will be { mon.a, mon.b, mon.c }. After a quorum has beenestablished the monitors will be able to allow map modifications as needed.

So say a new OSD is added to the cluster. The osdmap needs to reflectthis. The leader handles the modification and keeps it on a temporary,to-be-committed osdmap, and proposes the changes to all monitors in thequorum.

1. Leader proposes the modification to all quorum participants. Eachmodification is packed with a version and a proposal number.

2. Each monitor will check if it has seen said proposal number before.If not it will take the proposal from the leader, stash it on disk on atemporary location, and will let the leader that it has been accepted.If on the other hand the monitor sees that said proposal number has beenproposed before, then it will not accept the proposal and simply ignorethe leader.

3. The leader will collect all 'accepts' from peons. If (N+1)/2monitors (counting with the leader, which accepts its proposals bydefault) accepted the proposal, then the leader will issue a 'commit'instructing everyone to move the proposal from its temporary location toits final location (for instance, from 'stashed_proposal' to'osdmap:version_10'). If by chance not enough monitors accepted theproposal (i.e., less than (N+1)/2), eventually a timeout will betriggered and the quorum will undergo a new election.

This also follows Paxos 'prepare', 'promise', 'accept' and 'accepted'phases, even if we cut corners to reduce message passing.


Hope this helps.

  -Joao


Please help to clarify these points.

Regards
Pragya Jain




On Saturday, 30 August 2014 7:29 AM, Joao Eduardo Luis
<joao.l...@inktank.com> wrote:



    On 08/29/2014 11:22 PM, J David wrote:

     > So an even number N of monitors doesn't give you any better fault
     > resilience than N-1 monitors.  And the more monitors you have, the
     > more traffic there is between them.  So when N is even, N monitors
     > consume more resources and provide no extra benefit compared to N-1
     > monitors.


    Except for more copies ;)

    But yeah, if you're going with 2 or 4, you'll be better off with 3
    or 5.
       As long as you don't go with 1 you should be okay.  Only go with
    1 if
    you're truly okay with losing whatever you're storing if that one
    monitor's disk is fried.

       -Joao


    --
    Joao Eduardo Luis
    Software Engineer | http://inktank.com <http://inktank.com/>|
    http://ceph.com <http://ceph.com/>



--
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] question about monitor and paxos relationship

Reply via email to