That make sense, I didn't consider the case when ensemble size > write
quorum size. So I think as you mentioned before, we can add an option to
enforce strong rack diversity.

Considering the case of (4,2,2), I think we can't let num of racks = min
(num of racks, write quorum size). Because in this case, num of racks will
become 2, this will still potentially break the rack-diversity.

So if we want to add  such an option in the future, it should be min(num of
racks, ensemble size)?

On Thu, Aug 3, 2017 at 12:53 AM, Sijie Guo <[email protected]> wrote:

> On Wed, Aug 2, 2017 at 11:34 PM, Yiming Zang <[email protected]>
> wrote:
>
> > Hi @Sijie, it make sense to me to choose bookies at the same rack as many
> > as possible and have a least one replica at a remote rack for
> > availability, so that we can have lower latency within the same rack and
> > availability when one rack is down.
> >
> > However it seems the current implementation of RackAwarePlacementPolicy
> > doesn't do exactly as that goal is expected to do. It just start with the
> > same rack for the first bookie, and then for each next bookie, it pick
> the
> > rack that is different from the previous one. So it's not really choosing
> > "as many as possible" bookies from same rack as the client.
> >
>
> It is a variant of that, because the algorithm here need to ensure the
> coverage across all the write quorums.
>
> I think it is easy to explain it just using two examples.
>
> 1) ensemble size == write quorum size: [4, 4, 2]
> 2) ensemble size > write quorum size: [4, 2, 2]
>
>
> if we are going to support only 1) case, it totally makes sense to pick up
> as many bookies from the same rack of client. because any write quorums
> will meet 2-racks requirements.
>
> However, if we apply this algorithm to second case 2), it would fail. lets
> say we have two racks `R1` and `R2`. the client is running on a machine
> under `R1`. so if we pickup as many bookies from local rack, then an
> ensemble would end up like [R1, R2, R1, R1]. In this case, the last two
> bookies are ended at same rack `R1`, this will break the rack-diversity for
> entries written to the quorums formed by those two bookies.
>
> Because we are allowing the existence of `ensmeble size` > `write quorum
> size`, we take a simple approach to make sure any adjacent bookies in an
> ensemble will belong to two different racks. we can guarantee rack
> diversity (2 racks), no matter what is the write quorum size.
>
> Hope this explains.
>
> - Sijie
>
>
>
>
>
> >
> > Or am I missing anything?
> >
> > On Wed, Aug 2, 2017 at 7:29 PM, Sijie Guo <[email protected]> wrote:
> >
> > > On Aug 2, 2017 6:54 PM, "Yiming Zang" <[email protected]>
> wrote:
> > >
> > > I was looking into RackawareEnsemblePlacementPolicyImpl.java, and it
> > turns
> > > out to me that the current implementation of
> > > RackawareEnsemblePlacementPolicy only enforces at least 2 different
> > racks,
> > > it doesn't care about ensemble size, write quorum size or ack quorum
> > size.
> > >
> > > For example, imagine we have the follow rack diversity (r1 means
> rack1):
> > >
> > > ===========
> > > BK1: r1
> > > BK2: r2
> > > BK3: r3
> > > BK4: r3
> > > BK5: r3
> > > ===========
> > >
> > > Now we're creating a ledger with (3,3,2), in which ensemble size and
> > write
> > > quorum size are both 2. My expected behavior is the ensemble must
> > contains
> > > BK1 and BK2, and would choose the third bookie from BK3~BK5. However,
> in
> > > fact, the ensemble violate that rule, which can be BK2, BK3, BK4, where
> > two
> > > of the bookies share the same rack.
> > >
> > > I've also added a test case to validate this theory.
> > >
> > > Question: Is this behavior expected? Shall we use write quorum size as
> > rack
> > > diversity so that we can spread the ledger across more racks?
> > >
> > >
> > > I think that was the expected behavior. The idea was choosing bookies
> at
> > > the same rack as many as possible and have replicas at a remote rack
> for
> > > availability. Because typically inter-rack latency will be higher than
> > > intra-rack. So the rack devesity enforcement isn't aligned with write
> > > quorum size.
> > >
> > > I think we can introduce a setting in rack aware placement policy like
> > > 'rack.diversity.enforced'. if it is true, the number of racks will be
> min
> > > (num of total racks, write quorum size).
> > >
> > > Sijie
> > >
> > >
> > > There core logic is here, where it use "~" to represent a different
> rack
> > > than the previous one:
> > >
> > > // pick nodes by racks, to ensure there is at least two racks per write
> > > quorum.
> > > for (int i = 0; i < ensembleSize; i++) {
> > >     String curRack;
> > >     if (null == prevNode) {
> > >         if ((null == localNode) ||
> > >
> > > localNode.getNetworkLocation().equals(NetworkTopology.DEFAULT_RACK)) {
> > >             curRack = NodeBase.ROOT;
> > >         } else {
> > >             curRack = localNode.getNetworkLocation();
> > >         }
> > >     } else {
> > >         curRack = "~" + prevNode.getNetworkLocation();
> > >     }
> > >     prevNode = selectFromNetworkLocation(curRack, excludeNodes,
> > > ensemble, ensemble);
> > > }
> > > ArrayList<BookieSocketAddress> bookieList = ensemble.toList();
> > > if (ensembleSize != bookieList.size()) {
> > >     LOG.error("Not enough {} bookies are available to form an ensemble
> :
> > > {}.",
> > >               ensembleSize, bookieList);
> > >     throw new BKNotEnoughBookiesException();
> > > }
> > > return bookieList;
> > >
> >
>

Reply via email to