On Wed, Aug 2, 2017 at 11:34 PM, Yiming Zang <yz...@twitter.com.invalid>
wrote:

> Hi @Sijie, it make sense to me to choose bookies at the same rack as many
> as possible and have a least one replica at a remote rack for
> availability, so that we can have lower latency within the same rack and
> availability when one rack is down.
>
> However it seems the current implementation of RackAwarePlacementPolicy
> doesn't do exactly as that goal is expected to do. It just start with the
> same rack for the first bookie, and then for each next bookie, it pick the
> rack that is different from the previous one. So it's not really choosing
> "as many as possible" bookies from same rack as the client.
>

It is a variant of that, because the algorithm here need to ensure the
coverage across all the write quorums.

I think it is easy to explain it just using two examples.

1) ensemble size == write quorum size: [4, 4, 2]
2) ensemble size > write quorum size: [4, 2, 2]


if we are going to support only 1) case, it totally makes sense to pick up
as many bookies from the same rack of client. because any write quorums
will meet 2-racks requirements.

However, if we apply this algorithm to second case 2), it would fail. lets
say we have two racks `R1` and `R2`. the client is running on a machine
under `R1`. so if we pickup as many bookies from local rack, then an
ensemble would end up like [R1, R2, R1, R1]. In this case, the last two
bookies are ended at same rack `R1`, this will break the rack-diversity for
entries written to the quorums formed by those two bookies.

Because we are allowing the existence of `ensmeble size` > `write quorum
size`, we take a simple approach to make sure any adjacent bookies in an
ensemble will belong to two different racks. we can guarantee rack
diversity (2 racks), no matter what is the write quorum size.

Hope this explains.

- Sijie





>
> Or am I missing anything?
>
> On Wed, Aug 2, 2017 at 7:29 PM, Sijie Guo <guosi...@gmail.com> wrote:
>
> > On Aug 2, 2017 6:54 PM, "Yiming Zang" <yz...@twitter.com.invalid> wrote:
> >
> > I was looking into RackawareEnsemblePlacementPolicyImpl.java, and it
> turns
> > out to me that the current implementation of
> > RackawareEnsemblePlacementPolicy only enforces at least 2 different
> racks,
> > it doesn't care about ensemble size, write quorum size or ack quorum
> size.
> >
> > For example, imagine we have the follow rack diversity (r1 means rack1):
> >
> > ===========
> > BK1: r1
> > BK2: r2
> > BK3: r3
> > BK4: r3
> > BK5: r3
> > ===========
> >
> > Now we're creating a ledger with (3,3,2), in which ensemble size and
> write
> > quorum size are both 2. My expected behavior is the ensemble must
> contains
> > BK1 and BK2, and would choose the third bookie from BK3~BK5. However, in
> > fact, the ensemble violate that rule, which can be BK2, BK3, BK4, where
> two
> > of the bookies share the same rack.
> >
> > I've also added a test case to validate this theory.
> >
> > Question: Is this behavior expected? Shall we use write quorum size as
> rack
> > diversity so that we can spread the ledger across more racks?
> >
> >
> > I think that was the expected behavior. The idea was choosing bookies at
> > the same rack as many as possible and have replicas at a remote rack for
> > availability. Because typically inter-rack latency will be higher than
> > intra-rack. So the rack devesity enforcement isn't aligned with write
> > quorum size.
> >
> > I think we can introduce a setting in rack aware placement policy like
> > 'rack.diversity.enforced'. if it is true, the number of racks will be min
> > (num of total racks, write quorum size).
> >
> > Sijie
> >
> >
> > There core logic is here, where it use "~" to represent a different rack
> > than the previous one:
> >
> > // pick nodes by racks, to ensure there is at least two racks per write
> > quorum.
> > for (int i = 0; i < ensembleSize; i++) {
> >     String curRack;
> >     if (null == prevNode) {
> >         if ((null == localNode) ||
> >
> > localNode.getNetworkLocation().equals(NetworkTopology.DEFAULT_RACK)) {
> >             curRack = NodeBase.ROOT;
> >         } else {
> >             curRack = localNode.getNetworkLocation();
> >         }
> >     } else {
> >         curRack = "~" + prevNode.getNetworkLocation();
> >     }
> >     prevNode = selectFromNetworkLocation(curRack, excludeNodes,
> > ensemble, ensemble);
> > }
> > ArrayList<BookieSocketAddress> bookieList = ensemble.toList();
> > if (ensembleSize != bookieList.size()) {
> >     LOG.error("Not enough {} bookies are available to form an ensemble :
> > {}.",
> >               ensembleSize, bookieList);
> >     throw new BKNotEnoughBookiesException();
> > }
> > return bookieList;
> >
>

Reply via email to