Hi @Sijie, it make sense to me to choose bookies at the same rack as many as possible and have a least one replica at a remote rack for availability, so that we can have lower latency within the same rack and availability when one rack is down.
However it seems the current implementation of RackAwarePlacementPolicy doesn't do exactly as that goal is expected to do. It just start with the same rack for the first bookie, and then for each next bookie, it pick the rack that is different from the previous one. So it's not really choosing "as many as possible" bookies from same rack as the client. Or am I missing anything? On Wed, Aug 2, 2017 at 7:29 PM, Sijie Guo <[email protected]> wrote: > On Aug 2, 2017 6:54 PM, "Yiming Zang" <[email protected]> wrote: > > I was looking into RackawareEnsemblePlacementPolicyImpl.java, and it turns > out to me that the current implementation of > RackawareEnsemblePlacementPolicy only enforces at least 2 different racks, > it doesn't care about ensemble size, write quorum size or ack quorum size. > > For example, imagine we have the follow rack diversity (r1 means rack1): > > =========== > BK1: r1 > BK2: r2 > BK3: r3 > BK4: r3 > BK5: r3 > =========== > > Now we're creating a ledger with (3,3,2), in which ensemble size and write > quorum size are both 2. My expected behavior is the ensemble must contains > BK1 and BK2, and would choose the third bookie from BK3~BK5. However, in > fact, the ensemble violate that rule, which can be BK2, BK3, BK4, where two > of the bookies share the same rack. > > I've also added a test case to validate this theory. > > Question: Is this behavior expected? Shall we use write quorum size as rack > diversity so that we can spread the ledger across more racks? > > > I think that was the expected behavior. The idea was choosing bookies at > the same rack as many as possible and have replicas at a remote rack for > availability. Because typically inter-rack latency will be higher than > intra-rack. So the rack devesity enforcement isn't aligned with write > quorum size. > > I think we can introduce a setting in rack aware placement policy like > 'rack.diversity.enforced'. if it is true, the number of racks will be min > (num of total racks, write quorum size). > > Sijie > > > There core logic is here, where it use "~" to represent a different rack > than the previous one: > > // pick nodes by racks, to ensure there is at least two racks per write > quorum. > for (int i = 0; i < ensembleSize; i++) { > String curRack; > if (null == prevNode) { > if ((null == localNode) || > > localNode.getNetworkLocation().equals(NetworkTopology.DEFAULT_RACK)) { > curRack = NodeBase.ROOT; > } else { > curRack = localNode.getNetworkLocation(); > } > } else { > curRack = "~" + prevNode.getNetworkLocation(); > } > prevNode = selectFromNetworkLocation(curRack, excludeNodes, > ensemble, ensemble); > } > ArrayList<BookieSocketAddress> bookieList = ensemble.toList(); > if (ensembleSize != bookieList.size()) { > LOG.error("Not enough {} bookies are available to form an ensemble : > {}.", > ensembleSize, bookieList); > throw new BKNotEnoughBookiesException(); > } > return bookieList; >
