I was looking into RackawareEnsemblePlacementPolicyImpl.java, and it turns
out to me that the current implementation of
RackawareEnsemblePlacementPolicy only enforces at least 2 different racks,
it doesn't care about ensemble size, write quorum size or ack quorum size.
For example, imagine we have the follow rack diversity (r1 means rack1):
===========
BK1: r1
BK2: r2
BK3: r3
BK4: r3
BK5: r3
===========
Now we're creating a ledger with (3,3,2), in which ensemble size and write
quorum size are both 2. My expected behavior is the ensemble must contains
BK1 and BK2, and would choose the third bookie from BK3~BK5. However, in
fact, the ensemble violate that rule, which can be BK2, BK3, BK4, where two
of the bookies share the same rack.
I've also added a test case to validate this theory.
Question: Is this behavior expected? Shall we use write quorum size as rack
diversity so that we can spread the ledger across more racks?
There core logic is here, where it use "~" to represent a different rack
than the previous one:
// pick nodes by racks, to ensure there is at least two racks per write quorum.
for (int i = 0; i < ensembleSize; i++) {
String curRack;
if (null == prevNode) {
if ((null == localNode) ||
localNode.getNetworkLocation().equals(NetworkTopology.DEFAULT_RACK)) {
curRack = NodeBase.ROOT;
} else {
curRack = localNode.getNetworkLocation();
}
} else {
curRack = "~" + prevNode.getNetworkLocation();
}
prevNode = selectFromNetworkLocation(curRack, excludeNodes,
ensemble, ensemble);
}
ArrayList<BookieSocketAddress> bookieList = ensemble.toList();
if (ensembleSize != bookieList.size()) {
LOG.error("Not enough {} bookies are available to form an ensemble : {}.",
ensembleSize, bookieList);
throw new BKNotEnoughBookiesException();
}
return bookieList;