During the discussion in the hangout, it was mentioned that it would be desirable that consumers know the rack information of the brokers so that they can consume from the broker in the same rack to reduce latency. As I understand this will only be beneficial if consumer can consume from any broker in ISR, which is not possible now.
I suggest we skip the change to TMR. Once the change is made to consumer to be able to consume from any broker in ISR, the rack information can be added to TMR. Another thing I want to confirm is command line behavior. I think the desirable default behavior is to fail fast on command line for incomplete rack mapping. The error message can include further instruction that tells the user to add an extra argument (like "--allow-partial-rackinfo") to suppress the error and do an imperfect rack aware assignment. If the default behavior is to allow incomplete mapping, the error can still be easily missed. The affected command line tools are TopicCommand and ReassignPartitionsCommand. Thanks, Allen On Mon, Oct 26, 2015 at 12:55 PM, Aditya Auradkar <aaurad...@linkedin.com> wrote: > Hi Allen, > > For TopicMetadataResponse to understand version, you can bump up the > request version itself. Based on the version of the request, the response > can be appropriately serialized. It shouldn't be a huge change. For > example: We went through something similar for ProduceRequest recently ( > https://reviews.apache.org/r/33378/) > I guess the reason protocol information is not included in the TMR is > because the topic itself is independent of any particular protocol (SSL vs > Plaintext). Having said that, I'm not sure we even need rack information in > TMR. What usecase were you thinking of initially? > > For 1 - I'd be fine with adding an option to the command line tools that > check rack assignment. For e.g. "--strict-assignment" or something similar. > > Aditya > > On Thu, Oct 22, 2015 at 6:44 PM, Allen Wang <allenxw...@gmail.com> wrote: > > > For 2 and 3, I have updated the KIP. Please take a look. One thing I have > > changed is removing the proposal to add rack to TopicMetadataResponse. > The > > reason is that unlike UpdateMetadataRequest, TopicMetadataResponse does > not > > understand version. I don't see a way to include rack without breaking > old > > version of clients. That's probably why secure protocol is not included > in > > the TopicMetadataResponse either. I think it will be a much bigger change > > to include rack in TopicMetadataResponse. > > > > For 1, my concern is that doing rack aware assignment without complete > > broker to rack mapping will result in assignment that is not rack aware > and > > fail to provide fault tolerance in the event of rack outage. This kind of > > problem will be difficult to surface. And the cost of this problem is > high: > > you have to do partition reassignment if you are lucky to spot the > problem > > early on or face the consequence of data loss during real rack outage. > > > > I do see the concern of fail-fast as it might also cause data loss if > > producer is not able produce the message due to topic creation failure. > Is > > it feasible to treat dynamic topic creation and command tools > differently? > > We allow dynamic topic creation with incomplete broker-rack mapping and > > fail fast in command line. Another option is to let user determine the > > behavior for command line. For example, by default fail fast in command > > line but allow incomplete broker-rack mapping if another switch is > > provided. > > > > > > > > > > On Tue, Oct 20, 2015 at 10:05 AM, Aditya Auradkar < > > aaurad...@linkedin.com.invalid> wrote: > > > > > Hey Allen, > > > > > > 1. If we choose fail fast topic creation, we will have topic creation > > > failures while upgrading the cluster. I really doubt we want this > > behavior. > > > Ideally, this should be invisible to clients of a cluster. Currently, > > each > > > broker is effectively its own rack. So we probably can use the rack > > > information whenever possible but not make it a hard requirement. To > > extend > > > Gwen's example, one badly configured broker should not degrade topic > > > creation for the entire cluster. > > > > > > 2. Upgrade scenario - Can you add a section on the upgrade piece to > > confirm > > > that old clients will not see errors? I believe > > ZookeeperConsumerConnector > > > reads the Broker objects from ZK. I wanted to confirm that this will > not > > > cause any problems. > > > > > > 3. Could you elaborate your proposed changes to the > UpdateMetadataRequest > > > in the "Public Interfaces" section? Personally, I find this format easy > > to > > > read in terms of wire protocol changes: > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-4+-+Command+line+and+centralized+administrative+operations#KIP-4-Commandlineandcentralizedadministrativeoperations-CreateTopicRequest > > > > > > Aditya > > > > > > On Fri, Oct 16, 2015 at 3:45 PM, Allen Wang <allenxw...@gmail.com> > > wrote: > > > > > > > KIP is updated include rack as an optional property for broker. > Please > > > take > > > > a look and let me know if more details are needed. > > > > > > > > For the case where some brokers have rack and some do not, the > current > > > KIP > > > > uses the fail-fast behavior. If there are concerns, we can further > > > discuss > > > > this in the email thread or next hangout. > > > > > > > > > > > > > > > > On Thu, Oct 15, 2015 at 10:42 AM, Allen Wang <allenxw...@gmail.com> > > > wrote: > > > > > > > > > That's a good question. I can think of three actions if the rack > > > > > information is incomplete: > > > > > > > > > > 1. Treat the node without rack as if it is on its unique rack > > > > > 2. Disregard all rack information and fallback to current algorithm > > > > > 3. Fail-fast > > > > > > > > > > Now I think about it, one and three make more sense. The reason for > > > > > fail-fast is that user mistake for not providing the rack may never > > be > > > > > found if we tolerate that and the assignment may not be rack aware > as > > > the > > > > > user has expected and this creates debug problems when things fail. > > > > > > > > > > What do you think? If not fail-fast, is there anyway we can make > the > > > user > > > > > error standing out? > > > > > > > > > > > > > > > On Thu, Oct 15, 2015 at 10:17 AM, Gwen Shapira <g...@confluent.io> > > > > wrote: > > > > > > > > > >> Thanks! Just to clarify, when some brokers have rack assignment > and > > > some > > > > >> don't, do we act like none of them have it? or like those without > > > > >> assignment are in their own rack? > > > > >> > > > > >> The first scenario is good when first setting up rack-awareness, > but > > > the > > > > >> second makes more sense for on-going maintenance (I can totally > see > > > > >> someone > > > > >> adding a node and forgetting to set the rack property, we don't > want > > > > this > > > > >> to change behavior for anything except the new node). > > > > >> > > > > >> What do you think? > > > > >> > > > > >> Gwen > > > > >> > > > > >> On Thu, Oct 15, 2015 at 10:13 AM, Allen Wang < > allenxw...@gmail.com> > > > > >> wrote: > > > > >> > > > > >> > For scenario 1: > > > > >> > > > > > >> > - Add the rack information to broker property file or > dynamically > > > set > > > > >> it in > > > > >> > the wrapper code to bootstrap Kafka server. You would do that > for > > > all > > > > >> > brokers and restart the brokers one by one. > > > > >> > > > > > >> > In this scenario, the complete broker to rack mapping may not be > > > > >> available > > > > >> > until every broker is restarted. During that time we fall back > to > > > > >> default > > > > >> > replica assignment algorithm. > > > > >> > > > > > >> > For scenario 2: > > > > >> > > > > > >> > - Add the rack information to broker property file or > dynamically > > > set > > > > >> it in > > > > >> > the wrapper code and start the broker. > > > > >> > > > > > >> > > > > > >> > On Wed, Oct 14, 2015 at 2:36 PM, Gwen Shapira < > g...@confluent.io> > > > > >> wrote: > > > > >> > > > > > >> > > Can you clarify the workflow for the following scenarios: > > > > >> > > > > > > >> > > 1. I currently have 6 brokers and want to add rack information > > for > > > > >> each > > > > >> > > 2. I'm adding a new broker and I want to specify which rack it > > > > >> belongs on > > > > >> > > while adding it. > > > > >> > > > > > > >> > > Thanks! > > > > >> > > > > > > >> > > On Tue, Oct 13, 2015 at 2:21 PM, Allen Wang < > > allenxw...@gmail.com > > > > > > > > >> > wrote: > > > > >> > > > > > > >> > > > We discussed the KIP in the hangout today. The > recommendation > > is > > > > to > > > > >> > make > > > > >> > > > rack as a broker property in ZooKeeper. For users with > > existing > > > > rack > > > > >> > > > information stored somewhere, they would need to retrieve > the > > > > >> > information > > > > >> > > > at broker start up and dynamically set the rack property, > > which > > > > can > > > > >> be > > > > >> > > > implemented as a wrapper to bootstrap broker. There will be > no > > > > >> > interface > > > > >> > > or > > > > >> > > > pluggable implementation to retrieve the rack information. > > > > >> > > > > > > > >> > > > The assumption is that you always need to restart the broker > > to > > > > >> make a > > > > >> > > > change to the rack. > > > > >> > > > > > > > >> > > > Once the rack becomes a broker property, it will be possible > > to > > > > make > > > > >> > rack > > > > >> > > > part of the meta data to help the consumer choose which in > > sync > > > > >> replica > > > > >> > > to > > > > >> > > > consume from as part of the future consumer enhancement. > > > > >> > > > > > > > >> > > > I will update the KIP. > > > > >> > > > > > > > >> > > > Thanks, > > > > >> > > > Allen > > > > >> > > > > > > > >> > > > > > > > >> > > > On Thu, Oct 8, 2015 at 9:23 AM, Allen Wang < > > > allenxw...@gmail.com> > > > > >> > wrote: > > > > >> > > > > > > > >> > > > > I attended Tuesday's KIP hangout but this KIP was not > > > discussed > > > > >> due > > > > >> > to > > > > >> > > > > time constraint. > > > > >> > > > > > > > > >> > > > > However, after hearing discussion of KIP-35, I have the > > > feeling > > > > >> that > > > > >> > > > > incompatibility (caused by new broker property) between > > > brokers > > > > >> with > > > > >> > > > > different versions will be solved there. In addition, > > having > > > > >> stack > > > > >> > in > > > > >> > > > > broker property as meta data may also help consumers in > the > > > > >> future. > > > > >> > So > > > > >> > > I > > > > >> > > > am > > > > >> > > > > open to adding stack property to broker. > > > > >> > > > > > > > > >> > > > > Hopefully we can discuss this in the next KIP hangout. > > > > >> > > > > > > > > >> > > > > On Wed, Sep 30, 2015 at 2:46 PM, Allen Wang < > > > > allenxw...@gmail.com > > > > >> > > > > > >> > > > wrote: > > > > >> > > > > > > > > >> > > > >> Can you send me the information on the next KIP hangout? > > > > >> > > > >> > > > > >> > > > >> Currently the broker-rack mapping is not cached. In > > > KafkaApis, > > > > >> > > > >> RackLocator.getRackInfo() is called each time the mapping > > is > > > > >> needed > > > > >> > > for > > > > >> > > > >> auto topic creation. This will ensure latest mapping is > > used > > > at > > > > >> any > > > > >> > > > time. > > > > >> > > > >> > > > > >> > > > >> The ability to get the complete mapping makes it simple > to > > > > reuse > > > > >> the > > > > >> > > > same > > > > >> > > > >> interface in command line tools. > > > > >> > > > >> > > > > >> > > > >> > > > > >> > > > >> On Wed, Sep 30, 2015 at 11:01 AM, Aditya Auradkar < > > > > >> > > > >> aaurad...@linkedin.com.invalid> wrote: > > > > >> > > > >> > > > > >> > > > >>> Perhaps we discuss this during the next KIP hangout? > > > > >> > > > >>> > > > > >> > > > >>> I do see that a pluggable rack locator can be useful > but I > > > do > > > > >> see a > > > > >> > > few > > > > >> > > > >>> concerns: > > > > >> > > > >>> > > > > >> > > > >>> - The RackLocator (as described in the document), > implies > > > that > > > > >> it > > > > >> > can > > > > >> > > > >>> discover rack information for any node in the cluster. > How > > > > does > > > > >> it > > > > >> > > deal > > > > >> > > > >>> with rack location changes? For example, if I moved > broker > > > id > > > > >> (1) > > > > >> > > from > > > > >> > > > >>> rack > > > > >> > > > >>> X to Y, I only have to start that broker with a newer > rack > > > > >> config. > > > > >> > If > > > > >> > > > >>> RackLocator discovers broker -> rack information at > start > > up > > > > >> time, > > > > >> > > any > > > > >> > > > >>> change to a broker will require bouncing the entire > > cluster > > > > >> since > > > > >> > > > >>> createTopic requests can be sent to any node in the > > cluster. > > > > >> > > > >>> For this reason it may be simpler to have each node be > > aware > > > > of > > > > >> its > > > > >> > > own > > > > >> > > > >>> rack and persist it in ZK during start up time. > > > > >> > > > >>> > > > > >> > > > >>> - A pluggable RackLocator relies on an external service > > > being > > > > >> > > available > > > > >> > > > >>> to > > > > >> > > > >>> serve rack information. > > > > >> > > > >>> > > > > >> > > > >>> Out of curiosity, I looked up how a couple of other > > systems > > > > deal > > > > >> > with > > > > >> > > > >>> zone/rack awareness. > > > > >> > > > >>> For Cassandra some interesting modes are: > > > > >> > > > >>> (Property File configuration) > > > > >> > > > >>> > > > > >> > > > >>> > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > > http://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architectureSnitchPFSnitch_t.html > > > > >> > > > >>> (Dynamic inference) > > > > >> > > > >>> > > > > >> > > > >>> > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > > http://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architectureSnitchRackInf_c.html > > > > >> > > > >>> > > > > >> > > > >>> Voldemort does a static node -> zone assignment based on > > > > >> > > configuration. > > > > >> > > > >>> > > > > >> > > > >>> Aditya > > > > >> > > > >>> > > > > >> > > > >>> On Wed, Sep 30, 2015 at 10:05 AM, Allen Wang < > > > > >> allenxw...@gmail.com > > > > >> > > > > > > >> > > > >>> wrote: > > > > >> > > > >>> > > > > >> > > > >>> > I would like to see if we can do both: > > > > >> > > > >>> > > > > > >> > > > >>> > - Make RackLocator pluggable to facilitate migration > > with > > > > >> > existing > > > > >> > > > >>> > broker-rack mapping > > > > >> > > > >>> > > > > > >> > > > >>> > - Make rack an optional property for broker. If rack > is > > > > >> available > > > > >> > > > from > > > > >> > > > >>> > broker, treat it as source of truth. For users with > > > existing > > > > >> > > > >>> broker-rack > > > > >> > > > >>> > mapping somewhere else, they can use the pluggable way > > or > > > > they > > > > >> > can > > > > >> > > > >>> transfer > > > > >> > > > >>> > the mapping to the broker rack property. > > > > >> > > > >>> > > > > > >> > > > >>> > One thing I am not sure is what happens at rolling > > upgrade > > > > >> when > > > > >> > we > > > > >> > > > have > > > > >> > > > >>> > rack as a broker property. For brokers with older > > version > > > of > > > > >> > Kafka, > > > > >> > > > >>> will it > > > > >> > > > >>> > cause problem for them? If so, is there any > workaround? > > I > > > > also > > > > >> > > think > > > > >> > > > it > > > > >> > > > >>> > would be better not to have rack in the controller > wire > > > > >> protocol > > > > >> > > but > > > > >> > > > >>> not > > > > >> > > > >>> > sure if it is achievable. > > > > >> > > > >>> > > > > > >> > > > >>> > Thanks, > > > > >> > > > >>> > Allen > > > > >> > > > >>> > > > > > >> > > > >>> > > > > > >> > > > >>> > > > > > >> > > > >>> > > > > > >> > > > >>> > > > > > >> > > > >>> > On Mon, Sep 28, 2015 at 4:55 PM, Todd Palino < > > > > >> tpal...@gmail.com> > > > > >> > > > >>> wrote: > > > > >> > > > >>> > > > > > >> > > > >>> > > I tend to like the idea of a pluggable locator. For > > > > >> example, we > > > > >> > > > >>> already > > > > >> > > > >>> > > have an interface for discovering information about > > the > > > > >> > physical > > > > >> > > > >>> location > > > > >> > > > >>> > > of servers. I don't relish the idea of having to > > > maintain > > > > >> data > > > > >> > in > > > > >> > > > >>> > multiple > > > > >> > > > >>> > > places. > > > > >> > > > >>> > > > > > > >> > > > >>> > > -Todd > > > > >> > > > >>> > > > > > > >> > > > >>> > > On Mon, Sep 28, 2015 at 4:48 PM, Aditya Auradkar < > > > > >> > > > >>> > > aaurad...@linkedin.com.invalid> wrote: > > > > >> > > > >>> > > > > > > >> > > > >>> > > > Thanks for starting this KIP Allen. > > > > >> > > > >>> > > > > > > > >> > > > >>> > > > I agree with Gwen that having a RackLocator class > > that > > > > is > > > > >> > > > pluggable > > > > >> > > > >>> > seems > > > > >> > > > >>> > > > to be too complex. The KIP refers to potentially > > > non-ZK > > > > >> > storage > > > > >> > > > >>> for the > > > > >> > > > >>> > > > rack info which I don't think is necessary. > > > > >> > > > >>> > > > > > > > >> > > > >>> > > > Perhaps we can persist this info in zk under > > > > >> > > > >>> /brokers/ids/<broker_id> > > > > >> > > > >>> > > > similar to other broker properties and add a > config > > in > > > > >> > > > KafkaConfig > > > > >> > > > >>> > called > > > > >> > > > >>> > > > "rack". > > > > >> > > > >>> > > > > > > > {"jmx_port":-1,"endpoints":[...],"host":"xxx","port":yyy, > > > > >> > > "rack": > > > > >> > > > >>> > "abc"} > > > > >> > > > >>> > > > > > > > >> > > > >>> > > > Aditya > > > > >> > > > >>> > > > > > > > >> > > > >>> > > > On Mon, Sep 28, 2015 at 2:30 PM, Gwen Shapira < > > > > >> > > g...@confluent.io > > > > >> > > > > > > > > >> > > > >>> > wrote: > > > > >> > > > >>> > > > > > > > >> > > > >>> > > > > Hi, > > > > >> > > > >>> > > > > > > > > >> > > > >>> > > > > First, thanks for putting out a KIP for this. > This > > > is > > > > >> super > > > > >> > > > >>> important > > > > >> > > > >>> > > for > > > > >> > > > >>> > > > > production deployments of Kafka. > > > > >> > > > >>> > > > > > > > > >> > > > >>> > > > > Few questions: > > > > >> > > > >>> > > > > > > > > >> > > > >>> > > > > 1) Are we sure we want "as many racks as > > possible"? > > > > I'd > > > > >> > want > > > > >> > > to > > > > >> > > > >>> > balance > > > > >> > > > >>> > > > > between safety (more racks) and network > > utilization > > > > >> > (traffic > > > > >> > > > >>> within a > > > > >> > > > >>> > > > rack > > > > >> > > > >>> > > > > uses the high-bandwidth TOR switch). One replica > > on > > > a > > > > >> > > different > > > > >> > > > >>> rack > > > > >> > > > >>> > > and > > > > >> > > > >>> > > > > the rest on same rack (if possible) sounds > better > > to > > > > me. > > > > >> > > > >>> > > > > > > > > >> > > > >>> > > > > 2) Rack-locator class seems overly complex > > compared > > > to > > > > >> > > adding a > > > > >> > > > >>> > > > rack.number > > > > >> > > > >>> > > > > property to the broker properties file. Why do > we > > > want > > > > >> > that? > > > > >> > > > >>> > > > > > > > > >> > > > >>> > > > > Gwen > > > > >> > > > >>> > > > > > > > > >> > > > >>> > > > > > > > > >> > > > >>> > > > > > > > > >> > > > >>> > > > > On Mon, Sep 28, 2015 at 12:15 PM, Allen Wang < > > > > >> > > > >>> allenxw...@gmail.com> > > > > >> > > > >>> > > > wrote: > > > > >> > > > >>> > > > > > > > > >> > > > >>> > > > > > Hello Kafka Developers, > > > > >> > > > >>> > > > > > > > > > >> > > > >>> > > > > > I just created KIP-36 for rack aware replica > > > > >> assignment. > > > > >> > > > >>> > > > > > > > > > >> > > > >>> > > > > > > > > > >> > > > >>> > > > > > > > > > >> > > > >>> > > > > > > > > >> > > > >>> > > > > > > > >> > > > >>> > > > > > > >> > > > >>> > > > > > >> > > > >>> > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-36+Rack+aware+replica+assignment > > > > >> > > > >>> > > > > > > > > > >> > > > >>> > > > > > The goal is to utilize the isolation provided > by > > > the > > > > >> > racks > > > > >> > > in > > > > >> > > > >>> data > > > > >> > > > >>> > > > center > > > > >> > > > >>> > > > > > and distribute replicas to racks to provide > > fault > > > > >> > > tolerance. > > > > >> > > > >>> > > > > > > > > > >> > > > >>> > > > > > Comments are welcome. > > > > >> > > > >>> > > > > > > > > > >> > > > >>> > > > > > Thanks, > > > > >> > > > >>> > > > > > Allen > > > > >> > > > >>> > > > > > > > > > >> > > > >>> > > > > > > > > >> > > > >>> > > > > > > > >> > > > >>> > > > > > > >> > > > >>> > > > > > >> > > > >>> > > > > >> > > > >> > > > > >> > > > >> > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > > > > > > > > > > > >