For 2 and 3, I have updated the KIP. Please take a look. One thing I have changed is removing the proposal to add rack to TopicMetadataResponse. The reason is that unlike UpdateMetadataRequest, TopicMetadataResponse does not understand version. I don't see a way to include rack without breaking old version of clients. That's probably why secure protocol is not included in the TopicMetadataResponse either. I think it will be a much bigger change to include rack in TopicMetadataResponse.
For 1, my concern is that doing rack aware assignment without complete broker to rack mapping will result in assignment that is not rack aware and fail to provide fault tolerance in the event of rack outage. This kind of problem will be difficult to surface. And the cost of this problem is high: you have to do partition reassignment if you are lucky to spot the problem early on or face the consequence of data loss during real rack outage. I do see the concern of fail-fast as it might also cause data loss if producer is not able produce the message due to topic creation failure. Is it feasible to treat dynamic topic creation and command tools differently? We allow dynamic topic creation with incomplete broker-rack mapping and fail fast in command line. Another option is to let user determine the behavior for command line. For example, by default fail fast in command line but allow incomplete broker-rack mapping if another switch is provided. On Tue, Oct 20, 2015 at 10:05 AM, Aditya Auradkar < aaurad...@linkedin.com.invalid> wrote: > Hey Allen, > > 1. If we choose fail fast topic creation, we will have topic creation > failures while upgrading the cluster. I really doubt we want this behavior. > Ideally, this should be invisible to clients of a cluster. Currently, each > broker is effectively its own rack. So we probably can use the rack > information whenever possible but not make it a hard requirement. To extend > Gwen's example, one badly configured broker should not degrade topic > creation for the entire cluster. > > 2. Upgrade scenario - Can you add a section on the upgrade piece to confirm > that old clients will not see errors? I believe ZookeeperConsumerConnector > reads the Broker objects from ZK. I wanted to confirm that this will not > cause any problems. > > 3. Could you elaborate your proposed changes to the UpdateMetadataRequest > in the "Public Interfaces" section? Personally, I find this format easy to > read in terms of wire protocol changes: > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-4+-+Command+line+and+centralized+administrative+operations#KIP-4-Commandlineandcentralizedadministrativeoperations-CreateTopicRequest > > Aditya > > On Fri, Oct 16, 2015 at 3:45 PM, Allen Wang <allenxw...@gmail.com> wrote: > > > KIP is updated include rack as an optional property for broker. Please > take > > a look and let me know if more details are needed. > > > > For the case where some brokers have rack and some do not, the current > KIP > > uses the fail-fast behavior. If there are concerns, we can further > discuss > > this in the email thread or next hangout. > > > > > > > > On Thu, Oct 15, 2015 at 10:42 AM, Allen Wang <allenxw...@gmail.com> > wrote: > > > > > That's a good question. I can think of three actions if the rack > > > information is incomplete: > > > > > > 1. Treat the node without rack as if it is on its unique rack > > > 2. Disregard all rack information and fallback to current algorithm > > > 3. Fail-fast > > > > > > Now I think about it, one and three make more sense. The reason for > > > fail-fast is that user mistake for not providing the rack may never be > > > found if we tolerate that and the assignment may not be rack aware as > the > > > user has expected and this creates debug problems when things fail. > > > > > > What do you think? If not fail-fast, is there anyway we can make the > user > > > error standing out? > > > > > > > > > On Thu, Oct 15, 2015 at 10:17 AM, Gwen Shapira <g...@confluent.io> > > wrote: > > > > > >> Thanks! Just to clarify, when some brokers have rack assignment and > some > > >> don't, do we act like none of them have it? or like those without > > >> assignment are in their own rack? > > >> > > >> The first scenario is good when first setting up rack-awareness, but > the > > >> second makes more sense for on-going maintenance (I can totally see > > >> someone > > >> adding a node and forgetting to set the rack property, we don't want > > this > > >> to change behavior for anything except the new node). > > >> > > >> What do you think? > > >> > > >> Gwen > > >> > > >> On Thu, Oct 15, 2015 at 10:13 AM, Allen Wang <allenxw...@gmail.com> > > >> wrote: > > >> > > >> > For scenario 1: > > >> > > > >> > - Add the rack information to broker property file or dynamically > set > > >> it in > > >> > the wrapper code to bootstrap Kafka server. You would do that for > all > > >> > brokers and restart the brokers one by one. > > >> > > > >> > In this scenario, the complete broker to rack mapping may not be > > >> available > > >> > until every broker is restarted. During that time we fall back to > > >> default > > >> > replica assignment algorithm. > > >> > > > >> > For scenario 2: > > >> > > > >> > - Add the rack information to broker property file or dynamically > set > > >> it in > > >> > the wrapper code and start the broker. > > >> > > > >> > > > >> > On Wed, Oct 14, 2015 at 2:36 PM, Gwen Shapira <g...@confluent.io> > > >> wrote: > > >> > > > >> > > Can you clarify the workflow for the following scenarios: > > >> > > > > >> > > 1. I currently have 6 brokers and want to add rack information for > > >> each > > >> > > 2. I'm adding a new broker and I want to specify which rack it > > >> belongs on > > >> > > while adding it. > > >> > > > > >> > > Thanks! > > >> > > > > >> > > On Tue, Oct 13, 2015 at 2:21 PM, Allen Wang <allenxw...@gmail.com > > > > >> > wrote: > > >> > > > > >> > > > We discussed the KIP in the hangout today. The recommendation is > > to > > >> > make > > >> > > > rack as a broker property in ZooKeeper. For users with existing > > rack > > >> > > > information stored somewhere, they would need to retrieve the > > >> > information > > >> > > > at broker start up and dynamically set the rack property, which > > can > > >> be > > >> > > > implemented as a wrapper to bootstrap broker. There will be no > > >> > interface > > >> > > or > > >> > > > pluggable implementation to retrieve the rack information. > > >> > > > > > >> > > > The assumption is that you always need to restart the broker to > > >> make a > > >> > > > change to the rack. > > >> > > > > > >> > > > Once the rack becomes a broker property, it will be possible to > > make > > >> > rack > > >> > > > part of the meta data to help the consumer choose which in sync > > >> replica > > >> > > to > > >> > > > consume from as part of the future consumer enhancement. > > >> > > > > > >> > > > I will update the KIP. > > >> > > > > > >> > > > Thanks, > > >> > > > Allen > > >> > > > > > >> > > > > > >> > > > On Thu, Oct 8, 2015 at 9:23 AM, Allen Wang < > allenxw...@gmail.com> > > >> > wrote: > > >> > > > > > >> > > > > I attended Tuesday's KIP hangout but this KIP was not > discussed > > >> due > > >> > to > > >> > > > > time constraint. > > >> > > > > > > >> > > > > However, after hearing discussion of KIP-35, I have the > feeling > > >> that > > >> > > > > incompatibility (caused by new broker property) between > brokers > > >> with > > >> > > > > different versions will be solved there. In addition, having > > >> stack > > >> > in > > >> > > > > broker property as meta data may also help consumers in the > > >> future. > > >> > So > > >> > > I > > >> > > > am > > >> > > > > open to adding stack property to broker. > > >> > > > > > > >> > > > > Hopefully we can discuss this in the next KIP hangout. > > >> > > > > > > >> > > > > On Wed, Sep 30, 2015 at 2:46 PM, Allen Wang < > > allenxw...@gmail.com > > >> > > > >> > > > wrote: > > >> > > > > > > >> > > > >> Can you send me the information on the next KIP hangout? > > >> > > > >> > > >> > > > >> Currently the broker-rack mapping is not cached. In > KafkaApis, > > >> > > > >> RackLocator.getRackInfo() is called each time the mapping is > > >> needed > > >> > > for > > >> > > > >> auto topic creation. This will ensure latest mapping is used > at > > >> any > > >> > > > time. > > >> > > > >> > > >> > > > >> The ability to get the complete mapping makes it simple to > > reuse > > >> the > > >> > > > same > > >> > > > >> interface in command line tools. > > >> > > > >> > > >> > > > >> > > >> > > > >> On Wed, Sep 30, 2015 at 11:01 AM, Aditya Auradkar < > > >> > > > >> aaurad...@linkedin.com.invalid> wrote: > > >> > > > >> > > >> > > > >>> Perhaps we discuss this during the next KIP hangout? > > >> > > > >>> > > >> > > > >>> I do see that a pluggable rack locator can be useful but I > do > > >> see a > > >> > > few > > >> > > > >>> concerns: > > >> > > > >>> > > >> > > > >>> - The RackLocator (as described in the document), implies > that > > >> it > > >> > can > > >> > > > >>> discover rack information for any node in the cluster. How > > does > > >> it > > >> > > deal > > >> > > > >>> with rack location changes? For example, if I moved broker > id > > >> (1) > > >> > > from > > >> > > > >>> rack > > >> > > > >>> X to Y, I only have to start that broker with a newer rack > > >> config. > > >> > If > > >> > > > >>> RackLocator discovers broker -> rack information at start up > > >> time, > > >> > > any > > >> > > > >>> change to a broker will require bouncing the entire cluster > > >> since > > >> > > > >>> createTopic requests can be sent to any node in the cluster. > > >> > > > >>> For this reason it may be simpler to have each node be aware > > of > > >> its > > >> > > own > > >> > > > >>> rack and persist it in ZK during start up time. > > >> > > > >>> > > >> > > > >>> - A pluggable RackLocator relies on an external service > being > > >> > > available > > >> > > > >>> to > > >> > > > >>> serve rack information. > > >> > > > >>> > > >> > > > >>> Out of curiosity, I looked up how a couple of other systems > > deal > > >> > with > > >> > > > >>> zone/rack awareness. > > >> > > > >>> For Cassandra some interesting modes are: > > >> > > > >>> (Property File configuration) > > >> > > > >>> > > >> > > > >>> > > >> > > > > > >> > > > > >> > > > >> > > > http://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architectureSnitchPFSnitch_t.html > > >> > > > >>> (Dynamic inference) > > >> > > > >>> > > >> > > > >>> > > >> > > > > > >> > > > > >> > > > >> > > > http://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architectureSnitchRackInf_c.html > > >> > > > >>> > > >> > > > >>> Voldemort does a static node -> zone assignment based on > > >> > > configuration. > > >> > > > >>> > > >> > > > >>> Aditya > > >> > > > >>> > > >> > > > >>> On Wed, Sep 30, 2015 at 10:05 AM, Allen Wang < > > >> allenxw...@gmail.com > > >> > > > > >> > > > >>> wrote: > > >> > > > >>> > > >> > > > >>> > I would like to see if we can do both: > > >> > > > >>> > > > >> > > > >>> > - Make RackLocator pluggable to facilitate migration with > > >> > existing > > >> > > > >>> > broker-rack mapping > > >> > > > >>> > > > >> > > > >>> > - Make rack an optional property for broker. If rack is > > >> available > > >> > > > from > > >> > > > >>> > broker, treat it as source of truth. For users with > existing > > >> > > > >>> broker-rack > > >> > > > >>> > mapping somewhere else, they can use the pluggable way or > > they > > >> > can > > >> > > > >>> transfer > > >> > > > >>> > the mapping to the broker rack property. > > >> > > > >>> > > > >> > > > >>> > One thing I am not sure is what happens at rolling upgrade > > >> when > > >> > we > > >> > > > have > > >> > > > >>> > rack as a broker property. For brokers with older version > of > > >> > Kafka, > > >> > > > >>> will it > > >> > > > >>> > cause problem for them? If so, is there any workaround? I > > also > > >> > > think > > >> > > > it > > >> > > > >>> > would be better not to have rack in the controller wire > > >> protocol > > >> > > but > > >> > > > >>> not > > >> > > > >>> > sure if it is achievable. > > >> > > > >>> > > > >> > > > >>> > Thanks, > > >> > > > >>> > Allen > > >> > > > >>> > > > >> > > > >>> > > > >> > > > >>> > > > >> > > > >>> > > > >> > > > >>> > > > >> > > > >>> > On Mon, Sep 28, 2015 at 4:55 PM, Todd Palino < > > >> tpal...@gmail.com> > > >> > > > >>> wrote: > > >> > > > >>> > > > >> > > > >>> > > I tend to like the idea of a pluggable locator. For > > >> example, we > > >> > > > >>> already > > >> > > > >>> > > have an interface for discovering information about the > > >> > physical > > >> > > > >>> location > > >> > > > >>> > > of servers. I don't relish the idea of having to > maintain > > >> data > > >> > in > > >> > > > >>> > multiple > > >> > > > >>> > > places. > > >> > > > >>> > > > > >> > > > >>> > > -Todd > > >> > > > >>> > > > > >> > > > >>> > > On Mon, Sep 28, 2015 at 4:48 PM, Aditya Auradkar < > > >> > > > >>> > > aaurad...@linkedin.com.invalid> wrote: > > >> > > > >>> > > > > >> > > > >>> > > > Thanks for starting this KIP Allen. > > >> > > > >>> > > > > > >> > > > >>> > > > I agree with Gwen that having a RackLocator class that > > is > > >> > > > pluggable > > >> > > > >>> > seems > > >> > > > >>> > > > to be too complex. The KIP refers to potentially > non-ZK > > >> > storage > > >> > > > >>> for the > > >> > > > >>> > > > rack info which I don't think is necessary. > > >> > > > >>> > > > > > >> > > > >>> > > > Perhaps we can persist this info in zk under > > >> > > > >>> /brokers/ids/<broker_id> > > >> > > > >>> > > > similar to other broker properties and add a config in > > >> > > > KafkaConfig > > >> > > > >>> > called > > >> > > > >>> > > > "rack". > > >> > > > >>> > > > > > {"jmx_port":-1,"endpoints":[...],"host":"xxx","port":yyy, > > >> > > "rack": > > >> > > > >>> > "abc"} > > >> > > > >>> > > > > > >> > > > >>> > > > Aditya > > >> > > > >>> > > > > > >> > > > >>> > > > On Mon, Sep 28, 2015 at 2:30 PM, Gwen Shapira < > > >> > > g...@confluent.io > > >> > > > > > > >> > > > >>> > wrote: > > >> > > > >>> > > > > > >> > > > >>> > > > > Hi, > > >> > > > >>> > > > > > > >> > > > >>> > > > > First, thanks for putting out a KIP for this. This > is > > >> super > > >> > > > >>> important > > >> > > > >>> > > for > > >> > > > >>> > > > > production deployments of Kafka. > > >> > > > >>> > > > > > > >> > > > >>> > > > > Few questions: > > >> > > > >>> > > > > > > >> > > > >>> > > > > 1) Are we sure we want "as many racks as possible"? > > I'd > > >> > want > > >> > > to > > >> > > > >>> > balance > > >> > > > >>> > > > > between safety (more racks) and network utilization > > >> > (traffic > > >> > > > >>> within a > > >> > > > >>> > > > rack > > >> > > > >>> > > > > uses the high-bandwidth TOR switch). One replica on > a > > >> > > different > > >> > > > >>> rack > > >> > > > >>> > > and > > >> > > > >>> > > > > the rest on same rack (if possible) sounds better to > > me. > > >> > > > >>> > > > > > > >> > > > >>> > > > > 2) Rack-locator class seems overly complex compared > to > > >> > > adding a > > >> > > > >>> > > > rack.number > > >> > > > >>> > > > > property to the broker properties file. Why do we > want > > >> > that? > > >> > > > >>> > > > > > > >> > > > >>> > > > > Gwen > > >> > > > >>> > > > > > > >> > > > >>> > > > > > > >> > > > >>> > > > > > > >> > > > >>> > > > > On Mon, Sep 28, 2015 at 12:15 PM, Allen Wang < > > >> > > > >>> allenxw...@gmail.com> > > >> > > > >>> > > > wrote: > > >> > > > >>> > > > > > > >> > > > >>> > > > > > Hello Kafka Developers, > > >> > > > >>> > > > > > > > >> > > > >>> > > > > > I just created KIP-36 for rack aware replica > > >> assignment. > > >> > > > >>> > > > > > > > >> > > > >>> > > > > > > > >> > > > >>> > > > > > > > >> > > > >>> > > > > > > >> > > > >>> > > > > > >> > > > >>> > > > > >> > > > >>> > > > >> > > > >>> > > >> > > > > > >> > > > > >> > > > >> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-36+Rack+aware+replica+assignment > > >> > > > >>> > > > > > > > >> > > > >>> > > > > > The goal is to utilize the isolation provided by > the > > >> > racks > > >> > > in > > >> > > > >>> data > > >> > > > >>> > > > center > > >> > > > >>> > > > > > and distribute replicas to racks to provide fault > > >> > > tolerance. > > >> > > > >>> > > > > > > > >> > > > >>> > > > > > Comments are welcome. > > >> > > > >>> > > > > > > > >> > > > >>> > > > > > Thanks, > > >> > > > >>> > > > > > Allen > > >> > > > >>> > > > > > > > >> > > > >>> > > > > > > >> > > > >>> > > > > > >> > > > >>> > > > > >> > > > >>> > > > >> > > > >>> > > >> > > > >> > > >> > > > >> > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > > > > > > >