Apologies for the delay in response here. It will take a bit of tracking inside the request object to track this error and then handle it in KafkaApis, but it is possible. I am happy to take that preferred approach. I will update the wiki & patch to handle this scenario and re-initiate the vote tomorrow.
Thanks, Grant On Sun, Jun 19, 2016 at 8:59 PM, Ewen Cheslack-Postava <e...@confluent.io> wrote: > I'm on the same page as Jun & Dana wrt disconnecting. Closing a connection > should really be a last resort because we can no longer trust correct > behavior in this session. In this case, we detect a bad request, but > there's no reason to believe it will affect subsequent requests. There are > dependencies to be sure, and if the client doesn't check errors, they may > try to then write to topics that don't exist or something along those > lines, but those requests can also be failed without killing the underlying > TCP connection. > > -Ewen > > On Fri, Jun 17, 2016 at 1:46 PM, Jun Rao <j...@confluent.io> wrote: > > > Grant, > > > > I think Dana has a valid point. Currently, we throw an > > InvalidRequestException and close the connection only when the broker > can't > > deserialize the bytes into a request. In this case, the deserialization > is > > fine. It just that there are some additional constraints that can't be > > specified at the protocol level. We can potentially just remember the > > topics that violated those constraints in the request and handle them > > accordingly with the right error code w/o disconnecting. > > > > Thanks, > > > > Jun > > > > On Fri, Jun 17, 2016 at 8:40 AM, Dana Powers <dana.pow...@gmail.com> > > wrote: > > > > > I'm unconvinced (crazy, right?). Comments below: > > > > > > On Fri, Jun 17, 2016 at 7:27 AM, Grant Henke <ghe...@cloudera.com> > > wrote: > > > > Hi Dana, > > > > > > > > You mentioned one of the reasons I error and disconnect. Because I > > can't > > > > return an error for every request so the cardinality between request > > and > > > > response would be different. Beyond that though, I am handling this > > > > protocol rule/parsing error the same way all other messages do. > > > > > > But you can return an error for every topic, and isn't that the level > > > of error required here? > > > > > > > CreateTopic Response (Version: 0) => [topic_error_codes] > > > > topic_error_codes => topic error_code > > > > topic => STRING > > > > error_code => INT16 > > > > > > If I submit duplicate requests for a topic, it's an error isolated to > > > that topic. If I mess up the partition / replication / etc semantics > > > for a topic, that's an error isolated to that topic. Is there a > > > cardinality problem at this level? > > > > > > > > > > > > > > Parsing is handled in the RequestChannel and any exception that > occurs > > > > during this phase is caught, converted into an > InvalidRequestException > > > and > > > > results in a disconnect: > > > > > > > > > > https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/network/RequestChannel.scala#L92-L95 > > > > > > > > Since this is an error that could only occur (and would always occur) > > due > > > > to incorrect client implementations, and not because of any cluster > > state > > > > or unusual situation, I felt this behavior was okay and made sense. > For > > > > client developers the broker logging should make it obvious what the > > > issue > > > > is. My patch also clearly documents the protocol rules in the > Protocol > > > > definition. > > > > > > Documentation is great and definitely a must. But requiring client > > > developers to dig through server logs is not ideal. Client developers > > > don't always have direct access to those logs. And even if they do, > > > the brokers may have other traffic, which makes it difficult to track > > > down the exact point in the logs where the error occurred. > > > > > > As discussed above, I don't think you need to or should model this as > > > a request-level parsing error. It may be easier for the current broker > > > implementation to do that and just crash the connection, but I don't > > > think it makes that much sense from a raw api perspective. > > > > > > > In the future having a response header with an error code (and > > optimally > > > > error message) for every response would help solve this problem (for > > all > > > > message types). > > > > > > That will definitely help solve the more general invalid request error > > > problem. But I think given the current state of error handling / > > > feedback from brokers on request-level errors, you should treat > > > connection crash as a last resort. I think there is a good opportunity > > > to avoid it in this case, and I think the api would be better if done > > > that way. > > > > > > -Dana > > > > > > > On Fri, Jun 17, 2016 at 12:04 AM, Dana Powers <dana.pow...@gmail.com > > > > > wrote: > > > > > > > >> Why disconnect the client on a InvalidRequestException? The 2 errors > > > >> you are catching are both topic-level: (1) multiple requests for the > > > >> same topic, and (2) ReplicaAssignment and num_partitions / > > > >> replication_factor both set. Wouldn't it be better to just error the > > > >> offending create_topic_request, not the entire connection? The > > > >> CreateTopicsResponse returns a map of topics to error codes. You > could > > > >> just return the topic that caused the error and an > > > >> InvalidRequestException error code. > > > >> > > > >> -Dana > > > >> > > > >> On Thu, Jun 16, 2016 at 8:37 AM, Grant Henke <ghe...@cloudera.com> > > > wrote: > > > >> > I have updated the wiki and pull request based on the feedback. If > > > there > > > >> > are no objections I will start a vote at the end of the day. > > > >> > > > > >> > Details for this implementation can be read here: > > > >> > > > > >> > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-4+-+Command+line+and+centralized+administrative+operations#KIP-4-Commandlineandcentralizedadministrativeoperations-CreateTopicRequest > > > >> > > > > >> > The updated pull request can be found here (feel free to review): > > > >> > https://github.com/apache/kafka/pull/1489 > > > >> > > > > >> > Below is the exact content for clarity: > > > >> > > > > >> >> Create Topics Request (KAFKA-2945 > > > >> >> <https://issues.apache.org/jira/browse/KAFKA-2945>) > > > >> >> > > > >> >> > > > >> >> > > > >> >> CreateTopics Request (Version: 0) => [create_topic_requests] > > timeout > > > >> >> create_topic_requests => topic num_partitions > replication_factor > > > >> [replica_assignment] [configs] > > > >> >> topic => STRING > > > >> >> num_partitions => INT32 > > > >> >> replication_factor => INT16 > > > >> >> replica_assignment => partition_id [replicas] > > > >> >> partition_id => INT32 > > > >> >> replicas => INT32 > > > >> >> configs => config_key config_value > > > >> >> config_key => STRING > > > >> >> config_value => STRING > > > >> >> timeout => INT32 > > > >> >> > > > >> >> CreateTopicsRequest is a batch request to initiate topic creation > > > with > > > >> >> either predefined or automatic replica assignment and optionally > > > topic > > > >> >> configuration. > > > >> >> > > > >> >> Request semantics: > > > >> >> > > > >> >> 1. Must be sent to the controller broker > > > >> >> 2. If there are multiple instructions for the same topic in > one > > > >> >> request an InvalidRequestException will be logged on the > broker > > > and > > > >> the > > > >> >> client will be disconnected. > > > >> >> - This is because the list of topics is modeled server side > > as > > > a > > > >> >> map with TopicName as the key > > > >> >> 3. The principal must be authorized to the "Create" Operation > on > > > the > > > >> >> "Cluster" resource to create topics. > > > >> >> - Unauthorized requests will receive a > > > >> ClusterAuthorizationException > > > >> >> 4. > > > >> >> > > > >> >> Only one from ReplicaAssignment or (num_partitions + > > > >> replication_factor > > > >> >> ), can be defined in one instruction. > > > >> >> - If both parameters are specified an InvalidRequestException > > > will be > > > >> >> logged on the broker and the client will be disconnected. > > > >> >> - In the case ReplicaAssignment is defined number of > > partitions > > > >> and > > > >> >> replicas will be calculated from the supplied > > > replica_assignment. > > > >> >> - In the case of defined (num_partitions + > > replication_factor) > > > >> >> replica assignment will be automatically generated by the > > > server. > > > >> >> - One or the other must be defined. The existing broker > side > > > auto > > > >> >> create defaults will not be used > > > >> >> (default.replication.factor, num.partitions). The client > > > >> implementation can > > > >> >> have defaults for these options when generating the > messages. > > > >> >> - The first replica in [replicas] is assumed to be the > > > preferred > > > >> >> leader. This matches current behavior elsewhere. > > > >> >> 5. Setting a timeout > 0 will allow the request to block until > > the > > > >> >> topic metadata is "complete" on the controller node. > > > >> >> - Complete means the local topic metadata cache been > > completely > > > >> >> populated and all partitions have leaders > > > >> >> - The topic metadata is updated when the controller > sends > > > out > > > >> >> update metadata requests to the brokers > > > >> >> - If a timeout error occurs, the topic could still be > created > > > >> >> successfully at a later time. Its up to the client to query > > for > > > >> the state > > > >> >> at that point. > > > >> >> 6. Setting a timeout <= 0 will validate arguments and trigger > > the > > > >> >> create topics and return immediately. > > > >> >> - This is essentially the fully asynchronous mode we have > in > > > the > > > >> >> Zookeeper tools today. > > > >> >> - The error code in the response will either contain an > > > argument > > > >> >> validation exception or a timeout exception. If you > receive a > > > >> timeout > > > >> >> exception, because you asked for 0 timeout, you can assume > > the > > > >> message was > > > >> >> valid and the topic creation was triggered. > > > >> >> 7. The request is not transactional. > > > >> >> 1. If an error occurs on one topic, the others could still > be > > > >> >> created. > > > >> >> 2. Errors are reported independently. > > > >> >> > > > >> >> QA: > > > >> >> > > > >> >> - Why is CreateTopicsRequest a batch request? > > > >> >> - Scenarios where tools or admins want to create many > topics > > > >> should > > > >> >> be able to with fewer requests > > > >> >> - Example: MirrorMaker may want to create the topics > > downstream > > > >> >> - What happens if some topics error immediately? Will it > > > >> >> return immediately? > > > >> >> - The request will block until all topics have either been > > > >> created, > > > >> >> errors, or the timeout has been hit > > > >> >> - There is no "short circuiting" where 1 error stops the > > other > > > >> >> topics from being created > > > >> >> - Why implement "partial blocking" instead of fully async or > > fully > > > >> >> consistent? > > > >> >> - See Cluster Consistent Blocking > > > >> >> < > > > >> > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-4+-+Command+line+and+centralized+administrative+operations#KIP-4-Commandlineandcentralizedadministrativeoperations-cluster-consistent-blocking > > > >> > > > > >> >> below > > > >> >> - Why require the request to go to the controller? > > > >> >> - The controller is responsible for the cluster metadata > and > > > >> >> its propagation > > > >> >> - See Request Forwarding > > > >> >> < > > > >> > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-4+-+Command+line+and+centralized+administrative+operations#KIP-4-Commandlineandcentralizedadministrativeoperations-request > > > >> > > > > >> >> below > > > >> >> > > > >> >> Create Topics Response > > > >> >> > > > >> >> > > > >> >> > > > >> >> CreateTopics Response (Version: 0) => [topic_error_codes] > > > >> >> topic_error_codes => topic error_code > > > >> >> topic => STRING > > > >> >> error_code => INT16 > > > >> >> > > > >> >> CreateTopicsResponse contains a map between topic and topic > > creation > > > >> >> result error code (see New Protocol Errors > > > >> >> < > > > >> > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-4+-+Command+line+and+centralized+administrative+operations#KIP-4-Commandlineandcentralizedadministrativeoperations-NewProtocolErrors > > > >> > > > > >> >> ). > > > >> >> > > > >> >> > > > >> > Thank you, > > > >> > Grant > > > >> > > > > >> > > > > >> > On Wed, Jun 15, 2016 at 4:11 PM, Grant Henke <ghe...@cloudera.com > > > > > >> wrote: > > > >> > > > > >> >> Turns out we already have an InvalidRequestException: > > > >> >> > > > >> > > > > > > https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/network/RequestChannel.scala#L75-L98 > > > >> >> > > > >> >> We just don't map it in Errors.java so it results in an UNKNOWN > > error > > > >> when > > > >> >> sent back to the client. > > > >> >> > > > >> >> I will migrate the InvalidRequestException to the client package, > > > add it > > > >> >> to Errors and use that to signify any protocol parsing/rule > errors. > > > >> >> > > > >> >> > > > >> >> > > > >> >> On Wed, Jun 15, 2016 at 2:55 PM, Dana Powers < > > dana.pow...@gmail.com> > > > >> >> wrote: > > > >> >> > > > >> >>> On Wed, Jun 15, 2016 at 12:19 PM, Ismael Juma < > ism...@juma.me.uk> > > > >> wrote: > > > >> >>> > Hi Grant, > > > >> >>> > > > > >> >>> > Comments below. > > > >> >>> > > > > >> >>> > On Wed, Jun 15, 2016 at 6:52 PM, Grant Henke < > > ghe...@cloudera.com > > > > > > > >> >>> wrote: > > > >> >>> >> > > > >> >>> >> The one thing I want to avoid is to many super specific error > > > >> codes. I > > > >> >>> am > > > >> >>> >> not sure how much of a problem it really is but in the case > of > > > wire > > > >> >>> >> protocol errors like multiple instances of the same topic, do > > you > > > >> have > > > >> >>> any > > > >> >>> >> thoughts on the error? Should we make a generic > InvalidRequest > > > error > > > >> >>> and > > > >> >>> >> log the detailed message on the broker for client authors to > > > debug? > > > >> >>> >> > > > >> >>> > > > > >> >>> > That is a good question. It would be good to get input from > > client > > > >> >>> > developers like Dana on this. > > > >> >>> > > > >> >>> I think generic error codes are fine if the wire protocol > > > requirements > > > >> >>> are documented [i.e., no duplicate topics and > partitions/replicas > > > are > > > >> >>> either/or not both]. If I get a broker error at the protocol > level > > > >> >>> that I don't understand, the first place I look is the protocol > > > docs. > > > >> >>> It may cause a few more emails to the mailing lists asking for > > > >> >>> clarification, but I think those will be easier to triage than > > > >> >>> confused emails like "I said create topic with 10 partitions, > but > > I > > > >> >>> only got 5???" > > > >> >>> > > > >> >>> -Dana > > > >> >>> > > > >> >> > > > >> >> > > > >> >> > > > >> >> -- > > > >> >> Grant Henke > > > >> >> Software Engineer | Cloudera > > > >> >> gr...@cloudera.com | twitter.com/gchenke | > > > linkedin.com/in/granthenke > > > >> >> > > > >> > > > > >> > > > > >> > > > > >> > -- > > > >> > Grant Henke > > > >> > Software Engineer | Cloudera > > > >> > gr...@cloudera.com | twitter.com/gchenke | > > linkedin.com/in/granthenke > > > >> > > > > > > > > > > > > > > > > -- > > > > Grant Henke > > > > Software Engineer | Cloudera > > > > gr...@cloudera.com | twitter.com/gchenke | > linkedin.com/in/granthenke > > > > > > > > > -- > Thanks, > Ewen > -- Grant Henke Software Engineer | Cloudera gr...@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke