[jira] [Resolved] (KAFKA-7215) Improve LogCleaner behavior on error
[ https://issues.apache.org/jira/browse/KAFKA-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-7215. Resolution: Fixed Fix Version/s: 2.1.0 Merged to 2.1 and trunk. > Improve LogCleaner behavior on error > > > Key: KAFKA-7215 > URL: https://issues.apache.org/jira/browse/KAFKA-7215 > Project: Kafka > Issue Type: Improvement >Reporter: Stanislav Kozlovski >Assignee: Stanislav Kozlovski >Priority: Minor > Fix For: 2.1.0 > > > For more detailed information see > [KIP-346|https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-7366) topic level segment.bytes and segment.ms not taking effect immediately
[ https://issues.apache.org/jira/browse/KAFKA-7366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-7366. Resolution: Fixed Fix Version/s: 2.1.0 Merged to 2.1 and trunk. > topic level segment.bytes and segment.ms not taking effect immediately > -- > > Key: KAFKA-7366 > URL: https://issues.apache.org/jira/browse/KAFKA-7366 > Project: Kafka > Issue Type: Bug >Affects Versions: 1.1.0, 2.0.0 >Reporter: Jun Rao >Assignee: Manikumar >Priority: Major > Fix For: 2.1.0 > > > It used to be that topic level configs such as segment.bytes takes effect > immediately. Because of KAFKA-6324 in 1.1, those configs now only take effect > after the active segment has rolled. The relevant part of KAFKA-6324 is that > in Log.maybeRoll, the checking of the segment rolling is moved to > LogSegment.shouldRoll(). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-3097) Acls for PrincipalType User are case sensitive
[ https://issues.apache.org/jira/browse/KAFKA-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-3097. Resolution: Fixed Assignee: Manikumar (was: Ashish Singh) Fix Version/s: 2.1.0 Merged to 2.1 and trunk. > Acls for PrincipalType User are case sensitive > -- > > Key: KAFKA-3097 > URL: https://issues.apache.org/jira/browse/KAFKA-3097 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.9.0.0 >Reporter: Thomas Graves >Assignee: Manikumar >Priority: Major > Fix For: 2.1.0 > > > I gave a user acls for READ/WRITE but when I went to actually write to the > topic failed with auth exception. I figured out it was due to me specifying > the user as: user:tgraves rather then User:tgraves. > Seems like It should either fail on assign or be case insensitive. > The principal type of User should also probably be documented. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-7482) LeaderAndIsrRequest should be sent to the shutting down broker
[ https://issues.apache.org/jira/browse/KAFKA-7482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-7482. Resolution: Fixed Fix Version/s: 2.1.0 merged to 2.1 and trunk > LeaderAndIsrRequest should be sent to the shutting down broker > -- > > Key: KAFKA-7482 > URL: https://issues.apache.org/jira/browse/KAFKA-7482 > Project: Kafka > Issue Type: Bug >Affects Versions: 1.1.0, 2.0.0 >Reporter: Jun Rao >Assignee: Jun Rao >Priority: Major > Fix For: 2.1.0 > > > We introduced a regression in KAFKA-5642 in 1.1. Before 1.1, during a > controlled shutdown, the LeaderAndIsrRequest is sent to the shutting down > broker to inform it that it's no longer the leader for partitions whose > leader have been moved. After 1.1, such LeaderAndIsrRequest is no longer sent > to the shutting down broker. This can delay the time for the client to find > out the new leader. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KAFKA-7557) optimize LogManager.truncateFullyAndStartAt()
Jun Rao created KAFKA-7557: -- Summary: optimize LogManager.truncateFullyAndStartAt() Key: KAFKA-7557 URL: https://issues.apache.org/jira/browse/KAFKA-7557 Project: Kafka Issue Type: Bug Affects Versions: 2.0.0, 2.1.0 Reporter: Jun Rao When a ReplicaFetcherThread calls LogManager.truncateFullyAndStartAt() for a partition, we call LogManager.checkpointLogRecoveryOffsetsInDir() and then Log.deleteSnapshotsAfterRecoveryPointCheckpoint() on all the logs in that directory. This requires listing all the files in each log dir to figure out the snapshot files. If some logs have many log segment files. This could take some time. The can potentially block a replica fetcher thread, which indirectly causes the request handler threads to be blocked. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-7537) Only include live brokers in the UpdateMetadataRequest sent to existing brokers if there is no change in the partition states
[ https://issues.apache.org/jira/browse/KAFKA-7537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-7537. Resolution: Fixed Fix Version/s: 2.2.0 Merged the PR to trunk. > Only include live brokers in the UpdateMetadataRequest sent to existing > brokers if there is no change in the partition states > - > > Key: KAFKA-7537 > URL: https://issues.apache.org/jira/browse/KAFKA-7537 > Project: Kafka > Issue Type: Improvement > Components: controller >Reporter: Zhanxiang (Patrick) Huang >Assignee: Zhanxiang (Patrick) Huang >Priority: Major > Fix For: 2.2.0 > > > Currently if when brokers join/leave the cluster without any partition states > changes, controller will send out UpdateMetadataRequests containing the > states of all partitions to all brokers. But for existing brokers in the > cluster, the metadata diff between controller and the broker should only be > the "live_brokers" info. Only the brokers with empty metadata cache need the > full UpdateMetadataRequest. Sending the full UpdateMetadataRequest to all > brokers can place nonnegligible memory pressure on the controller side. > Let's say in total we have N brokers, M partitions in the cluster and we want > to add 1 brand new broker in the cluster. With RF=2, the memory footprint per > partition in the UpdateMetadataRequest is ~200 Bytes. In the current > controller implementation, if each of the N RequestSendThreads serializes and > sends out the UpdateMetadataRequest at roughly the same time (which is very > likely the case), we will end up using *(N+1)*M*200B*. In a large kafka > cluster, we can have: > {noformat} > N=99 > M=100k > Memory usage to send out UpdateMetadataRequest to all brokers: > 100 * 100K * 200B = 2G > However, we only need to send out full UpdateMetadataRequest to the newly > added broker. We only need to include live broker ids (4B * 100 brokers) in > the UpdateMetadataRequest sent to the existing 99 brokers. So the amount of > data that is actully needed will be: > 1 * 100K * 200B + 99 * (100 * 4B) = ~21M > We will can potentially reduce 2G / 21M = ~95x memory footprint as well as > the data tranferred in the network.{noformat} > > This issue kind of hurts the scalability of a kafka cluster. KIP-380 and > KAFKA-7186 also help to further reduce the controller memory footprint. > > In terms of implementation, we can keep some in-memory state in the > controller side to differentiate existing brokers and uninitialized brokers > (e.g. brand new brokers) so that if there is no change in partition states, > we only send out live brokers info to existing brokers. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-7412) Bug prone response from producer.send(ProducerRecord, Callback) if Kafka broker is not running
[ https://issues.apache.org/jira/browse/KAFKA-7412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-7412. Resolution: Fixed Fix Version/s: 2.2.0 Clarified the javadoc in producer callback. Metadata is not null with non-null exception. Merged the PR to trunk. > Bug prone response from producer.send(ProducerRecord, Callback) if Kafka > broker is not running > -- > > Key: KAFKA-7412 > URL: https://issues.apache.org/jira/browse/KAFKA-7412 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 2.0.0 >Reporter: Michal Turek >Assignee: huxihx >Priority: Major > Fix For: 2.2.0 > > Attachments: both_metadata_and_exception.png, > metadata_when_kafka_is_stopped.png > > > Hi there, I have probably found a bug in Java Kafka producer client. > Scenario & current behavior: > - Start Kafka broker, single instance. > - Start application that produces messages to Kafka. > - Let the application to load partitions for a topic to warm up the producer, > e.g. send a message to Kafka. I'm not sure if this is necessary step, but our > code does it. > - Gracefully stop the Kafka broker. > - Application logs now contains "org.apache.kafka.clients.NetworkClient: > [Producer clientId=...] Connection to node 0 could not be established. Broker > may not be available." so the client is aware about the Kafka unavailability. > - Trigger the producer to send a message using > KafkaProducer.send(ProducerRecord, Callback) method. > - The callback that notifies business code receives non-null RecordMetadata > and null Exception after request.timeout.ms. The metadata contains offset -1 > which is value of ProduceResponse.INVALID_OFFSET. > Expected behavior: > - If the Kafka is not running and the message is not appended to the log, the > callback should contain null RecordMetadata and non-null Exception. At least > I subjectively understand the Javadoc this way, "exception on production > error" in simple words. > - Developer that is not aware of this behavior and that doesn't test for > offset -1, may consider the message as successfully send and properly acked > by the broker. > Known workaround > - Together with checking for non-null exception in the callback, add another > condition for ProduceResponse.INVALID_OFFSET. > {noformat} > try { > producer.send(record, (metadata, exception) -> { > if (metadata != null) { > if (metadata.offset() != > ProduceResponse.INVALID_OFFSET) { > // Success > } else { > // Failure > } > } else { > // Failure > } > }); > } catch (Exception e) { > // Failure > } > {noformat} > Used setup > - Latest Kafka 2.0.0 for both broker and Java client. > - Originally found with broker 0.11.0.1 and client 2.0.0. > - Code is analogy of the one in Javadoc of KafkaProducer.send(). > - Used producer configuration (others use defaults). > {noformat} > bootstrap.servers = "localhost:9092" > client.id = "..." > acks = "all" > retries = 1 > linger.ms = "20" > compression.type = "lz4" > request.timeout.ms = 5000 # The same behavior is with default, this is to > speed up the tests > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-7557) optimize LogManager.truncateFullyAndStartAt()
[ https://issues.apache.org/jira/browse/KAFKA-7557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-7557. Resolution: Fixed Fix Version/s: 2.2.0 merged to trunk > optimize LogManager.truncateFullyAndStartAt() > - > > Key: KAFKA-7557 > URL: https://issues.apache.org/jira/browse/KAFKA-7557 > Project: Kafka > Issue Type: Bug >Affects Versions: 2.0.0, 2.1.0 >Reporter: Jun Rao >Assignee: huxihx >Priority: Major > Fix For: 2.2.0 > > > When a ReplicaFetcherThread calls LogManager.truncateFullyAndStartAt() for a > partition, we call LogManager.checkpointLogRecoveryOffsetsInDir() and then > Log.deleteSnapshotsAfterRecoveryPointCheckpoint() on all the logs in that > directory. This requires listing all the files in each log dir to figure out > the snapshot files. If some logs have many log segment files. This could take > some time. The can potentially block a replica fetcher thread, which > indirectly causes the request handler threads to be blocked. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KAFKA-7680) fetching a refilled chunk of log can cause log divergence
Jun Rao created KAFKA-7680: -- Summary: fetching a refilled chunk of log can cause log divergence Key: KAFKA-7680 URL: https://issues.apache.org/jira/browse/KAFKA-7680 Project: Kafka Issue Type: Bug Reporter: Jun Rao We use FileRecords.writeTo to send a fetch response for a follower. A log could be truncated and refilled in the middle of the send process (due to leader change). Then it's possible for the follower to append some uncommitted messages followed by committed messages. Those uncommitted messages may never be removed, causing log divergence. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KAFKA-7681) new metric for request thread utilization by request type
Jun Rao created KAFKA-7681: -- Summary: new metric for request thread utilization by request type Key: KAFKA-7681 URL: https://issues.apache.org/jira/browse/KAFKA-7681 Project: Kafka Issue Type: Improvement Reporter: Jun Rao When the request thread pool is saturated, it's often useful to know which type request is using the thread pool the most. It would be useful to add a metric that tracks the fraction of request thread pool usage by request type. This would be equivalent to (request rate) * (request local time ms) / 1000, but will be more direct. This would require a new KIP. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KAFKA-7682) turning on request logging for a subset of request types
Jun Rao created KAFKA-7682: -- Summary: turning on request logging for a subset of request types Key: KAFKA-7682 URL: https://issues.apache.org/jira/browse/KAFKA-7682 Project: Kafka Issue Type: Improvement Reporter: Jun Rao Turning on request level logging can be useful for debugging. However, the request logging can be quite verbose. It would be useful to turn if on for a subset of the request types. We already have a jmx bean to turn on/off the request logging dynamically. We could add a new jmx bean to control the request types to be logged. This requires a KIP. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-7449) Kafka console consumer is not sending topic to deserializer
[ https://issues.apache.org/jira/browse/KAFKA-7449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-7449. Resolution: Fixed Assignee: Mathieu Chataigner Fix Version/s: 2.2.0 Merged to trunk. > Kafka console consumer is not sending topic to deserializer > --- > > Key: KAFKA-7449 > URL: https://issues.apache.org/jira/browse/KAFKA-7449 > Project: Kafka > Issue Type: Bug > Components: core >Reporter: Mathieu Chataigner >Assignee: Mathieu Chataigner >Priority: Major > Labels: easyfix, pull-request-available > Fix For: 2.2.0 > > > We tried to create a custom Deserializer to consume some protobuf topics. > We have a mechanism for getting the protobuf class from topic name however > the console consumer is not forwarding the topic of the console consumer > record down to the deserializer. > Topic information is available in the ConsumerRecord. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-1120) Controller could miss a broker state change
[ https://issues.apache.org/jira/browse/KAFKA-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-1120. Resolution: Fixed Assignee: Zhanxiang (Patrick) Huang (was: Mickael Maison) This is fixed by KAFKA-7235. > Controller could miss a broker state change > > > Key: KAFKA-1120 > URL: https://issues.apache.org/jira/browse/KAFKA-1120 > Project: Kafka > Issue Type: Sub-task > Components: core >Affects Versions: 0.8.1 >Reporter: Jun Rao >Assignee: Zhanxiang (Patrick) Huang >Priority: Major > Labels: reliability > Fix For: 2.2.0 > > > When the controller is in the middle of processing a task (e.g., preferred > leader election, broker change), it holds a controller lock. During this > time, a broker could have de-registered and re-registered itself in ZK. After > the controller finishes processing the current task, it will start processing > the logic in the broker change listener. However, it will see no broker > change and therefore won't do anything to the restarted broker. This broker > will be in a weird state since the controller doesn't inform it to become the > leader of any partition. Yet, the cached metadata in other brokers could > still list that broker as the leader for some partitions. Client requests > routed to that broker will then get a TopicOrPartitionNotExistException. This > broker will continue to be in this bad state until it's restarted again. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-7235) Use brokerZkNodeVersion to prevent broker from processing outdated controller request
[ https://issues.apache.org/jira/browse/KAFKA-7235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-7235. Resolution: Fixed Fix Version/s: 2.2.0 merged to trunk. > Use brokerZkNodeVersion to prevent broker from processing outdated controller > request > - > > Key: KAFKA-7235 > URL: https://issues.apache.org/jira/browse/KAFKA-7235 > Project: Kafka > Issue Type: Improvement >Reporter: Dong Lin >Assignee: Zhanxiang (Patrick) Huang >Priority: Major > Fix For: 2.2.0 > > > Currently a broker can process controller requests that are sent before the > broker is restarted. This could cause a few problems. Here is one example: > Let's assume partitions p1 and p2 exists on broker1. > 1) Controller generates LeaderAndIsrRequest with p1 to be sent to broker1. > 2) Before controller sends the request, broker1 is quickly restarted. > 3) The LeaderAndIsrRequest with p1 is delivered to broker1. > 4) After processing the first LeaderAndIsrRequest, broker1 starts to > checkpoint high watermark for all partitions that it owns. Thus it may > overwrite high watermark checkpoint file with only the hw for partition p1. > The hw for partition p2 is now lost, which could be a problem. > In general, the correctness of broker logic currently relies on a few > assumption, e.g. the first LeaderAndIsrRequest received by broker should > contain all partitions hosted by the broker, which could break if broker can > receive controller requests that were generated before it restarts. > One reasonable solution to the problem is to include the > expectedBrokeNodeZkVersion in the controller requests. Broker should remember > the broker znode zkVersion after it registers itself in the zookeeper. Then > broker can reject those controller requests whose expectedBrokeNodeZkVersion > is different from its broker znode zkVersion. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-7704) kafka.server.ReplicaFetechManager.MaxLag.Replica metric is reported incorrectly
[ https://issues.apache.org/jira/browse/KAFKA-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-7704. Resolution: Fixed Fix Version/s: 2.1.1 2.2.0 Merged to trunk and 2.1. > kafka.server.ReplicaFetechManager.MaxLag.Replica metric is reported > incorrectly > --- > > Key: KAFKA-7704 > URL: https://issues.apache.org/jira/browse/KAFKA-7704 > Project: Kafka > Issue Type: Bug > Components: metrics >Affects Versions: 2.1.0 >Reporter: Yu Yang >Assignee: huxihx >Priority: Major > Fix For: 2.2.0, 2.1.1 > > Attachments: Screen Shot 2018-12-03 at 4.33.35 PM.png, Screen Shot > 2018-12-05 at 10.13.09 PM.png > > > We recently deployed kafka 2.1, and noticed a jump in > kafka.server.ReplicaFetcherManager.MaxLag.Replica metric. At the same time, > there is no under-replicated partitions for the cluster. > The initial analysis shows that kafka 2.1.0 does not report metric correctly > for topics that have no incoming traffic right now, but had traffic earlier. > For those topics, ReplicaFetcherManager will consider the maxLag be the > latest offset. > For instance, we have a topic named `test_topic`: > {code} > [root@kafkabroker03002:/mnt/kafka/test_topic-0]# ls -l > total 8 > -rw-rw-r-- 1 kafka kafka 10485760 Dec 4 00:13 099043947579.index > -rw-rw-r-- 1 kafka kafka0 Sep 23 03:01 099043947579.log > -rw-rw-r-- 1 kafka kafka 10 Dec 4 00:13 099043947579.snapshot > -rw-rw-r-- 1 kafka kafka 10485756 Dec 4 00:13 099043947579.timeindex > -rw-rw-r-- 1 kafka kafka4 Dec 4 00:13 leader-epoch-checkpoint > {code} > kafka reports ReplicaFetcherManager.MaxLag.Replica be 99043947579 > !Screen Shot 2018-12-03 at 4.33.35 PM.png|width=720px! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KAFKA-7829) Inaccurate description for --reassignment-json-file option in ReassignPartitionsCommand
Jun Rao created KAFKA-7829: -- Summary: Inaccurate description for --reassignment-json-file option in ReassignPartitionsCommand Key: KAFKA-7829 URL: https://issues.apache.org/jira/browse/KAFKA-7829 Project: Kafka Issue Type: Improvement Reporter: Jun Rao In ReassignPartitionsCommand, the --reassignment-json-file option has the following. This seems inaccurate since we support moving existing replicas to new log dirs. If absolute log directory path is specified, it is currently required that " + "the replica has not already been created on that broker. The replica will then be created in the specified log directory on the broker later -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-7829) Javadoc should show that AdminClient.alterReplicaLogDirs() is supported in Kafka 1.1.0 or later
[ https://issues.apache.org/jira/browse/KAFKA-7829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-7829. Resolution: Fixed Fix Version/s: 2.2.0 Merged to trunk. > Javadoc should show that AdminClient.alterReplicaLogDirs() is supported in > Kafka 1.1.0 or later > --- > > Key: KAFKA-7829 > URL: https://issues.apache.org/jira/browse/KAFKA-7829 > Project: Kafka > Issue Type: Improvement >Reporter: Jun Rao >Assignee: Dong Lin >Priority: Major > Fix For: 2.2.0 > > > In ReassignPartitionsCommand, the --reassignment-json-file option says "...If > absolute log directory path is specified, it is currently required that the > replica has not already been created on that broker...". This is inaccurate > since we support moving existing replicas to new log dirs. > In addition, the Javadoc of AdminClient.alterReplicaLogDirs(...) should show > the API is supported by brokers with version 1.1.0 or later. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KAFKA-7836) The propagation of log dir failure can be delayed due to slowness in closing the file handles
Jun Rao created KAFKA-7836: -- Summary: The propagation of log dir failure can be delayed due to slowness in closing the file handles Key: KAFKA-7836 URL: https://issues.apache.org/jira/browse/KAFKA-7836 Project: Kafka Issue Type: Improvement Reporter: Jun Rao In ReplicaManager.handleLogDirFailure(), we call zkClient.propagateLogDirEvent after logManager.handleLogDirFailure. The latter closes the file handles of the offline replicas, which could take time when the disk is bad. This will delay the new leader election by the controller. In one incident, we have seen the closing of file handles of multiple replicas taking more than 20 seconds. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KAFKA-7838) improve logging in Partition.maybeShrinkIsr()
Jun Rao created KAFKA-7838: -- Summary: improve logging in Partition.maybeShrinkIsr() Key: KAFKA-7838 URL: https://issues.apache.org/jira/browse/KAFKA-7838 Project: Kafka Issue Type: Improvement Reporter: Jun Rao When we take a replica out of ISR, it would be useful to further log the fetch offset of the out of sync replica and the leader's HW at the point. This could be useful when the admin needs to manually enable unclean leader election. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KAFKA-7837) maybeShrinkIsr may not reflect OfflinePartitions immediately
Jun Rao created KAFKA-7837: -- Summary: maybeShrinkIsr may not reflect OfflinePartitions immediately Key: KAFKA-7837 URL: https://issues.apache.org/jira/browse/KAFKA-7837 Project: Kafka Issue Type: Improvement Reporter: Jun Rao When a partition is marked offline due to a failed disk, the leader is supposed to not shrink its ISR any more. In ReplicaManager.maybeShrinkIsr(), we iterate through all non-offline partitions to shrink the ISR. If an ISR needs to shrink, we need to write the new ISR to ZK, which can take a bit of time. In this window, some partitions could now be marked as offline, but may not be picked up by the iterator since it only reflects the state at that point. This can cause all in-sync followers to be dropped out of ISR unnecessarily and prevents a clean leader election. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-6361) Fast leader fail over can lead to log divergence between leader and follower
[ https://issues.apache.org/jira/browse/KAFKA-6361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-6361. Resolution: Fixed Fix Version/s: 2.0.0 Merged the PR to trunk. > Fast leader fail over can lead to log divergence between leader and follower > > > Key: KAFKA-6361 > URL: https://issues.apache.org/jira/browse/KAFKA-6361 > Project: Kafka > Issue Type: Bug >Reporter: Jason Gustafson >Assignee: Anna Povzner >Priority: Major > Labels: reliability > Fix For: 2.0.0 > > > We have observed an edge case in the replication failover logic which can > cause a replica to permanently fall out of sync with the leader or, in the > worst case, actually have localized divergence between logs. This occurs in > spite of the improved truncation logic from KIP-101. > Suppose we have brokers A and B. Initially A is the leader in epoch 1. It > appends two batches: one in the range (0, 10) and the other in the range (11, > 20). The first one successfully replicates to B, but the second one does not. > In other words, the logs on the brokers look like this: > {code} > Broker A: > 0: offsets [0, 10], leader epoch: 1 > 1: offsets [11, 20], leader epoch: 1 > Broker B: > 0: offsets [0, 10], leader epoch: 1 > {code} > Broker A then has a zk session expiration and broker B is elected with epoch > 2. It appends a new batch with offsets (11, n) to its local log. So we now > have this: > {code} > Broker A: > 0: offsets [0, 10], leader epoch: 1 > 1: offsets [11, 20], leader epoch: 1 > Broker B: > 0: offsets [0, 10], leader epoch: 1 > 1: offsets: [11, n], leader epoch: 2 > {code} > Normally we expect broker A to truncate to offset 11 on becoming the > follower, but before it is able to do so, broker B has its own zk session > expiration and broker A again becomes leader, now with epoch 3. It then > appends a new entry in the range (21, 30). The updated logs look like this: > {code} > Broker A: > 0: offsets [0, 10], leader epoch: 1 > 1: offsets [11, 20], leader epoch: 1 > 2: offsets: [21, 30], leader epoch: 3 > Broker B: > 0: offsets [0, 10], leader epoch: 1 > 1: offsets: [11, n], leader epoch: 2 > {code} > Now what happens next depends on the last offset of the batch appended in > epoch 2. On becoming follower, broker B will send an OffsetForLeaderEpoch > request to broker A with epoch 2. Broker A will respond that epoch 2 ends at > offset 21. There are three cases: > 1) n < 20: In this case, broker B will not do any truncation. It will begin > fetching from offset n, which will ultimately cause an out of order offset > error because broker A will return the full batch beginning from offset 11 > which broker B will be unable to append. > 2) n == 20: Again broker B does not truncate. It will fetch from offset 21 > and everything will appear fine though the logs have actually diverged. > 3) n > 20: Broker B will attempt to truncate to offset 21. Since this is in > the middle of the batch, it will truncate all the way to offset 10. It can > begin fetching from offset 11 and everything is fine. > The case we have actually seen is the first one. The second one would likely > go unnoticed in practice and everything is fine in the third case. To > workaround the issue, we deleted the active segment on the replica which > allowed it to re-replicate consistently from the leader. > I'm not sure the best solution for this scenario. Maybe if the leader isn't > aware of an epoch, it should always respond with {{UNDEFINED_EPOCH_OFFSET}} > instead of using the offset of the next highest epoch. That would cause the > follower to truncate using its high watermark. Or perhaps instead of doing > so, it could send another OffsetForLeaderEpoch request at the next previous > cached epoch and then truncate using that. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KAFKA-6937) In-sync replica delayed during fetch if replica throttle is exceeded
Jun Rao created KAFKA-6937: -- Summary: In-sync replica delayed during fetch if replica throttle is exceeded Key: KAFKA-6937 URL: https://issues.apache.org/jira/browse/KAFKA-6937 Project: Kafka Issue Type: Bug Reporter: Jun Rao Assignee: Jun Rao When replication throttling is enabled, in-sync replica's traffic should never be throttled. However, in DelayedFetch.tryComplete(), we incorrectly delay the completion of an in-sync replica fetch request if replication throttling is engaged. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-6937) In-sync replica delayed during fetch if replica throttle is exceeded
[ https://issues.apache.org/jira/browse/KAFKA-6937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-6937. Resolution: Fixed Fix Version/s: 1.1.1 1.0.2 2.0.0 Merged the PR. > In-sync replica delayed during fetch if replica throttle is exceeded > > > Key: KAFKA-6937 > URL: https://issues.apache.org/jira/browse/KAFKA-6937 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.11.0.1, 1.1.0, 1.0.1 >Reporter: Jun Rao >Assignee: Jun Rao >Priority: Major > Fix For: 2.0.0, 1.0.2, 1.1.1 > > > When replication throttling is enabled, in-sync replica's traffic should > never be throttled. However, in DelayedFetch.tryComplete(), we incorrectly > delay the completion of an in-sync replica fetch request if replication > throttling is engaged. > The impact is that the producer may see increased latency if acks = all. The > delivery of the message to the consumer may also be delayed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-7011) Investigate if its possible to drop the ResourceNameType field from Java Resource class.
[ https://issues.apache.org/jira/browse/KAFKA-7011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-7011. Resolution: Fixed Merged the PR to trunk and 2.0 branch. > Investigate if its possible to drop the ResourceNameType field from Java > Resource class. > > > Key: KAFKA-7011 > URL: https://issues.apache.org/jira/browse/KAFKA-7011 > Project: Kafka > Issue Type: Sub-task > Components: core, security >Reporter: Andy Coates >Assignee: Andy Coates >Priority: Major > Fix For: 2.0.0 > > > Following on from the PR [#5117|https://github.com/apache/kafka/pull/5117] > and discussions with Colin McCabe... > > Current placement of ResourceNameType as field in Resource class is ... less > than ideal. A Resource should be a concrete resource. Look to resolve this. > > Thoughts... > A. I guess you could subclass Resource and have ResourcePrefix - but there is > no 'is-a' relationship here and it would still allow > authorise(ResourcePrefix()) > B. You could move ResourceNameType into AccessControllEntryData - possible. > C. Move ResourceNameType directly into AclBinding / AclBindingFilter - > possible > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-7006) Remove duplicate Scala ResourceNameType class
[ https://issues.apache.org/jira/browse/KAFKA-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-7006. Resolution: Fixed merged the PR to trunk and 2.0 branch. > Remove duplicate Scala ResourceNameType class > - > > Key: KAFKA-7006 > URL: https://issues.apache.org/jira/browse/KAFKA-7006 > Project: Kafka > Issue Type: Sub-task > Components: core, security >Reporter: Andy Coates >Assignee: Andy Coates >Priority: Major > Fix For: 2.0.0 > > > Relating to one of the outstanding work items in PR > [#5117|[https://github.com/apache/kafka/pull/5117]...] > The kafka.security.auth.ResourceTypeName class should be dropped in favour of > the Java. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-7005) Remove duplicate Java Resource class.
[ https://issues.apache.org/jira/browse/KAFKA-7005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-7005. Resolution: Fixed merged the PR to trunk and 2.0 branch. > Remove duplicate Java Resource class. > - > > Key: KAFKA-7005 > URL: https://issues.apache.org/jira/browse/KAFKA-7005 > Project: Kafka > Issue Type: Sub-task > Components: core, security >Reporter: Andy Coates >Assignee: Andy Coates >Priority: Major > Fix For: 2.0.0 > > > Relating to one of the outstanding work items in PR > [#5117|[https://github.com/apache/kafka/pull/5117]...] > The o.a.k.c.request.Resource class could be dropped in favour of > o.a.k.c..config.ConfigResource. > This will remove the duplication of `Resource` classes. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-7007) Use JSON for /kafka-acl-extended-changes path
[ https://issues.apache.org/jira/browse/KAFKA-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-7007. Resolution: Fixed merged the PR to trunk and 2.0 branch. > Use JSON for /kafka-acl-extended-changes path > - > > Key: KAFKA-7007 > URL: https://issues.apache.org/jira/browse/KAFKA-7007 > Project: Kafka > Issue Type: Sub-task > Components: core, security >Reporter: Andy Coates >Assignee: Andy Coates >Priority: Major > Fix For: 2.0.0 > > > Relating to one of the outstanding work items in PR > [#5117|[https://github.com/apache/kafka/pull/5117]...] > > Keep Literal ACLs on the old paths, using the old formats, to maintain > backwards compatibility. > Have Prefixed, and any latter types, go on new paths, using JSON, (old > brokers are not aware of them). > Add checks to reject any adminClient requests to add prefixed acls before the > cluster is fully upgraded. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-7010) Rename ResourceNameType.ANY to MATCH
[ https://issues.apache.org/jira/browse/KAFKA-7010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-7010. Resolution: Fixed merged the PR to trunk and 2.0 branch. > Rename ResourceNameType.ANY to MATCH > > > Key: KAFKA-7010 > URL: https://issues.apache.org/jira/browse/KAFKA-7010 > Project: Kafka > Issue Type: Sub-task > Components: core, security >Reporter: Andy Coates >Assignee: Andy Coates >Priority: Major > Fix For: 2.0.0 > > > Following on from the PR > [#5117|[https://github.com/apache/kafka/pull/5117]...] and discussions with > Colin McCabe... > The current ResourceNameType.ANY may be misleading as it performs pattern > matching for wildcard and prefixed bindings. Where as ResourceName.ANY just > brings back any resource name. > Renaming to ResourceNameType.MATCH and adding more Java doc should clear this > up. > Finally, ResourceNameType is no longer appropriate as the type is used in > ResourcePattern and ResourcePatternFilter. Hence rename to PatternType. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-7064) "Unexpected resource type GROUP" when describing broker configs using latest admin client
[ https://issues.apache.org/jira/browse/KAFKA-7064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-7064. Resolution: Fixed Merged [https://github.com/apache/kafka/pull/5245] to trunk and 2.0 branch. > "Unexpected resource type GROUP" when describing broker configs using latest > admin client > - > > Key: KAFKA-7064 > URL: https://issues.apache.org/jira/browse/KAFKA-7064 > Project: Kafka > Issue Type: Bug >Reporter: Rohan Desai >Assignee: Andy Coates >Priority: Blocker > Fix For: 2.0.0 > > > I'm getting the following error when I try to describe broker configs using > the admin client: > {code:java} > org.apache.kafka.common.errors.InvalidRequestException: Unexpected resource > type GROUP for resource 0{code} > I think its due to this commit: > [https://github.com/apache/kafka/commit/49db5a63c043b50c10c2dfd0648f8d74ee917b6a] > > My guess at what's going on is that now that the client is using > ConfigResource instead of Resource it's sending a describe request for > resource type BROKER w/ id 3, while the broker associates id 3 w/ GROUP -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-6254) Introduce Incremental FetchRequests to Increase Partition Scalability
[ https://issues.apache.org/jira/browse/KAFKA-6254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-6254. Resolution: Fixed Fix Version/s: 1.1.0 The PR is merged. > Introduce Incremental FetchRequests to Increase Partition Scalability > - > > Key: KAFKA-6254 > URL: https://issues.apache.org/jira/browse/KAFKA-6254 > Project: Kafka > Issue Type: Improvement >Reporter: Colin P. McCabe >Assignee: Colin P. McCabe >Priority: Major > Fix For: 1.1.0 > > > Introduce Incremental FetchRequests to Increase Partition Scalability. See > https://cwiki.apache.org/confluence/display/KAFKA/KIP-227%3A+Introduce+Incremental+FetchRequests+to+Increase+Partition+Scalability -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-6452) Add documentation for delegation token authentication mechanism
[ https://issues.apache.org/jira/browse/KAFKA-6452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-6452. Resolution: Fixed Fix Version/s: (was: 1.2.0) 1.1.0 The PR is merged to 1.1 and trunk. > Add documentation for delegation token authentication mechanism > --- > > Key: KAFKA-6452 > URL: https://issues.apache.org/jira/browse/KAFKA-6452 > Project: Kafka > Issue Type: Sub-task > Components: documentation >Reporter: Manikumar >Assignee: Manikumar >Priority: Major > Fix For: 1.1.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-6184) report a metric of the lag between the consumer offset and the start offset of the log
[ https://issues.apache.org/jira/browse/KAFKA-6184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-6184. Resolution: Fixed Fix Version/s: 1.2.0 The PR is merged to trunk. > report a metric of the lag between the consumer offset and the start offset > of the log > -- > > Key: KAFKA-6184 > URL: https://issues.apache.org/jira/browse/KAFKA-6184 > Project: Kafka > Issue Type: Improvement >Affects Versions: 1.0.0 >Reporter: Jun Rao >Assignee: huxihx >Priority: Major > Labels: needs-kip > Fix For: 1.2.0 > > > Currently, the consumer reports a metric of the lag between the high > watermark of a log and the consumer offset. It will be useful to report a > similar lag metric between the consumer offset and the start offset of the > log. If this latter lag gets close to 0, it's an indication that the consumer > may lose data soon. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KAFKA-6624) log segment deletion could cause a disk to be marked offline incorrectly
Jun Rao created KAFKA-6624: -- Summary: log segment deletion could cause a disk to be marked offline incorrectly Key: KAFKA-6624 URL: https://issues.apache.org/jira/browse/KAFKA-6624 Project: Kafka Issue Type: Bug Components: core Affects Versions: 1.1.0 Reporter: Jun Rao Saw the following log. [2018-03-06 23:12:20,721] ERROR Error while flushing log for topic1-0 in dir /data01/kafka-logs with offset 80993 (kafka.server.LogDirFailureChannel) java.nio.channels.ClosedChannelException at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:110) at sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:379) at org.apache.kafka.common.record.FileRecords.flush(FileRecords.java:163) at kafka.log.LogSegment$$anonfun$flush$1.apply$mcV$sp(LogSegment.scala:375) at kafka.log.LogSegment$$anonfun$flush$1.apply(LogSegment.scala:374) at kafka.log.LogSegment$$anonfun$flush$1.apply(LogSegment.scala:374) at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31) at kafka.log.LogSegment.flush(LogSegment.scala:374) at kafka.log.Log$$anonfun$flush$1$$anonfun$apply$mcV$sp$4.apply(Log.scala:1374) at kafka.log.Log$$anonfun$flush$1$$anonfun$apply$mcV$sp$4.apply(Log.scala:1373) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at kafka.log.Log$$anonfun$flush$1.apply$mcV$sp(Log.scala:1373) at kafka.log.Log$$anonfun$flush$1.apply(Log.scala:1368) at kafka.log.Log$$anonfun$flush$1.apply(Log.scala:1368) at kafka.log.Log.maybeHandleIOException(Log.scala:1669) at kafka.log.Log.flush(Log.scala:1368) at kafka.log.Log$$anonfun$roll$2$$anonfun$apply$1.apply$mcV$sp(Log.scala:1343) at kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(KafkaScheduler.scala:110) at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:61) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) [2018-03-06 23:12:20,722] INFO [ReplicaManager broker=0] Stopping serving replicas in dir /data01/kafka-logs (kafka.server.ReplicaManager) It seems that topic1 was being deleted around the time when flushing was called. Then flushing hit an IOException, which caused the disk to be marked offline incorrectly. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-6624) log segment deletion could cause a disk to be marked offline incorrectly
[ https://issues.apache.org/jira/browse/KAFKA-6624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-6624. Resolution: Fixed Fix Version/s: 1.1.0 > log segment deletion could cause a disk to be marked offline incorrectly > > > Key: KAFKA-6624 > URL: https://issues.apache.org/jira/browse/KAFKA-6624 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.1.0 >Reporter: Jun Rao >Assignee: Dong Lin >Priority: Major > Fix For: 1.1.0 > > > Saw the following log. > [2018-03-06 23:12:20,721] ERROR Error while flushing log for topic1-0 in dir > /data01/kafka-logs with offset 80993 (kafka.server.LogDirFailureChannel) > java.nio.channels.ClosedChannelException > at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:110) > at sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:379) > at > org.apache.kafka.common.record.FileRecords.flush(FileRecords.java:163) > at > kafka.log.LogSegment$$anonfun$flush$1.apply$mcV$sp(LogSegment.scala:375) > at kafka.log.LogSegment$$anonfun$flush$1.apply(LogSegment.scala:374) > at kafka.log.LogSegment$$anonfun$flush$1.apply(LogSegment.scala:374) > at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31) > at kafka.log.LogSegment.flush(LogSegment.scala:374) > at > kafka.log.Log$$anonfun$flush$1$$anonfun$apply$mcV$sp$4.apply(Log.scala:1374) > at > kafka.log.Log$$anonfun$flush$1$$anonfun$apply$mcV$sp$4.apply(Log.scala:1373) > at scala.collection.Iterator$class.foreach(Iterator.scala:891) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at kafka.log.Log$$anonfun$flush$1.apply$mcV$sp(Log.scala:1373) > at kafka.log.Log$$anonfun$flush$1.apply(Log.scala:1368) > at kafka.log.Log$$anonfun$flush$1.apply(Log.scala:1368) > at kafka.log.Log.maybeHandleIOException(Log.scala:1669) > at kafka.log.Log.flush(Log.scala:1368) > at > kafka.log.Log$$anonfun$roll$2$$anonfun$apply$1.apply$mcV$sp(Log.scala:1343) > at > kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(KafkaScheduler.scala:110) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:61) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > [2018-03-06 23:12:20,722] INFO [ReplicaManager broker=0] Stopping serving > replicas in dir /data01/kafka-logs (kafka.server.ReplicaManager) > It seems that topic1 was being deleted around the time when flushing was > called. Then flushing hit an IOException, which caused the disk to be marked > offline incorrectly. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KAFKA-6760) responses not logged properly in controller
Jun Rao created KAFKA-6760: -- Summary: responses not logged properly in controller Key: KAFKA-6760 URL: https://issues.apache.org/jira/browse/KAFKA-6760 Project: Kafka Issue Type: Improvement Components: core Affects Versions: 1.1.0 Reporter: Jun Rao Saw the following logging in controller.log. We need to log the StopReplicaResponse properly in KafkaController. [2018-04-05 14:38:41,878] DEBUG [Controller id=0] Delete topic callback invoked for org.apache.kafka.common.requests.StopReplicaResponse@263d40c (kafka.controller.K afkaController) It seems that the same issue exists for LeaderAndIsrResponse as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-3827) log.message.format.version should default to inter.broker.protocol.version
[ https://issues.apache.org/jira/browse/KAFKA-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-3827. Resolution: Won't Fix The currently implementation is easier to understand than what I proposed. Closing the jira. > log.message.format.version should default to inter.broker.protocol.version > -- > > Key: KAFKA-3827 > URL: https://issues.apache.org/jira/browse/KAFKA-3827 > Project: Kafka > Issue Type: Improvement >Affects Versions: 0.10.0.0 >Reporter: Jun Rao >Assignee: Manasvi Gupta >Priority: Major > Labels: newbie > > Currently, if one sets inter.broker.protocol.version to 0.9.0 and restarts > the broker, one will get the following exception since > log.message.format.version defaults to 0.10.0. It will be more intuitive if > log.message.format.version defaults to the value of > inter.broker.protocol.version. > java.lang.IllegalArgumentException: requirement failed: > log.message.format.version 0.10.0-IV1 cannot be used when > inter.broker.protocol.version is set to 0.9.0.1 > at scala.Predef$.require(Predef.scala:233) > at kafka.server.KafkaConfig.validateValues(KafkaConfig.scala:1023) > at kafka.server.KafkaConfig.(KafkaConfig.scala:994) > at kafka.server.KafkaConfig$.fromProps(KafkaConfig.scala:743) > at kafka.server.KafkaConfig$.fromProps(KafkaConfig.scala:740) > at > kafka.server.KafkaServerStartable$.fromProps(KafkaServerStartable.scala:28) > at kafka.Kafka$.main(Kafka.scala:58) > at kafka.Kafka.main(Kafka.scala) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-5706) log the name of the error instead of the error code in response objects
[ https://issues.apache.org/jira/browse/KAFKA-5706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-5706. Resolution: Cannot Reproduce [~manasvigupta], this doesn't seem to be an issue any more now that we have converted all requests/responses to java objects using Error enum. Closing the jira. > log the name of the error instead of the error code in response objects > --- > > Key: KAFKA-5706 > URL: https://issues.apache.org/jira/browse/KAFKA-5706 > Project: Kafka > Issue Type: Improvement > Components: clients, core >Affects Versions: 0.11.0.0 >Reporter: Jun Rao >Assignee: Manasvi Gupta >Priority: Major > Labels: newbie > > Currently, when logging the error code in the response objects, we simply log > response.toString(), which contains the error code. It will be useful to log > the name of the corresponding exception for the error, which is more > meaningful than an error code. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-6752) Unclean leader election metric no longer working
[ https://issues.apache.org/jira/browse/KAFKA-6752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-6752. Resolution: Fixed Fix Version/s: 1.1.1 2.0.0 Merged to trunk and 1.1. > Unclean leader election metric no longer working > > > Key: KAFKA-6752 > URL: https://issues.apache.org/jira/browse/KAFKA-6752 > Project: Kafka > Issue Type: Bug > Components: controller >Affects Versions: 1.1.0 >Reporter: Jason Gustafson >Assignee: Manikumar >Priority: Major > Fix For: 2.0.0, 1.1.1 > > > Happened to notice that the unclean leader election meter is no longer being > updated. This was probably lost during the controller overhaul. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-6447) Add Delegation Token Operations to KafkaAdminClient
[ https://issues.apache.org/jira/browse/KAFKA-6447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-6447. Resolution: Fixed Fix Version/s: 2.0.0 Merged the PR to trunk. > Add Delegation Token Operations to KafkaAdminClient > --- > > Key: KAFKA-6447 > URL: https://issues.apache.org/jira/browse/KAFKA-6447 > Project: Kafka > Issue Type: Sub-task >Reporter: Manikumar >Assignee: Manikumar >Priority: Major > Fix For: 2.0.0 > > > This JIRA is about adding delegation token operations to the new Admin Client > API. > KIP: > https://cwiki.apache.org/confluence/display/KAFKA/KIP-249%3A+Add+Delegation+Token+Operations+to+KafkaAdminClient -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KAFKA-6780) log cleaner shouldn't clean messages beyond high watermark
Jun Rao created KAFKA-6780: -- Summary: log cleaner shouldn't clean messages beyond high watermark Key: KAFKA-6780 URL: https://issues.apache.org/jira/browse/KAFKA-6780 Project: Kafka Issue Type: Bug Affects Versions: 1.1.0 Reporter: Jun Rao Currently, the firstUncleanableDirtyOffset computed by the log cleaner is bounded by the first offset in the active segment. It's possible for the high watermark to be smaller than that. This may cause a committed record to be removed because of an uncommitted record. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-6650) The controller should be able to handle a partially deleted topic
[ https://issues.apache.org/jira/browse/KAFKA-6650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-6650. Resolution: Fixed Fix Version/s: 2.0.0 merged the PR to trunk. > The controller should be able to handle a partially deleted topic > - > > Key: KAFKA-6650 > URL: https://issues.apache.org/jira/browse/KAFKA-6650 > Project: Kafka > Issue Type: Bug >Reporter: Lucas Wang >Assignee: Lucas Wang >Priority: Minor > Fix For: 2.0.0 > > > A previous controller could have deleted some partitions of a topic from ZK, > but not all partitions, and then died. > In that case, the new controller should be able to handle the partially > deleted topic, and finish the deletion. > In the current code base, if there is no leadership info for a replica's > partition, the transition to OfflineReplica state for the replica will fail. > Afterwards the transition to ReplicaDeletionStarted will fail as well since > the only valid previous state for ReplicaDeletionStarted is OfflineReplica. > Furthermore, it means the topic deletion will never finish. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KAFKA-6834) log cleaner should handle the case when the size of a message set is larger than the max message size
Jun Rao created KAFKA-6834: -- Summary: log cleaner should handle the case when the size of a message set is larger than the max message size Key: KAFKA-6834 URL: https://issues.apache.org/jira/browse/KAFKA-6834 Project: Kafka Issue Type: Bug Reporter: Jun Rao In KAFKA-5316, we added the logic to allow a message (set) larger than the per topic message size to be written to the log during log cleaning. However, the buffer size in the log cleaner is still bounded by the per topic message size. This can cause the log cleaner to die and cause the broker to run out of disk space. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KAFKA-6857) LeaderEpochFileCache.endOffsetFor() should check for UNDEFINED_EPOCH explicitly
Jun Rao created KAFKA-6857: -- Summary: LeaderEpochFileCache.endOffsetFor() should check for UNDEFINED_EPOCH explicitly Key: KAFKA-6857 URL: https://issues.apache.org/jira/browse/KAFKA-6857 Project: Kafka Issue Type: Bug Components: core Affects Versions: 0.11.0.0 Reporter: Jun Rao In LeaderEpochFileCache.endOffsetFor() , we have the following code. {code:java} if (requestedEpoch == latestEpoch) { leo().messageOffset {code} In the case when the requestedEpoch is UNDEFINED_EPOCH and latestEpoch is also UNDEFINED_EPOCH, we return leo. This will cause the follower to truncate to a wrong offset. If requestedEpoch is UNDEFINED_EPOCH, we need to request UNDEFINED_EPOCH_OFFSET. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KAFKA-6858) Log.truncateTo() may truncate to an earlier offset than requested
Jun Rao created KAFKA-6858: -- Summary: Log.truncateTo() may truncate to an earlier offset than requested Key: KAFKA-6858 URL: https://issues.apache.org/jira/browse/KAFKA-6858 Project: Kafka Issue Type: Improvement Reporter: Jun Rao In Log.truncateTo(), if the truncation point is in the middle of a message set, we will actually be truncating to the first offset of the message set. In that case, the replica fetcher thread should adjust the fetch offset to the actual truncated offset. Typically, the truncation point should never be in the middle of a message set. However, this could potentially happen during message format upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KAFKA-7299) batch LeaderAndIsr requests during auto preferred leader election
Jun Rao created KAFKA-7299: -- Summary: batch LeaderAndIsr requests during auto preferred leader election Key: KAFKA-7299 URL: https://issues.apache.org/jira/browse/KAFKA-7299 Project: Kafka Issue Type: Sub-task Components: core Affects Versions: 2.0.0 Reporter: Jun Rao Currently, in KafkaController.checkAndTriggerAutoLeaderRebalance(), we call onPreferredReplicaElection() one partition at a time. This means that the controller will be sending LeaderAndIsrRequest one partition at a time. It would be more efficient to call onPreferredReplicaElection() for a batch of partitions to reduce the number of LeaderAndIsrRequests. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-7299) batch LeaderAndIsr requests during auto preferred leader election
[ https://issues.apache.org/jira/browse/KAFKA-7299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-7299. Resolution: Fixed Fix Version/s: 2.1.0 Merged the PR to trunk. > batch LeaderAndIsr requests during auto preferred leader election > - > > Key: KAFKA-7299 > URL: https://issues.apache.org/jira/browse/KAFKA-7299 > Project: Kafka > Issue Type: Sub-task > Components: core >Affects Versions: 2.0.0 >Reporter: Jun Rao >Assignee: huxihx >Priority: Major > Fix For: 2.1.0 > > > Currently, in KafkaController.checkAndTriggerAutoLeaderRebalance(), we call > onPreferredReplicaElection() one partition at a time. This means that the > controller will be sending LeaderAndIsrRequest one partition at a time. It > would be more efficient to call onPreferredReplicaElection() for a batch of > partitions to reduce the number of LeaderAndIsrRequests. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-6835) Enable topic unclean leader election to be enabled without controller change
[ https://issues.apache.org/jira/browse/KAFKA-6835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-6835. Resolution: Fixed Merged the PR to trunk. > Enable topic unclean leader election to be enabled without controller change > > > Key: KAFKA-6835 > URL: https://issues.apache.org/jira/browse/KAFKA-6835 > Project: Kafka > Issue Type: Task > Components: core >Reporter: Rajini Sivaram >Assignee: Manikumar >Priority: Major > Fix For: 2.1.0 > > > Dynamic update of broker's default unclean.leader.election.enable will be > processed without controller change (KAFKA-6526). We should probably do the > same for topic overrides as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-6753) Speed up event processing on the controller
[ https://issues.apache.org/jira/browse/KAFKA-6753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-6753. Resolution: Fixed Fix Version/s: 2.1.0 Merged the PR to trunk. > Speed up event processing on the controller > > > Key: KAFKA-6753 > URL: https://issues.apache.org/jira/browse/KAFKA-6753 > Project: Kafka > Issue Type: Improvement >Reporter: Lucas Wang >Assignee: Lucas Wang >Priority: Minor > Fix For: 2.1.0 > > Attachments: Screen Shot 2018-04-04 at 7.08.55 PM.png > > > The existing controller code updates metrics after processing every event. > This can slow down event processing on the controller tremendously. In one > profiling we see that updating metrics takes nearly 100% of the CPU for the > controller event processing thread. Specifically the slowness can be > attributed to two factors: > 1. Each invocation to update the metrics is expensive. Specifically trying to > calculate the offline partitions count requires iterating through all the > partitions in the cluster to check if the partition is offline; and > calculating the preferred replica imbalance count requires iterating through > all the partitions in the cluster to check if a partition has a leader other > than the preferred leader. In a large cluster, the number of partitions can > be quite large, all seen by the controller. Even if the time spent to check a > single partition is small, the accumulation effect of so many partitions in > the cluster can make the invocation to update metrics quite expensive. One > might argue that maybe the logic for processing each single partition is not > optimized, we checked the CPU percentage of leaf nodes in the profiling > result, and found that inside the loops of collection objects, e.g. the set > of all partitions, no single function dominates the processing. Hence the > large number of the partitions in a cluster is the main contributor to the > slowness of one invocation to update the metrics. > 2. The invocation to update metrics is called many times when the is a high > number of events to be processed by the controller, one invocation after > processing any event. > The patch that will be submitted tries to fix bullet 2 above, i.e. reducing > the number of invocations to update metrics. Instead of updating the metrics > after processing any event, we only periodically check if the metrics needs > to be updated, i.e. once every second. > * If after the previous invocation to update metrics, there are other types > of events that changed the controller’s state, then one second later the > metrics will be updated. > * If after the previous invocation, there has been no other types of events, > then the call to update metrics can be bypassed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-6343) OOM as the result of creation of 5k topics
[ https://issues.apache.org/jira/browse/KAFKA-6343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-6343. Resolution: Fixed Assignee: Alex Dunayevsky Fix Version/s: 2.1.0 Merged the PR to trunk. > OOM as the result of creation of 5k topics > -- > > Key: KAFKA-6343 > URL: https://issues.apache.org/jira/browse/KAFKA-6343 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.10.1.1, 0.10.2.0, 0.10.2.1, 0.11.0.1, 0.11.0.2, 1.0.0 > Environment: RHEL 7, RAM 755GB per host >Reporter: Alex Dunayevsky >Assignee: Alex Dunayevsky >Priority: Major > Fix For: 2.1.0 > > > *Reproducing*: Create 5k topics *from the code* quickly, without any delays. > Wait until brokers will finish loading them. This will actually never happen, > since all brokers will go down one by one after approx 10-15 minutes or more, > depending on the hardware. > *Heap*: -Xmx/Xms: 5G, 10G, 50G, 256G, 512G > > *Topology*: 3 brokers, 3 zk. > *Code for 5k topic creation:* > {code:java} > package kafka > import kafka.admin.AdminUtils > import kafka.utils.{Logging, ZkUtils} > object TestCreateTopics extends App with Logging { > val zkConnect = "grid978:2185" > var zkUtils = ZkUtils(zkConnect, 6000, 6000, isZkSecurityEnabled = false) > for (topic <- 1 to 5000) { > AdminUtils.createTopic( > topic = s"${topic.toString}", > partitions= 10, > replicationFactor = 2, > zkUtils = zkUtils > ) > logger.info(s"Created topic ${topic.toString}") > } > } > {code} > *Cause of death:* > {code:java} > java.io.IOException: Map failed > at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:920) > at kafka.log.AbstractIndex.(AbstractIndex.scala:61) > at kafka.log.OffsetIndex.(OffsetIndex.scala:52) > at kafka.log.LogSegment.(LogSegment.scala:67) > at kafka.log.Log.loadSegments(Log.scala:255) > at kafka.log.Log.(Log.scala:108) > at kafka.log.LogManager.createLog(LogManager.scala:362) > at kafka.cluster.Partition.getOrCreateReplica(Partition.scala:94) > at > kafka.cluster.Partition$$anonfun$4$$anonfun$apply$2.apply(Partition.scala:174) > at > kafka.cluster.Partition$$anonfun$4$$anonfun$apply$2.apply(Partition.scala:174) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:78) > at kafka.cluster.Partition$$anonfun$4.apply(Partition.scala:174) > at kafka.cluster.Partition$$anonfun$4.apply(Partition.scala:168) > at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234) > at kafka.utils.CoreUtils$.inWriteLock(CoreUtils.scala:242) > at kafka.cluster.Partition.makeLeader(Partition.scala:168) > at > kafka.server.ReplicaManager$$anonfun$makeLeaders$4.apply(ReplicaManager.scala:758) > at > kafka.server.ReplicaManager$$anonfun$makeLeaders$4.apply(ReplicaManager.scala:757) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) > at > scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) > at scala.collection.mutable.HashMap.foreach(HashMap.scala:99) > at kafka.server.ReplicaManager.makeLeaders(ReplicaManager.scala:757) > at > kafka.server.ReplicaManager.becomeLeaderOrFollower(ReplicaManager.scala:703) > at > kafka.server.KafkaApis.handleLeaderAndIsrRequest(KafkaApis.scala:148) > at kafka.server.KafkaApis.handle(KafkaApis.scala:82) > at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:60) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.OutOfMemoryError: Map failed > at sun.nio.ch.FileChannelImpl.map0(Native Method) > at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:917) > ... 28 more > {code} > Broker restart results the same OOM issues. All brokers will not be able to > start again. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KAFKA-7366) topic level segment.bytes and segment.ms not taking effect immediately
Jun Rao created KAFKA-7366: -- Summary: topic level segment.bytes and segment.ms not taking effect immediately Key: KAFKA-7366 URL: https://issues.apache.org/jira/browse/KAFKA-7366 Project: Kafka Issue Type: Bug Affects Versions: 1.1.0 Reporter: Jun Rao It used to be that topic level configs such as segment.bytes takes effect immediately. Because of KAFKA-6324 in 1.1, those configs now only take effect after the active segment has rolled. The relevant part of KAFKA-6324 is that in Log.maybeRoll, the checking of the segment rolling is moved to LogSegment.shouldRoll(). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-7287) Set open ACL permissions for old consumer znode path
[ https://issues.apache.org/jira/browse/KAFKA-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-7287. Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 1.1.2 Also merged the PR to 1.1 branch. > Set open ACL permissions for old consumer znode path > > > Key: KAFKA-7287 > URL: https://issues.apache.org/jira/browse/KAFKA-7287 > Project: Kafka > Issue Type: Bug >Affects Versions: 1.1.0 >Reporter: Manikumar >Assignee: Manikumar >Priority: Major > Fix For: 1.1.2, 2.0.1, 2.1.0 > > > Old consumer znode path should have open ACL permissions in kerberized > environment. This got missed in kafkaZkClient changes. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-7117) Allow AclCommand to use AdminClient API
[ https://issues.apache.org/jira/browse/KAFKA-7117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-7117. Resolution: Fixed Fix Version/s: 2.1.0 Merged the PR to trunk. > Allow AclCommand to use AdminClient API > --- > > Key: KAFKA-7117 > URL: https://issues.apache.org/jira/browse/KAFKA-7117 > Project: Kafka > Issue Type: Improvement >Reporter: Manikumar >Assignee: Manikumar >Priority: Major > Labels: needs-kip > Fix For: 2.1.0 > > > Currently AclCommand (kafka-acls.sh) uses authorizer class (default > SimpleAclAuthorizer) to manage acls. > We should also allow AclCommand to support AdminClient API based acl > management. This will allow kafka-acls.sh script users to manage acls without > interacting zookeeper/authorizer directly. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-7400) Compacted topic segments that precede the log start offset are not cleaned up
[ https://issues.apache.org/jira/browse/KAFKA-7400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-7400. Resolution: Fixed Fix Version/s: 2.1.0 Merged the PR to trunk. > Compacted topic segments that precede the log start offset are not cleaned up > - > > Key: KAFKA-7400 > URL: https://issues.apache.org/jira/browse/KAFKA-7400 > Project: Kafka > Issue Type: Bug > Components: log >Affects Versions: 1.1.0, 1.1.1, 2.0.0 >Reporter: Bob Barrett >Assignee: Bob Barrett >Priority: Minor > Fix For: 2.1.0 > > > LogManager.cleanupLogs currently checks if a topic is compacted, and skips > any deletion if it is. This means that if the log start offset increases, log > segments that precede the start offset will never be deleted. The log cleanup > logic should be improved to delete these segments even for compacted topics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-7216) Exception while running kafka-acls.sh from 1.0 env on target Kafka env with 1.1.1
[ https://issues.apache.org/jira/browse/KAFKA-7216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-7216. Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 1.1.2 1.0.3 Merged to trunk, 2.1, 1.1 and 1.0. > Exception while running kafka-acls.sh from 1.0 env on target Kafka env with > 1.1.1 > - > > Key: KAFKA-7216 > URL: https://issues.apache.org/jira/browse/KAFKA-7216 > Project: Kafka > Issue Type: Bug > Components: admin >Affects Versions: 1.1.0, 1.1.1, 2.0.0 >Reporter: Satish Duggana >Assignee: Manikumar >Priority: Major > Fix For: 1.0.3, 1.1.2, 2.0.1, 2.1.0 > > > When `kafka-acls.sh` with SimpleAclAuthorizer on target Kafka cluster with > 1.1.1 version, it throws the below error. > {code:java} > kafka.common.KafkaException: DelegationToken not a valid resourceType name. > The valid names are Topic,Group,Cluster,TransactionalId > at > kafka.security.auth.ResourceType$$anonfun$fromString$1.apply(ResourceType.scala:56) > at > kafka.security.auth.ResourceType$$anonfun$fromString$1.apply(ResourceType.scala:56) > at scala.Option.getOrElse(Option.scala:121) > at kafka.security.auth.ResourceType$.fromString(ResourceType.scala:56) > at > kafka.security.auth.SimpleAclAuthorizer$$anonfun$loadCache$1$$anonfun$apply$mcV$sp$1.apply(SimpleAclAuthorizer.scala:233) > at > kafka.security.auth.SimpleAclAuthorizer$$anonfun$loadCache$1$$anonfun$apply$mcV$sp$1.apply(SimpleAclAuthorizer.scala:232) > at scala.collection.Iterator$class.foreach(Iterator.scala:891) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > kafka.security.auth.SimpleAclAuthorizer$$anonfun$loadCache$1.apply$mcV$sp(SimpleAclAuthorizer.scala:232) > at > kafka.security.auth.SimpleAclAuthorizer$$anonfun$loadCache$1.apply(SimpleAclAuthorizer.scala:230) > at > kafka.security.auth.SimpleAclAuthorizer$$anonfun$loadCache$1.apply(SimpleAclAuthorizer.scala:230) > at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:216) > at kafka.utils.CoreUtils$.inWriteLock(CoreUtils.scala:224) > at > kafka.security.auth.SimpleAclAuthorizer.loadCache(SimpleAclAuthorizer.scala:230) > at > kafka.security.auth.SimpleAclAuthorizer.configure(SimpleAclAuthorizer.scala:114) > at kafka.admin.AclCommand$.withAuthorizer(AclCommand.scala:83) > at kafka.admin.AclCommand$.addAcl(AclCommand.scala:93) > at kafka.admin.AclCommand$.main(AclCommand.scala:53) > at kafka.admin.AclCommand.main(AclCommand.scala) > {code} > > This is because it tries to get all the resource types registered from ZK > path and it throws error when `DelegationToken` resource is not defined in > `ResourceType` of client's Kafka version(which is earlier than 1.1.x) > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KAFKA-7482) LeaderAndIsrRequest should be sent to the shutting down broker
Jun Rao created KAFKA-7482: -- Summary: LeaderAndIsrRequest should be sent to the shutting down broker Key: KAFKA-7482 URL: https://issues.apache.org/jira/browse/KAFKA-7482 Project: Kafka Issue Type: Bug Affects Versions: 2.0.0, 1.1.0 Reporter: Jun Rao Assignee: Jun Rao We introduced a regression in KAFKA-5642 in 1.1. Before 1.1, during a controlled shutdown, the LeaderAndIsrRequest is sent to the shutting down broker to inform it that it's no longer the leader for partitions whose leader have been moved. After 1.1, such LeaderAndIsrRequest is no longer sent to the shutting down broker. This can delay the time for the client to find out the new leader. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-5995) Rename AlterReplicaDir to AlterReplicaDirs
[ https://issues.apache.org/jira/browse/KAFKA-5995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-5995. Resolution: Fixed Fix Version/s: 1.0.0 Issue resolved by pull request 3993 [https://github.com/apache/kafka/pull/3993] > Rename AlterReplicaDir to AlterReplicaDirs > -- > > Key: KAFKA-5995 > URL: https://issues.apache.org/jira/browse/KAFKA-5995 > Project: Kafka > Issue Type: Bug >Reporter: Dong Lin >Assignee: Dong Lin > Fix For: 1.0.0 > > > This is needed to follow the naming convention of other AdminClient methods > that are plural. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (KAFKA-5767) Kafka server should halt if IBP < 1.0.0 and there is log directory failure
[ https://issues.apache.org/jira/browse/KAFKA-5767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-5767. Resolution: Fixed Issue resolved by pull request 3718 [https://github.com/apache/kafka/pull/3718] > Kafka server should halt if IBP < 1.0.0 and there is log directory failure > -- > > Key: KAFKA-5767 > URL: https://issues.apache.org/jira/browse/KAFKA-5767 > Project: Kafka > Issue Type: Bug >Reporter: Dong Lin >Assignee: Dong Lin >Priority: Critical > Fix For: 1.0.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (KAFKA-6071) Use ZookeeperClient in LogManager
Jun Rao created KAFKA-6071: -- Summary: Use ZookeeperClient in LogManager Key: KAFKA-6071 URL: https://issues.apache.org/jira/browse/KAFKA-6071 Project: Kafka Issue Type: Sub-task Components: core Affects Versions: 1.1.0 Reporter: Jun Rao We want to replace the usage of ZkUtils in LogManager with ZookeeperClient. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (KAFKA-6072) Use ZookeeperClient in GroupCoordinator
Jun Rao created KAFKA-6072: -- Summary: Use ZookeeperClient in GroupCoordinator Key: KAFKA-6072 URL: https://issues.apache.org/jira/browse/KAFKA-6072 Project: Kafka Issue Type: Sub-task Components: core Affects Versions: 1.1.0 Reporter: Jun Rao We want to replace the usage of ZkUtils in GroupCoordinator with ZookeeperClient. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (KAFKA-6073) Use ZookeeperClient in KafkaApis
Jun Rao created KAFKA-6073: -- Summary: Use ZookeeperClient in KafkaApis Key: KAFKA-6073 URL: https://issues.apache.org/jira/browse/KAFKA-6073 Project: Kafka Issue Type: Sub-task Components: core Affects Versions: 1.1.0 Reporter: Jun Rao Fix For: 1.1.0 We want to replace the usage of ZkUtils with ZookeeperClient in KafkaApis. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (KAFKA-6074) Use ZookeeperClient in ReplicaManager and Partition
Jun Rao created KAFKA-6074: -- Summary: Use ZookeeperClient in ReplicaManager and Partition Key: KAFKA-6074 URL: https://issues.apache.org/jira/browse/KAFKA-6074 Project: Kafka Issue Type: Sub-task Components: core Affects Versions: 1.1.0 Reporter: Jun Rao Fix For: 1.1.0 We want to replace the usage of ZkUtils with ZookeeperClient in ReplicaManager and Partition. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (KAFKA-5642) Use async ZookeeperClient in Controller
[ https://issues.apache.org/jira/browse/KAFKA-5642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-5642. Resolution: Fixed Fix Version/s: 1.1.0 Issue resolved by pull request 3765 [https://github.com/apache/kafka/pull/3765] > Use async ZookeeperClient in Controller > --- > > Key: KAFKA-5642 > URL: https://issues.apache.org/jira/browse/KAFKA-5642 > Project: Kafka > Issue Type: Sub-task >Reporter: Onur Karaman >Assignee: Onur Karaman > Fix For: 1.1.0 > > > Synchronous zookeeper writes means that we wait an entire round trip before > doing the next write. These synchronous writes are happening at a > per-partition granularity in several places, so partition-heavy clusters > suffer from the controller doing many sequential round trips to zookeeper. > * PartitionStateMachine.electLeaderForPartition updates leaderAndIsr in > zookeeper on transition to OnlinePartition. This gets triggered per-partition > sequentially with synchronous writes during controlled shutdown of the > shutting down broker's replicas for which it is the leader. > * ReplicaStateMachine updates leaderAndIsr in zookeeper on transition to > OfflineReplica when calling KafkaController.removeReplicaFromIsr. This gets > triggered per-partition sequentially with synchronous writes for failed or > controlled shutdown brokers. > KAFKA-5501 introduced an async ZookeeperClient that encourages pipelined > requests to zookeeper. We should replace ZkClient's usage with this client. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (KAFKA-4444) Aggregate requests sent from controller to broker during controlled shutdown
[ https://issues.apache.org/jira/browse/KAFKA-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-. Resolution: Duplicate This is now fixed in KAFKA-5642. > Aggregate requests sent from controller to broker during controlled shutdown > > > Key: KAFKA- > URL: https://issues.apache.org/jira/browse/KAFKA- > Project: Kafka > Issue Type: Bug >Reporter: Dong Lin >Assignee: Dong Lin > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (KAFKA-3038) Speeding up partition reassignment after broker failure
[ https://issues.apache.org/jira/browse/KAFKA-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-3038. Resolution: Duplicate This is now fixed in KAFKA-5642. > Speeding up partition reassignment after broker failure > --- > > Key: KAFKA-3038 > URL: https://issues.apache.org/jira/browse/KAFKA-3038 > Project: Kafka > Issue Type: Improvement > Components: controller, core >Affects Versions: 0.9.0.0 >Reporter: Eno Thereska > > After a broker failure the controller does several writes to Zookeeper for > each partition on the failed broker. Writes are done one at a time, in closed > loop, which is slow especially under high latency networks. Zookeeper has > support for batching operations (the "multi" API). It is expected that > substituting serial writes with batched ones should reduce failure handling > time by an order of magnitude. > This is identified as an issue in > https://cwiki.apache.org/confluence/display/KAFKA/kafka+Detailed+Replication+Design+V3 > (section End-to-end latency during a broker failure) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (KAFKA-3083) a soft failure in controller may leave a topic partition in an inconsistent state
[ https://issues.apache.org/jira/browse/KAFKA-3083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-3083. Resolution: Fixed Assignee: Onur Karaman (was: Mayuresh Gharat) Fix Version/s: 1.1.0 This is now fixed in KAFKA-5642. > a soft failure in controller may leave a topic partition in an inconsistent > state > - > > Key: KAFKA-3083 > URL: https://issues.apache.org/jira/browse/KAFKA-3083 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.9.0.0 >Reporter: Jun Rao >Assignee: Onur Karaman > Labels: reliability > Fix For: 1.1.0 > > > The following sequence can happen. > 1. Broker A is the controller and is in the middle of processing a broker > change event. As part of this process, let's say it's about to shrink the isr > of a partition. > 2. Then broker A's session expires and broker B takes over as the new > controller. Broker B sends the initial leaderAndIsr request to all brokers. > 3. Broker A continues by shrinking the isr of the partition in ZK and sends > the new leaderAndIsr request to the broker (say C) that leads the partition. > Broker C will reject this leaderAndIsr since the request comes from a > controller with an older epoch. Now we could be in a situation that Broker C > thinks the isr has all replicas, but the isr stored in ZK is different. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (KAFKA-6093) replica dir not deleted after topic deletion
Jun Rao created KAFKA-6093: -- Summary: replica dir not deleted after topic deletion Key: KAFKA-6093 URL: https://issues.apache.org/jira/browse/KAFKA-6093 Project: Kafka Issue Type: Bug Components: core Affects Versions: 1.0.0 Reporter: Jun Rao Priority: Blocker Did the following test. 1. bin/kafka-topics.sh --zookeeper localhost:2181 --topic test --partitions 1 --replication-factor 1 --create 2. bin/kafka-topics.sh --zookeeper localhost:2181 --topic test --delete Saw the following in the broker log. {code:java} [2017-10-19 17:54:43,413] INFO Log for partition test-0 is renamed to /tmp/kafka-logs/test-0.0b578c684ec540b48bf63ecec1752f02-delete and is scheduled for deletion (kafka.log.LogManager) [2017-10-19 17:55:27,034] ERROR Exception while deleting Log(/tmp/kafka-logs/test-0.0b578c684ec540b48bf63ecec1752f02-delete) in dir /tmp/kafka-logs. (kafka.log.LogManager) org.apache.kafka.common.errors.KafkaStorageException: The log for partition test-0 is offline {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (KAFKA-6094) transient failure in testMultipleMarkersOneLeader
Jun Rao created KAFKA-6094: -- Summary: transient failure in testMultipleMarkersOneLeader Key: KAFKA-6094 URL: https://issues.apache.org/jira/browse/KAFKA-6094 Project: Kafka Issue Type: Bug Components: core Affects Versions: 1.1.0 Reporter: Jun Rao Saw the following transient failure in https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/9005/consoleFull kafka.api.TransactionsTest > testMultipleMarkersOneLeader FAILED org.apache.kafka.common.KafkaException: Cannot perform send because at least one previous transactional or idempotent request has failed with errors. at org.apache.kafka.clients.producer.internals.TransactionManager.failIfNotReadyForSend(TransactionManager.java:278) at org.apache.kafka.clients.producer.internals.TransactionManager.maybeAddPartitionToTransaction(TransactionManager.java:263) at org.apache.kafka.clients.producer.KafkaProducer.doSend(KafkaProducer.java:805) at org.apache.kafka.clients.producer.KafkaProducer.send(KafkaProducer.java:761) at org.apache.kafka.clients.producer.KafkaProducer.send(KafkaProducer.java:649) at kafka.api.TransactionsTest$$anonfun$sendTransactionalMessagesWithValueRange$1.apply(TransactionsTest.scala:500) at kafka.api.TransactionsTest$$anonfun$sendTransactionalMessagesWithValueRange$1.apply(TransactionsTest.scala:499) at scala.collection.immutable.Range.foreach(Range.scala:160) at kafka.api.TransactionsTest.sendTransactionalMessagesWithValueRange(TransactionsTest.scala:499) at kafka.api.TransactionsTest.testMultipleMarkersOneLeader(TransactionsTest.scala:475) Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 213 record(s) for largeTopic-8: 10474 ms has passed since batch creation plus linger time -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (KAFKA-6071) Use ZookeeperClient in LogManager
[ https://issues.apache.org/jira/browse/KAFKA-6071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-6071. Resolution: Fixed Fix Version/s: 1.1.0 Issue resolved by pull request 4089 [https://github.com/apache/kafka/pull/4089] > Use ZookeeperClient in LogManager > -- > > Key: KAFKA-6071 > URL: https://issues.apache.org/jira/browse/KAFKA-6071 > Project: Kafka > Issue Type: Sub-task > Components: core >Affects Versions: 1.1.0 >Reporter: Jun Rao >Assignee: Manikumar > Fix For: 1.1.0 > > > We want to replace the usage of ZkUtils in LogManager with ZookeeperClient. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (KAFKA-6146) re-register the exist watch on PreferredReplicaElectionZNode after the preferred leader election completes
Jun Rao created KAFKA-6146: -- Summary: re-register the exist watch on PreferredReplicaElectionZNode after the preferred leader election completes Key: KAFKA-6146 URL: https://issues.apache.org/jira/browse/KAFKA-6146 Project: Kafka Issue Type: Sub-task Affects Versions: 1.1.0 Reporter: Jun Rao Fix For: 1.1.0 Currently, after the PreferredReplicaElectionZNode is removed, we don't register the exist watcher on the path again. This means that future preferred replica election event will be missed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (KAFKA-4084) automated leader rebalance causes replication downtime for clusters with too many partitions
[ https://issues.apache.org/jira/browse/KAFKA-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-4084. Resolution: Fixed Fix Version/s: 1.1.0 The auto leader balancing logic now uses the async ZK api and batches the requests from the controller to the brokers. So, the process should be much faster with many partitions. Closing this for now. > automated leader rebalance causes replication downtime for clusters with too > many partitions > > > Key: KAFKA-4084 > URL: https://issues.apache.org/jira/browse/KAFKA-4084 > Project: Kafka > Issue Type: Bug > Components: controller >Affects Versions: 0.8.2.2, 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1 >Reporter: Tom Crayford > Labels: reliability > Fix For: 1.1.0 > > > If you enable {{auto.leader.rebalance.enable}} (which is on by default), and > you have a cluster with many partitions, there is a severe amount of > replication downtime following a restart. This causes > `UnderReplicatedPartitions` to fire, and replication is paused. > This is because the current automated leader rebalance mechanism changes > leaders for *all* imbalanced partitions at once, instead of doing it > gradually. This effectively stops all replica fetchers in the cluster > (assuming there are enough imbalanced partitions), and restarts them. This > can take minutes on busy clusters, during which no replication is happening > and user data is at risk. Clients with {{acks=-1}} also see issues at this > time, because replication is effectively stalled. > To quote Todd Palino from the mailing list: > bq. There is an admin CLI command to trigger the preferred replica election > manually. There is also a broker configuration “auto.leader.rebalance.enable” > which you can set to have the broker automatically perform the PLE when > needed. DO NOT USE THIS OPTION. There are serious performance issues when > doing so, especially on larger clusters. It needs some development work that > has not been fully identified yet. > This setting is extremely useful for smaller clusters, but with high > partition counts causes the huge issues stated above. > One potential fix could be adding a new configuration for the number of > partitions to do automated leader rebalancing for at once, and *stop* once > that number of leader rebalances are in flight, until they're done. There may > be better mechanisms, and I'd love to hear if anybody has any ideas. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (KAFKA-6184) report a metric of the lag between the consumer offset and the start offset of the log
Jun Rao created KAFKA-6184: -- Summary: report a metric of the lag between the consumer offset and the start offset of the log Key: KAFKA-6184 URL: https://issues.apache.org/jira/browse/KAFKA-6184 Project: Kafka Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Jun Rao Currently, the consumer reports a metric of the lag between the high watermark of a log and the consumer offset. It will be useful to report a similar lag metric between the consumer offset and the start offset of the log. If this latter lag gets close to 0, it's an indication that the consumer may lose data soon. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (KAFKA-6175) AbstractIndex should cache index file to avoid unnecessary disk access during resize()
[ https://issues.apache.org/jira/browse/KAFKA-6175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-6175. Resolution: Fixed > AbstractIndex should cache index file to avoid unnecessary disk access during > resize() > -- > > Key: KAFKA-6175 > URL: https://issues.apache.org/jira/browse/KAFKA-6175 > Project: Kafka > Issue Type: Improvement >Reporter: Dong Lin >Assignee: Dong Lin > Fix For: 1.0.1 > > > Currently when we shutdown a broker, we will call AbstractIndex.resize() for > all segments on the broker, regardless of whether the log segment is active > or not. AbstractIndex.resize() incurs raf.setLength(), which is expensive > because it accesses disks. If we do a threaddump during either > LogManger.shutdown() or LogManager.loadLogs(), most threads are in RUNNABLE > state at java.io.RandomAccessFile.setLength(). > This patch intends to speed up broker startup and shutdown time by skipping > AbstractIndex.resize() for inactive log segments. > Here is the time of LogManager.shutdown() in various settings. In all these > tests, broker has roughly 6k partitions and 19k segments. > - If broker does not have this patch and KAFKA-6172, LogManager.shutdown() > takes 69 seconds > - If broker has KAFKA-6172 but not this patch, LogManager.shutdown() takes 21 > seconds. > - If broker has KAFKA-6172 and this patch, LogManager.shutdown() takes 1.6 > seconds. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (KAFKA-6320) move ZK metrics in KafkaHealthCheck to ZookeeperClient
Jun Rao created KAFKA-6320: -- Summary: move ZK metrics in KafkaHealthCheck to ZookeeperClient Key: KAFKA-6320 URL: https://issues.apache.org/jira/browse/KAFKA-6320 Project: Kafka Issue Type: Sub-task Affects Versions: 1.0.0 Reporter: Jun Rao In KAFKA-5473, we will be de-commissioning the usage of KafkaHealthCheck. So, we need to move the ZK metrics SessionState and ZooKeeper${eventType}PerSec in that class to somewhere else (e.g. ZookeeperClient). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (KAFKA-7864) AdminZkClient.validateTopicCreate() should validate that partitions are 0-based
Jun Rao created KAFKA-7864: -- Summary: AdminZkClient.validateTopicCreate() should validate that partitions are 0-based Key: KAFKA-7864 URL: https://issues.apache.org/jira/browse/KAFKA-7864 Project: Kafka Issue Type: Improvement Reporter: Jun Rao AdminZkClient.validateTopicCreate() currently doesn't validate that partition ids in a topic are consecutive, starting from 0. The client code depends on that. So, it would be useful to tighten up the check. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-7838) improve logging in Partition.maybeShrinkIsr()
[ https://issues.apache.org/jira/browse/KAFKA-7838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-7838. Resolution: Fixed Fix Version/s: 2.2.0 Merged the PR to trunk. > improve logging in Partition.maybeShrinkIsr() > - > > Key: KAFKA-7838 > URL: https://issues.apache.org/jira/browse/KAFKA-7838 > Project: Kafka > Issue Type: Improvement >Reporter: Jun Rao >Assignee: Dhruvil Shah >Priority: Major > Fix For: 2.2.0 > > > When we take a replica out of ISR, it would be useful to further log the > fetch offset of the out of sync replica and the leader's HW at the point. > This could be useful when the admin needs to manually enable unclean leader > election. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-7837) maybeShrinkIsr may not reflect OfflinePartitions immediately
[ https://issues.apache.org/jira/browse/KAFKA-7837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-7837. Resolution: Fixed Fix Version/s: 2.2.0 Merged the PR to trunk. > maybeShrinkIsr may not reflect OfflinePartitions immediately > > > Key: KAFKA-7837 > URL: https://issues.apache.org/jira/browse/KAFKA-7837 > Project: Kafka > Issue Type: Improvement >Reporter: Jun Rao >Assignee: Dhruvil Shah >Priority: Major > Fix For: 2.2.0 > > > When a partition is marked offline due to a failed disk, the leader is > supposed to not shrink its ISR any more. In ReplicaManager.maybeShrinkIsr(), > we iterate through all non-offline partitions to shrink the ISR. If an ISR > needs to shrink, we need to write the new ISR to ZK, which can take a bit of > time. In this window, some partitions could now be marked as offline, but may > not be picked up by the iterator since it only reflects the state at that > point. This can cause all in-sync followers to be dropped out of ISR > unnecessarily and prevents a clean leader election. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-7283) mmap indexes lazily and skip sanity check for segments below recovery point
[ https://issues.apache.org/jira/browse/KAFKA-7283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-7283. Resolution: Fixed Fix Version/s: 2.3.0 Merged the PR to trunk. > mmap indexes lazily and skip sanity check for segments below recovery point > --- > > Key: KAFKA-7283 > URL: https://issues.apache.org/jira/browse/KAFKA-7283 > Project: Kafka > Issue Type: New Feature >Reporter: Zhanxiang (Patrick) Huang >Assignee: Zhanxiang (Patrick) Huang >Priority: Major > Fix For: 2.3.0 > > > This is a follow-up ticket for KIP-263. > Currently broker will mmap the index files, read the length as well as the > last entry of the file, and sanity check index files of all log segments in > the log directory after the broker is started. These operations can be slow > because broker needs to open index file and read data into page cache. In > this case, the time to restart a broker will increase proportional to the > number of segments in the log directory. > Per the KIP discussion, we think we can skip sanity check for segments below > the recovery point since Kafka does not provide guarantee for segments > already flushed to disk and sanity checking only index file benefits little > when the segment is also corrupted because of disk failure. Therefore, we can > make the following changes to improve broker startup time: > # Mmap the index file and populate fields of the index file on-demand rather > than performing costly disk operations when creating the index object on > broker startup. > # Skip sanity checks on indexes of segments below the recovery point. > With these changes, the broker startup time will increase only proportional > to the number of partitions in the log directly after cleaned shutdown > because only active segments are mmaped and sanity checked. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-7864) AdminZkClient.validateTopicCreate() should validate that partitions are 0-based
[ https://issues.apache.org/jira/browse/KAFKA-7864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-7864. Resolution: Fixed Fix Version/s: 2.3.0 Merged the PR to trunk. > AdminZkClient.validateTopicCreate() should validate that partitions are > 0-based > --- > > Key: KAFKA-7864 > URL: https://issues.apache.org/jira/browse/KAFKA-7864 > Project: Kafka > Issue Type: Improvement >Reporter: Jun Rao >Assignee: Ryan >Priority: Major > Labels: newbie > Fix For: 2.3.0 > > > AdminZkClient.validateTopicCreate() currently doesn't validate that partition > ids in a topic are consecutive, starting from 0. The client code depends on > that. So, it would be useful to tighten up the check. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KAFKA-7983) supporting replication.throttled.replicas in dynamic broker configuration
Jun Rao created KAFKA-7983: -- Summary: supporting replication.throttled.replicas in dynamic broker configuration Key: KAFKA-7983 URL: https://issues.apache.org/jira/browse/KAFKA-7983 Project: Kafka Issue Type: New Feature Components: core Reporter: Jun Rao In [KIP-226|https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration#KIP-226-DynamicBrokerConfiguration-DefaultTopicconfigs], we added the support to change broker defaults dynamically. However, it didn't support changing leader.replication.throttled.replicas and follower.replication.throttled.replicas. These 2 configs were introduced in [KIP-73|https://cwiki.apache.org/confluence/display/KAFKA/KIP-73+Replication+Quotas] and controls the set of topic partitions on which replication throttling will be engaged. One useful case is to be able to set a default value for both configs to * to allow throttling to be engaged for all topic partitions. Currently, the static default value for both configs are ignored for replication throttling, it would be useful to fix that as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KAFKA-7987) a broker's ZK session may die on transient auth failure
Jun Rao created KAFKA-7987: -- Summary: a broker's ZK session may die on transient auth failure Key: KAFKA-7987 URL: https://issues.apache.org/jira/browse/KAFKA-7987 Project: Kafka Issue Type: Improvement Reporter: Jun Rao After a transient network issue, we saw the following log in a broker. {code:java} [23:37:02,102] ERROR SASL authentication with Zookeeper Quorum member failed: javax.security.sasl.SaslException: An error: (java.security.PrivilegedActionException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Server not found in Kerberos database (7))]) occurred when evaluating Zookeeper Quorum Member's received SASL token. Zookeeper Client will go to AUTH_FAILED state. (org.apache.zookeeper.ClientCnxn) [23:37:02,102] ERROR [ZooKeeperClient] Auth failed. (kafka.zookeeper.ZooKeeperClient) {code} The network issue prevented the broker from communicating to ZK. The broker's ZK session then expired, but the broker didn't know that yet since it couldn't establish a connection to ZK. When the network was back, the broker tried to establish a connection to ZK, but failed due to auth failure (likely due to a transient KDC issue). The current logic just ignores the auth failure without trying to create a new ZK session. Then the broker will be permanently in a state that it's alive, but not registered in ZK. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KAFKA-7986) distinguish the logging from different ZooKeeperClient instances
Jun Rao created KAFKA-7986: -- Summary: distinguish the logging from different ZooKeeperClient instances Key: KAFKA-7986 URL: https://issues.apache.org/jira/browse/KAFKA-7986 Project: Kafka Issue Type: Improvement Reporter: Jun Rao It's possible for each broker to have more than 1 ZooKeeperClient instance. For example, SimpleAclAuthorizer creates a separate ZooKeeperClient instance when configured. It would be useful to distinguish the logging from different ZooKeeperClient instances. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-7956) Avoid blocking in ShutdownableThread.awaitShutdown if the thread has not been started
[ https://issues.apache.org/jira/browse/KAFKA-7956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-7956. Resolution: Fixed Assignee: Gardner Vickers Fix Version/s: 2.3.0 Merged the PR to trunk. > Avoid blocking in ShutdownableThread.awaitShutdown if the thread has not been > started > - > > Key: KAFKA-7956 > URL: https://issues.apache.org/jira/browse/KAFKA-7956 > Project: Kafka > Issue Type: Improvement > Components: core >Reporter: Gardner Vickers >Assignee: Gardner Vickers >Priority: Minor > Fix For: 2.3.0 > > > Opening this Jira to track [https://github.com/apache/kafka/pull/6218], since > it's a rather subtle change. > In some test cases it's desirable to instantiate a subclass of > `ShutdownableThread` without starting it. Since most subclasses of > `ShutdownableThread` put cleanup logic in `ShutdownableThread.shutdown()`, > being able to call `shutdown()` on the non-running thread would be useful. > This change allows us to avoid blocking in `ShutdownableThread.shutdown()` if > the thread's `run()` method has not been called. We also add a check that > `initiateShutdown()` was called before `awaitShutdown()`, to protect against > the case where a user calls `awaitShutdown()` before the thread has been > started, and unexpectedly is not blocked on the thread shutting down. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-7977) Flaky Test ReassignPartitionsClusterTest#shouldOnlyThrottleMovingReplicas
[ https://issues.apache.org/jira/browse/KAFKA-7977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-7977. Resolution: Fixed The PR in KAFKA-8018 is merged, which should reduce the chance for ZK session expiration. Closing this issue for now. If the same issue occurs again, feel free to reopen the ticket. > Flaky Test ReassignPartitionsClusterTest#shouldOnlyThrottleMovingReplicas > - > > Key: KAFKA-7977 > URL: https://issues.apache.org/jira/browse/KAFKA-7977 > Project: Kafka > Issue Type: Bug > Components: admin, unit tests >Affects Versions: 2.2.0 >Reporter: Matthias J. Sax >Priority: Critical > Labels: flaky-test > Fix For: 2.3.0, 2.2.1 > > > To get stable nightly builds for `2.2` release, I create tickets for all > observed test failures. > [https://jenkins.confluent.io/job/apache-kafka-test/job/2.2/25/] > {quote}org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired for /brokers/topics/topic1 at > org.apache.zookeeper.KeeperException.create(KeeperException.java:130) at > org.apache.zookeeper.KeeperException.create(KeeperException.java:54) at > kafka.zookeeper.AsyncResponse.resultException(ZooKeeperClient.scala:534) at > kafka.zk.KafkaZkClient.$anonfun$getReplicaAssignmentForTopics$2(KafkaZkClient.scala:579) > at > scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:244) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at > scala.collection.TraversableLike.flatMap(TraversableLike.scala:244) at > scala.collection.TraversableLike.flatMap$(TraversableLike.scala:241) at > scala.collection.AbstractTraversable.flatMap(Traversable.scala:108) at > kafka.zk.KafkaZkClient.getReplicaAssignmentForTopics(KafkaZkClient.scala:574) > at > kafka.admin.ReassignPartitionsCommand$.parseAndValidate(ReassignPartitionsCommand.scala:338) > at > kafka.admin.ReassignPartitionsCommand$.executeAssignment(ReassignPartitionsCommand.scala:209) > at > kafka.admin.ReassignPartitionsClusterTest.shouldOnlyThrottleMovingReplicas(ReassignPartitionsClusterTest.scala:343){quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KAFKA-8022) Flaky Test RequestQuotaTest#testExemptRequestTime
[ https://issues.apache.org/jira/browse/KAFKA-8022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-8022. Resolution: Fixed Fix Version/s: (was: 0.11.0.4) 2.2.1 2.3.0 The PR is merged. Closing it for now. > Flaky Test RequestQuotaTest#testExemptRequestTime > - > > Key: KAFKA-8022 > URL: https://issues.apache.org/jira/browse/KAFKA-8022 > Project: Kafka > Issue Type: Bug > Components: core, unit tests >Affects Versions: 0.11.0.3 >Reporter: Matthias J. Sax >Assignee: Jun Rao >Priority: Critical > Labels: flaky-test > Fix For: 2.3.0, 2.2.1 > > > [https://builds.apache.org/blue/organizations/jenkins/kafka-trunk-jdk11/detail/kafka-trunk-jdk11/328/tests] > {quote}kafka.zookeeper.ZooKeeperClientTimeoutException: Timed out waiting for > connection while in state: CONNECTING > at > kafka.zookeeper.ZooKeeperClient.$anonfun$waitUntilConnected$3(ZooKeeperClient.scala:242) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:253) > at > kafka.zookeeper.ZooKeeperClient.waitUntilConnected(ZooKeeperClient.scala:238) > at kafka.zookeeper.ZooKeeperClient.(ZooKeeperClient.scala:96) > at kafka.zk.KafkaZkClient$.apply(KafkaZkClient.scala:1825) > at kafka.zk.ZooKeeperTestHarness.setUp(ZooKeeperTestHarness.scala:59) > at > kafka.integration.KafkaServerTestHarness.setUp(KafkaServerTestHarness.scala:90) > at kafka.api.IntegrationTestHarness.doSetup(IntegrationTestHarness.scala:81) > at kafka.api.IntegrationTestHarness.setUp(IntegrationTestHarness.scala:73) > at kafka.server.RequestQuotaTest.setUp(RequestQuotaTest.scala:81){quote} > STDOUT: > {quote}[2019-03-01 00:40:47,090] ERROR [KafkaApi-0] Error when handling > request: clientId=unauthorized-CONTROLLED_SHUTDOWN, correlationId=1, > api=CONTROLLED_SHUTDOWN, body=\{broker_id=0,broker_epoch=9223372036854775807} > (kafka.server.KafkaApis:76) > org.apache.kafka.common.errors.ClusterAuthorizationException: Request > Request(processor=2, connectionId=127.0.0.1:37894-127.0.0.1:54838-2, > session=Session(User:Unauthorized,/127.0.0.1), > listenerName=ListenerName(PLAINTEXT), securityProtocol=PLAINTEXT, > buffer=null) is not authorized. > [2019-03-01 00:40:47,090] ERROR [KafkaApi-0] Error when handling request: > clientId=unauthorized-STOP_REPLICA, correlationId=1, api=STOP_REPLICA, > body=\{controller_id=0,controller_epoch=2147483647,broker_epoch=9223372036854775807,delete_partitions=true,partitions=[{topic=topic-1,partition_ids=[0]}]} > (kafka.server.KafkaApis:76) > org.apache.kafka.common.errors.ClusterAuthorizationException: Request > Request(processor=0, connectionId=127.0.0.1:37894-127.0.0.1:54822-1, > session=Session(User:Unauthorized,/127.0.0.1), > listenerName=ListenerName(PLAINTEXT), securityProtocol=PLAINTEXT, > buffer=null) is not authorized. > [2019-03-01 00:40:47,091] ERROR [KafkaApi-0] Error when handling request: > clientId=unauthorized-UPDATE_METADATA, correlationId=1, api=UPDATE_METADATA, > body=\{controller_id=0,controller_epoch=2147483647,broker_epoch=9223372036854775807,topic_states=[{topic=topic-1,partition_states=[{partition=0,controller_epoch=2147483647,leader=0,leader_epoch=2147483647,isr=[0],zk_version=2,replicas=[0],offline_replicas=[]}]}],live_brokers=[\{id=0,end_points=[{port=0,host=localhost,listener_name=PLAINTEXT,security_protocol_type=0}],rack=null}]} > (kafka.server.KafkaApis:76) > org.apache.kafka.common.errors.ClusterAuthorizationException: Request > Request(processor=1, connectionId=127.0.0.1:37894-127.0.0.1:54836-2, > session=Session(User:Unauthorized,/127.0.0.1), > listenerName=ListenerName(PLAINTEXT), securityProtocol=PLAINTEXT, > buffer=null) is not authorized. > [2019-03-01 00:40:47,090] ERROR [KafkaApi-0] Error when handling request: > clientId=unauthorized-LEADER_AND_ISR, correlationId=1, api=LEADER_AND_ISR, > body=\{controller_id=0,controller_epoch=2147483647,broker_epoch=9223372036854775807,topic_states=[{topic=topic-1,partition_states=[{partition=0,controller_epoch=2147483647,leader=0,leader_epoch=2147483647,isr=[0],zk_version=2,replicas=[0],is_new=true}]}],live_leaders=[\{id=0,host=localhost,port=0}]} > (kafka.server.KafkaApis:76) > org.apache.kafka.common.errors.ClusterAuthorizationException: Request > Request(processor=0, connectionId=127.0.0.1:37894-127.0.0.1:54834-2, > session=Session(User:Unauthorized,/127.0.0.1), > listenerName=ListenerName(PLAINTEXT), securityProtocol=PLAINTEXT, > buffer=null) is not authorized. > [2019-03-01 00:40:47,106] ERROR [KafkaApi-0] Error when handling request: > clientId=unauthorized-WRITE_TXN_MARKERS, correlationId=1, > api=WRITE_TXN_MARKERS, body=\{transaction_markers=[]} > (kafka.server.KafkaApi
[jira] [Commented] (KAFKA-5413) Log cleaner fails due to large offset in segment file
[ https://issues.apache.org/jira/browse/KAFKA-5413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16045572#comment-16045572 ] Jun Rao commented on KAFKA-5413: Could you attach the .index file of the 2 segments too? > Log cleaner fails due to large offset in segment file > - > > Key: KAFKA-5413 > URL: https://issues.apache.org/jira/browse/KAFKA-5413 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.10.2.0, 0.10.2.1 > Environment: Ubuntu 14.04 LTS, Oracle Java 8u92, kafka_2.11-0.10.2.0 >Reporter: Nicholas Ngorok > Labels: reliability > Attachments: .log, 002147422683.log > > > The log cleaner thread in our brokers is failing with the trace below > {noformat} > [2017-06-08 15:49:54,822] INFO {kafka-log-cleaner-thread-0} Cleaner 0: > Cleaning segment 0 in log __consumer_offsets-12 (largest timestamp Thu Jun 08 > 15:48:59 PDT 2017) into 0, retaining deletes. (kafka.log.LogCleaner) > [2017-06-08 15:49:54,822] INFO {kafka-log-cleaner-thread-0} Cleaner 0: > Cleaning segment 2147343575 in log __consumer_offsets-12 (largest timestamp > Thu Jun 08 15:49:06 PDT 2017) into 0, retaining deletes. > (kafka.log.LogCleaner) > [2017-06-08 15:49:54,834] ERROR {kafka-log-cleaner-thread-0} > [kafka-log-cleaner-thread-0], Error due to (kafka.log.LogCleaner) > java.lang.IllegalArgumentException: requirement failed: largest offset in > message set can not be safely converted to relative offset. > at scala.Predef$.require(Predef.scala:224) > at kafka.log.LogSegment.append(LogSegment.scala:109) > at kafka.log.Cleaner.cleanInto(LogCleaner.scala:478) > at > kafka.log.Cleaner$$anonfun$cleanSegments$1.apply(LogCleaner.scala:405) > at > kafka.log.Cleaner$$anonfun$cleanSegments$1.apply(LogCleaner.scala:401) > at scala.collection.immutable.List.foreach(List.scala:381) > at kafka.log.Cleaner.cleanSegments(LogCleaner.scala:401) > at kafka.log.Cleaner$$anonfun$clean$4.apply(LogCleaner.scala:363) > at kafka.log.Cleaner$$anonfun$clean$4.apply(LogCleaner.scala:362) > at scala.collection.immutable.List.foreach(List.scala:381) > at kafka.log.Cleaner.clean(LogCleaner.scala:362) > at > kafka.log.LogCleaner$CleanerThread.cleanOrSleep(LogCleaner.scala:241) > at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:220) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63) > [2017-06-08 15:49:54,835] INFO {kafka-log-cleaner-thread-0} > [kafka-log-cleaner-thread-0], Stopped (kafka.log.LogCleaner) > {noformat} > This seems to point at the specific line [here| > https://github.com/apache/kafka/blob/0.11.0/core/src/main/scala/kafka/log/LogSegment.scala#L92] > in the kafka src where the difference is actually larger than MAXINT as both > baseOffset and offset are of type long. It was introduced in this [pr| > https://github.com/apache/kafka/pull/2210/files/56d1f8196b77a47b176b7bbd1e4220a3be827631] > These were the outputs of dumping the first two log segments > {noformat} > :~$ /usr/bin/kafka-run-class kafka.tools.DumpLogSegments --deep-iteration > --files /kafka-logs/__consumer_offsets-12/000 > 0.log > Dumping /kafka-logs/__consumer_offsets-12/.log > Starting offset: 0 > offset: 1810054758 position: 0 NoTimestampType: -1 isvalid: true payloadsize: > -1 magic: 0 compresscodec: NONE crc: 3127861909 keysize: 34 > :~$ /usr/bin/kafka-run-class kafka.tools.DumpLogSegments --deep-iteration > --files /kafka-logs/__consumer_offsets-12/000 > 0002147343575.log > Dumping /kafka-logs/__consumer_offsets-12/002147343575.log > Starting offset: 2147343575 > offset: 2147539884 position: 0 NoTimestampType: -1 isvalid: true paylo > adsize: -1 magic: 0 compresscodec: NONE crc: 2282192097 keysize: 34 > {noformat} > My guess is that since 2147539884 is larger than MAXINT, we are hitting this > exception. Was there a specific reason, this check was added in 0.10.2? > E.g. if the first offset is a key = "key 0" and then we have MAXINT + 1 of > "key 1" following, wouldn't we run into this situation whenever the log > cleaner runs? -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (KAFKA-5405) Request log should log throttle time
[ https://issues.apache.org/jira/browse/KAFKA-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-5405. Resolution: Fixed Fix Version/s: 0.11.0.0 Merged to trunk and 0.11.0. > Request log should log throttle time > > > Key: KAFKA-5405 > URL: https://issues.apache.org/jira/browse/KAFKA-5405 > Project: Kafka > Issue Type: Improvement >Affects Versions: 0.11.0.0 >Reporter: Jun Rao >Assignee: huxihx > Labels: newbie > Fix For: 0.11.0.0 > > > In RequestChannel, when logging the request and the latency, it would be > useful to include the apiThrottleTime as well. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-5431) LogCleaner stopped due to org.apache.kafka.common.errors.CorruptRecordException
[ https://issues.apache.org/jira/browse/KAFKA-5431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16046650#comment-16046650 ] Jun Rao commented on KAFKA-5431: Interesting. First, in general, if there is an IOException during writing to the log, the broker will shut down immediately. Second, if the cleaner hits an IOException, currently, we just abort the current cleaning job. The existing log should be intact. The "invalid message" from the fetch follower seems a bit weird. That suggests that the log segment at offset 0 is corrupted somehow. Could you use the DumpLogSegment tool (https://cwiki.apache.org/confluence/display/KAFKA/System+Tools#SystemTools-DumpLogSegment) on that segment in the leader to see if there is any log corruption? > LogCleaner stopped due to > org.apache.kafka.common.errors.CorruptRecordException > --- > > Key: KAFKA-5431 > URL: https://issues.apache.org/jira/browse/KAFKA-5431 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.10.2.1 >Reporter: Carsten Rietz > Labels: reliability > > Hey all, > i have a strange problem with our uat cluster of 3 kafka brokers. > the __consumer_offsets topic was replicated to two instances and our disks > ran full due to a wrong configuration of the log cleaner. We fixed the > configuration and updated from 0.10.1.1 to 0.10.2.1 . > Today i increased the replication of the __consumer_offsets topic to 3 and > triggered replication to the third cluster via kafka-reassign-partitions.sh. > That went well but i get many errors like > {code} > [2017-06-12 09:59:50,342] ERROR Found invalid messages during fetch for > partition [__consumer_offsets,18] offset 0 error Record size is less than the > minimum record overhead (14) (kafka.server.ReplicaFetcherThread) > [2017-06-12 09:59:50,342] ERROR Found invalid messages during fetch for > partition [__consumer_offsets,24] offset 0 error Record size is less than the > minimum record overhead (14) (kafka.server.ReplicaFetcherThread) > {code} > Which i think are due to the full disk event. > The log cleaner threads died on these wrong messages: > {code} > [2017-06-12 09:59:50,722] ERROR [kafka-log-cleaner-thread-0], Error due to > (kafka.log.LogCleaner) > org.apache.kafka.common.errors.CorruptRecordException: Record size is less > than the minimum record overhead (14) > [2017-06-12 09:59:50,722] INFO [kafka-log-cleaner-thread-0], Stopped > (kafka.log.LogCleaner) > {code} > Looking at the file is see that some are truncated and some are jsut empty: > $ ls -lsh 00594653.log > 0 -rw-r--r-- 1 user user 100M Jun 12 11:00 00594653.log > Sadly i do not have the logs any more from the disk full event itsself. > I have three questions: > * What is the best way to clean this up? Deleting the old log files and > restarting the brokers? > * Why did kafka not handle the disk full event well? Is this only affecting > the cleanup or may we also loose data? > * Is this maybe caused by the combination of upgrade and disk full? > And last but not least: Keep up the good work. Kafka is really performing > well while being easy to administer and has good documentation! -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-5413) Log cleaner fails due to large offset in segment file
[ https://issues.apache.org/jira/browse/KAFKA-5413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16048429#comment-16048429 ] Jun Rao commented on KAFKA-5413: [~Kelvinrutt], yes, that seems to be the issue. segs.head.index.lastOffset doesn't give the precise last offset in the segment, especially in the case where the index is empty. The issue seems to be there even in 0.9.0. About the fix. LogSegment.nextOffset() gives the offset that we want, but can be expensive. Perhaps we could use the next segment's base offset instead. [~Kelvinrutt], do you think you could patch it? Until this is fixed, not sure if there is an easy way to get around the issue. > Log cleaner fails due to large offset in segment file > - > > Key: KAFKA-5413 > URL: https://issues.apache.org/jira/browse/KAFKA-5413 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.10.2.0, 0.10.2.1 > Environment: Ubuntu 14.04 LTS, Oracle Java 8u92, kafka_2.11-0.10.2.0 >Reporter: Nicholas Ngorok > Labels: reliability > Attachments: .index.cleaned, > .log, .log.cleaned, > .timeindex.cleaned, 002147422683.log > > > The log cleaner thread in our brokers is failing with the trace below > {noformat} > [2017-06-08 15:49:54,822] INFO {kafka-log-cleaner-thread-0} Cleaner 0: > Cleaning segment 0 in log __consumer_offsets-12 (largest timestamp Thu Jun 08 > 15:48:59 PDT 2017) into 0, retaining deletes. (kafka.log.LogCleaner) > [2017-06-08 15:49:54,822] INFO {kafka-log-cleaner-thread-0} Cleaner 0: > Cleaning segment 2147343575 in log __consumer_offsets-12 (largest timestamp > Thu Jun 08 15:49:06 PDT 2017) into 0, retaining deletes. > (kafka.log.LogCleaner) > [2017-06-08 15:49:54,834] ERROR {kafka-log-cleaner-thread-0} > [kafka-log-cleaner-thread-0], Error due to (kafka.log.LogCleaner) > java.lang.IllegalArgumentException: requirement failed: largest offset in > message set can not be safely converted to relative offset. > at scala.Predef$.require(Predef.scala:224) > at kafka.log.LogSegment.append(LogSegment.scala:109) > at kafka.log.Cleaner.cleanInto(LogCleaner.scala:478) > at > kafka.log.Cleaner$$anonfun$cleanSegments$1.apply(LogCleaner.scala:405) > at > kafka.log.Cleaner$$anonfun$cleanSegments$1.apply(LogCleaner.scala:401) > at scala.collection.immutable.List.foreach(List.scala:381) > at kafka.log.Cleaner.cleanSegments(LogCleaner.scala:401) > at kafka.log.Cleaner$$anonfun$clean$4.apply(LogCleaner.scala:363) > at kafka.log.Cleaner$$anonfun$clean$4.apply(LogCleaner.scala:362) > at scala.collection.immutable.List.foreach(List.scala:381) > at kafka.log.Cleaner.clean(LogCleaner.scala:362) > at > kafka.log.LogCleaner$CleanerThread.cleanOrSleep(LogCleaner.scala:241) > at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:220) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63) > [2017-06-08 15:49:54,835] INFO {kafka-log-cleaner-thread-0} > [kafka-log-cleaner-thread-0], Stopped (kafka.log.LogCleaner) > {noformat} > This seems to point at the specific line [here| > https://github.com/apache/kafka/blob/0.11.0/core/src/main/scala/kafka/log/LogSegment.scala#L92] > in the kafka src where the difference is actually larger than MAXINT as both > baseOffset and offset are of type long. It was introduced in this [pr| > https://github.com/apache/kafka/pull/2210/files/56d1f8196b77a47b176b7bbd1e4220a3be827631] > These were the outputs of dumping the first two log segments > {noformat} > :~$ /usr/bin/kafka-run-class kafka.tools.DumpLogSegments --deep-iteration > --files /kafka-logs/__consumer_offsets-12/000 > 0.log > Dumping /kafka-logs/__consumer_offsets-12/.log > Starting offset: 0 > offset: 1810054758 position: 0 NoTimestampType: -1 isvalid: true payloadsize: > -1 magic: 0 compresscodec: NONE crc: 3127861909 keysize: 34 > :~$ /usr/bin/kafka-run-class kafka.tools.DumpLogSegments --deep-iteration > --files /kafka-logs/__consumer_offsets-12/000 > 0002147343575.log > Dumping /kafka-logs/__consumer_offsets-12/002147343575.log > Starting offset: 2147343575 > offset: 2147539884 position: 0 NoTimestampType: -1 isvalid: true paylo > adsize: -1 magic: 0 compresscodec: NONE crc: 2282192097 keysize: 34 > {noformat} > My guess is that since 2147539884 is larger than MAXINT, we are hitting this > exception. Was there a specific reason, this check was added in 0.10.2? > E.g. if the first offset is a key = "key 0" and then we have MAXINT + 1 of > "key 1" following, wouldn't we run into this situation whenever the log > cleaner runs? --
[jira] [Commented] (KAFKA-5431) LogCleaner stopped due to org.apache.kafka.common.errors.CorruptRecordException
[ https://issues.apache.org/jira/browse/KAFKA-5431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16049229#comment-16049229 ] Jun Rao commented on KAFKA-5431: [~crietz], thanks for the info. I guess the preallocated size is 100MB? It's not clear why .log has the preallocated size. Normally, we shrink the file size to its actual size after rolling. In fact, the size for other segments like 0001.log does look normal. It's also interesting that each segment has only 1 message in it. Did you set a really small log segment size? > LogCleaner stopped due to > org.apache.kafka.common.errors.CorruptRecordException > --- > > Key: KAFKA-5431 > URL: https://issues.apache.org/jira/browse/KAFKA-5431 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.10.2.1 >Reporter: Carsten Rietz >Priority: Critical > Labels: reliability > Fix For: 0.11.0.1 > > > Hey all, > i have a strange problem with our uat cluster of 3 kafka brokers. > the __consumer_offsets topic was replicated to two instances and our disks > ran full due to a wrong configuration of the log cleaner. We fixed the > configuration and updated from 0.10.1.1 to 0.10.2.1 . > Today i increased the replication of the __consumer_offsets topic to 3 and > triggered replication to the third cluster via kafka-reassign-partitions.sh. > That went well but i get many errors like > {code} > [2017-06-12 09:59:50,342] ERROR Found invalid messages during fetch for > partition [__consumer_offsets,18] offset 0 error Record size is less than the > minimum record overhead (14) (kafka.server.ReplicaFetcherThread) > [2017-06-12 09:59:50,342] ERROR Found invalid messages during fetch for > partition [__consumer_offsets,24] offset 0 error Record size is less than the > minimum record overhead (14) (kafka.server.ReplicaFetcherThread) > {code} > Which i think are due to the full disk event. > The log cleaner threads died on these wrong messages: > {code} > [2017-06-12 09:59:50,722] ERROR [kafka-log-cleaner-thread-0], Error due to > (kafka.log.LogCleaner) > org.apache.kafka.common.errors.CorruptRecordException: Record size is less > than the minimum record overhead (14) > [2017-06-12 09:59:50,722] INFO [kafka-log-cleaner-thread-0], Stopped > (kafka.log.LogCleaner) > {code} > Looking at the file is see that some are truncated and some are jsut empty: > $ ls -lsh 00594653.log > 0 -rw-r--r-- 1 user user 100M Jun 12 11:00 00594653.log > Sadly i do not have the logs any more from the disk full event itsself. > I have three questions: > * What is the best way to clean this up? Deleting the old log files and > restarting the brokers? > * Why did kafka not handle the disk full event well? Is this only affecting > the cleanup or may we also loose data? > * Is this maybe caused by the combination of upgrade and disk full? > And last but not least: Keep up the good work. Kafka is really performing > well while being easy to administer and has good documentation! -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-5413) Log cleaner fails due to large offset in segment file
[ https://issues.apache.org/jira/browse/KAFKA-5413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16049239#comment-16049239 ] Jun Rao commented on KAFKA-5413: [~Kelvinrutt], thanks for the patch. We normally contribute code through git pull requests (see details http://kafka.apache.org/contributing) to make the review and testing easy. Do you think you could submit a git PR instead? Thanks, > Log cleaner fails due to large offset in segment file > - > > Key: KAFKA-5413 > URL: https://issues.apache.org/jira/browse/KAFKA-5413 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.10.2.0, 0.10.2.1 > Environment: Ubuntu 14.04 LTS, Oracle Java 8u92, kafka_2.11-0.10.2.0 >Reporter: Nicholas Ngorok > Labels: reliability > Fix For: 0.11.0.1 > > Attachments: .index.cleaned, > .log, .log.cleaned, > .timeindex.cleaned, 002147422683.log, > kafka-5413.patch > > > The log cleaner thread in our brokers is failing with the trace below > {noformat} > [2017-06-08 15:49:54,822] INFO {kafka-log-cleaner-thread-0} Cleaner 0: > Cleaning segment 0 in log __consumer_offsets-12 (largest timestamp Thu Jun 08 > 15:48:59 PDT 2017) into 0, retaining deletes. (kafka.log.LogCleaner) > [2017-06-08 15:49:54,822] INFO {kafka-log-cleaner-thread-0} Cleaner 0: > Cleaning segment 2147343575 in log __consumer_offsets-12 (largest timestamp > Thu Jun 08 15:49:06 PDT 2017) into 0, retaining deletes. > (kafka.log.LogCleaner) > [2017-06-08 15:49:54,834] ERROR {kafka-log-cleaner-thread-0} > [kafka-log-cleaner-thread-0], Error due to (kafka.log.LogCleaner) > java.lang.IllegalArgumentException: requirement failed: largest offset in > message set can not be safely converted to relative offset. > at scala.Predef$.require(Predef.scala:224) > at kafka.log.LogSegment.append(LogSegment.scala:109) > at kafka.log.Cleaner.cleanInto(LogCleaner.scala:478) > at > kafka.log.Cleaner$$anonfun$cleanSegments$1.apply(LogCleaner.scala:405) > at > kafka.log.Cleaner$$anonfun$cleanSegments$1.apply(LogCleaner.scala:401) > at scala.collection.immutable.List.foreach(List.scala:381) > at kafka.log.Cleaner.cleanSegments(LogCleaner.scala:401) > at kafka.log.Cleaner$$anonfun$clean$4.apply(LogCleaner.scala:363) > at kafka.log.Cleaner$$anonfun$clean$4.apply(LogCleaner.scala:362) > at scala.collection.immutable.List.foreach(List.scala:381) > at kafka.log.Cleaner.clean(LogCleaner.scala:362) > at > kafka.log.LogCleaner$CleanerThread.cleanOrSleep(LogCleaner.scala:241) > at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:220) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63) > [2017-06-08 15:49:54,835] INFO {kafka-log-cleaner-thread-0} > [kafka-log-cleaner-thread-0], Stopped (kafka.log.LogCleaner) > {noformat} > This seems to point at the specific line [here| > https://github.com/apache/kafka/blob/0.11.0/core/src/main/scala/kafka/log/LogSegment.scala#L92] > in the kafka src where the difference is actually larger than MAXINT as both > baseOffset and offset are of type long. It was introduced in this [pr| > https://github.com/apache/kafka/pull/2210/files/56d1f8196b77a47b176b7bbd1e4220a3be827631] > These were the outputs of dumping the first two log segments > {noformat} > :~$ /usr/bin/kafka-run-class kafka.tools.DumpLogSegments --deep-iteration > --files /kafka-logs/__consumer_offsets-12/000 > 0.log > Dumping /kafka-logs/__consumer_offsets-12/.log > Starting offset: 0 > offset: 1810054758 position: 0 NoTimestampType: -1 isvalid: true payloadsize: > -1 magic: 0 compresscodec: NONE crc: 3127861909 keysize: 34 > :~$ /usr/bin/kafka-run-class kafka.tools.DumpLogSegments --deep-iteration > --files /kafka-logs/__consumer_offsets-12/000 > 0002147343575.log > Dumping /kafka-logs/__consumer_offsets-12/002147343575.log > Starting offset: 2147343575 > offset: 2147539884 position: 0 NoTimestampType: -1 isvalid: true paylo > adsize: -1 magic: 0 compresscodec: NONE crc: 2282192097 keysize: 34 > {noformat} > My guess is that since 2147539884 is larger than MAXINT, we are hitting this > exception. Was there a specific reason, this check was added in 0.10.2? > E.g. if the first offset is a key = "key 0" and then we have MAXINT + 1 of > "key 1" following, wouldn't we run into this situation whenever the log > cleaner runs? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (KAFKA-5473) handle ZK session expiration properly when a new session can't be established
Jun Rao created KAFKA-5473: -- Summary: handle ZK session expiration properly when a new session can't be established Key: KAFKA-5473 URL: https://issues.apache.org/jira/browse/KAFKA-5473 Project: Kafka Issue Type: Sub-task Affects Versions: 0.9.0.0 Reporter: Jun Rao In https://issues.apache.org/jira/browse/KAFKA-2405, we change the logic in handling ZK session expiration a bit. If a new ZK session can't be established after session expiration, we just log an error and continue. However, this can leave the broker in a bad state since it's up, but not registered from the controller's perspective. Replicas on this broker may never to be in sync. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (KAFKA-5491) The ProducerPerformance tool should support transactions
[ https://issues.apache.org/jira/browse/KAFKA-5491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-5491. Resolution: Fixed Fix Version/s: 0.11.0.0 Issue resolved by pull request 3400 [https://github.com/apache/kafka/pull/3400] > The ProducerPerformance tool should support transactions > > > Key: KAFKA-5491 > URL: https://issues.apache.org/jira/browse/KAFKA-5491 > Project: Kafka > Issue Type: Bug >Reporter: Apurva Mehta >Assignee: Apurva Mehta > Fix For: 0.11.0.0 > > > We should allow users of the ProducerPerformance tool to run transactional > sends. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (KAFKA-5542) Improve Java doc for LeaderEpochFileCache.endOffsetFor()
[ https://issues.apache.org/jira/browse/KAFKA-5542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-5542. Resolution: Fixed Fix Version/s: 0.11.1.0 Issue resolved by pull request 3468 [https://github.com/apache/kafka/pull/3468] > Improve Java doc for LeaderEpochFileCache.endOffsetFor() > > > Key: KAFKA-5542 > URL: https://issues.apache.org/jira/browse/KAFKA-5542 > Project: Kafka > Issue Type: Improvement >Reporter: Dong Lin >Assignee: Ben Stopford > Fix For: 0.11.1.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (KAFKA-5431) LogCleaner stopped due to org.apache.kafka.common.errors.CorruptRecordException
[ https://issues.apache.org/jira/browse/KAFKA-5431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-5431. Resolution: Fixed Fix Version/s: 0.11.1.0 Issue resolved by pull request 3525 [https://github.com/apache/kafka/pull/3525] > LogCleaner stopped due to > org.apache.kafka.common.errors.CorruptRecordException > --- > > Key: KAFKA-5431 > URL: https://issues.apache.org/jira/browse/KAFKA-5431 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.10.2.1 >Reporter: Carsten Rietz >Assignee: huxihx > Labels: reliability > Fix For: 0.11.1.0, 0.11.0.1 > > > Hey all, > i have a strange problem with our uat cluster of 3 kafka brokers. > the __consumer_offsets topic was replicated to two instances and our disks > ran full due to a wrong configuration of the log cleaner. We fixed the > configuration and updated from 0.10.1.1 to 0.10.2.1 . > Today i increased the replication of the __consumer_offsets topic to 3 and > triggered replication to the third cluster via kafka-reassign-partitions.sh. > That went well but i get many errors like > {code} > [2017-06-12 09:59:50,342] ERROR Found invalid messages during fetch for > partition [__consumer_offsets,18] offset 0 error Record size is less than the > minimum record overhead (14) (kafka.server.ReplicaFetcherThread) > [2017-06-12 09:59:50,342] ERROR Found invalid messages during fetch for > partition [__consumer_offsets,24] offset 0 error Record size is less than the > minimum record overhead (14) (kafka.server.ReplicaFetcherThread) > {code} > Which i think are due to the full disk event. > The log cleaner threads died on these wrong messages: > {code} > [2017-06-12 09:59:50,722] ERROR [kafka-log-cleaner-thread-0], Error due to > (kafka.log.LogCleaner) > org.apache.kafka.common.errors.CorruptRecordException: Record size is less > than the minimum record overhead (14) > [2017-06-12 09:59:50,722] INFO [kafka-log-cleaner-thread-0], Stopped > (kafka.log.LogCleaner) > {code} > Looking at the file is see that some are truncated and some are jsut empty: > $ ls -lsh 00594653.log > 0 -rw-r--r-- 1 user user 100M Jun 12 11:00 00594653.log > Sadly i do not have the logs any more from the disk full event itsself. > I have three questions: > * What is the best way to clean this up? Deleting the old log files and > restarting the brokers? > * Why did kafka not handle the disk full event well? Is this only affecting > the cleanup or may we also loose data? > * Is this maybe caused by the combination of upgrade and disk full? > And last but not least: Keep up the good work. Kafka is really performing > well while being easy to administer and has good documentation! -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (KAFKA-5501) use async zookeeper apis everywhere
[ https://issues.apache.org/jira/browse/KAFKA-5501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-5501. Resolution: Fixed Fix Version/s: 1.0.0 Issue resolved by pull request 3427 [https://github.com/apache/kafka/pull/3427] > use async zookeeper apis everywhere > --- > > Key: KAFKA-5501 > URL: https://issues.apache.org/jira/browse/KAFKA-5501 > Project: Kafka > Issue Type: Sub-task >Reporter: Onur Karaman >Assignee: Onur Karaman > Fix For: 1.0.0 > > > Synchronous zookeeper writes means that we wait an entire round trip before > doing the next write. These synchronous writes are happening at a > per-partition granularity in several places, so partition-heavy clusters > suffer from the controller doing many sequential round trips to zookeeper. > * PartitionStateMachine.electLeaderForPartition updates leaderAndIsr in > zookeeper on transition to OnlinePartition. This gets triggered per-partition > sequentially with synchronous writes during controlled shutdown of the > shutting down broker's replicas for which it is the leader. > * ReplicaStateMachine updates leaderAndIsr in zookeeper on transition to > OfflineReplica when calling KafkaController.removeReplicaFromIsr. This gets > triggered per-partition sequentially with synchronous writes for failed or > controlled shutdown brokers. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (KAFKA-5706) log the name of the error instead of the error code in response objects
Jun Rao created KAFKA-5706: -- Summary: log the name of the error instead of the error code in response objects Key: KAFKA-5706 URL: https://issues.apache.org/jira/browse/KAFKA-5706 Project: Kafka Issue Type: Improvement Components: clients, core Affects Versions: 0.11.0.0 Reporter: Jun Rao Currently, when logging the error code in the response objects, we simply log response.toString(), which contains the error code. It will be useful to log the name of the corresponding exception for the error, which is more meaningful than an error code. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (KAFKA-3984) Broker doesn't retry reconnecting to an expired Zookeeper connection
[ https://issues.apache.org/jira/browse/KAFKA-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao resolved KAFKA-3984. Resolution: Duplicate Marking this as duplicate. The fix will be done in KAFKA-5473. > Broker doesn't retry reconnecting to an expired Zookeeper connection > > > Key: KAFKA-3984 > URL: https://issues.apache.org/jira/browse/KAFKA-3984 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.9.0.1, 0.10.1.1 >Reporter: Braedon Vickers > > We've been having issues with the network connectivity of our Kafka cluster, > and this seems to be triggering an issue where the brokers stop trying to > reconnect to Zookeeper, leaving us with a broken cluster even when the > network has recovered. > When network issues begin we see {{java.net.NoRouteToHostException}} > exceptions from {{org.apache.zookeeper.ClientCnxn}} as it attempts to > re-establish the connection. If the network issue resolves itself while we > are only getting these errors the broker seems to reconnect fine. > However, a lot of the time we end up with a message like this: > {code}[2016-07-22 00:21:44,181] FATAL Could not establish session with > zookeeper (kafka.server.KafkaHealthcheck) > org.I0Itec.zkclient.exception.ZkException: Unable to connect to hosts> > at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:71) > at org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:1279) > ... > Caused by: java.net.UnknownHostException: > at java.net.InetAddress.getAllByName(InetAddress.java:1126) > at java.net.InetAddress.getAllByName(InetAddress.java:1192) > at > org.apache.zookeeper.client.StaticHostProvider.(StaticHostProvider.java:61) > at org.apache.zookeeper.ZooKeeper.(ZooKeeper.java:445) > ... > {code} > (apologies for the partial stack traces - I'm having to try and reconstruct > them from a less than ideal centralised logging setup.) > If this happens, the broker stops trying to reconnect to Zookeeper, and we > have to restart it. > It looks like while the {{org.apache.zookeeper.Zookeeper}} client's state > isn't {{Expired}} it will keep retrying the connection, and will recover OK > when the network is back. However, once it changes to {{Expired}} (not > entirely sure how that happens - based on the session timeout perhaps?) > zkclient closes the existing client and attempts to create a new one. If the > network is still down, the client constructor throws a > {{java.net.UnknownHostException}}, zkclient calls > {{handleSessionEstablishmentError()}} on {{KafkaHealthcheck}}, > {{KafkaHealthcheck.handleSessionEstablishmentError()}} logs a "Fatal" error > and does nothing else. > It seems like some form of retry needs to happen here, or the broker is stuck > with no Zookeeper connection > indefinitely.{{KafkaHealthcheck.handleSessionEstablishmentError()}} used to > kill the JVM, but that was removed in > https://issues.apache.org/jira/browse/KAFKA-2405. Killing the JVM would be > better than doing nothing, as then your init system could restart it, > allowing it to recover once the network was back. > Our cluster is running 0.9.0.1, so not sure if it affects 0.10.0.0 as well. > However, it seems likely, as there doesn't seem to be any code changes in > kafka or zkclient that would affect this behaviour. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (KAFKA-5745) Partition.makeLeader() should convert HW to OffsetMetadata before becoming the leader
Jun Rao created KAFKA-5745: -- Summary: Partition.makeLeader() should convert HW to OffsetMetadata before becoming the leader Key: KAFKA-5745 URL: https://issues.apache.org/jira/browse/KAFKA-5745 Project: Kafka Issue Type: Bug Components: core Affects Versions: 0.11.0.0 Reporter: Jun Rao Saw the following uncaught exception in the log. ERROR [KafkaApi-411] Error when handling request Name: FetchRequest; Version: 3; CorrelationId: 708572; ClientId: client-0; ReplicaId: -1; MaxWait: 500 ms; MinBytes: 1 bytes; MaxBytes:52428800 bytes; RequestInfo: ([topic1,3],PartitionFetchInfo(953794,1048576)) (kafka.server.KafkaApis) org.apache.kafka.common.KafkaException: 953793 [-1 : -1] cannot compare its segment info with 953794 [829749 : 564535961] since it only has message offset info at kafka.server.LogOffsetMetadata.onOlderSegment(LogOffsetMetadata.scala:48) at kafka.server.DelayedFetch$$anonfun$tryComplete$1.apply(DelayedFetch.scala:93) at kafka.server.DelayedFetch$$anonfun$tryComplete$1.apply(DelayedFetch.scala:77) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at kafka.server.DelayedFetch.tryComplete(DelayedFetch.scala:77) at kafka.server.DelayedOperation.safeTryComplete(DelayedOperation.scala:104) at kafka.server.DelayedOperationPurgatory.tryCompleteElseWatch(DelayedOperation.scala:196) at kafka.server.ReplicaManager.fetchMessages(ReplicaManager.scala:516) at kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:534) at kafka.server.KafkaApis.handle(KafkaApis.scala:79) at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:60) at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (KAFKA-5858) consumer.poll() shouldn't throw exception due to deserialization error
Jun Rao created KAFKA-5858: -- Summary: consumer.poll() shouldn't throw exception due to deserialization error Key: KAFKA-5858 URL: https://issues.apache.org/jira/browse/KAFKA-5858 Project: Kafka Issue Type: Improvement Affects Versions: 0.11.0.0 Reporter: Jun Rao Currently, the new consumer will throw an exception in poll() if it hits a deserialization error. The consumer then can't make progress from this point on. It will be better to throw the deserialization exception only when the key/value of the ConsumerRecord is accessed, like the old consumer does. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (KAFKA-5871) bound the throttle time in byte rate quota
Jun Rao created KAFKA-5871: -- Summary: bound the throttle time in byte rate quota Key: KAFKA-5871 URL: https://issues.apache.org/jira/browse/KAFKA-5871 Project: Kafka Issue Type: Improvement Components: core Affects Versions: 0.11.0.0 Reporter: Jun Rao Currently, if a user sets a very small byte rate quota, the calculated throttled time can be higher than the request timeout in the client, which will cause the client request to time out and be retried. On the other hand, for the request time quota, we bound the throttled time with the metric window size. This prevents client timeout and seems to be better. We probably want to implement the same thing in the byte rate quota. -- This message was sent by Atlassian JIRA (v6.4.14#64029)