[jira] [Commented] (KAFKA-9141) Global state update error: missing recovery or wrong log message
[ https://issues.apache.org/jira/browse/KAFKA-9141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969047#comment-16969047 ] Matthias J. Sax commented on KAFKA-9141: Calling `cleanUp()` is something a user would need to to explicitly – there is no reason for Kafka Streams to call it. Can you share the full StackTrace/Error message? > Global state update error: missing recovery or wrong log message > > > Key: KAFKA-9141 > URL: https://issues.apache.org/jira/browse/KAFKA-9141 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 2.3.0 >Reporter: Chris Toomey >Priority: Major > > I'm getting an {{OffsetOutOfRangeException}} accompanied by the log message > "Updating global state failed. You can restart KafkaStreams to recover from > this error." But I've restarted the app several times and it's not > recovering, it keeps failing the same way. > > I see there's a {{cleanUp()}} method on {{KafkaStreams}} that looks like it's > what's needed, but it's not called anywhere in the streams source code. So > either that's a bug and the call should be added to do the recovery, or the > log message is wrong and should be fixed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-9155) __consuemer_offsets unable to compress properly
yanrui created KAFKA-9155: - Summary: __consuemer_offsets unable to compress properly Key: KAFKA-9155 URL: https://issues.apache.org/jira/browse/KAFKA-9155 Project: Kafka Issue Type: Bug Components: log cleaner Affects Versions: 2.1.1 Reporter: yanrui -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-9155) __consuemer_offsets unable to compress properly
[ https://issues.apache.org/jira/browse/KAFKA-9155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yanrui updated KAFKA-9155: -- Attachment: data.jpg > __consuemer_offsets unable to compress properly > --- > > Key: KAFKA-9155 > URL: https://issues.apache.org/jira/browse/KAFKA-9155 > Project: Kafka > Issue Type: Bug > Components: log cleaner >Affects Versions: 2.1.1 >Reporter: yanrui >Priority: Major > Attachments: data.jpg > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-9155) __consuemer_offsets unable to compress properly
[ https://issues.apache.org/jira/browse/KAFKA-9155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yanrui updated KAFKA-9155: -- Attachment: 32-log.jpg > __consuemer_offsets unable to compress properly > --- > > Key: KAFKA-9155 > URL: https://issues.apache.org/jira/browse/KAFKA-9155 > Project: Kafka > Issue Type: Bug > Components: log cleaner >Affects Versions: 2.1.1 >Reporter: yanrui >Priority: Major > Attachments: 32-log.jpg, data.jpg > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-9155) __consuemer_offsets unable to compress properly
[ https://issues.apache.org/jira/browse/KAFKA-9155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yanrui updated KAFKA-9155: -- Attachment: last_batch.jpg > __consuemer_offsets unable to compress properly > --- > > Key: KAFKA-9155 > URL: https://issues.apache.org/jira/browse/KAFKA-9155 > Project: Kafka > Issue Type: Bug > Components: log cleaner >Affects Versions: 2.1.1 >Reporter: yanrui >Priority: Major > Attachments: 32-log.jpg, data.jpg, last_batch.jpg > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-9155) __consuemer_offsets unable to compress properly
[ https://issues.apache.org/jira/browse/KAFKA-9155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yanrui updated KAFKA-9155: -- Description: The clean thread has been cleaned up normally, but the dirtyest partition selection has not been selected to 32. Found a strange problem:query log cleanup point the __consumer_offsets-32 is 864717799,but the last offset is 163990182 > __consuemer_offsets unable to compress properly > --- > > Key: KAFKA-9155 > URL: https://issues.apache.org/jira/browse/KAFKA-9155 > Project: Kafka > Issue Type: Bug > Components: log cleaner >Affects Versions: 2.1.1 >Reporter: yanrui >Priority: Major > Attachments: 32-log.jpg, data.jpg, last_batch.jpg > > > The clean thread has been cleaned up normally, but the dirtyest partition > selection has not been selected to 32. > Found a strange problem:query log cleanup point the __consumer_offsets-32 is > 864717799,but the last offset is 163990182 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-9155) __consuemer_offsets unable to compress properly
[ https://issues.apache.org/jira/browse/KAFKA-9155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yanrui updated KAFKA-9155: -- Attachment: 32data_detail.jpg > __consuemer_offsets unable to compress properly > --- > > Key: KAFKA-9155 > URL: https://issues.apache.org/jira/browse/KAFKA-9155 > Project: Kafka > Issue Type: Bug > Components: log cleaner >Affects Versions: 2.1.1 >Reporter: yanrui >Priority: Major > Attachments: 32-log.jpg, 32data_detail.jpg, data.jpg, last_batch.jpg > > > The clean thread has been cleaned up normally, but the dirtyest partition > selection has not been selected to 32. > Found a strange problem:query log cleanup point the __consumer_offsets-32 is > 864717799,but the last offset is 163990182 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-9155) __consuemer_offsets unable to compress properly
[ https://issues.apache.org/jira/browse/KAFKA-9155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yanrui updated KAFKA-9155: -- Description: In my environment, the clean thread has been cleaned up normally. There are 2 replicas of each internal topic partition.Take the 32nd partition as an example,One of the replicas is normal,But for another has been unable to be selected as the dirtiest partition. Found a strange problem:query log cleanup point the __consumer_offsets-32 is 864717799,but the last offset is 163990182. I don't know what caused it, ask for help. was: The clean thread has been cleaned up normally, but the dirtyest partition selection has not been selected to 32. Found a strange problem:query log cleanup point the __consumer_offsets-32 is 864717799,but the last offset is 163990182 > __consuemer_offsets unable to compress properly > --- > > Key: KAFKA-9155 > URL: https://issues.apache.org/jira/browse/KAFKA-9155 > Project: Kafka > Issue Type: Bug > Components: log cleaner >Affects Versions: 2.1.1 >Reporter: yanrui >Priority: Major > Attachments: 32-log.jpg, 32data_detail.jpg, data.jpg, last_batch.jpg > > > In my environment, the clean thread has been cleaned up normally. > There are 2 replicas of each internal topic partition.Take the 32nd partition > as an example,One of the replicas is normal,But for another > has been unable to be selected as the dirtiest partition. > Found a strange problem:query log cleanup point the __consumer_offsets-32 is > 864717799,but the last offset is 163990182. I don't know what caused it, ask > for help. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId
[ https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969090#comment-16969090 ] Tim Van Laer commented on KAFKA-8803: - I ran into the same issue. One stream instance (the one dealing with partition 52) kept failing with: {code} org.apache.kafka.streams.errors.StreamsException: Exception caught in process. taskId=0_52, processor=KSTREAM-SOURCE-00, topic=galactica.timeline-aligner.entries-internal.0, partition=52, offset=5151450, stacktrace=org.apache.kafka.common.errors.TimeoutException: Timeout expired after 6milliseconds while awaiting InitProducerId at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:380) ~[timeline-aligner.jar:?] at org.apache.kafka.streams.processor.internals.AssignedStreamsTasks.process(AssignedStreamsTasks.java:199) ~[timeline-aligner.jar:?] at org.apache.kafka.streams.processor.internals.TaskManager.process(TaskManager.java:425) ~[timeline-aligner.jar:?] at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:912) ~[timeline-aligner.jar:?] at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:819) ~[timeline-aligner.jar:?] at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:788) ~[timeline-aligner.jar:?] Caused by: org.apache.kafka.common.errors.TimeoutException: Timeout expired after 6milliseconds while awaiting InitProducerId {code} It was automatically restarted every time, but it kept failing (even after stopping the whole group). Yesterday two brokers throw a UNKNOWN_LEADER_EPOCH error and after that, the client started to get into troubles. {code} [2019-11-06 11:53:42,499] INFO [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=3] Retrying leaderEpoch request for partition xyz.entries-internal.0-52 as the leader reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread) {code} {code} [2019-11-06 10:06:56,652] INFO [ReplicaFetcher replicaId=1, leaderId=2, fetcherId=3] Retrying leaderEpoch request for partition xyz.entries-internal.0-52 as the leader reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread) {code} Meta: * Kafka Streams 2.3.1, * Broker: patched: 2.3.1 without KAFKA-8724 (see KAFKA-9133) I will give the {{max.block.ms}} a shot, but we're first trying a rolling restart of the brokers. > Stream will not start due to TimeoutException: Timeout expired after > 6milliseconds while awaiting InitProducerId > > > Key: KAFKA-8803 > URL: https://issues.apache.org/jira/browse/KAFKA-8803 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: Raman Gupta >Priority: Major > Attachments: logs.txt.gz, screenshot-1.png > > > One streams app is consistently failing at startup with the following > exception: > {code} > 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] > org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout > exception caught when initializing transactions for task 0_36. This might > happen if the broker is slow to respond, if the network connection to the > broker was interrupted, or if similar circumstances arise. You can increase > producer parameter `max.block.ms` to increase this timeout. > org.apache.kafka.common.errors.TimeoutException: Timeout expired after > 6milliseconds while awaiting InitProducerId > {code} > These same brokers are used by many other streams without any issue, > including some in the very same processes for the stream which consistently > throws this exception. > *UPDATE 08/16:* > The very first instance of this error is August 13th 2019, 17:03:36.754 and > it happened for 4 different streams. For 3 of these streams, the error only > happened once, and then the stream recovered. For the 4th stream, the error > has continued to happen, and continues to happen now. > I looked up the broker logs for this time, and see that at August 13th 2019, > 16:47:43, two of four brokers started reporting messages like this, for > multiple partitions: > [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, > fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader > reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread) > The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, > here is a view of the count of these messages over time: > !screenshot-1.png! > However, as noted, the stream task timeout error continues to happen. > I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 > broker. The broker has a patch for KAFKA-8773.
[jira] [Created] (KAFKA-9156) LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent state
shilin Lu created KAFKA-9156: Summary: LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent state Key: KAFKA-9156 URL: https://issues.apache.org/jira/browse/KAFKA-9156 Project: Kafka Issue Type: Bug Components: core Affects Versions: 2.3.0 Reporter: shilin Lu Attachments: image-2019-11-07-17-42-13-852.png, image-2019-11-07-17-44-05-357.png, image-2019-11-07-17-46-53-650.png !image-2019-11-07-17-42-13-852.png! this timeindex get function is not thread safe ,may cause create some timeindex. !image-2019-11-07-17-44-05-357.png! When create timeindex not exactly one ,may cause mappedbytebuffer position to end. Then write index entry to this mmap file will cause java.nio.BufferOverflowException. !image-2019-11-07-17-46-53-650.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9156) LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent state
[ https://issues.apache.org/jira/browse/KAFKA-9156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969107#comment-16969107 ] shilin Lu commented on KAFKA-9156: -- [~guozhang] [~ijuma] please take a look at this issue ,thanks. our company encounter this problem in product env. > LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent > state > --- > > Key: KAFKA-9156 > URL: https://issues.apache.org/jira/browse/KAFKA-9156 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.3.0 >Reporter: shilin Lu >Priority: Critical > Attachments: image-2019-11-07-17-42-13-852.png, > image-2019-11-07-17-44-05-357.png, image-2019-11-07-17-46-53-650.png > > > !image-2019-11-07-17-42-13-852.png! > this timeindex get function is not thread safe ,may cause create some > timeindex. > !image-2019-11-07-17-44-05-357.png! > When create timeindex not exactly one ,may cause mappedbytebuffer position to > end. Then write index entry to this mmap file will cause > java.nio.BufferOverflowException. > > !image-2019-11-07-17-46-53-650.png! > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-9156) LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent state
[ https://issues.apache.org/jira/browse/KAFKA-9156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shilin Lu updated KAFKA-9156: - Reviewer: Ismael Juma > LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent > state > --- > > Key: KAFKA-9156 > URL: https://issues.apache.org/jira/browse/KAFKA-9156 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.3.0 >Reporter: shilin Lu >Priority: Critical > Attachments: image-2019-11-07-17-42-13-852.png, > image-2019-11-07-17-44-05-357.png, image-2019-11-07-17-46-53-650.png > > > !image-2019-11-07-17-42-13-852.png! > this timeindex get function is not thread safe ,may cause create some > timeindex. > !image-2019-11-07-17-44-05-357.png! > When create timeindex not exactly one ,may cause mappedbytebuffer position to > end. Then write index entry to this mmap file will cause > java.nio.BufferOverflowException. > > !image-2019-11-07-17-46-53-650.png! > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId
[ https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969090#comment-16969090 ] Tim Van Laer edited comment on KAFKA-8803 at 11/7/19 9:56 AM: -- I ran into the same issue. One stream instance (the one dealing with partition 52) kept failing with: {code} org.apache.kafka.streams.errors.StreamsException: Exception caught in process. taskId=0_52, processor=KSTREAM-SOURCE-00, topic=xyz.entries-internal.0, partition=52, offset=5151450, stacktrace=org.apache.kafka.common.errors.TimeoutException: Timeout expired after 6milliseconds while awaiting InitProducerId at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:380) ~[timeline-aligner.jar:?] at org.apache.kafka.streams.processor.internals.AssignedStreamsTasks.process(AssignedStreamsTasks.java:199) ~[timeline-aligner.jar:?] at org.apache.kafka.streams.processor.internals.TaskManager.process(TaskManager.java:425) ~[timeline-aligner.jar:?] at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:912) ~[timeline-aligner.jar:?] at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:819) ~[timeline-aligner.jar:?] at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:788) ~[timeline-aligner.jar:?] Caused by: org.apache.kafka.common.errors.TimeoutException: Timeout expired after 6milliseconds while awaiting InitProducerId {code} It was automatically restarted every time, but it kept failing (even after stopping the whole group). Yesterday two brokers throw a UNKNOWN_LEADER_EPOCH error and after that, the client started to get into troubles. {code} [2019-11-06 11:53:42,499] INFO [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=3] Retrying leaderEpoch request for partition xyz.entries-internal.0-52 as the leader reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread) {code} {code} [2019-11-06 10:06:56,652] INFO [ReplicaFetcher replicaId=1, leaderId=2, fetcherId=3] Retrying leaderEpoch request for partition xyz.entries-internal.0-52 as the leader reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread) {code} Meta: * Kafka Streams 2.3.1, * Broker: patched: 2.3.1 without KAFKA-8724 (see KAFKA-9133) I will give the {{max.block.ms}} a shot, but we're first trying a rolling restart of the brokers. was (Author: timvanlaer): I ran into the same issue. One stream instance (the one dealing with partition 52) kept failing with: {code} org.apache.kafka.streams.errors.StreamsException: Exception caught in process. taskId=0_52, processor=KSTREAM-SOURCE-00, topic=galactica.timeline-aligner.entries-internal.0, partition=52, offset=5151450, stacktrace=org.apache.kafka.common.errors.TimeoutException: Timeout expired after 6milliseconds while awaiting InitProducerId at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:380) ~[timeline-aligner.jar:?] at org.apache.kafka.streams.processor.internals.AssignedStreamsTasks.process(AssignedStreamsTasks.java:199) ~[timeline-aligner.jar:?] at org.apache.kafka.streams.processor.internals.TaskManager.process(TaskManager.java:425) ~[timeline-aligner.jar:?] at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:912) ~[timeline-aligner.jar:?] at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:819) ~[timeline-aligner.jar:?] at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:788) ~[timeline-aligner.jar:?] Caused by: org.apache.kafka.common.errors.TimeoutException: Timeout expired after 6milliseconds while awaiting InitProducerId {code} It was automatically restarted every time, but it kept failing (even after stopping the whole group). Yesterday two brokers throw a UNKNOWN_LEADER_EPOCH error and after that, the client started to get into troubles. {code} [2019-11-06 11:53:42,499] INFO [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=3] Retrying leaderEpoch request for partition xyz.entries-internal.0-52 as the leader reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread) {code} {code} [2019-11-06 10:06:56,652] INFO [ReplicaFetcher replicaId=1, leaderId=2, fetcherId=3] Retrying leaderEpoch request for partition xyz.entries-internal.0-52 as the leader reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread) {code} Meta: * Kafka Streams 2.3.1, * Broker: patched: 2.3.1 without KAFKA-8724 (see KAFKA-9133) I will give the {{max.block.ms}} a shot, but we're first trying a rolling restart of the brokers. > Stream will not start due to TimeoutException: Timeout expired after > 6milliseconds while awaiting InitProducerId > -
[jira] [Commented] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId
[ https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969161#comment-16969161 ] Tim Van Laer commented on KAFKA-8803: - I can confirm: a rolling restart of the brokers resolved the issue. It's hard to tell as our rolling restart is automated, but I have the impression only a restart of the group coordinator is enough to do the trick. Would that make sense? > Stream will not start due to TimeoutException: Timeout expired after > 6milliseconds while awaiting InitProducerId > > > Key: KAFKA-8803 > URL: https://issues.apache.org/jira/browse/KAFKA-8803 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: Raman Gupta >Priority: Major > Attachments: logs.txt.gz, screenshot-1.png > > > One streams app is consistently failing at startup with the following > exception: > {code} > 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] > org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout > exception caught when initializing transactions for task 0_36. This might > happen if the broker is slow to respond, if the network connection to the > broker was interrupted, or if similar circumstances arise. You can increase > producer parameter `max.block.ms` to increase this timeout. > org.apache.kafka.common.errors.TimeoutException: Timeout expired after > 6milliseconds while awaiting InitProducerId > {code} > These same brokers are used by many other streams without any issue, > including some in the very same processes for the stream which consistently > throws this exception. > *UPDATE 08/16:* > The very first instance of this error is August 13th 2019, 17:03:36.754 and > it happened for 4 different streams. For 3 of these streams, the error only > happened once, and then the stream recovered. For the 4th stream, the error > has continued to happen, and continues to happen now. > I looked up the broker logs for this time, and see that at August 13th 2019, > 16:47:43, two of four brokers started reporting messages like this, for > multiple partitions: > [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, > fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader > reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread) > The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, > here is a view of the count of these messages over time: > !screenshot-1.png! > However, as noted, the stream task timeout error continues to happen. > I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 > broker. The broker has a patch for KAFKA-8773. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId
[ https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969161#comment-16969161 ] Tim Van Laer edited comment on KAFKA-8803 at 11/7/19 10:54 AM: --- I can confirm: a rolling restart of the brokers resolved the issue. It's hard to tell as our rolling restart is automated, but I have the impression a restart of the group coordinator is enough to do the trick. Would that make sense? was (Author: timvanlaer): I can confirm: a rolling restart of the brokers resolved the issue. It's hard to tell as our rolling restart is automated, but I have the impression only a restart of the group coordinator is enough to do the trick. Would that make sense? > Stream will not start due to TimeoutException: Timeout expired after > 6milliseconds while awaiting InitProducerId > > > Key: KAFKA-8803 > URL: https://issues.apache.org/jira/browse/KAFKA-8803 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: Raman Gupta >Priority: Major > Attachments: logs.txt.gz, screenshot-1.png > > > One streams app is consistently failing at startup with the following > exception: > {code} > 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] > org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout > exception caught when initializing transactions for task 0_36. This might > happen if the broker is slow to respond, if the network connection to the > broker was interrupted, or if similar circumstances arise. You can increase > producer parameter `max.block.ms` to increase this timeout. > org.apache.kafka.common.errors.TimeoutException: Timeout expired after > 6milliseconds while awaiting InitProducerId > {code} > These same brokers are used by many other streams without any issue, > including some in the very same processes for the stream which consistently > throws this exception. > *UPDATE 08/16:* > The very first instance of this error is August 13th 2019, 17:03:36.754 and > it happened for 4 different streams. For 3 of these streams, the error only > happened once, and then the stream recovered. For the 4th stream, the error > has continued to happen, and continues to happen now. > I looked up the broker logs for this time, and see that at August 13th 2019, > 16:47:43, two of four brokers started reporting messages like this, for > multiple partitions: > [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, > fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader > reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread) > The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, > here is a view of the count of these messages over time: > !screenshot-1.png! > However, as noted, the stream task timeout error continues to happen. > I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 > broker. The broker has a patch for KAFKA-8773. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-8933) An unhandled SSL handshake exception in polling event - needed a retry logic
[ https://issues.apache.org/jira/browse/KAFKA-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969309#comment-16969309 ] Mickael Maison commented on KAFKA-8933: --- We saw a similar exception in 2.4/2.5: {code:java} Nov 5 18:10:26 mirrormaker2-6c5bbc-jx85h mirrormaker2 ERROR [MirrorSourceConnector|task-0] Failure during poll. (org.apache.kafka.connect.mirror.MirrorSourceTask:159) java.lang.NullPointerException at org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.handleCompletedMetadataResponse(NetworkClient.java:1071) at org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:847) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:549) at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:262) at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.pollNoWakeup(ConsumerNetworkClient.java:303) at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1249) at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1211) at org.apache.kafka.connect.mirror.MirrorSourceTask.poll(MirrorSourceTask.java:137) at org.apache.kafka.connect.runtime.WorkerSourceTask.poll(WorkerSourceTask.java:259) at org.apache.kafka.connect.runtime.WorkerSourceTask.execute(WorkerSourceTask.java:226) at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:177) at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:227) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source){code} Prior to the exception, the client was disconnected. In NetworkClient.processDisconnection() several paths can lead to inProgressRequestVersion being set to null. If there's a MetadataRequest in flight to another node at the time, then the exception is hit when handling the response as inProgressRequestVersion is unboxed to an int. I was able to reproduce reliably with the following setup: - 2 brokers with SASL - consumer consuming from broker 0 - make broker 0 drop consumer connection - the consumer gets: {code:java} INFO [main]: [Consumer clientId=consumer-2, groupId=null] Error sending fetch request (sessionId=735517, epoch=2) to node 0: {}. org.apache.kafka.common.errors.DisconnectException{code} - make broker 0 fail authentication - consumer gets: {code:java} INFO [main]: [Consumer clientId=consumer-2, groupId=null] Failed authentication with localhost/127.0.0.1 (Authentication failed, invalid credentials) ERROR [main]: [Consumer clientId=consumer-2, groupId=null] Connection to node 0 (localhost/127.0.0.1:9093) failed authentication due to: Authentication failed, invalid credentials{code} - force metadata refresh by consumer, for example: consumer.partitionsFor("topic-that-does-not-exist"); - the consumer gets: {code:java} java.lang.NullPointerException at org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.handleCompletedMetadataResponse(NetworkClient.java:1073) at org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:847) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:549) at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:262) at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:233) at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:212) at org.apache.kafka.clients.consumer.internals.Fetcher.getTopicMetadata(Fetcher.java:368) at org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1930) at org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1898) at main.ConsumerTest2.main(ConsumerTest2.java:37) {code} I'm not super familair with this code path but the following patch helped: {code:java} diff --git clients/src/main/java/org/apache/kafka/clients/NetworkClient.java clients/src/main/java/org/apache/kafka/clients/NetworkClient.java index d782df865..d3119f132 100644 --- clients/src/main/java/org/apache/kafka/clients/NetworkClient.java +++ clients/src/main/java/org/apache/kafka/clients/NetworkClient.java @@ -1067,6 +1069,9 @@ public class NetworkClient implements KafkaClient { if (response.brokers().isEmpty()) { log.trace("Ignoring empty metadata response with correlation id {}.", requestHeader.correlationId()); this.metadata.failedUpdate(now, null); +} else if (inProgressRequestVersion == null) { +log.warn("Ignoring metadata resp
[jira] [Updated] (KAFKA-8933) An unhandled SSL handshake exception in polling event - needed a retry logic
[ https://issues.apache.org/jira/browse/KAFKA-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mickael Maison updated KAFKA-8933: -- Affects Version/s: 2.4.0 > An unhandled SSL handshake exception in polling event - needed a retry logic > > > Key: KAFKA-8933 > URL: https://issues.apache.org/jira/browse/KAFKA-8933 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 2.2.1, 2.4.0 > Environment: software platform >Reporter: Remigius >Priority: Critical > > Already client is connected and during polling event, SSL handshake failure > happened. it led to leaving the co-ordinator. Even on SSL handshake failure > which was actually intermittent issue, polling should have some resilient and > retry the polling. Leaving group caused all instances of clients to drop and > left the messages in Kafka for long time until re-subscribe the kafka topic > manually. > > > {noformat} > 2019-09-06 04:03:09,016 ERROR [reactive-kafka-] > org.apache.kafka.clients.NetworkClient [Consumer clientId=aaa, groupId=bbb] > Connection to node 150 (host:port) failed authentication due to: SSL > handshake failed > 2019-09-06 04:03:09,021 ERROR [reactive-kafka-] > reactor.kafka.receiver.internals.DefaultKafkaReceiver Unexpected exception > java.lang.NullPointerException: null > at > org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.handleCompletedMetadataResponse(NetworkClient.java:1012) > ~[kafka-clients-2.2.1.jar!/:?] > at > org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:822) > ~[kafka-clients-2.2.1.jar!/:?] > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:544) > ~[kafka-clients-2.2.1.jar!/:?] > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:265) > ~[kafka-clients-2.2.1.jar!/:?] > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:236) > ~[kafka-clients-2.2.1.jar!/:?] > at > org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1256) > ~[kafka-clients-2.2.1.jar!/:?] > at > org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1200) > ~[kafka-clients-2.2.1.jar!/:?] > at > org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1176) > ~[kafka-clients-2.2.1.jar!/:?] > at > reactor.kafka.receiver.internals.DefaultKafkaReceiver$PollEvent.run(DefaultKafkaReceiver.java:470) > ~[reactor-kafka-1.1.1.RELEASE.jar!/:1.1.1.RELEASE] > at > reactor.kafka.receiver.internals.DefaultKafkaReceiver.doEvent(DefaultKafkaReceiver.java:401) > ~[reactor-kafka-1.1.1.RELEASE.jar!/:1.1.1.RELEASE] > at > reactor.kafka.receiver.internals.DefaultKafkaReceiver.lambda$start$14(DefaultKafkaReceiver.java:335) > ~[reactor-kafka-1.1.1.RELEASE.jar!/:1.1.1.RELEASE] > at reactor.core.publisher.LambdaSubscriber.onNext(LambdaSubscriber.java:130) > ~[reactor-core-3.2.10.RELEASE.jar!/:3.2.10.RELEASE] > at > reactor.core.publisher.FluxPublishOn$PublishOnSubscriber.runAsync(FluxPublishOn.java:398) > ~[reactor-core-3.2.10.RELEASE.jar!/:3.2.10.RELEASE] > at > reactor.core.publisher.FluxPublishOn$PublishOnSubscriber.run(FluxPublishOn.java:484) > ~[reactor-core-3.2.10.RELEASE.jar!/:3.2.10.RELEASE] > at > reactor.kafka.receiver.internals.KafkaSchedulers$EventScheduler.lambda$decorate$1(KafkaSchedulers.java:100) > ~[reactor-kafka-1.1.1.RELEASE.jar!/:1.1.1.RELEASE] > at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:84) > ~[reactor-core-3.2.10.RELEASE.jar!/:3.2.10.RELEASE] > at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:37) > ~[reactor-core-3.2.10.RELEASE.jar!/:3.2.10.RELEASE] > at > org.springframework.cloud.sleuth.instrument.async.TraceCallable.call(TraceCallable.java:70) > ~[spring-cloud-sleuth-core-2.1.1.RELEASE.jar!/:2.1.1.RELEASE] > at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) > ~[?:?] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > ~[?:?] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > ~[?:?] > at java.lang.Thread.run(Thread.java:834) [?:?] > 2019-09-06 04:03:09,023 INFO [reactive-kafka-] > org.apache.kafka.clients.consumer.internals.AbstractCoordinator [Consumer > clientId=aaa, groupId=bbb] Member x_13-081e61ec-1509-4e0e-819e-58063d1ce8f6 > sending LeaveGroup request to coordinator{noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (KAFKA-8933) An unhandled SSL handshake exception in polling event - needed a retry logic
[ https://issues.apache.org/jira/browse/KAFKA-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969309#comment-16969309 ] Mickael Maison edited comment on KAFKA-8933 at 11/7/19 2:37 PM: We saw a similar exception in 2.4/2.5: {code:java} Nov 5 18:10:26 mirrormaker2-6c5bbc-jx85h mirrormaker2 ERROR [MirrorSourceConnector|task-0] Failure during poll. (org.apache.kafka.connect.mirror.MirrorSourceTask:159) java.lang.NullPointerException at org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.handleCompletedMetadataResponse(NetworkClient.java:1071) at org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:847) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:549) at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:262) at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.pollNoWakeup(ConsumerNetworkClient.java:303) at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1249) at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1211) at org.apache.kafka.connect.mirror.MirrorSourceTask.poll(MirrorSourceTask.java:137) at org.apache.kafka.connect.runtime.WorkerSourceTask.poll(WorkerSourceTask.java:259) at org.apache.kafka.connect.runtime.WorkerSourceTask.execute(WorkerSourceTask.java:226) at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:177) at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:227) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source){code} Prior to the exception, the client was disconnected. In NetworkClient.processDisconnection() several paths can lead to inProgressRequestVersion being set to null. If there's a MetadataRequest in flight to another node at the time, then the exception is hit when handling the response as inProgressRequestVersion is unboxed to an int. I was able to reproduce reliably with the following setup: - 2 brokers with SASL - consumer consuming from broker 0 - make broker 0 drop consumer connection - the consumer gets: {code:java} INFO [main]: [Consumer clientId=consumer-2, groupId=null] Error sending fetch request (sessionId=735517, epoch=2) to node 0: {}. org.apache.kafka.common.errors.DisconnectException{code} - make broker 0 fail authentication - consumer gets: {code:java} INFO [main]: [Consumer clientId=consumer-2, groupId=null] Failed authentication with localhost/127.0.0.1 (Authentication failed, invalid credentials) ERROR [main]: [Consumer clientId=consumer-2, groupId=null] Connection to node 0 (localhost/127.0.0.1:9093) failed authentication due to: Authentication failed, invalid credentials{code} - force metadata refresh by consumer, for example: consumer.partitionsFor("topic-that-does-not-exist"); - the consumer gets: {code:java} java.lang.NullPointerException at org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.handleCompletedMetadataResponse(NetworkClient.java:1073) at org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:847) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:549) at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:262) at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:233) at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:212) at org.apache.kafka.clients.consumer.internals.Fetcher.getTopicMetadata(Fetcher.java:368) at org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1930) at org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1898) at main.ConsumerTest2.main(ConsumerTest2.java:37) {code} I'm not super familair with this code path but the following patch helped: {code:java} diff --git clients/src/main/java/org/apache/kafka/clients/NetworkClient.java clients/src/main/java/org/apache/kafka/clients/NetworkClient.java index d782df865..d3119f132 100644 --- clients/src/main/java/org/apache/kafka/clients/NetworkClient.java +++ clients/src/main/java/org/apache/kafka/clients/NetworkClient.java @@ -1067,6 +1069,9 @@ public class NetworkClient implements KafkaClient { if (response.brokers().isEmpty()) { log.trace("Ignoring empty metadata response with correlation id {}.", requestHeader.correlationId()); this.metadata.failedUpdate(now, null); +} else if (inProgressRequestVersion == n
[jira] [Commented] (KAFKA-7016) Reconsider the "avoid the expensive and useless stack trace for api exceptions" practice
[ https://issues.apache.org/jira/browse/KAFKA-7016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969380#comment-16969380 ] Viliam Durina commented on KAFKA-7016: -- I add my vote to this: the exceptions are much less useful. Exceptions cost nothing if they're not thrown, which is the normal case. If you use exceptions internally in normal execution, avoid that. I have a complex code and don't know where in my code the problem happened. I ended up wrapping each kafka call individually with this to find out: {{handleExc(() -> /* the kafka call here */);}} {{private static void handleExc(RunnableExc action) {}} {{ try {}} {{ action.run();}} {{ } catch (Exception e) {}} {{ throw new RuntimeException(e);}} {{ }}} {{}}} Very annoying. > Reconsider the "avoid the expensive and useless stack trace for api > exceptions" practice > > > Key: KAFKA-7016 > URL: https://issues.apache.org/jira/browse/KAFKA-7016 > Project: Kafka > Issue Type: Bug >Reporter: Martin Vysny >Priority: Major > > I am trying to write a Kafka Consumer; upon running it only prints out: > {\{ org.apache.kafka.common.errors.InvalidGroupIdException: The configured > groupId is invalid}} > Note that the stack trace is missing, so that I have no information which > part of my code is bad and need fixing; I also have no information which > Kafka Client method has been called. Upon closer examination I found this in > ApiException: > > {{/* avoid the expensive and useless stack trace for api exceptions */}} > {{@Override}} > {{public Throwable fillInStackTrace() {}} > \{{ return this;}} > {{}}} > > I think it is a bad practice to hide all useful debugging info and trade it > for dubious performance gains. Exceptions are for exceptional code flow which > are allowed to be slow. > > This applies to kafka-clients 1.1.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9044) Brokers occasionally (randomly?) dropping out of clusters
[ https://issues.apache.org/jira/browse/KAFKA-9044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969412#comment-16969412 ] Peter Bukowinski commented on KAFKA-9044: - I don't see any log entries like that. When broker 29 became the controller, this is all I see. Do I need to modify my log4j properties to see the sessionid [2019-10-29 18:10:03,135] INFO [Controller id=29] 29 successfully elected as the controller. Epoch incremented to 17 and epoch zk version is now 16 (kafka.controller.KafkaController) [2019-10-29 18:10:03,135] INFO [Controller id=29] Registering handlers (kafka.controller.KafkaController) [2019-10-29 18:10:03,138] INFO [Controller id=29] Deleting log dir event notifications (kafka.controller.KafkaController) [2019-10-29 18:10:03,139] INFO [Controller id=29] Deleting isr change notifications (kafka.controller.KafkaController) [2019-10-29 18:10:03,140] INFO [Controller id=29] Initializing controller context (kafka.controller.KafkaController) [2019-10-29 18:10:03,164] INFO [Controller id=29] Initialized broker epochs cache: Map(5 -> 42985827819, 10 -> 42985831769, 24 -> 42985842811, 25 -> 42985843630, 14 -> 4298583480, 20 -> 42985839540, 29 -> 42985846752, 6 -> 42985828664, 28 -> 42985845998, 21 -> 42985840362, 9 -> 42985830988, 13 -> 42985834055, 2 -> 42985825412, 17 -> 42985837193, 22 -> 42985841177, 27 -> 42985845228, 12 -> 42985833303, 7 -> 42985829425, 3 -> 42985826221, 18 -> 42985837953, 16 -> 42985836344, 11 -> 42985832529, 26 -> 42985844473, 23 -> 42985842007, 8 -> 42985830214, 30 -> 42985847510, 19 -> 42985838744, 4 -> 42985827024, 15 -> 42985835551) (kafka.controller.KafkaController) [2019-10-29 18:10:03,184] DEBUG [Controller id=29] Register BrokerModifications handler for Set(5, 10, 24, 25, 14, 20, 29, 6, 28, 21, 9, 13, 2, 17, 22, 27, 12, 7, 3, 18, 16, 11, 26, 23, 8, 30, 19, 4, 15) (kafka.controller.KafkaController) [2019-10-29 18:10:03,329] DEBUG [Channel manager on controller 29]: Controller 29 trying to connect to broker 18 (kafka.controller.ControllerChannelManager) [2019-10-29 18:10:03,333] DEBUG [Channel manager on controller 29]: Controller 29 trying to connect to broker 27 (kafka.controller.ControllerChannelManager) [2019-10-29 18:10:03,334] DEBUG [Channel manager on controller 29]: Controller 29 trying to connect to broker 19 (kafka.controller.ControllerChannelManager) [2019-10-29 18:10:03,335] DEBUG [Channel manager on controller 29]: Controller 29 trying to connect to broker 17 (kafka.controller.ControllerChannelManager) This is the controller znode for the cluster: {"version":1,"brokerid":29,"timestamp":"1572397803132"} cZxid = 0xa0227fdba ctime = Tue Oct 29 18:10:03 PDT 2019 mZxid = 0xa0227fdba mtime = Tue Oct 29 18:10:03 PDT 2019 pZxid = 0xa0227fdba cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x36b49768579d9d5 dataLength = 55 numChildren = 0 I tried grepping the transaction logs for each of the hex values but came up empty. > Brokers occasionally (randomly?) dropping out of clusters > - > > Key: KAFKA-9044 > URL: https://issues.apache.org/jira/browse/KAFKA-9044 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.3.0, 2.3.1 > Environment: Ubuntu 14.04 >Reporter: Peter Bukowinski >Priority: Major > > I have several cluster running kafka 2.3.1 and this issue has affected all of > them. Because of replication and the size of the clusters (30 brokers), this > bug is not causing any data loss, but it is nevertheless concerning. When a > broker drops out, the log gives no indication that there are any zookeeper > issues (and indeed the zookeepers are healthy when this occurs. Here's > snippet from a broker log when it occurs: > {{[2019-10-07 11:02:27,630] INFO [GroupMetadataManager brokerId=14] Removed 0 > expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 11:02:56,936] INFO [Log partition=internal_test-52, > dir=/data/3/kl] Found deletable segments with base offsets [1975332] due to > retention time 360ms breach (kafka.log.Log)}} > {{[2019-10-07 11:02:56,936] INFO [Log partition=internal_test-52, > dir=/data/3/kl] Scheduling log segment [baseOffset 1975332, size 92076008] > for deletion. (kafka.log.Log)}} > {{[2019-10-07 11:02:56,936] INFO [Log partition=internal_test-52, > dir=/data/3/kl] Incrementing log start offset to 2000317 (kafka.log.Log)}} > {{[2019-10-07 11:03:56,936] INFO [Log partition=internal_test-52, > dir=/data/3/kl] Deleting segment 1975332 (kafka.log.Log)}} > {{[2019-10-07 11:03:56,957] INFO Deleted log > /data/3/kl/internal_test-52/01975332.log.deleted. > (kafka.log.LogSegment)}} > {{[2019-10-07 11:03:56,957] INFO Deleted offset index > /data/3/
[jira] [Created] (KAFKA-9157) logcleaner could generate empty segment files after cleaning
Jun Rao created KAFKA-9157: -- Summary: logcleaner could generate empty segment files after cleaning Key: KAFKA-9157 URL: https://issues.apache.org/jira/browse/KAFKA-9157 Project: Kafka Issue Type: Improvement Components: core Affects Versions: 2.3.0 Reporter: Jun Rao Currently, the log cleaner could only combine segments within a 2-billion offset range. If all records in that range are deleted, an empty segment could be generated. It would be useful to avoid generating such empty segments. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9044) Brokers occasionally (randomly?) dropping out of clusters
[ https://issues.apache.org/jira/browse/KAFKA-9044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969452#comment-16969452 ] Jun Rao commented on KAFKA-9044: The ZK session is typically created when the broker starts, not when it becomes the controller. Alternatively, the ZK server may log the client IP address for a ZK session when it's created. If you can find out that, you can see if the IP matches the broker IP. > Brokers occasionally (randomly?) dropping out of clusters > - > > Key: KAFKA-9044 > URL: https://issues.apache.org/jira/browse/KAFKA-9044 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.3.0, 2.3.1 > Environment: Ubuntu 14.04 >Reporter: Peter Bukowinski >Priority: Major > > I have several cluster running kafka 2.3.1 and this issue has affected all of > them. Because of replication and the size of the clusters (30 brokers), this > bug is not causing any data loss, but it is nevertheless concerning. When a > broker drops out, the log gives no indication that there are any zookeeper > issues (and indeed the zookeepers are healthy when this occurs. Here's > snippet from a broker log when it occurs: > {{[2019-10-07 11:02:27,630] INFO [GroupMetadataManager brokerId=14] Removed 0 > expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 11:02:56,936] INFO [Log partition=internal_test-52, > dir=/data/3/kl] Found deletable segments with base offsets [1975332] due to > retention time 360ms breach (kafka.log.Log)}} > {{[2019-10-07 11:02:56,936] INFO [Log partition=internal_test-52, > dir=/data/3/kl] Scheduling log segment [baseOffset 1975332, size 92076008] > for deletion. (kafka.log.Log)}} > {{[2019-10-07 11:02:56,936] INFO [Log partition=internal_test-52, > dir=/data/3/kl] Incrementing log start offset to 2000317 (kafka.log.Log)}} > {{[2019-10-07 11:03:56,936] INFO [Log partition=internal_test-52, > dir=/data/3/kl] Deleting segment 1975332 (kafka.log.Log)}} > {{[2019-10-07 11:03:56,957] INFO Deleted log > /data/3/kl/internal_test-52/01975332.log.deleted. > (kafka.log.LogSegment)}} > {{[2019-10-07 11:03:56,957] INFO Deleted offset index > /data/3/kl/internal_test-52/01975332.index.deleted. > (kafka.log.LogSegment)}} > {{[2019-10-07 11:03:56,958] INFO Deleted time index > /data/3/kl/internal_test-52/01975332.timeindex.deleted. > (kafka.log.LogSegment)}} > {{[2019-10-07 11:12:27,630] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 11:22:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 11:32:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 11:42:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 11:52:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 12:02:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 12:12:27,630] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 12:22:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 12:32:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 12:42:27,630] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 1 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 12:52:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 13:02:27,630] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 13:12:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 13:22:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offse
[jira] [Commented] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId
[ https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969488#comment-16969488 ] Boyang Chen commented on KAFKA-8803: You mean [~timvanlaer] transaction coordinator right? Group coordinator is handling the consumer related logic > Stream will not start due to TimeoutException: Timeout expired after > 6milliseconds while awaiting InitProducerId > > > Key: KAFKA-8803 > URL: https://issues.apache.org/jira/browse/KAFKA-8803 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: Raman Gupta >Priority: Major > Attachments: logs.txt.gz, screenshot-1.png > > > One streams app is consistently failing at startup with the following > exception: > {code} > 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] > org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout > exception caught when initializing transactions for task 0_36. This might > happen if the broker is slow to respond, if the network connection to the > broker was interrupted, or if similar circumstances arise. You can increase > producer parameter `max.block.ms` to increase this timeout. > org.apache.kafka.common.errors.TimeoutException: Timeout expired after > 6milliseconds while awaiting InitProducerId > {code} > These same brokers are used by many other streams without any issue, > including some in the very same processes for the stream which consistently > throws this exception. > *UPDATE 08/16:* > The very first instance of this error is August 13th 2019, 17:03:36.754 and > it happened for 4 different streams. For 3 of these streams, the error only > happened once, and then the stream recovered. For the 4th stream, the error > has continued to happen, and continues to happen now. > I looked up the broker logs for this time, and see that at August 13th 2019, > 16:47:43, two of four brokers started reporting messages like this, for > multiple partitions: > [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, > fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader > reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread) > The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, > here is a view of the count of these messages over time: > !screenshot-1.png! > However, as noted, the stream task timeout error continues to happen. > I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 > broker. The broker has a patch for KAFKA-8773. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9044) Brokers occasionally (randomly?) dropping out of clusters
[ https://issues.apache.org/jira/browse/KAFKA-9044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969517#comment-16969517 ] Peter Bukowinski commented on KAFKA-9044: - My broker log startup shows these lines with no sessionid. [2019-11-07 09:01:07,626] INFO starting (kafka.server.KafkaServer) [2019-11-07 09:01:07,626] INFO Connecting to zookeeper on zk-a0001.example.com:2181,zk-a0002.example.com:2181,zk-a0003.example.com:2181/kafka (kafka.server.KafkaServer) [2019-11-07 09:01:07,650] INFO [ZooKeeperClient Kafka server] Initializing a new session to zk-a0001.example.com:2181,zk-a0002.example.com:2181,zk-a0003.example.com:2181. (kafka.zookeeper.ZooKeeperClient) [2019-11-07 09:01:07,675] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient) [2019-11-07 09:01:07,699] INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient) [2019-11-07 09:01:07,784] INFO Created zookeeper path /kafka (kafka.server.KafkaServer) [2019-11-07 09:01:07,785] INFO [ZooKeeperClient Kafka server] Closing. (kafka.zookeeper.ZooKeeperClient) [2019-11-07 09:01:07,793] INFO [ZooKeeperClient Kafka server] Closed. (kafka.zookeeper.ZooKeeperClient) [2019-11-07 09:01:07,794] INFO [ZooKeeperClient Kafka server] Initializing a new session to zk-a0001.example.com:2181,zk-a0002.example.com:2181,zk-a0003.example.com:2181/kafka. (kafka.zookeeper.ZooKeeperClient) [2019-11-07 09:01:07,794] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient) [2019-11-07 09:01:07,801] INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient) [2019-11-07 09:01:08,035] INFO Cluster ID = KQW1r4YbRK-8LzVUdbKnZg (kafka.server.KafkaServer) > Brokers occasionally (randomly?) dropping out of clusters > - > > Key: KAFKA-9044 > URL: https://issues.apache.org/jira/browse/KAFKA-9044 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.3.0, 2.3.1 > Environment: Ubuntu 14.04 >Reporter: Peter Bukowinski >Priority: Major > > I have several cluster running kafka 2.3.1 and this issue has affected all of > them. Because of replication and the size of the clusters (30 brokers), this > bug is not causing any data loss, but it is nevertheless concerning. When a > broker drops out, the log gives no indication that there are any zookeeper > issues (and indeed the zookeepers are healthy when this occurs. Here's > snippet from a broker log when it occurs: > {{[2019-10-07 11:02:27,630] INFO [GroupMetadataManager brokerId=14] Removed 0 > expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 11:02:56,936] INFO [Log partition=internal_test-52, > dir=/data/3/kl] Found deletable segments with base offsets [1975332] due to > retention time 360ms breach (kafka.log.Log)}} > {{[2019-10-07 11:02:56,936] INFO [Log partition=internal_test-52, > dir=/data/3/kl] Scheduling log segment [baseOffset 1975332, size 92076008] > for deletion. (kafka.log.Log)}} > {{[2019-10-07 11:02:56,936] INFO [Log partition=internal_test-52, > dir=/data/3/kl] Incrementing log start offset to 2000317 (kafka.log.Log)}} > {{[2019-10-07 11:03:56,936] INFO [Log partition=internal_test-52, > dir=/data/3/kl] Deleting segment 1975332 (kafka.log.Log)}} > {{[2019-10-07 11:03:56,957] INFO Deleted log > /data/3/kl/internal_test-52/01975332.log.deleted. > (kafka.log.LogSegment)}} > {{[2019-10-07 11:03:56,957] INFO Deleted offset index > /data/3/kl/internal_test-52/01975332.index.deleted. > (kafka.log.LogSegment)}} > {{[2019-10-07 11:03:56,958] INFO Deleted time index > /data/3/kl/internal_test-52/01975332.timeindex.deleted. > (kafka.log.LogSegment)}} > {{[2019-10-07 11:12:27,630] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 11:22:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 11:32:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 11:42:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 11:52:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 12:02:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinat
[jira] [Commented] (KAFKA-9141) Global state update error: missing recovery or wrong log message
[ https://issues.apache.org/jira/browse/KAFKA-9141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969529#comment-16969529 ] Chris Toomey commented on KAFKA-9141: - Here's the log message and stack trace. It comes from [this line|https://github.com/apache/kafka/blob/2.4/streams/src/main/java/org/apache/kafka/streams/processor/internals/GlobalStreamThread.java#L248] in {{GlobalStreamThread}}. {code:java} 2019-10-31 17:12:15.894 - - ERROR o.a.k.s.p.i.GlobalStreamThread$StateConsumer service-platformAPI-revokedJwtsCache-91814a54-b18b-47a8-9d35-926e6abcb5f6-GlobalStreamThread - global-stream-thread [service-platformAPI-revokedJwtsCache-91814a54-b18b-47a8-9d35-926e6abcb5f6-GlobalStreamThread] Updating global state failed. You can restart KafkaStreams to recover from this error. org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of range with no configured reset policy for partitions: {livongo.public.entity.system.auth.jwt.revoked-0=15579} at org.apache.kafka.clients.consumer.internals.Fetcher.initializePartitionRecords(Fetcher.java:1266) at org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:605) at org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1283) at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1214) at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1190) at org.apache.kafka.streams.processor.internals.GlobalStreamThread$StateConsumer.pollAndUpdate(GlobalStreamThread.java:239) at org.apache.kafka.streams.processor.internals.GlobalStreamThread.run(GlobalStreamThread.java:290) {code} The context in which this occurred was switching the application to point to a different kafka broker in a development environment, which caused the offset problem. Several of us on my team got this error and took the message "You can restart KafkaStreams to recover from this error" to mean "restarting KafkaStreams will fix this error", which it doesn't, hence this ticket. How could an application detect and self-recover by calling {{cleanUp()}} from this, given that it happens asynchronously on the global stream thread? > Global state update error: missing recovery or wrong log message > > > Key: KAFKA-9141 > URL: https://issues.apache.org/jira/browse/KAFKA-9141 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 2.3.0 >Reporter: Chris Toomey >Priority: Major > > I'm getting an {{OffsetOutOfRangeException}} accompanied by the log message > "Updating global state failed. You can restart KafkaStreams to recover from > this error." But I've restarted the app several times and it's not > recovering, it keeps failing the same way. > > I see there's a {{cleanUp()}} method on {{KafkaStreams}} that looks like it's > what's needed, but it's not called anywhere in the streams source code. So > either that's a bug and the call should be added to do the recovery, or the > log message is wrong and should be fixed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId
[ https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969532#comment-16969532 ] Tim Van Laer commented on KAFKA-8803: - [~bchen225242] good point, I guess so. Any way to find the transaction coordinator broker for this particular consumer? > Stream will not start due to TimeoutException: Timeout expired after > 6milliseconds while awaiting InitProducerId > > > Key: KAFKA-8803 > URL: https://issues.apache.org/jira/browse/KAFKA-8803 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: Raman Gupta >Priority: Major > Attachments: logs.txt.gz, screenshot-1.png > > > One streams app is consistently failing at startup with the following > exception: > {code} > 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] > org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout > exception caught when initializing transactions for task 0_36. This might > happen if the broker is slow to respond, if the network connection to the > broker was interrupted, or if similar circumstances arise. You can increase > producer parameter `max.block.ms` to increase this timeout. > org.apache.kafka.common.errors.TimeoutException: Timeout expired after > 6milliseconds while awaiting InitProducerId > {code} > These same brokers are used by many other streams without any issue, > including some in the very same processes for the stream which consistently > throws this exception. > *UPDATE 08/16:* > The very first instance of this error is August 13th 2019, 17:03:36.754 and > it happened for 4 different streams. For 3 of these streams, the error only > happened once, and then the stream recovered. For the 4th stream, the error > has continued to happen, and continues to happen now. > I looked up the broker logs for this time, and see that at August 13th 2019, > 16:47:43, two of four brokers started reporting messages like this, for > multiple partitions: > [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, > fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader > reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread) > The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, > here is a view of the count of these messages over time: > !screenshot-1.png! > However, as noted, the stream task timeout error continues to happen. > I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 > broker. The broker has a patch for KAFKA-8773. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9133) LogCleaner thread dies with: currentLog cannot be empty on an unexpected exception
[ https://issues.apache.org/jira/browse/KAFKA-9133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969589#comment-16969589 ] ASF GitHub Bot commented on KAFKA-9133: --- hachikuji commented on pull request #7662: KAFKA-9133; Cleaner should handle log start offset larger than active segment base offset URL: https://github.com/apache/kafka/pull/7662 This was a regression in 2.3.1. In the case of a DeleteRecords call, the log start offset may be higher than the active segment base offset. The cleaner should allow for this case gracefully. ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > LogCleaner thread dies with: currentLog cannot be empty on an unexpected > exception > -- > > Key: KAFKA-9133 > URL: https://issues.apache.org/jira/browse/KAFKA-9133 > Project: Kafka > Issue Type: Bug > Components: log cleaner >Affects Versions: 2.3.1 >Reporter: Karolis Pocius >Assignee: Jason Gustafson >Priority: Blocker > Fix For: 2.4.0, 2.3.2 > > > Log cleaner thread dies without a clear reference to which log is causing it: > {code} > [2019-11-02 11:59:59,078] INFO Starting the log cleaner (kafka.log.LogCleaner) > [2019-11-02 11:59:59,144] INFO [kafka-log-cleaner-thread-0]: Starting > (kafka.log.LogCleaner) > [2019-11-02 11:59:59,199] ERROR [kafka-log-cleaner-thread-0]: Error due to > (kafka.log.LogCleaner) > java.lang.IllegalStateException: currentLog cannot be empty on an unexpected > exception > at kafka.log.LogCleaner$CleanerThread.cleanFilthiestLog(LogCleaner.scala:346) > at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:307) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:89) > Caused by: java.lang.IllegalArgumentException: Illegal request for non-active > segments beginning at offset 5033130, which is larger than the active > segment's base offset 5019648 > at kafka.log.Log.nonActiveLogSegmentsFrom(Log.scala:1933) > at > kafka.log.LogCleanerManager$.maxCompactionDelay(LogCleanerManager.scala:491) > at > kafka.log.LogCleanerManager.$anonfun$grabFilthiestCompactedLog$4(LogCleanerManager.scala:184) > at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.immutable.List.map(List.scala:298) > at > kafka.log.LogCleanerManager.$anonfun$grabFilthiestCompactedLog$1(LogCleanerManager.scala:181) > at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:253) > at > kafka.log.LogCleanerManager.grabFilthiestCompactedLog(LogCleanerManager.scala:171) > at kafka.log.LogCleaner$CleanerThread.cleanFilthiestLog(LogCleaner.scala:321) > ... 2 more > [2019-11-02 11:59:59,200] INFO [kafka-log-cleaner-thread-0]: Stopped > (kafka.log.LogCleaner) > {code} > If I try to ressurect it by dynamically bumping {{log.cleaner.threads}} it > instantly dies with the exact same error. > Not sure if this is something KAFKA-8725 is supposed to address. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-9101) Create a fetch.max.bytes configuration for the broker
[ https://issues.apache.org/jira/browse/KAFKA-9101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin McCabe resolved KAFKA-9101. - Resolution: Fixed https://github.com/apache/kafka/pull/7595 > Create a fetch.max.bytes configuration for the broker > - > > Key: KAFKA-9101 > URL: https://issues.apache.org/jira/browse/KAFKA-9101 > Project: Kafka > Issue Type: Improvement > Components: core >Reporter: Colin McCabe >Assignee: Colin McCabe >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-9158) producer fetch metadata until buffer is full
Brandon Jiang created KAFKA-9158: Summary: producer fetch metadata until buffer is full Key: KAFKA-9158 URL: https://issues.apache.org/jira/browse/KAFKA-9158 Project: Kafka Issue Type: Improvement Components: producer Affects Versions: 2.2.1 Reporter: Brandon Jiang Currently, based on my understanding with KafakProducer.doSend() method, when trigger kafka java client with {code:java} org.apache.kafka.clients.producer.Producer.send(message);{code} The currently behavior is to do metadata fetch first and then queue the message for sending ? This could result the kafka producer hangs for max.block.ms time when kafka servers are down. Is it possible to change the Producer.send() function behavior to queue all the messages first and then only do metadata fetch async when the queue is full. e.g. {code:java} if (msgQueue.batchIsFull ) { clusterAndWaitTime = waitOnMetadata(record.topic(), record.partition(), maxBlockTimeMs); . this.sender.wakeup(); }{code} I am happy to make this changes. But before doing that, just want to check the current design logic behind producer.send(). I have seen there are lots of talks regarding the producer send() behavior. e.g. https://issues.apache.org/jira/browse/KAFKA-2948, https://issues.apache.org/jira/browse/KAFKA-5369 So trying to make sure I didnot misunderstand the current behavior here. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-9159) Consumer.endOffsets Throw TimeoutException: Failed to get offsets by times in 30000ms after a leader change
zhangzhanchang created KAFKA-9159: - Summary: Consumer.endOffsets Throw TimeoutException: Failed to get offsets by times in 3ms after a leader change Key: KAFKA-9159 URL: https://issues.apache.org/jira/browse/KAFKA-9159 Project: Kafka Issue Type: Bug Components: consumer Affects Versions: 2.3.0, 2.0.0, 0.10.2.0 Reporter: zhangzhanchang case 1: 0.10.2 version loop call Consumer.endOffsets Throw TimeoutException: Failed to get offsets by times in 3ms after kill -9 broker,but a leader change ,loop call Consumer.endOffsets no problem case 2: 2.0 version loop call Consumer.endOffsets Throw TimeoutException: Failed to get offsets by times in 3ms after a leader change,but kill -9 broker,loop call Consumer.endOffsets no problem -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9156) LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent state
[ https://issues.apache.org/jira/browse/KAFKA-9156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969720#comment-16969720 ] Guozhang Wang commented on KAFKA-9156: -- [~lushilin] Thanks for reporting this ticket, and I think it is indeed an issue: although Option is immutable we are trying to reassign the var. I think we can fix it by replacing the `var Option` with a `val AtomicReference` in which we can rely on `compareAndSet` to set the values atomically. Would you like to provide a patch for it? > LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent > state > --- > > Key: KAFKA-9156 > URL: https://issues.apache.org/jira/browse/KAFKA-9156 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.3.0 >Reporter: shilin Lu >Priority: Critical > Attachments: image-2019-11-07-17-42-13-852.png, > image-2019-11-07-17-44-05-357.png, image-2019-11-07-17-46-53-650.png > > > !image-2019-11-07-17-42-13-852.png! > this timeindex get function is not thread safe ,may cause create some > timeindex. > !image-2019-11-07-17-44-05-357.png! > When create timeindex not exactly one ,may cause mappedbytebuffer position to > end. Then write index entry to this mmap file will cause > java.nio.BufferOverflowException. > > !image-2019-11-07-17-46-53-650.png! > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9156) LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent state
[ https://issues.apache.org/jira/browse/KAFKA-9156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969747#comment-16969747 ] shilin Lu commented on KAFKA-9156: -- [~guozhang] thanks for reply. i will fix it and provide a patch for this issue at this weekend.before this, i need to learn the process of submitting a patch. > LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent > state > --- > > Key: KAFKA-9156 > URL: https://issues.apache.org/jira/browse/KAFKA-9156 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.3.0 >Reporter: shilin Lu >Priority: Critical > Attachments: image-2019-11-07-17-42-13-852.png, > image-2019-11-07-17-44-05-357.png, image-2019-11-07-17-46-53-650.png > > > !image-2019-11-07-17-42-13-852.png! > this timeindex get function is not thread safe ,may cause create some > timeindex. > !image-2019-11-07-17-44-05-357.png! > When create timeindex not exactly one ,may cause mappedbytebuffer position to > end. Then write index entry to this mmap file will cause > java.nio.BufferOverflowException. > > !image-2019-11-07-17-46-53-650.png! > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-9159) Consumer.endOffsets Throw TimeoutException: Failed to get offsets by times in 30000ms after a leader change
[ https://issues.apache.org/jira/browse/KAFKA-9159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhangzhanchang updated KAFKA-9159: -- External issue URL: https://github.com/apache/kafka/pull/7660 > Consumer.endOffsets Throw TimeoutException: Failed to get offsets by times in > 3ms after a leader change > --- > > Key: KAFKA-9159 > URL: https://issues.apache.org/jira/browse/KAFKA-9159 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 0.10.2.0, 2.0.0, 2.3.0 >Reporter: zhangzhanchang >Priority: Major > > case 1: 0.10.2 version loop call Consumer.endOffsets Throw TimeoutException: > Failed to get offsets by times in 3ms after kill -9 broker,but a leader > change ,loop call Consumer.endOffsets no problem > case 2: 2.0 version loop call Consumer.endOffsets Throw TimeoutException: > Failed to get offsets by times in 3ms after a leader change,but kill -9 > broker,loop call Consumer.endOffsets no problem -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-9159) Consumer.endOffsets Throw TimeoutException: Failed to get offsets by times in 30000ms after a leader change
[ https://issues.apache.org/jira/browse/KAFKA-9159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhangzhanchang updated KAFKA-9159: -- External issue URL: (was: https://github.com/apache/kafka/pull/7660) > Consumer.endOffsets Throw TimeoutException: Failed to get offsets by times in > 3ms after a leader change > --- > > Key: KAFKA-9159 > URL: https://issues.apache.org/jira/browse/KAFKA-9159 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 0.10.2.0, 2.0.0, 2.3.0 >Reporter: zhangzhanchang >Priority: Major > > case 1: 0.10.2 version loop call Consumer.endOffsets Throw TimeoutException: > Failed to get offsets by times in 3ms after kill -9 broker,but a leader > change ,loop call Consumer.endOffsets no problem > case 2: 2.0 version loop call Consumer.endOffsets Throw TimeoutException: > Failed to get offsets by times in 3ms after a leader change,but kill -9 > broker,loop call Consumer.endOffsets no problem -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-9159) Consumer.endOffsets Throw TimeoutException: Failed to get offsets by times in 30000ms after a leader change
[ https://issues.apache.org/jira/browse/KAFKA-9159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhangzhanchang updated KAFKA-9159: -- Attachment: image-2019-11-08-10-28-19-881.png image-2019-11-08-10-29-06-282.png Description: case 1: 0.10.2 version loop call Consumer.endOffsets Throw TimeoutException: Failed to get offsets by times in 3ms after kill -9 broker,but a leader change ,loop call Consumer.endOffsets no problem !image-2019-11-08-10-28-19-881.png! case 2: 2.0 version loop call Consumer.endOffsets Throw TimeoutException: Failed to get offsets by times in 3ms after a leader change,but kill -9 broker,loop call Consumer.endOffsets no problem !image-2019-11-08-10-29-06-282.png! was: case 1: 0.10.2 version loop call Consumer.endOffsets Throw TimeoutException: Failed to get offsets by times in 3ms after kill -9 broker,but a leader change ,loop call Consumer.endOffsets no problem case 2: 2.0 version loop call Consumer.endOffsets Throw TimeoutException: Failed to get offsets by times in 3ms after a leader change,but kill -9 broker,loop call Consumer.endOffsets no problem > Consumer.endOffsets Throw TimeoutException: Failed to get offsets by times in > 3ms after a leader change > --- > > Key: KAFKA-9159 > URL: https://issues.apache.org/jira/browse/KAFKA-9159 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 0.10.2.0, 2.0.0, 2.3.0 >Reporter: zhangzhanchang >Priority: Major > Attachments: image-2019-11-08-10-28-19-881.png, > image-2019-11-08-10-29-06-282.png > > > case 1: 0.10.2 version loop call Consumer.endOffsets Throw TimeoutException: > Failed to get offsets by times in 3ms after kill -9 broker,but a leader > change ,loop call Consumer.endOffsets no problem > !image-2019-11-08-10-28-19-881.png! > case 2: 2.0 version loop call Consumer.endOffsets Throw TimeoutException: > Failed to get offsets by times in 3ms after a leader change,but kill -9 > broker,loop call Consumer.endOffsets no problem > !image-2019-11-08-10-29-06-282.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-9159) Consumer.endOffsets Throw TimeoutException: Failed to get offsets by times in 30000ms after a leader change
[ https://issues.apache.org/jira/browse/KAFKA-9159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhangzhanchang updated KAFKA-9159: -- Reviewer: Ismael Juma (was: Guozhang Wang) Affects Version/s: (was: 2.3.0) > Consumer.endOffsets Throw TimeoutException: Failed to get offsets by times in > 3ms after a leader change > --- > > Key: KAFKA-9159 > URL: https://issues.apache.org/jira/browse/KAFKA-9159 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 0.10.2.0, 2.0.0 >Reporter: zhangzhanchang >Priority: Major > Attachments: image-2019-11-08-10-28-19-881.png, > image-2019-11-08-10-29-06-282.png > > > case 1: 0.10.2 version loop call Consumer.endOffsets Throw TimeoutException: > Failed to get offsets by times in 3ms after kill -9 broker,but a leader > change ,loop call Consumer.endOffsets no problem > !image-2019-11-08-10-28-19-881.png! > case 2: 2.0 version loop call Consumer.endOffsets Throw TimeoutException: > Failed to get offsets by times in 3ms after a leader change,but kill -9 > broker,loop call Consumer.endOffsets no problem > !image-2019-11-08-10-29-06-282.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9159) Consumer.endOffsets Throw TimeoutException: Failed to get offsets by times in 30000ms after a leader change
[ https://issues.apache.org/jira/browse/KAFKA-9159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969765#comment-16969765 ] zhangzhanchang commented on KAFKA-9159: --- this modify: if (metadata.updateRequested() || (future.failed() && future.exception() instanceof InvalidMetadataException)), compatible with both cases > Consumer.endOffsets Throw TimeoutException: Failed to get offsets by times in > 3ms after a leader change > --- > > Key: KAFKA-9159 > URL: https://issues.apache.org/jira/browse/KAFKA-9159 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 0.10.2.0, 2.0.0 >Reporter: zhangzhanchang >Priority: Major > Attachments: image-2019-11-08-10-28-19-881.png, > image-2019-11-08-10-29-06-282.png > > > case 1: 0.10.2 version loop call Consumer.endOffsets Throw TimeoutException: > Failed to get offsets by times in 3ms after kill -9 broker,but a leader > change ,loop call Consumer.endOffsets no problem > !image-2019-11-08-10-28-19-881.png! > case 2: 2.0 version loop call Consumer.endOffsets Throw TimeoutException: > Failed to get offsets by times in 3ms after a leader change,but kill -9 > broker,loop call Consumer.endOffsets no problem > !image-2019-11-08-10-29-06-282.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-9159) Consumer.endOffsets Throw TimeoutException: Failed to get offsets by times in 30000ms after a leader change
[ https://issues.apache.org/jira/browse/KAFKA-9159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhangzhanchang updated KAFKA-9159: -- Reviewer: Guozhang Wang (was: Ismael Juma) > Consumer.endOffsets Throw TimeoutException: Failed to get offsets by times in > 3ms after a leader change > --- > > Key: KAFKA-9159 > URL: https://issues.apache.org/jira/browse/KAFKA-9159 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 0.10.2.0, 2.0.0 >Reporter: zhangzhanchang >Priority: Major > Attachments: image-2019-11-08-10-28-19-881.png, > image-2019-11-08-10-29-06-282.png > > > case 1: 0.10.2 version loop call Consumer.endOffsets Throw TimeoutException: > Failed to get offsets by times in 3ms after kill -9 broker,but a leader > change ,loop call Consumer.endOffsets no problem > !image-2019-11-08-10-28-19-881.png! > case 2: 2.0 version loop call Consumer.endOffsets Throw TimeoutException: > Failed to get offsets by times in 3ms after a leader change,but kill -9 > broker,loop call Consumer.endOffsets no problem > !image-2019-11-08-10-29-06-282.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9138) Add system test covering Foreign Key joins (KIP-213)
[ https://issues.apache.org/jira/browse/KAFKA-9138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969772#comment-16969772 ] ASF GitHub Bot commented on KAFKA-9138: --- vvcephei commented on pull request #7664: KAFKA-9138: Add system test for relational joins URL: https://github.com/apache/kafka/pull/7664 Adds a system test to verify the new foreign-key join introduced in KIP-213. Also, adds a unit test and integration test to verify the test logic itself. ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add system test covering Foreign Key joins (KIP-213) > > > Key: KAFKA-9138 > URL: https://issues.apache.org/jira/browse/KAFKA-9138 > Project: Kafka > Issue Type: Test > Components: streams >Reporter: John Roesler >Assignee: John Roesler >Priority: Major > > There are unit and integration tests, but we should really have a system test > as well. > I plan to create a new test, since this feature is pretty different than the > existing topology/data set of smoke test. Although, it might be possible for > the new test to subsume smoke test. I'd give the new test a few releases to > burn in before considering a merge, though. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9148) Consider forking RocksDB for Streams
[ https://issues.apache.org/jira/browse/KAFKA-9148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969778#comment-16969778 ] Sophie Blee-Goldman commented on KAFKA-9148: I guess we'd still have to make sure Flink doesn't pull a rocksdb and change any public APIs out from under the users, or allow the APIs to significantly diverge unless we want to go down a path of maintaining entirely distinct RocksDbStateStore and FRocksDbStateStore classes (rather than just swapping out the engine underneath based on the user choice) – in the end it seems like we can't avoid _some_ amount of extra work, so I guess it's a matter of betting on how much it will be and how much can we take on > Consider forking RocksDB for Streams > - > > Key: KAFKA-9148 > URL: https://issues.apache.org/jira/browse/KAFKA-9148 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Sophie Blee-Goldman >Priority: Major > > We recently upgraded our RocksDB dependency to 5.18 for its memory-management > abilities (namely the WriteBufferManager, see KAFKA-8215). Unfortunately, > someone from Flink recently discovered a ~8% [performance > regression|https://github.com/facebook/rocksdb/issues/5774] that exists in > all versions 5.18+ (up through the current newest version, 6.2.2). Flink was > able to react to this by downgrading to 5.17 and [picking the > WriteBufferManage|https://github.com/dataArtisans/frocksdb/pull/4]r to their > fork (fRocksDB). > Due to this and other reasons enumerated below, we should consider also > forking our own RocksDB for Streams. > Pros: > * We can avoid passing sudden breaking changes on to our users, such removal > of methods with no deprecation period (see discussion on KAFKA-8897) > * We can pick whichever version has the best performance for our needs, and > pick over any new features, metrics, etc that we need to use rather than > being forced to upgrade (and breaking user code, introducing regression, etc) > * The Java API seems to be a very low priority to the rocksdb folks. > ** They leave out critical functionality, features, and configuration > options that have been in the c++ API for a very long time > ** Those that do make it over often have random gaps in the API such as > setters but no getters (see [rocksdb PR > #5186|https://github.com/facebook/rocksdb/pull/5186]) > ** Others are poorly designed and require too many trips across the JNI, > making otherwise incredibly useful features prohibitively expensive. > *** [Custom comparator|#issuecomment-83145980]]: a custom comparator could > significantly improve the performance of session windows. This is trivial to > do but given the high performance cost of crossing the jni, it is currently > only practical to use a c++ comparator > *** [Prefix Seek|https://github.com/facebook/rocksdb/issues/6004]: not > currently used by Streams but a commonly requested feature, and may also > allow improved range queries > ** Even when an external contributor develops a solution for poorly > performing Java functionality and helpfully tries to contribute their patch > back to rocksdb, it gets ignored by the rocksdb people ([rocksdb PR > #2283|https://github.com/facebook/rocksdb/pull/2283]) > Cons: > * more work > Given that we rarely upgrade the Rocks dependency, use only some fraction of > its features, and would need or want to make only minimal changes ourselves, > it seems like we could actually get away with very little extra work by > forking rocksdb. Note that as of this writing the frocksdb repo has only > needed to open 5 PRs on top of the actual rocksdb (two of them trivial). > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9148) Consider forking RocksDB for Streams
[ https://issues.apache.org/jira/browse/KAFKA-9148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969779#comment-16969779 ] Sophie Blee-Goldman commented on KAFKA-9148: FWIW I actually do think we could gain a lot by forking rocksdb just for the custom comparator alone, which would mean practically no extra work as there's no reason that should conflict with anything else in rocks > Consider forking RocksDB for Streams > - > > Key: KAFKA-9148 > URL: https://issues.apache.org/jira/browse/KAFKA-9148 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Sophie Blee-Goldman >Priority: Major > > We recently upgraded our RocksDB dependency to 5.18 for its memory-management > abilities (namely the WriteBufferManager, see KAFKA-8215). Unfortunately, > someone from Flink recently discovered a ~8% [performance > regression|https://github.com/facebook/rocksdb/issues/5774] that exists in > all versions 5.18+ (up through the current newest version, 6.2.2). Flink was > able to react to this by downgrading to 5.17 and [picking the > WriteBufferManage|https://github.com/dataArtisans/frocksdb/pull/4]r to their > fork (fRocksDB). > Due to this and other reasons enumerated below, we should consider also > forking our own RocksDB for Streams. > Pros: > * We can avoid passing sudden breaking changes on to our users, such removal > of methods with no deprecation period (see discussion on KAFKA-8897) > * We can pick whichever version has the best performance for our needs, and > pick over any new features, metrics, etc that we need to use rather than > being forced to upgrade (and breaking user code, introducing regression, etc) > * The Java API seems to be a very low priority to the rocksdb folks. > ** They leave out critical functionality, features, and configuration > options that have been in the c++ API for a very long time > ** Those that do make it over often have random gaps in the API such as > setters but no getters (see [rocksdb PR > #5186|https://github.com/facebook/rocksdb/pull/5186]) > ** Others are poorly designed and require too many trips across the JNI, > making otherwise incredibly useful features prohibitively expensive. > *** [Custom comparator|#issuecomment-83145980]]: a custom comparator could > significantly improve the performance of session windows. This is trivial to > do but given the high performance cost of crossing the jni, it is currently > only practical to use a c++ comparator > *** [Prefix Seek|https://github.com/facebook/rocksdb/issues/6004]: not > currently used by Streams but a commonly requested feature, and may also > allow improved range queries > ** Even when an external contributor develops a solution for poorly > performing Java functionality and helpfully tries to contribute their patch > back to rocksdb, it gets ignored by the rocksdb people ([rocksdb PR > #2283|https://github.com/facebook/rocksdb/pull/2283]) > Cons: > * more work > Given that we rarely upgrade the Rocks dependency, use only some fraction of > its features, and would need or want to make only minimal changes ourselves, > it seems like we could actually get away with very little extra work by > forking rocksdb. Note that as of this writing the frocksdb repo has only > needed to open 5 PRs on top of the actual rocksdb (two of them trivial). > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9156) LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent state
[ https://issues.apache.org/jira/browse/KAFKA-9156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969794#comment-16969794 ] shilin Lu commented on KAFKA-9156: -- [~guozhang] as you suggest use val AtomicReference in this case ,i think it can work.but i want to fix this bug use double check lock.this is the code, if you think it has some problem ,i will modify it. {code:java} // code placeholder {code} def get: TimeIndex = \{ if (timeIndex.isEmpty) { this synchronized { if (timeIndex.isEmpty) { timeIndex = Some(new TimeIndex(_file, baseOffset, maxIndexSize, writable)) } } } timeIndex.get } > LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent > state > --- > > Key: KAFKA-9156 > URL: https://issues.apache.org/jira/browse/KAFKA-9156 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.3.0 >Reporter: shilin Lu >Priority: Critical > Attachments: image-2019-11-07-17-42-13-852.png, > image-2019-11-07-17-44-05-357.png, image-2019-11-07-17-46-53-650.png > > > !image-2019-11-07-17-42-13-852.png! > this timeindex get function is not thread safe ,may cause create some > timeindex. > !image-2019-11-07-17-44-05-357.png! > When create timeindex not exactly one ,may cause mappedbytebuffer position to > end. Then write index entry to this mmap file will cause > java.nio.BufferOverflowException. > > !image-2019-11-07-17-46-53-650.png! > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (KAFKA-9156) LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent state
[ https://issues.apache.org/jira/browse/KAFKA-9156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969794#comment-16969794 ] shilin Lu edited comment on KAFKA-9156 at 11/8/19 3:24 AM: --- [~guozhang] as you suggest use val AtomicReference in this case ,i think it can work.but i want to fix this bug use double check lock.this is the code, if you think it has some problem ,i will modify it. {code:java} // code placeholder def get: TimeIndex = { if (timeIndex.isEmpty) { this synchronized { if (timeIndex.isEmpty) { timeIndex = Some(new TimeIndex(_file, baseOffset, maxIndexSize, writable)) } } } timeIndex.get }{code} was (Author: lushilin): [~guozhang] as you suggest use val AtomicReference in this case ,i think it can work.but i want to fix this bug use double check lock.this is the code, if you think it has some problem ,i will modify it. {code:java} // code placeholder {code} def get: TimeIndex = \{ if (timeIndex.isEmpty) { this synchronized { if (timeIndex.isEmpty) { timeIndex = Some(new TimeIndex(_file, baseOffset, maxIndexSize, writable)) } } } timeIndex.get } > LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent > state > --- > > Key: KAFKA-9156 > URL: https://issues.apache.org/jira/browse/KAFKA-9156 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.3.0 >Reporter: shilin Lu >Priority: Critical > Attachments: image-2019-11-07-17-42-13-852.png, > image-2019-11-07-17-44-05-357.png, image-2019-11-07-17-46-53-650.png > > > !image-2019-11-07-17-42-13-852.png! > this timeindex get function is not thread safe ,may cause create some > timeindex. > !image-2019-11-07-17-44-05-357.png! > When create timeindex not exactly one ,may cause mappedbytebuffer position to > end. Then write index entry to this mmap file will cause > java.nio.BufferOverflowException. > > !image-2019-11-07-17-46-53-650.png! > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (KAFKA-9156) LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent state
[ https://issues.apache.org/jira/browse/KAFKA-9156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969794#comment-16969794 ] shilin Lu edited comment on KAFKA-9156 at 11/8/19 3:27 AM: --- [~guozhang] as you suggest use val AtomicReference in this case ,i think it can work.but i want to fix this bug use double check lock.this is the code because in this case i think it is same as new a singleton instabce.if you think it has some problem ,i will modify it. {code:java} // code placeholder def get: TimeIndex = { if (timeIndex.isEmpty) { this synchronized { if (timeIndex.isEmpty) { timeIndex = Some(new TimeIndex(_file, baseOffset, maxIndexSize, writable)) } } } timeIndex.get }{code} was (Author: lushilin): [~guozhang] as you suggest use val AtomicReference in this case ,i think it can work.but i want to fix this bug use double check lock.this is the code, if you think it has some problem ,i will modify it. {code:java} // code placeholder def get: TimeIndex = { if (timeIndex.isEmpty) { this synchronized { if (timeIndex.isEmpty) { timeIndex = Some(new TimeIndex(_file, baseOffset, maxIndexSize, writable)) } } } timeIndex.get }{code} > LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent > state > --- > > Key: KAFKA-9156 > URL: https://issues.apache.org/jira/browse/KAFKA-9156 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.3.0 >Reporter: shilin Lu >Priority: Critical > Attachments: image-2019-11-07-17-42-13-852.png, > image-2019-11-07-17-44-05-357.png, image-2019-11-07-17-46-53-650.png > > > !image-2019-11-07-17-42-13-852.png! > this timeindex get function is not thread safe ,may cause create some > timeindex. > !image-2019-11-07-17-44-05-357.png! > When create timeindex not exactly one ,may cause mappedbytebuffer position to > end. Then write index entry to this mmap file will cause > java.nio.BufferOverflowException. > > !image-2019-11-07-17-46-53-650.png! > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (KAFKA-9156) LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent state
[ https://issues.apache.org/jira/browse/KAFKA-9156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969794#comment-16969794 ] shilin Lu edited comment on KAFKA-9156 at 11/8/19 3:27 AM: --- [~guozhang] as you suggest use val AtomicReference in this case ,i think it can work.but i want to fix this bug use double check lock.this is the code because in this case i think it is same as new a singleton instance.if you think it has some problem ,i will modify it. {code:java} // code placeholder def get: TimeIndex = { if (timeIndex.isEmpty) { this synchronized { if (timeIndex.isEmpty) { timeIndex = Some(new TimeIndex(_file, baseOffset, maxIndexSize, writable)) } } } timeIndex.get }{code} was (Author: lushilin): [~guozhang] as you suggest use val AtomicReference in this case ,i think it can work.but i want to fix this bug use double check lock.this is the code because in this case i think it is same as new a singleton instabce.if you think it has some problem ,i will modify it. {code:java} // code placeholder def get: TimeIndex = { if (timeIndex.isEmpty) { this synchronized { if (timeIndex.isEmpty) { timeIndex = Some(new TimeIndex(_file, baseOffset, maxIndexSize, writable)) } } } timeIndex.get }{code} > LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent > state > --- > > Key: KAFKA-9156 > URL: https://issues.apache.org/jira/browse/KAFKA-9156 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.3.0 >Reporter: shilin Lu >Priority: Critical > Attachments: image-2019-11-07-17-42-13-852.png, > image-2019-11-07-17-44-05-357.png, image-2019-11-07-17-46-53-650.png > > > !image-2019-11-07-17-42-13-852.png! > this timeindex get function is not thread safe ,may cause create some > timeindex. > !image-2019-11-07-17-44-05-357.png! > When create timeindex not exactly one ,may cause mappedbytebuffer position to > end. Then write index entry to this mmap file will cause > java.nio.BufferOverflowException. > > !image-2019-11-07-17-46-53-650.png! > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Issue Comment Deleted] (KAFKA-9156) LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent state
[ https://issues.apache.org/jira/browse/KAFKA-9156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shilin Lu updated KAFKA-9156: - Comment: was deleted (was: [~guozhang] as you suggest use val AtomicReference in this case ,i think it can work.but i want to fix this bug use double check lock.this is the code because in this case i think it is same as new a singleton instance.if you think it has some problem ,i will modify it. {code:java} // code placeholder def get: TimeIndex = { if (timeIndex.isEmpty) { this synchronized { if (timeIndex.isEmpty) { timeIndex = Some(new TimeIndex(_file, baseOffset, maxIndexSize, writable)) } } } timeIndex.get }{code}) > LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent > state > --- > > Key: KAFKA-9156 > URL: https://issues.apache.org/jira/browse/KAFKA-9156 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.3.0 >Reporter: shilin Lu >Priority: Critical > Attachments: image-2019-11-07-17-42-13-852.png, > image-2019-11-07-17-44-05-357.png, image-2019-11-07-17-46-53-650.png > > > !image-2019-11-07-17-42-13-852.png! > this timeindex get function is not thread safe ,may cause create some > timeindex. > !image-2019-11-07-17-44-05-357.png! > When create timeindex not exactly one ,may cause mappedbytebuffer position to > end. Then write index entry to this mmap file will cause > java.nio.BufferOverflowException. > > !image-2019-11-07-17-46-53-650.png! > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-9160) Use a common stateMachine to make the state transfer more simple and clear。
maobaolong created KAFKA-9160: - Summary: Use a common stateMachine to make the state transfer more simple and clear。 Key: KAFKA-9160 URL: https://issues.apache.org/jira/browse/KAFKA-9160 Project: Kafka Issue Type: Improvement Components: controller Affects Versions: 2.5.0 Reporter: maobaolong We can reference the repo https://github.com/maobaolong/statemachine_demo -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-6747) kafka-streams Invalid transition attempted from state READY to state ABORTING_TRANSACTION
[ https://issues.apache.org/jira/browse/KAFKA-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969814#comment-16969814 ] Leon commented on KAFKA-6747: - What is the valid transition? I got Invalid transition attempted from state COMMITTING_TRANSACTION to state ABORTING_TRANSACTION, when the brokers are down. The code flow is to abort when commit throws exception. > kafka-streams Invalid transition attempted from state READY to state > ABORTING_TRANSACTION > - > > Key: KAFKA-6747 > URL: https://issues.apache.org/jira/browse/KAFKA-6747 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 1.1.0 >Reporter: Frederic Arno >Assignee: Zhihong Yu >Priority: Major > Fix For: 0.11.0.3, 1.0.2, 1.1.1, 2.0.0 > > > [~frederica] running tests against kafka-streams 1.1 and get the following > stack trace (everything was working alright using kafka-streams 1.0): > {code} > ERROR org.apache.kafka.streams.processor.internals.AssignedStreamsTasks - > stream-thread [feedBuilder-XXX-StreamThread-4] Failed to close stream task, > 0_2 > org.apache.kafka.common.KafkaException: TransactionalId feedBuilder-0_2: > Invalid transition attempted from state READY to state ABORTING_TRANSACTION > at > org.apache.kafka.clients.producer.internals.TransactionManager.transitionTo(TransactionManager.java:757) > at > org.apache.kafka.clients.producer.internals.TransactionManager.transitionTo(TransactionManager.java:751) > at > org.apache.kafka.clients.producer.internals.TransactionManager.beginAbort(TransactionManager.java:230) > at > org.apache.kafka.clients.producer.KafkaProducer.abortTransaction(KafkaProducer.java:660) > at > org.apache.kafka.streams.processor.internals.StreamTask.closeSuspended(StreamTask.java:486) > at > org.apache.kafka.streams.processor.internals.StreamTask.close(StreamTask.java:546) > at > org.apache.kafka.streams.processor.internals.AssignedTasks.closeNonRunningTasks(AssignedTasks.java:166) > at > org.apache.kafka.streams.processor.internals.AssignedTasks.suspend(AssignedTasks.java:151) > at > org.apache.kafka.streams.processor.internals.TaskManager.suspendTasksAndState(TaskManager.java:242) > at > org.apache.kafka.streams.processor.internals.StreamThread$RebalanceListener.onPartitionsRevoked(StreamThread.java:291) > at > org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinPrepare(ConsumerCoordinator.java:414) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:359) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:316) > at > org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:290) > at > org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1149) > at > org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1115) > at > org.apache.kafka.streams.processor.internals.StreamThread.pollRequests(StreamThread.java:827) > at > org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:784) > at > org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:750) > at > org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:720) > {code} > This happens when starting the same stream-processing application on 3 JVMs > all running on the same linux box, JVMs are named JVM-[2-4]. All 3 instances > use separate stream state.dir. No record is ever processed because the input > kafka topics are empty at this stage. > JVM-2 starts first, joined shortly after by JVM-4 and JVM-3, find the state > transition logs below. The above stacktrace is from JVM-4 > {code} > [JVM-2] stream-client [feedBuilder-XXX] State transition from RUNNING to > REBALANCING > [JVM-2] stream-client [feedBuilder-XXX] State transition from REBALANCING to > RUNNING > [JVM-4] stream-client [feedBuilder-XXX] State transition from RUNNING to > REBALANCING > [JVM-2] stream-client [feedBuilder-XXX] State transition from RUNNING to > REBALANCING > [JVM-2] stream-client [feedBuilder-XXX] State transition from REBALANCING to > RUNNING > [JVM-3] stream-client [feedBuilder-XXX] State transition from RUNNING to > REBALANCING > [JVM-2] stream-client [feedBuilder-XXX] State transition from RUNNING to > REBALANCING > JVM-4 crashes here with above stacktrace > [JVM-2] stream-client [feedBuilder-XXX] State transition from REBALANCING to > RUNNING > [JVM-4] stream-client [feedBuilder-XXX] Sta
[jira] [Commented] (KAFKA-9156) LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent state
[ https://issues.apache.org/jira/browse/KAFKA-9156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969820#comment-16969820 ] shilin Lu commented on KAFKA-9156: -- [~guozhang] i think only use val AtomicReference can't perform as well. {code:java} // code placeholder class LazyTimeIndex(@volatile private var _file: File, baseOffset: Long, maxIndexSize: Int = -1, writable: Boolean = true) { @volatile private var timeIndex: Option[TimeIndex] = None def file: File = { if (timeIndex.isDefined) timeIndex.get.file else _file } def file_=(f: File) { if (timeIndex.isDefined) timeIndex.get.file = f else _file = f } def get: TimeIndex = { if (timeIndex.isEmpty) { this synchronized { if (timeIndex.isEmpty) { timeIndex = Some(new TimeIndex(_file, baseOffset, maxIndexSize, writable)) } } } timeIndex.get } }{code} the file will rename by def file_=(f: File) this function. we should keep it Multi-thread safe. i think this code can work.i use synchronized because i think this function is lightweight. if you think has some problem, i will modify it. thanks {code:java} // code placeholder class LazyTimeIndex(@volatile private var _file: File, baseOffset: Long, maxIndexSize: Int = -1, writable: Boolean = true) { @volatile private var timeIndex: Option[TimeIndex] = None def file: File = { this synchronized { if (timeIndex.isDefined) timeIndex.get.file else _file } } def file_=(f: File) { this synchronized { if (timeIndex.isDefined) timeIndex.get.file = f else _file = f } } def get: TimeIndex = { if (timeIndex.isEmpty) { this synchronized { if (timeIndex.isEmpty) { timeIndex = Some(new TimeIndex(_file, baseOffset, maxIndexSize, writable)) } } } timeIndex.get } }{code} > LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent > state > --- > > Key: KAFKA-9156 > URL: https://issues.apache.org/jira/browse/KAFKA-9156 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.3.0 >Reporter: shilin Lu >Priority: Critical > Attachments: image-2019-11-07-17-42-13-852.png, > image-2019-11-07-17-44-05-357.png, image-2019-11-07-17-46-53-650.png > > > !image-2019-11-07-17-42-13-852.png! > this timeindex get function is not thread safe ,may cause create some > timeindex. > !image-2019-11-07-17-44-05-357.png! > When create timeindex not exactly one ,may cause mappedbytebuffer position to > end. Then write index entry to this mmap file will cause > java.nio.BufferOverflowException. > > !image-2019-11-07-17-46-53-650.png! > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-9161) Close gaps in Streams configs documentation
Sophie Blee-Goldman created KAFKA-9161: -- Summary: Close gaps in Streams configs documentation Key: KAFKA-9161 URL: https://issues.apache.org/jira/browse/KAFKA-9161 Project: Kafka Issue Type: Bug Components: documentation, streams Reporter: Sophie Blee-Goldman There are a number of Streams configs that aren't documented in the "Configuring a Streams Application" section of the docs. As of 2.3 the missing configs are: # default.windowed.key.serde.inner * # default.windowed.value.serde.inner * # max.task.idle.ms # rocksdb.config.setter. ** # topology.optimization # upgrade.from * these configs are also missing the corresponding DOC string ** this one actually does appear on that page, but instead of being included in the list of Streams configs it is for some reason under "Consumer and Producer Configuration Parameters" ? There are also a few configs whose documented name is slightly incorrect, ie it is missing the "default" prefix that appears in the actual code. I assume this was not intentional because there are other configs that do have the "default" prefix included, so it doesn't seem to be dropped on purpose. The "missing-default" configs are: # key.serde # value.serde # timestamp.extractor -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KAFKA-9161) Close gaps in Streams configs documentation
[ https://issues.apache.org/jira/browse/KAFKA-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sophie Blee-Goldman reassigned KAFKA-9161: -- Assignee: Sophie Blee-Goldman > Close gaps in Streams configs documentation > --- > > Key: KAFKA-9161 > URL: https://issues.apache.org/jira/browse/KAFKA-9161 > Project: Kafka > Issue Type: Bug > Components: documentation, streams >Reporter: Sophie Blee-Goldman >Assignee: Sophie Blee-Goldman >Priority: Major > > There are a number of Streams configs that aren't documented in the > "Configuring a Streams Application" section of the docs. As of 2.3 the > missing configs are: > # default.windowed.key.serde.inner * > # default.windowed.value.serde.inner * > # max.task.idle.ms > # rocksdb.config.setter. ** > # topology.optimization > # upgrade.from > * these configs are also missing the corresponding DOC string > ** this one actually does appear on that page, but instead of being included > in the list of Streams configs it is for some reason under "Consumer and > Producer Configuration Parameters" ? > There are also a few configs whose documented name is slightly incorrect, ie > it is missing the "default" prefix that appears in the actual code. I assume > this was not intentional because there are other configs that do have the > "default" prefix included, so it doesn't seem to be dropped on purpose. The > "missing-default" configs are: > # key.serde > # value.serde > # timestamp.extractor -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-8933) An unhandled SSL handshake exception in polling event - needed a retry logic
[ https://issues.apache.org/jira/browse/KAFKA-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969828#comment-16969828 ] Manikumar commented on KAFKA-8933: -- [~mimaison] If this existing bug in previous releases, then it's not a blocker. Any regressions in 2.4 branch or data loss issues considered as blocker. > An unhandled SSL handshake exception in polling event - needed a retry logic > > > Key: KAFKA-8933 > URL: https://issues.apache.org/jira/browse/KAFKA-8933 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 2.2.1, 2.4.0 > Environment: software platform >Reporter: Remigius >Priority: Critical > > Already client is connected and during polling event, SSL handshake failure > happened. it led to leaving the co-ordinator. Even on SSL handshake failure > which was actually intermittent issue, polling should have some resilient and > retry the polling. Leaving group caused all instances of clients to drop and > left the messages in Kafka for long time until re-subscribe the kafka topic > manually. > > > {noformat} > 2019-09-06 04:03:09,016 ERROR [reactive-kafka-] > org.apache.kafka.clients.NetworkClient [Consumer clientId=aaa, groupId=bbb] > Connection to node 150 (host:port) failed authentication due to: SSL > handshake failed > 2019-09-06 04:03:09,021 ERROR [reactive-kafka-] > reactor.kafka.receiver.internals.DefaultKafkaReceiver Unexpected exception > java.lang.NullPointerException: null > at > org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.handleCompletedMetadataResponse(NetworkClient.java:1012) > ~[kafka-clients-2.2.1.jar!/:?] > at > org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:822) > ~[kafka-clients-2.2.1.jar!/:?] > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:544) > ~[kafka-clients-2.2.1.jar!/:?] > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:265) > ~[kafka-clients-2.2.1.jar!/:?] > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:236) > ~[kafka-clients-2.2.1.jar!/:?] > at > org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1256) > ~[kafka-clients-2.2.1.jar!/:?] > at > org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1200) > ~[kafka-clients-2.2.1.jar!/:?] > at > org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1176) > ~[kafka-clients-2.2.1.jar!/:?] > at > reactor.kafka.receiver.internals.DefaultKafkaReceiver$PollEvent.run(DefaultKafkaReceiver.java:470) > ~[reactor-kafka-1.1.1.RELEASE.jar!/:1.1.1.RELEASE] > at > reactor.kafka.receiver.internals.DefaultKafkaReceiver.doEvent(DefaultKafkaReceiver.java:401) > ~[reactor-kafka-1.1.1.RELEASE.jar!/:1.1.1.RELEASE] > at > reactor.kafka.receiver.internals.DefaultKafkaReceiver.lambda$start$14(DefaultKafkaReceiver.java:335) > ~[reactor-kafka-1.1.1.RELEASE.jar!/:1.1.1.RELEASE] > at reactor.core.publisher.LambdaSubscriber.onNext(LambdaSubscriber.java:130) > ~[reactor-core-3.2.10.RELEASE.jar!/:3.2.10.RELEASE] > at > reactor.core.publisher.FluxPublishOn$PublishOnSubscriber.runAsync(FluxPublishOn.java:398) > ~[reactor-core-3.2.10.RELEASE.jar!/:3.2.10.RELEASE] > at > reactor.core.publisher.FluxPublishOn$PublishOnSubscriber.run(FluxPublishOn.java:484) > ~[reactor-core-3.2.10.RELEASE.jar!/:3.2.10.RELEASE] > at > reactor.kafka.receiver.internals.KafkaSchedulers$EventScheduler.lambda$decorate$1(KafkaSchedulers.java:100) > ~[reactor-kafka-1.1.1.RELEASE.jar!/:1.1.1.RELEASE] > at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:84) > ~[reactor-core-3.2.10.RELEASE.jar!/:3.2.10.RELEASE] > at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:37) > ~[reactor-core-3.2.10.RELEASE.jar!/:3.2.10.RELEASE] > at > org.springframework.cloud.sleuth.instrument.async.TraceCallable.call(TraceCallable.java:70) > ~[spring-cloud-sleuth-core-2.1.1.RELEASE.jar!/:2.1.1.RELEASE] > at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) > ~[?:?] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > ~[?:?] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > ~[?:?] > at java.lang.Thread.run(Thread.java:834) [?:?] > 2019-09-06 04:03:09,023 INFO [reactive-kafka-] > org.apache.kafka.clients.consumer.internals.AbstractCoordinator [Consumer > clientId=aaa, groupId=bbb] Member x_13-081e61ec-1509-4e0e-819e-58063d1ce8f6 > sending LeaveGroup request to coordinator{noformat
[jira] [Comment Edited] (KAFKA-8933) An unhandled SSL handshake exception in polling event - needed a retry logic
[ https://issues.apache.org/jira/browse/KAFKA-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969828#comment-16969828 ] Manikumar edited comment on KAFKA-8933 at 11/8/19 4:40 AM: --- [~mimaison] If this is existing bug in previous releases, then it's not a blocker. Any regressions in 2.4 branch or data loss issues considered as blocker. was (Author: omkreddy): [~mimaison] If this existing bug in previous releases, then it's not a blocker. Any regressions in 2.4 branch or data loss issues considered as blocker. > An unhandled SSL handshake exception in polling event - needed a retry logic > > > Key: KAFKA-8933 > URL: https://issues.apache.org/jira/browse/KAFKA-8933 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 2.2.1, 2.4.0 > Environment: software platform >Reporter: Remigius >Priority: Critical > > Already client is connected and during polling event, SSL handshake failure > happened. it led to leaving the co-ordinator. Even on SSL handshake failure > which was actually intermittent issue, polling should have some resilient and > retry the polling. Leaving group caused all instances of clients to drop and > left the messages in Kafka for long time until re-subscribe the kafka topic > manually. > > > {noformat} > 2019-09-06 04:03:09,016 ERROR [reactive-kafka-] > org.apache.kafka.clients.NetworkClient [Consumer clientId=aaa, groupId=bbb] > Connection to node 150 (host:port) failed authentication due to: SSL > handshake failed > 2019-09-06 04:03:09,021 ERROR [reactive-kafka-] > reactor.kafka.receiver.internals.DefaultKafkaReceiver Unexpected exception > java.lang.NullPointerException: null > at > org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.handleCompletedMetadataResponse(NetworkClient.java:1012) > ~[kafka-clients-2.2.1.jar!/:?] > at > org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:822) > ~[kafka-clients-2.2.1.jar!/:?] > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:544) > ~[kafka-clients-2.2.1.jar!/:?] > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:265) > ~[kafka-clients-2.2.1.jar!/:?] > at > org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:236) > ~[kafka-clients-2.2.1.jar!/:?] > at > org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1256) > ~[kafka-clients-2.2.1.jar!/:?] > at > org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1200) > ~[kafka-clients-2.2.1.jar!/:?] > at > org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1176) > ~[kafka-clients-2.2.1.jar!/:?] > at > reactor.kafka.receiver.internals.DefaultKafkaReceiver$PollEvent.run(DefaultKafkaReceiver.java:470) > ~[reactor-kafka-1.1.1.RELEASE.jar!/:1.1.1.RELEASE] > at > reactor.kafka.receiver.internals.DefaultKafkaReceiver.doEvent(DefaultKafkaReceiver.java:401) > ~[reactor-kafka-1.1.1.RELEASE.jar!/:1.1.1.RELEASE] > at > reactor.kafka.receiver.internals.DefaultKafkaReceiver.lambda$start$14(DefaultKafkaReceiver.java:335) > ~[reactor-kafka-1.1.1.RELEASE.jar!/:1.1.1.RELEASE] > at reactor.core.publisher.LambdaSubscriber.onNext(LambdaSubscriber.java:130) > ~[reactor-core-3.2.10.RELEASE.jar!/:3.2.10.RELEASE] > at > reactor.core.publisher.FluxPublishOn$PublishOnSubscriber.runAsync(FluxPublishOn.java:398) > ~[reactor-core-3.2.10.RELEASE.jar!/:3.2.10.RELEASE] > at > reactor.core.publisher.FluxPublishOn$PublishOnSubscriber.run(FluxPublishOn.java:484) > ~[reactor-core-3.2.10.RELEASE.jar!/:3.2.10.RELEASE] > at > reactor.kafka.receiver.internals.KafkaSchedulers$EventScheduler.lambda$decorate$1(KafkaSchedulers.java:100) > ~[reactor-kafka-1.1.1.RELEASE.jar!/:1.1.1.RELEASE] > at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:84) > ~[reactor-core-3.2.10.RELEASE.jar!/:3.2.10.RELEASE] > at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:37) > ~[reactor-core-3.2.10.RELEASE.jar!/:3.2.10.RELEASE] > at > org.springframework.cloud.sleuth.instrument.async.TraceCallable.call(TraceCallable.java:70) > ~[spring-cloud-sleuth-core-2.1.1.RELEASE.jar!/:2.1.1.RELEASE] > at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) > ~[?:?] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > ~[?:?] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > ~[?:?] > at java.lang.Thread.run(Thread.java:834) [?:?] > 2019-09-06 04:03:09,023 INFO
[jira] [Commented] (KAFKA-9134) Refactor the StreamsPartitionAssignor for more code sharing with the FutureStreamsPartitionAssignor
[ https://issues.apache.org/jira/browse/KAFKA-9134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969829#comment-16969829 ] John Roesler commented on KAFKA-9134: - Thanks for calling this out [~ableegoldman]. Recently, while chasing down some problems with the version probing test, I actually developed the opinion that the blame is in the opposite direction... Namely, that the FutureStreamsPartitionAssignor shares too much code with the StreamsPartitionAssignor. Reading over the details of your thinking, though, I'm wondering if these are really just two ways of saying the same thing. Actually, I recently made some changes that sound directionally similar to what you propose. Can you take a look at https://github.com/apache/kafka/pull/7649 (and the other changes in that PR) and comment on whether you think it addressed (some of) your concerns in this ticket? > Refactor the StreamsPartitionAssignor for more code sharing with the > FutureStreamsPartitionAssignor > --- > > Key: KAFKA-9134 > URL: https://issues.apache.org/jira/browse/KAFKA-9134 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Sophie Blee-Goldman >Priority: Major > > Frequently when fixing bugs in the StreamsPartitionAssignor, version probing, > or other assignor-related matters, we make a change in the > StreamsPartitionAssignor and re-run the version_probing_upgrade system test > only to see that the issue is not fixed because we forgot to mirror that > change in the FutureStreamsPartitionAssignor. Worse yet, we are making a new > change or fixing something that doesn't directly affect version probing, so > we update only the StreamsPartitionAssignor and don't even run the version > probing test, and discover later the version probing system test has started > failing. Then we often waste time digging through old changes just to > discover it was just because we forgot to copy any changes to the > StreamsUpgradeTest classes (includes also it's future version of the > SubscriptionInfo and AssignmentInfo classes) > > We should refactor the StreamsPartitionAssignor so that the future version > can rely more heavily on the original class. This will probably involve > either or all of: > * making the class's latestSupportedVersion configurable in the constructor, > to be used only by the system test. this will probably also require making it > configurable in some other classes such as SubscriptionInfo and AssignmentInfo > * breaking up the methods such as onAssignment and assign so that the future > assignor can call smaller pieces at a time – part of the original problem is > that the normal assignor will throw an exception partway through these > methods if the latest version is a "future" one, so we end up having to call > the method, catch the exception, and recode the remainder of the work in that > method -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-9161) Close gaps in Streams configs documentation
[ https://issues.apache.org/jira/browse/KAFKA-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias J. Sax updated KAFKA-9161: --- Labels: beginner newbie (was: ) > Close gaps in Streams configs documentation > --- > > Key: KAFKA-9161 > URL: https://issues.apache.org/jira/browse/KAFKA-9161 > Project: Kafka > Issue Type: Bug > Components: documentation, streams >Reporter: Sophie Blee-Goldman >Assignee: Sophie Blee-Goldman >Priority: Major > Labels: beginner, newbie > > There are a number of Streams configs that aren't documented in the > "Configuring a Streams Application" section of the docs. As of 2.3 the > missing configs are: > # default.windowed.key.serde.inner * > # default.windowed.value.serde.inner * > # max.task.idle.ms > # rocksdb.config.setter. ** > # topology.optimization > # upgrade.from > * these configs are also missing the corresponding DOC string > ** this one actually does appear on that page, but instead of being included > in the list of Streams configs it is for some reason under "Consumer and > Producer Configuration Parameters" ? > There are also a few configs whose documented name is slightly incorrect, ie > it is missing the "default" prefix that appears in the actual code. I assume > this was not intentional because there are other configs that do have the > "default" prefix included, so it doesn't seem to be dropped on purpose. The > "missing-default" configs are: > # key.serde > # value.serde > # timestamp.extractor -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (KAFKA-9156) LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent state
[ https://issues.apache.org/jira/browse/KAFKA-9156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969820#comment-16969820 ] shilin Lu edited comment on KAFKA-9156 at 11/8/19 5:07 AM: --- [~guozhang] i think only use val AtomicReference can't perform as well. {code:java} // code placeholder class LazyTimeIndex(@volatile private var _file: File, baseOffset: Long, maxIndexSize: Int = -1, writable: Boolean = true) { @volatile private var timeIndex: Option[TimeIndex] = None def file: File = { if (timeIndex.isDefined) timeIndex.get.file else _file } def file_=(f: File) { if (timeIndex.isDefined) timeIndex.get.file = f else _file = f } def get: TimeIndex = { if (timeIndex.isEmpty) { this synchronized { if (timeIndex.isEmpty) { timeIndex = Some(new TimeIndex(_file, baseOffset, maxIndexSize, writable)) } } } timeIndex.get } }{code} the file will rename by def file_=(f: File) this function. so _file and timeIndex param are unsafe in multi-thread. we should keep it Multi-thread safe. i think this code can work.i use synchronized because i think this function is lightweight. if you think has some problem, i will modify it. thanks {code:java} // code placeholder class LazyTimeIndex(@volatile private var _file: File, baseOffset: Long, maxIndexSize: Int = -1, writable: Boolean = true) { @volatile private var timeIndex: Option[TimeIndex] = None def file: File = { this synchronized { if (timeIndex.isDefined) timeIndex.get.file else _file } } def file_=(f: File) { this synchronized { if (timeIndex.isDefined) timeIndex.get.file = f else _file = f } } def get: TimeIndex = { if (timeIndex.isEmpty) { this synchronized { if (timeIndex.isEmpty) { timeIndex = Some(new TimeIndex(_file, baseOffset, maxIndexSize, writable)) } } } timeIndex.get } }{code} was (Author: lushilin): [~guozhang] i think only use val AtomicReference can't perform as well. {code:java} // code placeholder class LazyTimeIndex(@volatile private var _file: File, baseOffset: Long, maxIndexSize: Int = -1, writable: Boolean = true) { @volatile private var timeIndex: Option[TimeIndex] = None def file: File = { if (timeIndex.isDefined) timeIndex.get.file else _file } def file_=(f: File) { if (timeIndex.isDefined) timeIndex.get.file = f else _file = f } def get: TimeIndex = { if (timeIndex.isEmpty) { this synchronized { if (timeIndex.isEmpty) { timeIndex = Some(new TimeIndex(_file, baseOffset, maxIndexSize, writable)) } } } timeIndex.get } }{code} the file will rename by def file_=(f: File) this function. we should keep it Multi-thread safe. i think this code can work.i use synchronized because i think this function is lightweight. if you think has some problem, i will modify it. thanks {code:java} // code placeholder class LazyTimeIndex(@volatile private var _file: File, baseOffset: Long, maxIndexSize: Int = -1, writable: Boolean = true) { @volatile private var timeIndex: Option[TimeIndex] = None def file: File = { this synchronized { if (timeIndex.isDefined) timeIndex.get.file else _file } } def file_=(f: File) { this synchronized { if (timeIndex.isDefined) timeIndex.get.file = f else _file = f } } def get: TimeIndex = { if (timeIndex.isEmpty) { this synchronized { if (timeIndex.isEmpty) { timeIndex = Some(new TimeIndex(_file, baseOffset, maxIndexSize, writable)) } } } timeIndex.get } }{code} > LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent > state > --- > > Key: KAFKA-9156 > URL: https://issues.apache.org/jira/browse/KAFKA-9156 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.3.0 >Reporter: shilin Lu >Priority: Critical > Attachments: image-2019-11-07-17-42-13-852.png, > image-2019-11-07-17-44-05-357.png, image-2019-11-07-17-46-53-650.png > > > !image-2019-11-07-17-42-13-852.png! > this timeindex get function is not thread safe ,may cause create some > timeindex. > !image-2019-11-07-17-44-05-357.png! > When create timeindex not exactly one ,may cause mappedbytebuffer position to > end. Then write index entry to this mmap file will cause > java.nio.BufferOverflowException. > > !image-2019-11-07-17-46-53-650.png! > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-9158) Makr producer fetch metadata only when buffer is full
[ https://issues.apache.org/jira/browse/KAFKA-9158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brandon Jiang updated KAFKA-9158: - Affects Version/s: (was: 2.2.1) 2.3.0 > Makr producer fetch metadata only when buffer is full > - > > Key: KAFKA-9158 > URL: https://issues.apache.org/jira/browse/KAFKA-9158 > Project: Kafka > Issue Type: Improvement > Components: producer >Affects Versions: 2.3.0 >Reporter: Brandon Jiang >Priority: Major > > Currently, based on my understanding with KafakProducer.doSend() method, when > trigger kafka java client with > {code:java} > org.apache.kafka.clients.producer.Producer.send(message);{code} > The currently behavior is to do metadata fetch first and then queue the > message for sending ? This could result the kafka producer hangs for > max.block.ms time when kafka servers are down. > Is it possible to change the Producer.send() function behavior to queue all > the messages first and then only do metadata fetch async when the queue is > full. e.g. > {code:java} > if (msgQueue.batchIsFull ) { > clusterAndWaitTime = waitOnMetadata(record.topic(), record.partition(), > maxBlockTimeMs); > . > this.sender.wakeup(); > }{code} > I am happy to make this changes. But before doing that, just want to check > the current design logic behind producer.send(). > I have seen there are lots of talks regarding the producer send() behavior. > e.g. https://issues.apache.org/jira/browse/KAFKA-2948, > https://issues.apache.org/jira/browse/KAFKA-5369 > So trying to make sure I didnot misunderstand the current behavior here. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-9158) Makr producer fetch metadata only when buffer is full
[ https://issues.apache.org/jira/browse/KAFKA-9158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brandon Jiang updated KAFKA-9158: - Summary: Makr producer fetch metadata only when buffer is full (was: producer fetch metadata until buffer is full) > Makr producer fetch metadata only when buffer is full > - > > Key: KAFKA-9158 > URL: https://issues.apache.org/jira/browse/KAFKA-9158 > Project: Kafka > Issue Type: Improvement > Components: producer >Affects Versions: 2.2.1 >Reporter: Brandon Jiang >Priority: Major > > Currently, based on my understanding with KafakProducer.doSend() method, when > trigger kafka java client with > {code:java} > org.apache.kafka.clients.producer.Producer.send(message);{code} > The currently behavior is to do metadata fetch first and then queue the > message for sending ? This could result the kafka producer hangs for > max.block.ms time when kafka servers are down. > Is it possible to change the Producer.send() function behavior to queue all > the messages first and then only do metadata fetch async when the queue is > full. e.g. > {code:java} > if (msgQueue.batchIsFull ) { > clusterAndWaitTime = waitOnMetadata(record.topic(), record.partition(), > maxBlockTimeMs); > . > this.sender.wakeup(); > }{code} > I am happy to make this changes. But before doing that, just want to check > the current design logic behind producer.send(). > I have seen there are lots of talks regarding the producer send() behavior. > e.g. https://issues.apache.org/jira/browse/KAFKA-2948, > https://issues.apache.org/jira/browse/KAFKA-5369 > So trying to make sure I didnot misunderstand the current behavior here. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-9159) Consumer.endOffsets Throw TimeoutException: Failed to get offsets by times in 30000ms after a leader change
[ https://issues.apache.org/jira/browse/KAFKA-9159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhangzhanchang updated KAFKA-9159: -- Description: case 1: 0.10.2 version loop call Consumer.endOffsets Throw TimeoutException: Failed to get offsets by times in 3ms after kill -9 broker,but a leader change ,loop call Consumer.endOffsets no problem !image-2019-11-08-10-28-19-881.png|width=416,height=299! case 2: 2.0 version loop call Consumer.endOffsets Throw TimeoutException: Failed to get offsets by times in 3ms after a leader change,but kill -9 broker,loop call Consumer.endOffsets no problem !image-2019-11-08-10-29-06-282.png|width=412,height=314! was: case 1: 0.10.2 version loop call Consumer.endOffsets Throw TimeoutException: Failed to get offsets by times in 3ms after kill -9 broker,but a leader change ,loop call Consumer.endOffsets no problem !image-2019-11-08-10-28-19-881.png! case 2: 2.0 version loop call Consumer.endOffsets Throw TimeoutException: Failed to get offsets by times in 3ms after a leader change,but kill -9 broker,loop call Consumer.endOffsets no problem !image-2019-11-08-10-29-06-282.png! > Consumer.endOffsets Throw TimeoutException: Failed to get offsets by times in > 3ms after a leader change > --- > > Key: KAFKA-9159 > URL: https://issues.apache.org/jira/browse/KAFKA-9159 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 0.10.2.0, 2.0.0 >Reporter: zhangzhanchang >Priority: Major > Attachments: image-2019-11-08-10-28-19-881.png, > image-2019-11-08-10-29-06-282.png > > > case 1: 0.10.2 version loop call Consumer.endOffsets Throw TimeoutException: > Failed to get offsets by times in 3ms after kill -9 broker,but a leader > change ,loop call Consumer.endOffsets no problem > !image-2019-11-08-10-28-19-881.png|width=416,height=299! > case 2: 2.0 version loop call Consumer.endOffsets Throw TimeoutException: > Failed to get offsets by times in 3ms after a leader change,but kill -9 > broker,loop call Consumer.endOffsets no problem > !image-2019-11-08-10-29-06-282.png|width=412,height=314! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KAFKA-8677) Flakey test GroupEndToEndAuthorizationTest#testNoDescribeProduceOrConsumeWithoutTopicDescribeAcl
[ https://issues.apache.org/jira/browse/KAFKA-8677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lucas Bradstreet reassigned KAFKA-8677: --- Assignee: (was: Lucas Bradstreet) > Flakey test > GroupEndToEndAuthorizationTest#testNoDescribeProduceOrConsumeWithoutTopicDescribeAcl > > > Key: KAFKA-8677 > URL: https://issues.apache.org/jira/browse/KAFKA-8677 > Project: Kafka > Issue Type: Bug > Components: core, security, unit tests >Affects Versions: 2.4.0 >Reporter: Boyang Chen >Priority: Blocker > Labels: flaky-test > Fix For: 2.4.0 > > > [https://builds.apache.org/job/kafka-pr-jdk11-scala2.12/6325/console] > > *18:43:39* kafka.api.GroupEndToEndAuthorizationTest > > testNoDescribeProduceOrConsumeWithoutTopicDescribeAcl STARTED*18:44:00* > kafka.api.GroupEndToEndAuthorizationTest.testNoDescribeProduceOrConsumeWithoutTopicDescribeAcl > failed, log available in > /home/jenkins/jenkins-slave/workspace/kafka-pr-jdk11-scala2.12/core/build/reports/testOutput/kafka.api.GroupEndToEndAuthorizationTest.testNoDescribeProduceOrConsumeWithoutTopicDescribeAcl.test.stdout*18:44:00* > *18:44:00* kafka.api.GroupEndToEndAuthorizationTest > > testNoDescribeProduceOrConsumeWithoutTopicDescribeAcl FAILED*18:44:00* > org.scalatest.exceptions.TestFailedException: Consumed 0 records before > timeout instead of the expected 1 records > --- > I found this flaky test is actually exposing a real bug in consumer: within > {{KafkaConsumer.poll}}, we have an optimization to try to send the next fetch > request before returning the data in order to pipelining the fetch requests: > {code} > if (!records.isEmpty()) { > // before returning the fetched records, we can send off > the next round of fetches > // and avoid block waiting for their responses to enable > pipelining while the user > // is handling the fetched records. > // > // NOTE: since the consumed position has already been > updated, we must not allow > // wakeups or any other errors to be triggered prior to > returning the fetched records. > if (fetcher.sendFetches() > 0 || > client.hasPendingRequests()) { > client.pollNoWakeup(); > } > return this.interceptors.onConsume(new > ConsumerRecords<>(records)); > } > {code} > As the NOTE mentioned, this pollNoWakeup should NOT throw any exceptions, > since at this point the fetch position has been updated. If an exception is > thrown here, and the callers decides to capture and continue, those records > would never be returned again, causing data loss. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KAFKA-8677) Flakey test GroupEndToEndAuthorizationTest#testNoDescribeProduceOrConsumeWithoutTopicDescribeAcl
[ https://issues.apache.org/jira/browse/KAFKA-8677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lucas Bradstreet reassigned KAFKA-8677: --- Assignee: Lucas Bradstreet (was: Guozhang Wang) > Flakey test > GroupEndToEndAuthorizationTest#testNoDescribeProduceOrConsumeWithoutTopicDescribeAcl > > > Key: KAFKA-8677 > URL: https://issues.apache.org/jira/browse/KAFKA-8677 > Project: Kafka > Issue Type: Bug > Components: core, security, unit tests >Affects Versions: 2.4.0 >Reporter: Boyang Chen >Assignee: Lucas Bradstreet >Priority: Blocker > Labels: flaky-test > Fix For: 2.4.0 > > > [https://builds.apache.org/job/kafka-pr-jdk11-scala2.12/6325/console] > > *18:43:39* kafka.api.GroupEndToEndAuthorizationTest > > testNoDescribeProduceOrConsumeWithoutTopicDescribeAcl STARTED*18:44:00* > kafka.api.GroupEndToEndAuthorizationTest.testNoDescribeProduceOrConsumeWithoutTopicDescribeAcl > failed, log available in > /home/jenkins/jenkins-slave/workspace/kafka-pr-jdk11-scala2.12/core/build/reports/testOutput/kafka.api.GroupEndToEndAuthorizationTest.testNoDescribeProduceOrConsumeWithoutTopicDescribeAcl.test.stdout*18:44:00* > *18:44:00* kafka.api.GroupEndToEndAuthorizationTest > > testNoDescribeProduceOrConsumeWithoutTopicDescribeAcl FAILED*18:44:00* > org.scalatest.exceptions.TestFailedException: Consumed 0 records before > timeout instead of the expected 1 records > --- > I found this flaky test is actually exposing a real bug in consumer: within > {{KafkaConsumer.poll}}, we have an optimization to try to send the next fetch > request before returning the data in order to pipelining the fetch requests: > {code} > if (!records.isEmpty()) { > // before returning the fetched records, we can send off > the next round of fetches > // and avoid block waiting for their responses to enable > pipelining while the user > // is handling the fetched records. > // > // NOTE: since the consumed position has already been > updated, we must not allow > // wakeups or any other errors to be triggered prior to > returning the fetched records. > if (fetcher.sendFetches() > 0 || > client.hasPendingRequests()) { > client.pollNoWakeup(); > } > return this.interceptors.onConsume(new > ConsumerRecords<>(records)); > } > {code} > As the NOTE mentioned, this pollNoWakeup should NOT throw any exceptions, > since at this point the fetch position has been updated. If an exception is > thrown here, and the callers decides to capture and continue, those records > would never be returned again, causing data loss. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-9162) Re-assignment of Partition gets infinitely stuck "Still in progress" state
vikash kumar created KAFKA-9162: --- Summary: Re-assignment of Partition gets infinitely stuck "Still in progress" state Key: KAFKA-9162 URL: https://issues.apache.org/jira/browse/KAFKA-9162 Project: Kafka Issue Type: Bug Affects Versions: 0.10.1.1 Environment: Amazon Linux 1 x86_64 GNU/Linux Reporter: vikash kumar We have 6 node kafka cluster. Out of the 6 machines, 3 of them have both Kafka + zookeeper, and the remaining 3 of them have just kafka. Recently, we added one more kafka node. While re-assigning the partition to all the nodes (including the newer one), we executed the below command: /opt/kafka/bin/kafka-reassign-partitions.sh --reassignment-json-file new_assignment_details.json --execute --zookeeper localhost:2181 However, when we verified the status by using the below command, /opt/kafka/bin/kafka-reassign-partitions.sh --reassignment-json-file new_assignment_details.json --verify --zookeeper localhost:2181 We get the below output. Some of the partitions are re-assignment is still in progress. /opt/kafka/bin/kafka-reassign-partitions.sh --reassignment-json-file new_assignment_details.json --verify --zookeeper localhost:2181 | grep 'progress' Reassignment of partition [topic-name,854] is still in progress Reassignment of partition [topic-name,674] is still in progress Reassignment of partition [topic-name,944] is still in progress Reassignment of partition [topic-name,404] is still in progress Reassignment of partition [topic-name,314] is still in progress Reassignment of partition [topic-name,853] is still in progress Reassignment of partition [prom-metrics,403] is still in progress Reassignment of partition [prom-metrics,134] is still in progress There is no way to either: 1. Cancel the on-going partition re-assignment. 2. Rollback is also not possible. ( When we try doing that it says that "There is an existing assignment running." -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-9163) Re-assignment of Partition gets infinitely stuck "Still in progress" state
vikash kumar created KAFKA-9163: --- Summary: Re-assignment of Partition gets infinitely stuck "Still in progress" state Key: KAFKA-9163 URL: https://issues.apache.org/jira/browse/KAFKA-9163 Project: Kafka Issue Type: Bug Affects Versions: 0.10.1.1 Environment: Amazon Linux 1 x86_64 GNU/Linux Reporter: vikash kumar We have 6 node kafka cluster. Out of the 6 machines, 3 of them have both Kafka + zookeeper, and the remaining 3 of them have just kafka. Recently, we added one more kafka node. While re-assigning the partition to all the nodes (including the newer one), we executed the below command: /opt/kafka/bin/kafka-reassign-partitions.sh --reassignment-json-file new_assignment_details.json --execute --zookeeper localhost:2181 However, when we verified the status by using the below command, /opt/kafka/bin/kafka-reassign-partitions.sh --reassignment-json-file new_assignment_details.json --verify --zookeeper localhost:2181 We get the below output. Some of the partitions are re-assignment is still in progress. /opt/kafka/bin/kafka-reassign-partitions.sh --reassignment-json-file new_assignment_details.json --verify --zookeeper localhost:2181 | grep 'progress' Reassignment of partition [topic-name,854] is still in progress Reassignment of partition [topic-name,674] is still in progress Reassignment of partition [topic-name,944] is still in progress Reassignment of partition [topic-name,404] is still in progress Reassignment of partition [topic-name,314] is still in progress Reassignment of partition [topic-name,853] is still in progress Reassignment of partition [prom-metrics,403] is still in progress Reassignment of partition [prom-metrics,134] is still in progress There is no way to either: 1. Cancel the on-going partition re-assignment. 2. Rollback is also not possible. ( When we try doing that it says that "There is an existing assignment running." -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9133) LogCleaner thread dies with: currentLog cannot be empty on an unexpected exception
[ https://issues.apache.org/jira/browse/KAFKA-9133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969917#comment-16969917 ] ASF GitHub Bot commented on KAFKA-9133: --- timvlaer commented on pull request #7639: KAFKA-9133: add extra bounds check for dirty offset before cleaning URL: https://github.com/apache/kafka/pull/7639 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > LogCleaner thread dies with: currentLog cannot be empty on an unexpected > exception > -- > > Key: KAFKA-9133 > URL: https://issues.apache.org/jira/browse/KAFKA-9133 > Project: Kafka > Issue Type: Bug > Components: log cleaner >Affects Versions: 2.3.1 >Reporter: Karolis Pocius >Assignee: Jason Gustafson >Priority: Blocker > Fix For: 2.4.0, 2.3.2 > > > Log cleaner thread dies without a clear reference to which log is causing it: > {code} > [2019-11-02 11:59:59,078] INFO Starting the log cleaner (kafka.log.LogCleaner) > [2019-11-02 11:59:59,144] INFO [kafka-log-cleaner-thread-0]: Starting > (kafka.log.LogCleaner) > [2019-11-02 11:59:59,199] ERROR [kafka-log-cleaner-thread-0]: Error due to > (kafka.log.LogCleaner) > java.lang.IllegalStateException: currentLog cannot be empty on an unexpected > exception > at kafka.log.LogCleaner$CleanerThread.cleanFilthiestLog(LogCleaner.scala:346) > at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:307) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:89) > Caused by: java.lang.IllegalArgumentException: Illegal request for non-active > segments beginning at offset 5033130, which is larger than the active > segment's base offset 5019648 > at kafka.log.Log.nonActiveLogSegmentsFrom(Log.scala:1933) > at > kafka.log.LogCleanerManager$.maxCompactionDelay(LogCleanerManager.scala:491) > at > kafka.log.LogCleanerManager.$anonfun$grabFilthiestCompactedLog$4(LogCleanerManager.scala:184) > at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.immutable.List.map(List.scala:298) > at > kafka.log.LogCleanerManager.$anonfun$grabFilthiestCompactedLog$1(LogCleanerManager.scala:181) > at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:253) > at > kafka.log.LogCleanerManager.grabFilthiestCompactedLog(LogCleanerManager.scala:171) > at kafka.log.LogCleaner$CleanerThread.cleanFilthiestLog(LogCleaner.scala:321) > ... 2 more > [2019-11-02 11:59:59,200] INFO [kafka-log-cleaner-thread-0]: Stopped > (kafka.log.LogCleaner) > {code} > If I try to ressurect it by dynamically bumping {{log.cleaner.threads}} it > instantly dies with the exact same error. > Not sure if this is something KAFKA-8725 is supposed to address. -- This message was sent by Atlassian Jira (v8.3.4#803005)