date:20191107

[jira] [Commented] (KAFKA-9141) Global state update error: missing recovery or wrong log message

2019-11-07 Thread Matthias J. Sax (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969047#comment-16969047
 ] 

Matthias J. Sax commented on KAFKA-9141:


Calling `cleanUp()` is something a user would need to to explicitly – there is 
no reason for Kafka Streams to call it.

Can you share the full StackTrace/Error message?

> Global state update error: missing recovery or wrong log message
> 
>
> Key: KAFKA-9141
> URL: https://issues.apache.org/jira/browse/KAFKA-9141
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.3.0
>Reporter: Chris Toomey
>Priority: Major
>
> I'm getting an {{OffsetOutOfRangeException}} accompanied by the log message 
> "Updating global state failed. You can restart KafkaStreams to recover from 
> this error." But I've restarted the app several times and it's not 
> recovering, it keeps failing the same way.
>  
> I see there's a {{cleanUp()}} method on {{KafkaStreams}} that looks like it's 
> what's needed, but it's not called anywhere in the streams source code. So 
> either that's a bug and the call should be added to do the recovery, or the 
> log message is wrong and should be fixed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-9155) __consuemer_offsets unable to compress properly

2019-11-07 Thread yanrui (Jira)

yanrui created KAFKA-9155:
-

 Summary: __consuemer_offsets unable to compress properly
 Key: KAFKA-9155
 URL: https://issues.apache.org/jira/browse/KAFKA-9155
 Project: Kafka
  Issue Type: Bug
  Components: log cleaner
Affects Versions: 2.1.1
Reporter: yanrui






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-9155) __consuemer_offsets unable to compress properly

2019-11-07 Thread yanrui (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-9155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yanrui updated KAFKA-9155:
--
Attachment: data.jpg

> __consuemer_offsets unable to compress properly
> ---
>
> Key: KAFKA-9155
> URL: https://issues.apache.org/jira/browse/KAFKA-9155
> Project: Kafka
>  Issue Type: Bug
>  Components: log cleaner
>Affects Versions: 2.1.1
>Reporter: yanrui
>Priority: Major
> Attachments: data.jpg
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-9155) __consuemer_offsets unable to compress properly

2019-11-07 Thread yanrui (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-9155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yanrui updated KAFKA-9155:
--
Attachment: 32-log.jpg

> __consuemer_offsets unable to compress properly
> ---
>
> Key: KAFKA-9155
> URL: https://issues.apache.org/jira/browse/KAFKA-9155
> Project: Kafka
>  Issue Type: Bug
>  Components: log cleaner
>Affects Versions: 2.1.1
>Reporter: yanrui
>Priority: Major
> Attachments: 32-log.jpg, data.jpg
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-9155) __consuemer_offsets unable to compress properly

2019-11-07 Thread yanrui (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-9155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yanrui updated KAFKA-9155:
--
Attachment: last_batch.jpg

> __consuemer_offsets unable to compress properly
> ---
>
> Key: KAFKA-9155
> URL: https://issues.apache.org/jira/browse/KAFKA-9155
> Project: Kafka
>  Issue Type: Bug
>  Components: log cleaner
>Affects Versions: 2.1.1
>Reporter: yanrui
>Priority: Major
> Attachments: 32-log.jpg, data.jpg, last_batch.jpg
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-9155) __consuemer_offsets unable to compress properly

2019-11-07 Thread yanrui (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-9155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yanrui updated KAFKA-9155:
--
Description: 
The clean thread has been cleaned up normally, but the dirtyest partition 
selection has not been selected to 32.

Found a strange problem：query log cleanup point the __consumer_offsets-32  is 
864717799，but the last offset is 163990182 


> __consuemer_offsets unable to compress properly
> ---
>
> Key: KAFKA-9155
> URL: https://issues.apache.org/jira/browse/KAFKA-9155
> Project: Kafka
>  Issue Type: Bug
>  Components: log cleaner
>Affects Versions: 2.1.1
>Reporter: yanrui
>Priority: Major
> Attachments: 32-log.jpg, data.jpg, last_batch.jpg
>
>
> The clean thread has been cleaned up normally, but the dirtyest partition 
> selection has not been selected to 32.
> Found a strange problem：query log cleanup point the __consumer_offsets-32  is 
> 864717799，but the last offset is 163990182 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-9155) __consuemer_offsets unable to compress properly

2019-11-07 Thread yanrui (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-9155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yanrui updated KAFKA-9155:
--
Attachment: 32data_detail.jpg

> __consuemer_offsets unable to compress properly
> ---
>
> Key: KAFKA-9155
> URL: https://issues.apache.org/jira/browse/KAFKA-9155
> Project: Kafka
>  Issue Type: Bug
>  Components: log cleaner
>Affects Versions: 2.1.1
>Reporter: yanrui
>Priority: Major
> Attachments: 32-log.jpg, 32data_detail.jpg, data.jpg, last_batch.jpg
>
>
> The clean thread has been cleaned up normally, but the dirtyest partition 
> selection has not been selected to 32.
> Found a strange problem：query log cleanup point the __consumer_offsets-32  is 
> 864717799，but the last offset is 163990182 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-9155) __consuemer_offsets unable to compress properly

2019-11-07 Thread yanrui (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-9155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yanrui updated KAFKA-9155:
--
Description: 
In my environment, the clean thread has been cleaned up normally.
There are 2 replicas of each internal topic partition.Take the 32nd partition 
as an example，One of the replicas is normal，But for another
has been unable to be selected as the dirtiest partition.
Found a strange problem：query log cleanup point the __consumer_offsets-32  is 
864717799，but the last offset is 163990182. I don't know what caused it, ask 
for help.


  was:
The clean thread has been cleaned up normally, but the dirtyest partition 
selection has not been selected to 32.

Found a strange problem：query log cleanup point the __consumer_offsets-32  is 
864717799，but the last offset is 163990182 



> __consuemer_offsets unable to compress properly
> ---
>
> Key: KAFKA-9155
> URL: https://issues.apache.org/jira/browse/KAFKA-9155
> Project: Kafka
>  Issue Type: Bug
>  Components: log cleaner
>Affects Versions: 2.1.1
>Reporter: yanrui
>Priority: Major
> Attachments: 32-log.jpg, 32data_detail.jpg, data.jpg, last_batch.jpg
>
>
> In my environment, the clean thread has been cleaned up normally.
> There are 2 replicas of each internal topic partition.Take the 32nd partition 
> as an example，One of the replicas is normal，But for another
> has been unable to be selected as the dirtiest partition.
> Found a strange problem：query log cleanup point the __consumer_offsets-32  is 
> 864717799，but the last offset is 163990182. I don't know what caused it, ask 
> for help.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId

2019-11-07 Thread Tim Van Laer (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969090#comment-16969090
 ] 

Tim Van Laer commented on KAFKA-8803:
-

I ran into the same issue. 

One stream instance (the one dealing with partition 52) kept failing with:
{code}
org.apache.kafka.streams.errors.StreamsException: Exception caught in process. 
taskId=0_52, processor=KSTREAM-SOURCE-00, 
topic=galactica.timeline-aligner.entries-internal.0, partition=52, 
offset=5151450, stacktrace=org.apache.kafka.common.errors.TimeoutException: 
Timeout expired after 6milliseconds while awaiting InitProducerId

  at 
org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:380)
 ~[timeline-aligner.jar:?]
  at 
org.apache.kafka.streams.processor.internals.AssignedStreamsTasks.process(AssignedStreamsTasks.java:199)
 ~[timeline-aligner.jar:?]
  at 
org.apache.kafka.streams.processor.internals.TaskManager.process(TaskManager.java:425)
 ~[timeline-aligner.jar:?]
  at 
org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:912)
 ~[timeline-aligner.jar:?]
  at 
org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:819)
 ~[timeline-aligner.jar:?]
  at 
org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:788)
 ~[timeline-aligner.jar:?]
Caused by: org.apache.kafka.common.errors.TimeoutException: Timeout expired 
after 6milliseconds while awaiting InitProducerId
{code}
It was automatically restarted every time, but it kept failing (even after 
stopping the whole group).

Yesterday two brokers throw a UNKNOWN_LEADER_EPOCH error and after that, the 
client started to get into troubles. 
{code}
[2019-11-06 11:53:42,499] INFO [ReplicaFetcher replicaId=0, leaderId=2, 
fetcherId=3] Retrying leaderEpoch request for partition 
xyz.entries-internal.0-52 as the leader reported an error: UNKNOWN_LEADER_EPOCH 
(kafka.server.ReplicaFetcherThread)
{code}
{code}
[2019-11-06 10:06:56,652] INFO [ReplicaFetcher replicaId=1, leaderId=2, 
fetcherId=3] Retrying leaderEpoch request for partition 
xyz.entries-internal.0-52 as the leader reported an error: UNKNOWN_LEADER_EPOCH 
(kafka.server.ReplicaFetcherThread)
{code}

Meta:
* Kafka Streams 2.3.1, 
* Broker: patched: 2.3.1 without KAFKA-8724 (see KAFKA-9133) 

I will give the {{max.block.ms}} a shot, but we're first trying a rolling 
restart of the brokers.

> Stream will not start due to TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> 
>
> Key: KAFKA-8803
> URL: https://issues.apache.org/jira/browse/KAFKA-8803
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Raman Gupta
>Priority: Major
> Attachments: logs.txt.gz, screenshot-1.png
>
>
> One streams app is consistently failing at startup with the following 
> exception:
> {code}
> 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] 
> org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout 
> exception caught when initializing transactions for task 0_36. This might 
> happen if the broker is slow to respond, if the network connection to the 
> broker was interrupted, or if similar circumstances arise. You can increase 
> producer parameter `max.block.ms` to increase this timeout.
> org.apache.kafka.common.errors.TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> {code}
> These same brokers are used by many other streams without any issue, 
> including some in the very same processes for the stream which consistently 
> throws this exception.
> *UPDATE 08/16:*
> The very first instance of this error is August 13th 2019, 17:03:36.754 and 
> it happened for 4 different streams. For 3 of these streams, the error only 
> happened once, and then the stream recovered. For the 4th stream, the error 
> has continued to happen, and continues to happen now.
> I looked up the broker logs for this time, and see that at August 13th 2019, 
> 16:47:43, two of four brokers started reporting messages like this, for 
> multiple partitions:
> [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, 
> fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader 
> reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
> The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, 
> here is a view of the count of these messages over time:
>  !screenshot-1.png! 
> However, as noted, the stream task timeout error continues to happen.
> I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 
> broker. The broker has a patch for KAFKA-8773.

[jira] [Created] (KAFKA-9156) LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent state

2019-11-07 Thread shilin Lu (Jira)

shilin Lu created KAFKA-9156:


 Summary: LazyTimeIndex & LazyOffsetIndex may cause 
niobufferoverflow in concurrent state
 Key: KAFKA-9156
 URL: https://issues.apache.org/jira/browse/KAFKA-9156
 Project: Kafka
  Issue Type: Bug
  Components: core
Affects Versions: 2.3.0
Reporter: shilin Lu
 Attachments: image-2019-11-07-17-42-13-852.png, 
image-2019-11-07-17-44-05-357.png, image-2019-11-07-17-46-53-650.png

!image-2019-11-07-17-42-13-852.png!

this timeindex get function is not thread safe ,may cause create some timeindex.

!image-2019-11-07-17-44-05-357.png!

When create timeindex not exactly one ,may cause mappedbytebuffer position to 
end. Then write index entry to this mmap file will cause 
java.nio.BufferOverflowException.

 

!image-2019-11-07-17-46-53-650.png!

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-9156) LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent state

2019-11-07 Thread shilin Lu (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969107#comment-16969107
 ] 

shilin Lu commented on KAFKA-9156:
--

[~guozhang] [~ijuma] please take a look at this issue ,thanks. our company 
encounter this problem in product env.

> LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent 
> state
> ---
>
> Key: KAFKA-9156
> URL: https://issues.apache.org/jira/browse/KAFKA-9156
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.3.0
>Reporter: shilin Lu
>Priority: Critical
> Attachments: image-2019-11-07-17-42-13-852.png, 
> image-2019-11-07-17-44-05-357.png, image-2019-11-07-17-46-53-650.png
>
>
> !image-2019-11-07-17-42-13-852.png!
> this timeindex get function is not thread safe ,may cause create some 
> timeindex.
> !image-2019-11-07-17-44-05-357.png!
> When create timeindex not exactly one ,may cause mappedbytebuffer position to 
> end. Then write index entry to this mmap file will cause 
> java.nio.BufferOverflowException.
>  
> !image-2019-11-07-17-46-53-650.png!
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-9156) LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent state

2019-11-07 Thread shilin Lu (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-9156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shilin Lu updated KAFKA-9156:
-
Reviewer: Ismael Juma

> LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent 
> state
> ---
>
> Key: KAFKA-9156
> URL: https://issues.apache.org/jira/browse/KAFKA-9156
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.3.0
>Reporter: shilin Lu
>Priority: Critical
> Attachments: image-2019-11-07-17-42-13-852.png, 
> image-2019-11-07-17-44-05-357.png, image-2019-11-07-17-46-53-650.png
>
>
> !image-2019-11-07-17-42-13-852.png!
> this timeindex get function is not thread safe ,may cause create some 
> timeindex.
> !image-2019-11-07-17-44-05-357.png!
> When create timeindex not exactly one ,may cause mappedbytebuffer position to 
> end. Then write index entry to this mmap file will cause 
> java.nio.BufferOverflowException.
>  
> !image-2019-11-07-17-46-53-650.png!
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId

2019-11-07 Thread Tim Van Laer (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969090#comment-16969090
 ] 

Tim Van Laer edited comment on KAFKA-8803 at 11/7/19 9:56 AM:
--

I ran into the same issue. 

One stream instance (the one dealing with partition 52) kept failing with:
{code}
org.apache.kafka.streams.errors.StreamsException: Exception caught in process. 
taskId=0_52, processor=KSTREAM-SOURCE-00, topic=xyz.entries-internal.0, 
partition=52, offset=5151450, 
stacktrace=org.apache.kafka.common.errors.TimeoutException: Timeout expired 
after 6milliseconds while awaiting InitProducerId

  at 
org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:380)
 ~[timeline-aligner.jar:?]
  at 
org.apache.kafka.streams.processor.internals.AssignedStreamsTasks.process(AssignedStreamsTasks.java:199)
 ~[timeline-aligner.jar:?]
  at 
org.apache.kafka.streams.processor.internals.TaskManager.process(TaskManager.java:425)
 ~[timeline-aligner.jar:?]
  at 
org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:912)
 ~[timeline-aligner.jar:?]
  at 
org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:819)
 ~[timeline-aligner.jar:?]
  at 
org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:788)
 ~[timeline-aligner.jar:?]
Caused by: org.apache.kafka.common.errors.TimeoutException: Timeout expired 
after 6milliseconds while awaiting InitProducerId
{code}
It was automatically restarted every time, but it kept failing (even after 
stopping the whole group).

Yesterday two brokers throw a UNKNOWN_LEADER_EPOCH error and after that, the 
client started to get into troubles. 
{code}
[2019-11-06 11:53:42,499] INFO [ReplicaFetcher replicaId=0, leaderId=2, 
fetcherId=3] Retrying leaderEpoch request for partition 
xyz.entries-internal.0-52 as the leader reported an error: UNKNOWN_LEADER_EPOCH 
(kafka.server.ReplicaFetcherThread)
{code}
{code}
[2019-11-06 10:06:56,652] INFO [ReplicaFetcher replicaId=1, leaderId=2, 
fetcherId=3] Retrying leaderEpoch request for partition 
xyz.entries-internal.0-52 as the leader reported an error: UNKNOWN_LEADER_EPOCH 
(kafka.server.ReplicaFetcherThread)
{code}

Meta:
* Kafka Streams 2.3.1, 
* Broker: patched: 2.3.1 without KAFKA-8724 (see KAFKA-9133) 

I will give the {{max.block.ms}} a shot, but we're first trying a rolling 
restart of the brokers.


was (Author: timvanlaer):
I ran into the same issue. 

One stream instance (the one dealing with partition 52) kept failing with:
{code}
org.apache.kafka.streams.errors.StreamsException: Exception caught in process. 
taskId=0_52, processor=KSTREAM-SOURCE-00, 
topic=galactica.timeline-aligner.entries-internal.0, partition=52, 
offset=5151450, stacktrace=org.apache.kafka.common.errors.TimeoutException: 
Timeout expired after 6milliseconds while awaiting InitProducerId

  at 
org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:380)
 ~[timeline-aligner.jar:?]
  at 
org.apache.kafka.streams.processor.internals.AssignedStreamsTasks.process(AssignedStreamsTasks.java:199)
 ~[timeline-aligner.jar:?]
  at 
org.apache.kafka.streams.processor.internals.TaskManager.process(TaskManager.java:425)
 ~[timeline-aligner.jar:?]
  at 
org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:912)
 ~[timeline-aligner.jar:?]
  at 
org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:819)
 ~[timeline-aligner.jar:?]
  at 
org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:788)
 ~[timeline-aligner.jar:?]
Caused by: org.apache.kafka.common.errors.TimeoutException: Timeout expired 
after 6milliseconds while awaiting InitProducerId
{code}
It was automatically restarted every time, but it kept failing (even after 
stopping the whole group).

Yesterday two brokers throw a UNKNOWN_LEADER_EPOCH error and after that, the 
client started to get into troubles. 
{code}
[2019-11-06 11:53:42,499] INFO [ReplicaFetcher replicaId=0, leaderId=2, 
fetcherId=3] Retrying leaderEpoch request for partition 
xyz.entries-internal.0-52 as the leader reported an error: UNKNOWN_LEADER_EPOCH 
(kafka.server.ReplicaFetcherThread)
{code}
{code}
[2019-11-06 10:06:56,652] INFO [ReplicaFetcher replicaId=1, leaderId=2, 
fetcherId=3] Retrying leaderEpoch request for partition 
xyz.entries-internal.0-52 as the leader reported an error: UNKNOWN_LEADER_EPOCH 
(kafka.server.ReplicaFetcherThread)
{code}

Meta:
* Kafka Streams 2.3.1, 
* Broker: patched: 2.3.1 without KAFKA-8724 (see KAFKA-9133) 

I will give the {{max.block.ms}} a shot, but we're first trying a rolling 
restart of the brokers.

> Stream will not start due to TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> -

[jira] [Commented] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId

2019-11-07 Thread Tim Van Laer (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969161#comment-16969161
 ] 

Tim Van Laer commented on KAFKA-8803:
-

I can confirm: a rolling restart of the brokers resolved the issue. 

It's hard to tell as our rolling restart is automated, but I have the 
impression only a restart of the group coordinator is enough to do the trick. 
Would that make sense?

> Stream will not start due to TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> 
>
> Key: KAFKA-8803
> URL: https://issues.apache.org/jira/browse/KAFKA-8803
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Raman Gupta
>Priority: Major
> Attachments: logs.txt.gz, screenshot-1.png
>
>
> One streams app is consistently failing at startup with the following 
> exception:
> {code}
> 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] 
> org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout 
> exception caught when initializing transactions for task 0_36. This might 
> happen if the broker is slow to respond, if the network connection to the 
> broker was interrupted, or if similar circumstances arise. You can increase 
> producer parameter `max.block.ms` to increase this timeout.
> org.apache.kafka.common.errors.TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> {code}
> These same brokers are used by many other streams without any issue, 
> including some in the very same processes for the stream which consistently 
> throws this exception.
> *UPDATE 08/16:*
> The very first instance of this error is August 13th 2019, 17:03:36.754 and 
> it happened for 4 different streams. For 3 of these streams, the error only 
> happened once, and then the stream recovered. For the 4th stream, the error 
> has continued to happen, and continues to happen now.
> I looked up the broker logs for this time, and see that at August 13th 2019, 
> 16:47:43, two of four brokers started reporting messages like this, for 
> multiple partitions:
> [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, 
> fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader 
> reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
> The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, 
> here is a view of the count of these messages over time:
>  !screenshot-1.png! 
> However, as noted, the stream task timeout error continues to happen.
> I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 
> broker. The broker has a patch for KAFKA-8773.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId

2019-11-07 Thread Tim Van Laer (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969161#comment-16969161
 ] 

Tim Van Laer edited comment on KAFKA-8803 at 11/7/19 10:54 AM:
---

I can confirm: a rolling restart of the brokers resolved the issue. 

It's hard to tell as our rolling restart is automated, but I have the 
impression a restart of the group coordinator is enough to do the trick. Would 
that make sense?


was (Author: timvanlaer):
I can confirm: a rolling restart of the brokers resolved the issue. 

It's hard to tell as our rolling restart is automated, but I have the 
impression only a restart of the group coordinator is enough to do the trick. 
Would that make sense?

> Stream will not start due to TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> 
>
> Key: KAFKA-8803
> URL: https://issues.apache.org/jira/browse/KAFKA-8803
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Raman Gupta
>Priority: Major
> Attachments: logs.txt.gz, screenshot-1.png
>
>
> One streams app is consistently failing at startup with the following 
> exception:
> {code}
> 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] 
> org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout 
> exception caught when initializing transactions for task 0_36. This might 
> happen if the broker is slow to respond, if the network connection to the 
> broker was interrupted, or if similar circumstances arise. You can increase 
> producer parameter `max.block.ms` to increase this timeout.
> org.apache.kafka.common.errors.TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> {code}
> These same brokers are used by many other streams without any issue, 
> including some in the very same processes for the stream which consistently 
> throws this exception.
> *UPDATE 08/16:*
> The very first instance of this error is August 13th 2019, 17:03:36.754 and 
> it happened for 4 different streams. For 3 of these streams, the error only 
> happened once, and then the stream recovered. For the 4th stream, the error 
> has continued to happen, and continues to happen now.
> I looked up the broker logs for this time, and see that at August 13th 2019, 
> 16:47:43, two of four brokers started reporting messages like this, for 
> multiple partitions:
> [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, 
> fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader 
> reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
> The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, 
> here is a view of the count of these messages over time:
>  !screenshot-1.png! 
> However, as noted, the stream task timeout error continues to happen.
> I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 
> broker. The broker has a patch for KAFKA-8773.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-8933) An unhandled SSL handshake exception in polling event - needed a retry logic

2019-11-07 Thread Mickael Maison (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969309#comment-16969309
 ] 

Mickael Maison commented on KAFKA-8933:
---

We saw a similar exception in 2.4/2.5:

{code:java}
Nov 5 18:10:26 mirrormaker2-6c5bbc-jx85h mirrormaker2 ERROR 
[MirrorSourceConnector|task-0] Failure during poll. 
(org.apache.kafka.connect.mirror.MirrorSourceTask:159)
java.lang.NullPointerException
 at 
org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.handleCompletedMetadataResponse(NetworkClient.java:1071)
 at 
org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:847)
 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:549)
 at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:262)
 at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.pollNoWakeup(ConsumerNetworkClient.java:303)
 at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1249)
 at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1211)
 at 
org.apache.kafka.connect.mirror.MirrorSourceTask.poll(MirrorSourceTask.java:137)
 at 
org.apache.kafka.connect.runtime.WorkerSourceTask.poll(WorkerSourceTask.java:259)
 at 
org.apache.kafka.connect.runtime.WorkerSourceTask.execute(WorkerSourceTask.java:226)
 at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:177)
 at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:227)
 at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
 at java.util.concurrent.FutureTask.run(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source){code}

Prior to the exception, the client was disconnected. In 
NetworkClient.processDisconnection() several paths can lead to 
inProgressRequestVersion being set to null. If there's a MetadataRequest in 
flight to another node at the time, then the exception is hit when handling the 
response as inProgressRequestVersion is unboxed to an int.

I was able to reproduce reliably with the following setup:
- 2 brokers with SASL
- consumer consuming from broker 0
- make broker 0 drop consumer connection
- the consumer gets:
{code:java}
INFO [main]: [Consumer clientId=consumer-2, groupId=null] Error sending fetch 
request (sessionId=735517, epoch=2) to node 0: {}.
org.apache.kafka.common.errors.DisconnectException{code}
- make broker 0 fail authentication
- consumer gets:
{code:java}
INFO [main]: [Consumer clientId=consumer-2, groupId=null] Failed authentication 
with localhost/127.0.0.1 (Authentication failed, invalid credentials)
ERROR [main]: [Consumer clientId=consumer-2, groupId=null] Connection to node 0 
(localhost/127.0.0.1:9093) failed authentication due to: Authentication failed, 
invalid credentials{code}
- force metadata refresh by consumer, for example: 
consumer.partitionsFor("topic-that-does-not-exist");
- the consumer gets:
{code:java}
java.lang.NullPointerException
at 
org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.handleCompletedMetadataResponse(NetworkClient.java:1073)
at 
org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:847)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:549)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:262)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:233)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:212)
at 
org.apache.kafka.clients.consumer.internals.Fetcher.getTopicMetadata(Fetcher.java:368)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1930)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1898)
at main.ConsumerTest2.main(ConsumerTest2.java:37)
{code}

I'm not super familair with this code path but the following patch helped:

{code:java}
diff --git clients/src/main/java/org/apache/kafka/clients/NetworkClient.java 
clients/src/main/java/org/apache/kafka/clients/NetworkClient.java
index d782df865..d3119f132 100644
--- clients/src/main/java/org/apache/kafka/clients/NetworkClient.java
+++ clients/src/main/java/org/apache/kafka/clients/NetworkClient.java
@@ -1067,6 +1069,9 @@ public class NetworkClient implements KafkaClient {
 if (response.brokers().isEmpty()) {
 log.trace("Ignoring empty metadata response with correlation 
id {}.", requestHeader.correlationId());
 this.metadata.failedUpdate(now, null);
+} else if (inProgressRequestVersion == null) {
+log.warn("Ignoring metadata resp

[jira] [Updated] (KAFKA-8933) An unhandled SSL handshake exception in polling event - needed a retry logic

2019-11-07 Thread Mickael Maison (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mickael Maison updated KAFKA-8933:
--
Affects Version/s: 2.4.0

> An unhandled SSL handshake exception in polling event - needed a retry logic
> 
>
> Key: KAFKA-8933
> URL: https://issues.apache.org/jira/browse/KAFKA-8933
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.2.1, 2.4.0
> Environment: software platform
>Reporter: Remigius
>Priority: Critical
>
> Already client is connected and during polling event, SSL handshake failure 
> happened. it led to leaving the co-ordinator. Even on SSL handshake failure 
> which was actually intermittent issue, polling should have some resilient and 
> retry the polling. Leaving group caused all instances of clients to drop and 
> left the messages in Kafka for long time until re-subscribe the kafka topic 
> manually.
>  
>  
> {noformat}
> 2019-09-06 04:03:09,016 ERROR [reactive-kafka-] 
> org.apache.kafka.clients.NetworkClient [Consumer clientId=aaa, groupId=bbb] 
> Connection to node 150 (host:port) failed authentication due to: SSL 
> handshake failed
> 2019-09-06 04:03:09,021 ERROR [reactive-kafka-]  
> reactor.kafka.receiver.internals.DefaultKafkaReceiver Unexpected exception
> java.lang.NullPointerException: null
>  at 
> org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.handleCompletedMetadataResponse(NetworkClient.java:1012)
>  ~[kafka-clients-2.2.1.jar!/:?]
>  at 
> org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:822)
>  ~[kafka-clients-2.2.1.jar!/:?]
>  at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:544) 
> ~[kafka-clients-2.2.1.jar!/:?]
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:265)
>  ~[kafka-clients-2.2.1.jar!/:?]
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:236)
>  ~[kafka-clients-2.2.1.jar!/:?]
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1256)
>  ~[kafka-clients-2.2.1.jar!/:?]
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1200) 
> ~[kafka-clients-2.2.1.jar!/:?]
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1176) 
> ~[kafka-clients-2.2.1.jar!/:?]
>  at 
> reactor.kafka.receiver.internals.DefaultKafkaReceiver$PollEvent.run(DefaultKafkaReceiver.java:470)
>  ~[reactor-kafka-1.1.1.RELEASE.jar!/:1.1.1.RELEASE]
>  at 
> reactor.kafka.receiver.internals.DefaultKafkaReceiver.doEvent(DefaultKafkaReceiver.java:401)
>  ~[reactor-kafka-1.1.1.RELEASE.jar!/:1.1.1.RELEASE]
>  at 
> reactor.kafka.receiver.internals.DefaultKafkaReceiver.lambda$start$14(DefaultKafkaReceiver.java:335)
>  ~[reactor-kafka-1.1.1.RELEASE.jar!/:1.1.1.RELEASE]
>  at reactor.core.publisher.LambdaSubscriber.onNext(LambdaSubscriber.java:130) 
> ~[reactor-core-3.2.10.RELEASE.jar!/:3.2.10.RELEASE]
>  at 
> reactor.core.publisher.FluxPublishOn$PublishOnSubscriber.runAsync(FluxPublishOn.java:398)
>  ~[reactor-core-3.2.10.RELEASE.jar!/:3.2.10.RELEASE]
>  at 
> reactor.core.publisher.FluxPublishOn$PublishOnSubscriber.run(FluxPublishOn.java:484)
>  ~[reactor-core-3.2.10.RELEASE.jar!/:3.2.10.RELEASE]
>  at 
> reactor.kafka.receiver.internals.KafkaSchedulers$EventScheduler.lambda$decorate$1(KafkaSchedulers.java:100)
>  ~[reactor-kafka-1.1.1.RELEASE.jar!/:1.1.1.RELEASE]
>  at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:84) 
> ~[reactor-core-3.2.10.RELEASE.jar!/:3.2.10.RELEASE]
>  at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:37) 
> ~[reactor-core-3.2.10.RELEASE.jar!/:3.2.10.RELEASE]
>  at 
> org.springframework.cloud.sleuth.instrument.async.TraceCallable.call(TraceCallable.java:70)
>  ~[spring-cloud-sleuth-core-2.1.1.RELEASE.jar!/:2.1.1.RELEASE]
>  at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
>  ~[?:?]
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  ~[?:?]
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  ~[?:?]
>  at java.lang.Thread.run(Thread.java:834) [?:?]
> 2019-09-06 04:03:09,023 INFO  [reactive-kafka-] 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator [Consumer 
> clientId=aaa, groupId=bbb] Member x_13-081e61ec-1509-4e0e-819e-58063d1ce8f6 
> sending LeaveGroup request to coordinator{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (KAFKA-8933) An unhandled SSL handshake exception in polling event - needed a retry logic

2019-11-07 Thread Mickael Maison (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969309#comment-16969309
 ] 

Mickael Maison edited comment on KAFKA-8933 at 11/7/19 2:37 PM:


We saw a similar exception in 2.4/2.5:
{code:java}
Nov 5 18:10:26 mirrormaker2-6c5bbc-jx85h mirrormaker2 ERROR 
[MirrorSourceConnector|task-0] Failure during poll. 
(org.apache.kafka.connect.mirror.MirrorSourceTask:159)
java.lang.NullPointerException
 at 
org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.handleCompletedMetadataResponse(NetworkClient.java:1071)
 at 
org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:847)
 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:549)
 at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:262)
 at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.pollNoWakeup(ConsumerNetworkClient.java:303)
 at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1249)
 at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1211)
 at 
org.apache.kafka.connect.mirror.MirrorSourceTask.poll(MirrorSourceTask.java:137)
 at 
org.apache.kafka.connect.runtime.WorkerSourceTask.poll(WorkerSourceTask.java:259)
 at 
org.apache.kafka.connect.runtime.WorkerSourceTask.execute(WorkerSourceTask.java:226)
 at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:177)
 at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:227)
 at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
 at java.util.concurrent.FutureTask.run(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source){code}
Prior to the exception, the client was disconnected. In 
NetworkClient.processDisconnection() several paths can lead to 
inProgressRequestVersion being set to null. If there's a MetadataRequest in 
flight to another node at the time, then the exception is hit when handling the 
response as inProgressRequestVersion is unboxed to an int.

I was able to reproduce reliably with the following setup:
 - 2 brokers with SASL
 - consumer consuming from broker 0
 - make broker 0 drop consumer connection
 - the consumer gets:
{code:java}
INFO [main]: [Consumer clientId=consumer-2, groupId=null] Error sending fetch 
request (sessionId=735517, epoch=2) to node 0: {}.
org.apache.kafka.common.errors.DisconnectException{code}

 - make broker 0 fail authentication
 - consumer gets:
{code:java}
INFO [main]: [Consumer clientId=consumer-2, groupId=null] Failed authentication 
with localhost/127.0.0.1 (Authentication failed, invalid credentials)
ERROR [main]: [Consumer clientId=consumer-2, groupId=null] Connection to node 0 
(localhost/127.0.0.1:9093) failed authentication due to: Authentication failed, 
invalid credentials{code}

 - force metadata refresh by consumer, for example: 
consumer.partitionsFor("topic-that-does-not-exist");
 - the consumer gets:
{code:java}
java.lang.NullPointerException
at 
org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.handleCompletedMetadataResponse(NetworkClient.java:1073)
at 
org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:847)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:549)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:262)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:233)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:212)
at 
org.apache.kafka.clients.consumer.internals.Fetcher.getTopicMetadata(Fetcher.java:368)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1930)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1898)
at main.ConsumerTest2.main(ConsumerTest2.java:37)
{code}

I'm not super familair with this code path but the following patch helped:
{code:java}
diff --git clients/src/main/java/org/apache/kafka/clients/NetworkClient.java 
clients/src/main/java/org/apache/kafka/clients/NetworkClient.java
index d782df865..d3119f132 100644
--- clients/src/main/java/org/apache/kafka/clients/NetworkClient.java
+++ clients/src/main/java/org/apache/kafka/clients/NetworkClient.java
@@ -1067,6 +1069,9 @@ public class NetworkClient implements KafkaClient {
 if (response.brokers().isEmpty()) {
 log.trace("Ignoring empty metadata response with correlation 
id {}.", requestHeader.correlationId());
 this.metadata.failedUpdate(now, null);
+} else if (inProgressRequestVersion == n

[jira] [Commented] (KAFKA-7016) Reconsider the "avoid the expensive and useless stack trace for api exceptions" practice

2019-11-07 Thread Viliam Durina (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-7016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969380#comment-16969380
 ] 

Viliam Durina commented on KAFKA-7016:
--

I add my vote to this: the exceptions are much less useful. Exceptions cost 
nothing if they're not thrown, which is the normal case. If you use exceptions 
internally in normal execution, avoid that. I have a complex code and don't 
know where in my code the problem happened. I ended up wrapping each kafka call 
individually with this to find out:

{{handleExc(() -> /* the kafka call here */);}}

{{private static void handleExc(RunnableExc action) {}}
{{  try {}}
{{    action.run();}}
{{  } catch (Exception e) {}}
{{    throw new RuntimeException(e);}}
{{  }}}
{{}}}

 

Very annoying.

> Reconsider the "avoid the expensive and useless stack trace for api 
> exceptions" practice
> 
>
> Key: KAFKA-7016
> URL: https://issues.apache.org/jira/browse/KAFKA-7016
> Project: Kafka
>  Issue Type: Bug
>Reporter: Martin Vysny
>Priority: Major
>
> I am trying to write a Kafka Consumer; upon running it only prints out:
> {\{ org.apache.kafka.common.errors.InvalidGroupIdException: The configured 
> groupId is invalid}}
> Note that the stack trace is missing, so that I have no information which 
> part of my code is bad and need fixing; I also have no information which 
> Kafka Client method has been called. Upon closer examination I found this in 
> ApiException:
>  
> {{/* avoid the expensive and useless stack trace for api exceptions */}}
>  {{@Override}}
>  {{public Throwable fillInStackTrace() {}}
>  \{{ return this;}}
>  {{}}}
>  
> I think it is a bad practice to hide all useful debugging info and trade it 
> for dubious performance gains. Exceptions are for exceptional code flow which 
> are allowed to be slow.
>  
> This applies to kafka-clients 1.1.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-9044) Brokers occasionally (randomly?) dropping out of clusters

2019-11-07 Thread Peter Bukowinski (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969412#comment-16969412
 ] 

Peter Bukowinski commented on KAFKA-9044:
-

I don't see any log entries like that. When broker 29 became the controller, 
this is all I see. Do I need to modify my log4j properties to see the sessionid

 [2019-10-29 18:10:03,135] INFO [Controller id=29] 29 successfully elected as 
the controller. Epoch incremented to 17 and epoch zk version is now 16 
(kafka.controller.KafkaController)
 [2019-10-29 18:10:03,135] INFO [Controller id=29] Registering handlers 
(kafka.controller.KafkaController)
 [2019-10-29 18:10:03,138] INFO [Controller id=29] Deleting log dir event 
notifications (kafka.controller.KafkaController)
 [2019-10-29 18:10:03,139] INFO [Controller id=29] Deleting isr change 
notifications (kafka.controller.KafkaController)
 [2019-10-29 18:10:03,140] INFO [Controller id=29] Initializing controller 
context (kafka.controller.KafkaController)
 [2019-10-29 18:10:03,164] INFO [Controller id=29] Initialized broker epochs 
cache: Map(5 -> 42985827819, 10 -> 42985831769, 24 -> 42985842811, 25 -> 
42985843630, 14 -> 4298583480, 20 -> 42985839540, 29 -> 42985846752, 6 -> 
42985828664, 28 -> 42985845998, 21 -> 42985840362, 9 -> 42985830988, 13 -> 
42985834055, 2 -> 42985825412, 17 -> 42985837193, 22 -> 42985841177, 27 -> 
42985845228, 12 -> 42985833303, 7 -> 42985829425, 3 -> 42985826221, 18 -> 
42985837953, 16 -> 42985836344, 11 -> 42985832529, 26 -> 42985844473, 23 -> 
42985842007, 8 -> 42985830214, 30 -> 42985847510, 19 -> 42985838744, 4 -> 
42985827024, 15 -> 42985835551) (kafka.controller.KafkaController)
 [2019-10-29 18:10:03,184] DEBUG [Controller id=29] Register 
BrokerModifications handler for Set(5, 10, 24, 25, 14, 20, 29, 6, 28, 21, 9, 
13, 2, 17, 22, 27, 12, 7, 3, 18, 16, 11, 26, 23, 8, 30, 19, 4, 15) 
(kafka.controller.KafkaController)
 [2019-10-29 18:10:03,329] DEBUG [Channel manager on controller 29]: Controller 
29 trying to connect to broker 18 (kafka.controller.ControllerChannelManager)
 [2019-10-29 18:10:03,333] DEBUG [Channel manager on controller 29]: Controller 
29 trying to connect to broker 27 (kafka.controller.ControllerChannelManager)
 [2019-10-29 18:10:03,334] DEBUG [Channel manager on controller 29]: Controller 
29 trying to connect to broker 19 (kafka.controller.ControllerChannelManager)
 [2019-10-29 18:10:03,335] DEBUG [Channel manager on controller 29]: Controller 
29 trying to connect to broker 17 (kafka.controller.ControllerChannelManager)


This is the controller znode for the cluster:

{"version":1,"brokerid":29,"timestamp":"1572397803132"}
cZxid = 0xa0227fdba
ctime = Tue Oct 29 18:10:03 PDT 2019
mZxid = 0xa0227fdba
mtime = Tue Oct 29 18:10:03 PDT 2019
pZxid = 0xa0227fdba
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x36b49768579d9d5
dataLength = 55
numChildren = 0


I tried grepping the transaction logs for each of the hex values but came up 
empty.

> Brokers occasionally (randomly?) dropping out of clusters
> -
>
> Key: KAFKA-9044
> URL: https://issues.apache.org/jira/browse/KAFKA-9044
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.3.0, 2.3.1
> Environment: Ubuntu 14.04
>Reporter: Peter Bukowinski
>Priority: Major
>
> I have several cluster running kafka 2.3.1 and this issue has affected all of 
> them. Because of replication and the size of the clusters (30 brokers), this 
> bug is not causing any data loss, but it is nevertheless concerning. When a 
> broker drops out, the log gives no indication that there are any zookeeper 
> issues (and indeed the zookeepers are healthy when this occurs. Here's 
> snippet from a broker log when it occurs:
> {{[2019-10-07 11:02:27,630] INFO [GroupMetadataManager brokerId=14] Removed 0 
> expired offsets in 0 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 11:02:56,936] INFO [Log partition=internal_test-52, 
> dir=/data/3/kl] Found deletable segments with base offsets [1975332] due to 
> retention time 360ms breach (kafka.log.Log)}}
>  {{[2019-10-07 11:02:56,936] INFO [Log partition=internal_test-52, 
> dir=/data/3/kl] Scheduling log segment [baseOffset 1975332, size 92076008] 
> for deletion. (kafka.log.Log)}}
>  {{[2019-10-07 11:02:56,936] INFO [Log partition=internal_test-52, 
> dir=/data/3/kl] Incrementing log start offset to 2000317 (kafka.log.Log)}}
>  {{[2019-10-07 11:03:56,936] INFO [Log partition=internal_test-52, 
> dir=/data/3/kl] Deleting segment 1975332 (kafka.log.Log)}}
>  {{[2019-10-07 11:03:56,957] INFO Deleted log 
> /data/3/kl/internal_test-52/01975332.log.deleted. 
> (kafka.log.LogSegment)}}
>  {{[2019-10-07 11:03:56,957] INFO Deleted offset index 
> /data/3/

[jira] [Created] (KAFKA-9157) logcleaner could generate empty segment files after cleaning

2019-11-07 Thread Jun Rao (Jira)

Jun Rao created KAFKA-9157:
--

 Summary: logcleaner could generate empty segment files after 
cleaning
 Key: KAFKA-9157
 URL: https://issues.apache.org/jira/browse/KAFKA-9157
 Project: Kafka
  Issue Type: Improvement
  Components: core
Affects Versions: 2.3.0
Reporter: Jun Rao


Currently, the log cleaner could only combine segments within a 2-billion 
offset range. If all records in that range are deleted, an empty segment could 
be generated. It would be useful to avoid generating such empty segments.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-9044) Brokers occasionally (randomly?) dropping out of clusters

2019-11-07 Thread Jun Rao (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969452#comment-16969452
 ] 

Jun Rao commented on KAFKA-9044:


The ZK session is typically created when the broker starts, not when it becomes 
the controller.

Alternatively, the ZK server may log the client IP address for a ZK session 
when it's created. If you can find out that, you can see if the IP matches the 
broker IP.

> Brokers occasionally (randomly?) dropping out of clusters
> -
>
> Key: KAFKA-9044
> URL: https://issues.apache.org/jira/browse/KAFKA-9044
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.3.0, 2.3.1
> Environment: Ubuntu 14.04
>Reporter: Peter Bukowinski
>Priority: Major
>
> I have several cluster running kafka 2.3.1 and this issue has affected all of 
> them. Because of replication and the size of the clusters (30 brokers), this 
> bug is not causing any data loss, but it is nevertheless concerning. When a 
> broker drops out, the log gives no indication that there are any zookeeper 
> issues (and indeed the zookeepers are healthy when this occurs. Here's 
> snippet from a broker log when it occurs:
> {{[2019-10-07 11:02:27,630] INFO [GroupMetadataManager brokerId=14] Removed 0 
> expired offsets in 0 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 11:02:56,936] INFO [Log partition=internal_test-52, 
> dir=/data/3/kl] Found deletable segments with base offsets [1975332] due to 
> retention time 360ms breach (kafka.log.Log)}}
>  {{[2019-10-07 11:02:56,936] INFO [Log partition=internal_test-52, 
> dir=/data/3/kl] Scheduling log segment [baseOffset 1975332, size 92076008] 
> for deletion. (kafka.log.Log)}}
>  {{[2019-10-07 11:02:56,936] INFO [Log partition=internal_test-52, 
> dir=/data/3/kl] Incrementing log start offset to 2000317 (kafka.log.Log)}}
>  {{[2019-10-07 11:03:56,936] INFO [Log partition=internal_test-52, 
> dir=/data/3/kl] Deleting segment 1975332 (kafka.log.Log)}}
>  {{[2019-10-07 11:03:56,957] INFO Deleted log 
> /data/3/kl/internal_test-52/01975332.log.deleted. 
> (kafka.log.LogSegment)}}
>  {{[2019-10-07 11:03:56,957] INFO Deleted offset index 
> /data/3/kl/internal_test-52/01975332.index.deleted. 
> (kafka.log.LogSegment)}}
>  {{[2019-10-07 11:03:56,958] INFO Deleted time index 
> /data/3/kl/internal_test-52/01975332.timeindex.deleted. 
> (kafka.log.LogSegment)}}
>  {{[2019-10-07 11:12:27,630] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offsets in 0 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 11:22:27,629] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offsets in 0 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 11:32:27,629] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offsets in 0 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 11:42:27,629] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offsets in 0 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 11:52:27,629] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offsets in 0 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 12:02:27,629] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offsets in 0 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 12:12:27,630] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offsets in 0 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 12:22:27,629] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offsets in 0 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 12:32:27,629] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offsets in 0 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 12:42:27,630] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offsets in 1 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 12:52:27,629] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offsets in 0 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 13:02:27,630] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offsets in 0 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 13:12:27,629] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offsets in 0 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 13:22:27,629] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offse

[jira] [Commented] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId

2019-11-07 Thread Boyang Chen (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969488#comment-16969488
 ] 

Boyang Chen commented on KAFKA-8803:


You mean [~timvanlaer] transaction coordinator right? Group coordinator is 
handling the consumer related logic

> Stream will not start due to TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> 
>
> Key: KAFKA-8803
> URL: https://issues.apache.org/jira/browse/KAFKA-8803
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Raman Gupta
>Priority: Major
> Attachments: logs.txt.gz, screenshot-1.png
>
>
> One streams app is consistently failing at startup with the following 
> exception:
> {code}
> 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] 
> org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout 
> exception caught when initializing transactions for task 0_36. This might 
> happen if the broker is slow to respond, if the network connection to the 
> broker was interrupted, or if similar circumstances arise. You can increase 
> producer parameter `max.block.ms` to increase this timeout.
> org.apache.kafka.common.errors.TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> {code}
> These same brokers are used by many other streams without any issue, 
> including some in the very same processes for the stream which consistently 
> throws this exception.
> *UPDATE 08/16:*
> The very first instance of this error is August 13th 2019, 17:03:36.754 and 
> it happened for 4 different streams. For 3 of these streams, the error only 
> happened once, and then the stream recovered. For the 4th stream, the error 
> has continued to happen, and continues to happen now.
> I looked up the broker logs for this time, and see that at August 13th 2019, 
> 16:47:43, two of four brokers started reporting messages like this, for 
> multiple partitions:
> [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, 
> fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader 
> reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
> The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, 
> here is a view of the count of these messages over time:
>  !screenshot-1.png! 
> However, as noted, the stream task timeout error continues to happen.
> I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 
> broker. The broker has a patch for KAFKA-8773.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-9044) Brokers occasionally (randomly?) dropping out of clusters

2019-11-07 Thread Peter Bukowinski (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969517#comment-16969517
 ] 

Peter Bukowinski commented on KAFKA-9044:
-

My broker log startup shows these lines with no sessionid.

[2019-11-07 09:01:07,626] INFO starting (kafka.server.KafkaServer)
[2019-11-07 09:01:07,626] INFO Connecting to zookeeper on 
zk-a0001.example.com:2181,zk-a0002.example.com:2181,zk-a0003.example.com:2181/kafka
 (kafka.server.KafkaServer)
[2019-11-07 09:01:07,650] INFO [ZooKeeperClient Kafka server] Initializing a 
new session to 
zk-a0001.example.com:2181,zk-a0002.example.com:2181,zk-a0003.example.com:2181. 
(kafka.zookeeper.ZooKeeperClient)
[2019-11-07 09:01:07,675] INFO [ZooKeeperClient Kafka server] Waiting until 
connected. (kafka.zookeeper.ZooKeeperClient)
[2019-11-07 09:01:07,699] INFO [ZooKeeperClient Kafka server] Connected. 
(kafka.zookeeper.ZooKeeperClient)
[2019-11-07 09:01:07,784] INFO Created zookeeper path /kafka 
(kafka.server.KafkaServer)
[2019-11-07 09:01:07,785] INFO [ZooKeeperClient Kafka server] Closing. 
(kafka.zookeeper.ZooKeeperClient)
[2019-11-07 09:01:07,793] INFO [ZooKeeperClient Kafka server] Closed. 
(kafka.zookeeper.ZooKeeperClient)
[2019-11-07 09:01:07,794] INFO [ZooKeeperClient Kafka server] Initializing a 
new session to 
zk-a0001.example.com:2181,zk-a0002.example.com:2181,zk-a0003.example.com:2181/kafka.
 (kafka.zookeeper.ZooKeeperClient)
[2019-11-07 09:01:07,794] INFO [ZooKeeperClient Kafka server] Waiting until 
connected. (kafka.zookeeper.ZooKeeperClient)
[2019-11-07 09:01:07,801] INFO [ZooKeeperClient Kafka server] Connected. 
(kafka.zookeeper.ZooKeeperClient)
[2019-11-07 09:01:08,035] INFO Cluster ID = KQW1r4YbRK-8LzVUdbKnZg 
(kafka.server.KafkaServer)

> Brokers occasionally (randomly?) dropping out of clusters
> -
>
> Key: KAFKA-9044
> URL: https://issues.apache.org/jira/browse/KAFKA-9044
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.3.0, 2.3.1
> Environment: Ubuntu 14.04
>Reporter: Peter Bukowinski
>Priority: Major
>
> I have several cluster running kafka 2.3.1 and this issue has affected all of 
> them. Because of replication and the size of the clusters (30 brokers), this 
> bug is not causing any data loss, but it is nevertheless concerning. When a 
> broker drops out, the log gives no indication that there are any zookeeper 
> issues (and indeed the zookeepers are healthy when this occurs. Here's 
> snippet from a broker log when it occurs:
> {{[2019-10-07 11:02:27,630] INFO [GroupMetadataManager brokerId=14] Removed 0 
> expired offsets in 0 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 11:02:56,936] INFO [Log partition=internal_test-52, 
> dir=/data/3/kl] Found deletable segments with base offsets [1975332] due to 
> retention time 360ms breach (kafka.log.Log)}}
>  {{[2019-10-07 11:02:56,936] INFO [Log partition=internal_test-52, 
> dir=/data/3/kl] Scheduling log segment [baseOffset 1975332, size 92076008] 
> for deletion. (kafka.log.Log)}}
>  {{[2019-10-07 11:02:56,936] INFO [Log partition=internal_test-52, 
> dir=/data/3/kl] Incrementing log start offset to 2000317 (kafka.log.Log)}}
>  {{[2019-10-07 11:03:56,936] INFO [Log partition=internal_test-52, 
> dir=/data/3/kl] Deleting segment 1975332 (kafka.log.Log)}}
>  {{[2019-10-07 11:03:56,957] INFO Deleted log 
> /data/3/kl/internal_test-52/01975332.log.deleted. 
> (kafka.log.LogSegment)}}
>  {{[2019-10-07 11:03:56,957] INFO Deleted offset index 
> /data/3/kl/internal_test-52/01975332.index.deleted. 
> (kafka.log.LogSegment)}}
>  {{[2019-10-07 11:03:56,958] INFO Deleted time index 
> /data/3/kl/internal_test-52/01975332.timeindex.deleted. 
> (kafka.log.LogSegment)}}
>  {{[2019-10-07 11:12:27,630] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offsets in 0 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 11:22:27,629] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offsets in 0 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 11:32:27,629] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offsets in 0 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 11:42:27,629] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offsets in 0 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 11:52:27,629] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offsets in 0 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 12:02:27,629] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offsets in 0 milliseconds. 
> (kafka.coordinat

[jira] [Commented] (KAFKA-9141) Global state update error: missing recovery or wrong log message

2019-11-07 Thread Chris Toomey (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969529#comment-16969529
 ] 

Chris Toomey commented on KAFKA-9141:
-

Here's the log message and stack trace. It comes from [this 
line|https://github.com/apache/kafka/blob/2.4/streams/src/main/java/org/apache/kafka/streams/processor/internals/GlobalStreamThread.java#L248]
 in {{GlobalStreamThread}}.

{code:java}
2019-10-31 17:12:15.894 - - ERROR o.a.k.s.p.i.GlobalStreamThread$StateConsumer 
service-platformAPI-revokedJwtsCache-91814a54-b18b-47a8-9d35-926e6abcb5f6-GlobalStreamThread
 - global-stream-thread 
[service-platformAPI-revokedJwtsCache-91814a54-b18b-47a8-9d35-926e6abcb5f6-GlobalStreamThread]
 Updating global state failed. You can restart KafkaStreams to recover from 
this error.
org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of 
range with no configured reset policy for partitions: 
{livongo.public.entity.system.auth.jwt.revoked-0=15579}
at 
org.apache.kafka.clients.consumer.internals.Fetcher.initializePartitionRecords(Fetcher.java:1266)
at 
org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:605)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1283)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1214)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1190)
at 
org.apache.kafka.streams.processor.internals.GlobalStreamThread$StateConsumer.pollAndUpdate(GlobalStreamThread.java:239)
at 
org.apache.kafka.streams.processor.internals.GlobalStreamThread.run(GlobalStreamThread.java:290)
{code}

The context in which this occurred was switching the application to point to a 
different kafka broker in a development environment, which caused the offset 
problem. Several of us on my team got this error and took the message "You can 
restart KafkaStreams to recover from this error" to mean "restarting 
KafkaStreams will fix this error", which it doesn't, hence this ticket.

How could an application detect and self-recover by calling {{cleanUp()}} from 
this, given that it happens asynchronously on the global stream thread?

> Global state update error: missing recovery or wrong log message
> 
>
> Key: KAFKA-9141
> URL: https://issues.apache.org/jira/browse/KAFKA-9141
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.3.0
>Reporter: Chris Toomey
>Priority: Major
>
> I'm getting an {{OffsetOutOfRangeException}} accompanied by the log message 
> "Updating global state failed. You can restart KafkaStreams to recover from 
> this error." But I've restarted the app several times and it's not 
> recovering, it keeps failing the same way.
>  
> I see there's a {{cleanUp()}} method on {{KafkaStreams}} that looks like it's 
> what's needed, but it's not called anywhere in the streams source code. So 
> either that's a bug and the call should be added to do the recovery, or the 
> log message is wrong and should be fixed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId

2019-11-07 Thread Tim Van Laer (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969532#comment-16969532
 ] 

Tim Van Laer commented on KAFKA-8803:
-

[~bchen225242] good point, I guess so. Any way to find the transaction 
coordinator broker for this particular consumer?

> Stream will not start due to TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> 
>
> Key: KAFKA-8803
> URL: https://issues.apache.org/jira/browse/KAFKA-8803
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Raman Gupta
>Priority: Major
> Attachments: logs.txt.gz, screenshot-1.png
>
>
> One streams app is consistently failing at startup with the following 
> exception:
> {code}
> 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] 
> org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout 
> exception caught when initializing transactions for task 0_36. This might 
> happen if the broker is slow to respond, if the network connection to the 
> broker was interrupted, or if similar circumstances arise. You can increase 
> producer parameter `max.block.ms` to increase this timeout.
> org.apache.kafka.common.errors.TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> {code}
> These same brokers are used by many other streams without any issue, 
> including some in the very same processes for the stream which consistently 
> throws this exception.
> *UPDATE 08/16:*
> The very first instance of this error is August 13th 2019, 17:03:36.754 and 
> it happened for 4 different streams. For 3 of these streams, the error only 
> happened once, and then the stream recovered. For the 4th stream, the error 
> has continued to happen, and continues to happen now.
> I looked up the broker logs for this time, and see that at August 13th 2019, 
> 16:47:43, two of four brokers started reporting messages like this, for 
> multiple partitions:
> [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, 
> fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader 
> reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
> The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, 
> here is a view of the count of these messages over time:
>  !screenshot-1.png! 
> However, as noted, the stream task timeout error continues to happen.
> I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 
> broker. The broker has a patch for KAFKA-8773.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-9133) LogCleaner thread dies with: currentLog cannot be empty on an unexpected exception

2019-11-07 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969589#comment-16969589
 ] 

ASF GitHub Bot commented on KAFKA-9133:
---

hachikuji commented on pull request #7662: KAFKA-9133; Cleaner should handle 
log start offset larger than active segment base offset
URL: https://github.com/apache/kafka/pull/7662
 
 
   This was a regression in 2.3.1. In the case of a DeleteRecords call, the log 
start offset may be higher than the active segment base offset. The cleaner 
should allow for this case gracefully.
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> LogCleaner thread dies with: currentLog cannot be empty on an unexpected 
> exception
> --
>
> Key: KAFKA-9133
> URL: https://issues.apache.org/jira/browse/KAFKA-9133
> Project: Kafka
>  Issue Type: Bug
>  Components: log cleaner
>Affects Versions: 2.3.1
>Reporter: Karolis Pocius
>Assignee: Jason Gustafson
>Priority: Blocker
> Fix For: 2.4.0, 2.3.2
>
>
> Log cleaner thread dies without a clear reference to which log is causing it:
> {code}
> [2019-11-02 11:59:59,078] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2019-11-02 11:59:59,144] INFO [kafka-log-cleaner-thread-0]: Starting 
> (kafka.log.LogCleaner)
> [2019-11-02 11:59:59,199] ERROR [kafka-log-cleaner-thread-0]: Error due to 
> (kafka.log.LogCleaner)
> java.lang.IllegalStateException: currentLog cannot be empty on an unexpected 
> exception
>  at kafka.log.LogCleaner$CleanerThread.cleanFilthiestLog(LogCleaner.scala:346)
>  at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:307)
>  at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:89)
> Caused by: java.lang.IllegalArgumentException: Illegal request for non-active 
> segments beginning at offset 5033130, which is larger than the active 
> segment's base offset 5019648
>  at kafka.log.Log.nonActiveLogSegmentsFrom(Log.scala:1933)
>  at 
> kafka.log.LogCleanerManager$.maxCompactionDelay(LogCleanerManager.scala:491)
>  at 
> kafka.log.LogCleanerManager.$anonfun$grabFilthiestCompactedLog$4(LogCleanerManager.scala:184)
>  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>  at scala.collection.immutable.List.foreach(List.scala:392)
>  at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>  at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>  at scala.collection.immutable.List.map(List.scala:298)
>  at 
> kafka.log.LogCleanerManager.$anonfun$grabFilthiestCompactedLog$1(LogCleanerManager.scala:181)
>  at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:253)
>  at 
> kafka.log.LogCleanerManager.grabFilthiestCompactedLog(LogCleanerManager.scala:171)
>  at kafka.log.LogCleaner$CleanerThread.cleanFilthiestLog(LogCleaner.scala:321)
>  ... 2 more
> [2019-11-02 11:59:59,200] INFO [kafka-log-cleaner-thread-0]: Stopped 
> (kafka.log.LogCleaner)
> {code}
> If I try to ressurect it by dynamically bumping {{log.cleaner.threads}} it 
> instantly dies with the exact same error.
> Not sure if this is something KAFKA-8725 is supposed to address.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (KAFKA-9101) Create a fetch.max.bytes configuration for the broker

2019-11-07 Thread Colin McCabe (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-9101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin McCabe resolved KAFKA-9101.
-
Resolution: Fixed

https://github.com/apache/kafka/pull/7595

> Create a fetch.max.bytes configuration for the broker
> -
>
> Key: KAFKA-9101
> URL: https://issues.apache.org/jira/browse/KAFKA-9101
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Reporter: Colin McCabe
>Assignee: Colin McCabe
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-9158) producer fetch metadata until buffer is full

2019-11-07 Thread Brandon Jiang (Jira)

Brandon Jiang created KAFKA-9158:


 Summary: producer fetch metadata until buffer is full
 Key: KAFKA-9158
 URL: https://issues.apache.org/jira/browse/KAFKA-9158
 Project: Kafka
  Issue Type: Improvement
  Components: producer 
Affects Versions: 2.2.1
Reporter: Brandon Jiang


Currently, based on my understanding with KafakProducer.doSend() method, when 
trigger kafka java client with  
{code:java}
org.apache.kafka.clients.producer.Producer.send(message);{code}
The currently behavior is to do metadata fetch first and then queue the message 
for sending ?  This could result the kafka producer hangs for max.block.ms time 
when kafka servers are down.

Is it possible to change the Producer.send() function behavior to queue all the 
messages first and then only do metadata fetch async when the queue is full. 
e.g. 
{code:java}
if (msgQueue.batchIsFull ) {
 clusterAndWaitTime = waitOnMetadata(record.topic(), record.partition(), 
maxBlockTimeMs);
 .
 this.sender.wakeup();

}{code}
I am happy to make this changes. But before doing that, just want to check the 
current design logic behind producer.send(). 

I have seen there are lots of talks regarding the producer send() behavior.  
e.g. https://issues.apache.org/jira/browse/KAFKA-2948, 
https://issues.apache.org/jira/browse/KAFKA-5369

So trying to make sure I didnot misunderstand the current behavior here.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-9159) Consumer.endOffsets Throw TimeoutException: Failed to get offsets by times in 30000ms after a leader change

2019-11-07 Thread zhangzhanchang (Jira)

zhangzhanchang created KAFKA-9159:
-

 Summary: Consumer.endOffsets Throw TimeoutException: Failed to get 
offsets by times in 3ms after a leader change
 Key: KAFKA-9159
 URL: https://issues.apache.org/jira/browse/KAFKA-9159
 Project: Kafka
  Issue Type: Bug
  Components: consumer
Affects Versions: 2.3.0, 2.0.0, 0.10.2.0
Reporter: zhangzhanchang


case 1: 0.10.2 version loop call Consumer.endOffsets Throw TimeoutException: 
Failed to get offsets by times in 3ms after kill -9 broker,but a leader 
change ,loop call Consumer.endOffsets no problem

case 2: 2.0 version loop call Consumer.endOffsets Throw TimeoutException: 
Failed to get offsets by times in 3ms after a leader change,but kill -9 
broker,loop call Consumer.endOffsets no problem



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-9156) LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent state

2019-11-07 Thread Guozhang Wang (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969720#comment-16969720
 ] 

Guozhang Wang commented on KAFKA-9156:
--

[~lushilin] Thanks for reporting this ticket, and I think it is indeed an 
issue: although Option is immutable we are trying to reassign the var. I think 
we can fix it by replacing the `var Option` with a `val AtomicReference` in 
which we can rely on `compareAndSet` to set the values atomically.

Would you like to provide a patch for it?

> LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent 
> state
> ---
>
> Key: KAFKA-9156
> URL: https://issues.apache.org/jira/browse/KAFKA-9156
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.3.0
>Reporter: shilin Lu
>Priority: Critical
> Attachments: image-2019-11-07-17-42-13-852.png, 
> image-2019-11-07-17-44-05-357.png, image-2019-11-07-17-46-53-650.png
>
>
> !image-2019-11-07-17-42-13-852.png!
> this timeindex get function is not thread safe ,may cause create some 
> timeindex.
> !image-2019-11-07-17-44-05-357.png!
> When create timeindex not exactly one ,may cause mappedbytebuffer position to 
> end. Then write index entry to this mmap file will cause 
> java.nio.BufferOverflowException.
>  
> !image-2019-11-07-17-46-53-650.png!
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-9156) LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent state

2019-11-07 Thread shilin Lu (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969747#comment-16969747
 ] 

shilin Lu commented on KAFKA-9156:
--

[~guozhang] thanks for reply. i will fix it and provide a patch for this issue 
at this weekend.before this, i need to learn the process of submitting a patch.

> LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent 
> state
> ---
>
> Key: KAFKA-9156
> URL: https://issues.apache.org/jira/browse/KAFKA-9156
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.3.0
>Reporter: shilin Lu
>Priority: Critical
> Attachments: image-2019-11-07-17-42-13-852.png, 
> image-2019-11-07-17-44-05-357.png, image-2019-11-07-17-46-53-650.png
>
>
> !image-2019-11-07-17-42-13-852.png!
> this timeindex get function is not thread safe ,may cause create some 
> timeindex.
> !image-2019-11-07-17-44-05-357.png!
> When create timeindex not exactly one ,may cause mappedbytebuffer position to 
> end. Then write index entry to this mmap file will cause 
> java.nio.BufferOverflowException.
>  
> !image-2019-11-07-17-46-53-650.png!
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-9159) Consumer.endOffsets Throw TimeoutException: Failed to get offsets by times in 30000ms after a leader change

2019-11-07 Thread zhangzhanchang (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-9159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangzhanchang updated KAFKA-9159:
--
External issue URL: https://github.com/apache/kafka/pull/7660

> Consumer.endOffsets Throw TimeoutException: Failed to get offsets by times in 
> 3ms after a leader change
> ---
>
> Key: KAFKA-9159
> URL: https://issues.apache.org/jira/browse/KAFKA-9159
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 0.10.2.0, 2.0.0, 2.3.0
>Reporter: zhangzhanchang
>Priority: Major
>
> case 1: 0.10.2 version loop call Consumer.endOffsets Throw TimeoutException: 
> Failed to get offsets by times in 3ms after kill -9 broker,but a leader 
> change ,loop call Consumer.endOffsets no problem
> case 2: 2.0 version loop call Consumer.endOffsets Throw TimeoutException: 
> Failed to get offsets by times in 3ms after a leader change,but kill -9 
> broker,loop call Consumer.endOffsets no problem



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-9159) Consumer.endOffsets Throw TimeoutException: Failed to get offsets by times in 30000ms after a leader change

2019-11-07 Thread zhangzhanchang (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-9159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangzhanchang updated KAFKA-9159:
--
External issue URL:   (was: https://github.com/apache/kafka/pull/7660)

> Consumer.endOffsets Throw TimeoutException: Failed to get offsets by times in 
> 3ms after a leader change
> ---
>
> Key: KAFKA-9159
> URL: https://issues.apache.org/jira/browse/KAFKA-9159
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 0.10.2.0, 2.0.0, 2.3.0
>Reporter: zhangzhanchang
>Priority: Major
>
> case 1: 0.10.2 version loop call Consumer.endOffsets Throw TimeoutException: 
> Failed to get offsets by times in 3ms after kill -9 broker,but a leader 
> change ,loop call Consumer.endOffsets no problem
> case 2: 2.0 version loop call Consumer.endOffsets Throw TimeoutException: 
> Failed to get offsets by times in 3ms after a leader change,but kill -9 
> broker,loop call Consumer.endOffsets no problem



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-9159) Consumer.endOffsets Throw TimeoutException: Failed to get offsets by times in 30000ms after a leader change

2019-11-07 Thread zhangzhanchang (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-9159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangzhanchang updated KAFKA-9159:
--
 Attachment: image-2019-11-08-10-28-19-881.png
 image-2019-11-08-10-29-06-282.png
Description: 
case 1: 0.10.2 version loop call Consumer.endOffsets Throw TimeoutException: 
Failed to get offsets by times in 3ms after kill -9 broker,but a leader 
change ,loop call Consumer.endOffsets no problem

!image-2019-11-08-10-28-19-881.png!

case 2: 2.0 version loop call Consumer.endOffsets Throw TimeoutException: 
Failed to get offsets by times in 3ms after a leader change,but kill -9 
broker,loop call Consumer.endOffsets no problem

!image-2019-11-08-10-29-06-282.png!

  was:
case 1: 0.10.2 version loop call Consumer.endOffsets Throw TimeoutException: 
Failed to get offsets by times in 3ms after kill -9 broker,but a leader 
change ,loop call Consumer.endOffsets no problem

case 2: 2.0 version loop call Consumer.endOffsets Throw TimeoutException: 
Failed to get offsets by times in 3ms after a leader change,but kill -9 
broker,loop call Consumer.endOffsets no problem


> Consumer.endOffsets Throw TimeoutException: Failed to get offsets by times in 
> 3ms after a leader change
> ---
>
> Key: KAFKA-9159
> URL: https://issues.apache.org/jira/browse/KAFKA-9159
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 0.10.2.0, 2.0.0, 2.3.0
>Reporter: zhangzhanchang
>Priority: Major
> Attachments: image-2019-11-08-10-28-19-881.png, 
> image-2019-11-08-10-29-06-282.png
>
>
> case 1: 0.10.2 version loop call Consumer.endOffsets Throw TimeoutException: 
> Failed to get offsets by times in 3ms after kill -9 broker,but a leader 
> change ,loop call Consumer.endOffsets no problem
> !image-2019-11-08-10-28-19-881.png!
> case 2: 2.0 version loop call Consumer.endOffsets Throw TimeoutException: 
> Failed to get offsets by times in 3ms after a leader change,but kill -9 
> broker,loop call Consumer.endOffsets no problem
> !image-2019-11-08-10-29-06-282.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-9159) Consumer.endOffsets Throw TimeoutException: Failed to get offsets by times in 30000ms after a leader change

2019-11-07 Thread zhangzhanchang (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-9159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangzhanchang updated KAFKA-9159:
--
 Reviewer: Ismael Juma  (was: Guozhang Wang)
Affects Version/s: (was: 2.3.0)

> Consumer.endOffsets Throw TimeoutException: Failed to get offsets by times in 
> 3ms after a leader change
> ---
>
> Key: KAFKA-9159
> URL: https://issues.apache.org/jira/browse/KAFKA-9159
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 0.10.2.0, 2.0.0
>Reporter: zhangzhanchang
>Priority: Major
> Attachments: image-2019-11-08-10-28-19-881.png, 
> image-2019-11-08-10-29-06-282.png
>
>
> case 1: 0.10.2 version loop call Consumer.endOffsets Throw TimeoutException: 
> Failed to get offsets by times in 3ms after kill -9 broker,but a leader 
> change ,loop call Consumer.endOffsets no problem
> !image-2019-11-08-10-28-19-881.png!
> case 2: 2.0 version loop call Consumer.endOffsets Throw TimeoutException: 
> Failed to get offsets by times in 3ms after a leader change,but kill -9 
> broker,loop call Consumer.endOffsets no problem
> !image-2019-11-08-10-29-06-282.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-9159) Consumer.endOffsets Throw TimeoutException: Failed to get offsets by times in 30000ms after a leader change

2019-11-07 Thread zhangzhanchang (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969765#comment-16969765
 ] 

zhangzhanchang commented on KAFKA-9159:
---

this modify: if (metadata.updateRequested() || (future.failed() && 
future.exception() instanceof InvalidMetadataException)),  compatible with both 
cases

> Consumer.endOffsets Throw TimeoutException: Failed to get offsets by times in 
> 3ms after a leader change
> ---
>
> Key: KAFKA-9159
> URL: https://issues.apache.org/jira/browse/KAFKA-9159
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 0.10.2.0, 2.0.0
>Reporter: zhangzhanchang
>Priority: Major
> Attachments: image-2019-11-08-10-28-19-881.png, 
> image-2019-11-08-10-29-06-282.png
>
>
> case 1: 0.10.2 version loop call Consumer.endOffsets Throw TimeoutException: 
> Failed to get offsets by times in 3ms after kill -9 broker,but a leader 
> change ,loop call Consumer.endOffsets no problem
> !image-2019-11-08-10-28-19-881.png!
> case 2: 2.0 version loop call Consumer.endOffsets Throw TimeoutException: 
> Failed to get offsets by times in 3ms after a leader change,but kill -9 
> broker,loop call Consumer.endOffsets no problem
> !image-2019-11-08-10-29-06-282.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-9159) Consumer.endOffsets Throw TimeoutException: Failed to get offsets by times in 30000ms after a leader change

2019-11-07 Thread zhangzhanchang (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-9159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangzhanchang updated KAFKA-9159:
--
Reviewer: Guozhang Wang  (was: Ismael Juma)

> Consumer.endOffsets Throw TimeoutException: Failed to get offsets by times in 
> 3ms after a leader change
> ---
>
> Key: KAFKA-9159
> URL: https://issues.apache.org/jira/browse/KAFKA-9159
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 0.10.2.0, 2.0.0
>Reporter: zhangzhanchang
>Priority: Major
> Attachments: image-2019-11-08-10-28-19-881.png, 
> image-2019-11-08-10-29-06-282.png
>
>
> case 1: 0.10.2 version loop call Consumer.endOffsets Throw TimeoutException: 
> Failed to get offsets by times in 3ms after kill -9 broker,but a leader 
> change ,loop call Consumer.endOffsets no problem
> !image-2019-11-08-10-28-19-881.png!
> case 2: 2.0 version loop call Consumer.endOffsets Throw TimeoutException: 
> Failed to get offsets by times in 3ms after a leader change,but kill -9 
> broker,loop call Consumer.endOffsets no problem
> !image-2019-11-08-10-29-06-282.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-9138) Add system test covering Foreign Key joins (KIP-213)

2019-11-07 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969772#comment-16969772
 ] 

ASF GitHub Bot commented on KAFKA-9138:
---

vvcephei commented on pull request #7664: KAFKA-9138: Add system test for 
relational joins
URL: https://github.com/apache/kafka/pull/7664
 
 
   Adds a system test to verify the new foreign-key join introduced in KIP-213.
   
   Also, adds a unit test and integration test to verify the test logic itself.
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add system test covering Foreign Key joins (KIP-213)
> 
>
> Key: KAFKA-9138
> URL: https://issues.apache.org/jira/browse/KAFKA-9138
> Project: Kafka
>  Issue Type: Test
>  Components: streams
>Reporter: John Roesler
>Assignee: John Roesler
>Priority: Major
>
> There are unit and integration tests, but we should really have a system test 
> as well.
> I plan to create a new test, since this feature is pretty different than the 
> existing topology/data set of smoke test. Although, it might be possible for 
> the new test to subsume smoke test. I'd give the new test a few releases to 
> burn in before considering a merge, though.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-9148) Consider forking RocksDB for Streams

2019-11-07 Thread Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969778#comment-16969778
 ] 

Sophie Blee-Goldman commented on KAFKA-9148:


I guess we'd still have to make sure Flink doesn't pull a rocksdb and change 
any public APIs out from under the users, or allow the APIs to significantly 
diverge unless we want to go down a path of maintaining entirely distinct 
RocksDbStateStore and FRocksDbStateStore classes (rather than just swapping out 
the engine underneath based on the user choice) – in the end it seems like we 
can't avoid _some_ amount of extra work, so I guess it's a matter of betting on 
how much it will be and how much can we take on

> Consider forking RocksDB for Streams 
> -
>
> Key: KAFKA-9148
> URL: https://issues.apache.org/jira/browse/KAFKA-9148
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Sophie Blee-Goldman
>Priority: Major
>
> We recently upgraded our RocksDB dependency to 5.18 for its memory-management 
> abilities (namely the WriteBufferManager, see KAFKA-8215). Unfortunately, 
> someone from Flink recently discovered a ~8% [performance 
> regression|https://github.com/facebook/rocksdb/issues/5774] that exists in 
> all versions 5.18+ (up through the current newest version, 6.2.2). Flink was 
> able to react to this by downgrading to 5.17 and [picking the 
> WriteBufferManage|https://github.com/dataArtisans/frocksdb/pull/4]r to their 
> fork (fRocksDB).
> Due to this and other reasons enumerated below, we should consider also 
> forking our own RocksDB for Streams.
> Pros:
>  * We can avoid passing sudden breaking changes on to our users, such removal 
> of methods with no deprecation period (see discussion on KAFKA-8897)
>  * We can pick whichever version has the best performance for our needs, and 
> pick over any new features, metrics, etc that we need to use rather than 
> being forced to upgrade (and breaking user code, introducing regression, etc)
>  * The Java API seems to be a very low priority to the rocksdb folks.
>  ** They leave out critical functionality, features, and configuration 
> options that have been in the c++ API for a very long time
>  ** Those that do make it over often have random gaps in the API such as 
> setters but no getters (see [rocksdb PR 
> #5186|https://github.com/facebook/rocksdb/pull/5186])
>  ** Others are poorly designed and require too many trips across the JNI, 
> making otherwise incredibly useful features prohibitively expensive.
>  *** [Custom comparator|#issuecomment-83145980]]: a custom comparator could 
> significantly improve the performance of session windows. This is trivial to 
> do but given the high performance cost of crossing the jni, it is currently 
> only practical to use a c++ comparator
>  *** [Prefix Seek|https://github.com/facebook/rocksdb/issues/6004]: not 
> currently used by Streams but a commonly requested feature, and may also 
> allow improved range queries
>  ** Even when an external contributor develops a solution for poorly 
> performing Java functionality and helpfully tries to contribute their patch 
> back to rocksdb, it gets ignored by the rocksdb people ([rocksdb PR 
> #2283|https://github.com/facebook/rocksdb/pull/2283])
> Cons:
>  * more work
> Given that we rarely upgrade the Rocks dependency, use only some fraction of 
> its features, and would need or want to make only minimal changes ourselves, 
> it seems like we could actually get away with very little extra work by 
> forking rocksdb. Note that as of this writing the frocksdb repo has only 
> needed to open 5 PRs on top of the actual rocksdb (two of them trivial).
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-9148) Consider forking RocksDB for Streams

2019-11-07 Thread Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969779#comment-16969779
 ] 

Sophie Blee-Goldman commented on KAFKA-9148:


FWIW I actually do think we could gain a lot by forking rocksdb just for the 
custom comparator alone, which would mean practically no extra work as there's 
no reason that should conflict with anything else in rocks

> Consider forking RocksDB for Streams 
> -
>
> Key: KAFKA-9148
> URL: https://issues.apache.org/jira/browse/KAFKA-9148
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Sophie Blee-Goldman
>Priority: Major
>
> We recently upgraded our RocksDB dependency to 5.18 for its memory-management 
> abilities (namely the WriteBufferManager, see KAFKA-8215). Unfortunately, 
> someone from Flink recently discovered a ~8% [performance 
> regression|https://github.com/facebook/rocksdb/issues/5774] that exists in 
> all versions 5.18+ (up through the current newest version, 6.2.2). Flink was 
> able to react to this by downgrading to 5.17 and [picking the 
> WriteBufferManage|https://github.com/dataArtisans/frocksdb/pull/4]r to their 
> fork (fRocksDB).
> Due to this and other reasons enumerated below, we should consider also 
> forking our own RocksDB for Streams.
> Pros:
>  * We can avoid passing sudden breaking changes on to our users, such removal 
> of methods with no deprecation period (see discussion on KAFKA-8897)
>  * We can pick whichever version has the best performance for our needs, and 
> pick over any new features, metrics, etc that we need to use rather than 
> being forced to upgrade (and breaking user code, introducing regression, etc)
>  * The Java API seems to be a very low priority to the rocksdb folks.
>  ** They leave out critical functionality, features, and configuration 
> options that have been in the c++ API for a very long time
>  ** Those that do make it over often have random gaps in the API such as 
> setters but no getters (see [rocksdb PR 
> #5186|https://github.com/facebook/rocksdb/pull/5186])
>  ** Others are poorly designed and require too many trips across the JNI, 
> making otherwise incredibly useful features prohibitively expensive.
>  *** [Custom comparator|#issuecomment-83145980]]: a custom comparator could 
> significantly improve the performance of session windows. This is trivial to 
> do but given the high performance cost of crossing the jni, it is currently 
> only practical to use a c++ comparator
>  *** [Prefix Seek|https://github.com/facebook/rocksdb/issues/6004]: not 
> currently used by Streams but a commonly requested feature, and may also 
> allow improved range queries
>  ** Even when an external contributor develops a solution for poorly 
> performing Java functionality and helpfully tries to contribute their patch 
> back to rocksdb, it gets ignored by the rocksdb people ([rocksdb PR 
> #2283|https://github.com/facebook/rocksdb/pull/2283])
> Cons:
>  * more work
> Given that we rarely upgrade the Rocks dependency, use only some fraction of 
> its features, and would need or want to make only minimal changes ourselves, 
> it seems like we could actually get away with very little extra work by 
> forking rocksdb. Note that as of this writing the frocksdb repo has only 
> needed to open 5 PRs on top of the actual rocksdb (two of them trivial).
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-9156) LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent state

2019-11-07 Thread shilin Lu (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969794#comment-16969794
 ] 

shilin Lu commented on KAFKA-9156:
--

[~guozhang] as you suggest use val AtomicReference in this case ,i think it can 
work.but i want to fix this bug use double check lock.this is the code, if you 
think it has some problem ,i will modify it.
{code:java}
// code placeholder
{code}
def get: TimeIndex = \{ if (timeIndex.isEmpty) { this synchronized { if 
(timeIndex.isEmpty) { timeIndex = Some(new TimeIndex(_file, baseOffset, 
maxIndexSize, writable)) } } } timeIndex.get }

> LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent 
> state
> ---
>
> Key: KAFKA-9156
> URL: https://issues.apache.org/jira/browse/KAFKA-9156
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.3.0
>Reporter: shilin Lu
>Priority: Critical
> Attachments: image-2019-11-07-17-42-13-852.png, 
> image-2019-11-07-17-44-05-357.png, image-2019-11-07-17-46-53-650.png
>
>
> !image-2019-11-07-17-42-13-852.png!
> this timeindex get function is not thread safe ,may cause create some 
> timeindex.
> !image-2019-11-07-17-44-05-357.png!
> When create timeindex not exactly one ,may cause mappedbytebuffer position to 
> end. Then write index entry to this mmap file will cause 
> java.nio.BufferOverflowException.
>  
> !image-2019-11-07-17-46-53-650.png!
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (KAFKA-9156) LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent state

2019-11-07 Thread shilin Lu (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969794#comment-16969794
 ] 

shilin Lu edited comment on KAFKA-9156 at 11/8/19 3:24 AM:
---

[~guozhang] as you suggest use val AtomicReference in this case ,i think it can 
work.but i want to fix this bug use double check lock.this is the code, if you 
think it has some problem ,i will modify it.
{code:java}
// code placeholder
def get: TimeIndex = {
  if (timeIndex.isEmpty) {
this synchronized {
  if (timeIndex.isEmpty) {
timeIndex = Some(new TimeIndex(_file, baseOffset, maxIndexSize, 
writable))
  }
}
  }
  timeIndex.get
}{code}


was (Author: lushilin):
[~guozhang] as you suggest use val AtomicReference in this case ,i think it can 
work.but i want to fix this bug use double check lock.this is the code, if you 
think it has some problem ,i will modify it.
{code:java}
// code placeholder
{code}
def get: TimeIndex = \{ if (timeIndex.isEmpty) { this synchronized { if 
(timeIndex.isEmpty) { timeIndex = Some(new TimeIndex(_file, baseOffset, 
maxIndexSize, writable)) } } } timeIndex.get }

> LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent 
> state
> ---
>
> Key: KAFKA-9156
> URL: https://issues.apache.org/jira/browse/KAFKA-9156
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.3.0
>Reporter: shilin Lu
>Priority: Critical
> Attachments: image-2019-11-07-17-42-13-852.png, 
> image-2019-11-07-17-44-05-357.png, image-2019-11-07-17-46-53-650.png
>
>
> !image-2019-11-07-17-42-13-852.png!
> this timeindex get function is not thread safe ,may cause create some 
> timeindex.
> !image-2019-11-07-17-44-05-357.png!
> When create timeindex not exactly one ,may cause mappedbytebuffer position to 
> end. Then write index entry to this mmap file will cause 
> java.nio.BufferOverflowException.
>  
> !image-2019-11-07-17-46-53-650.png!
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (KAFKA-9156) LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent state

2019-11-07 Thread shilin Lu (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969794#comment-16969794
 ] 

shilin Lu edited comment on KAFKA-9156 at 11/8/19 3:27 AM:
---

[~guozhang] as you suggest use val AtomicReference in this case ,i think it can 
work.but i want to fix this bug use double check lock.this is the code because 
in this case i think it is same as new a singleton instabce.if you think it has 
some problem ,i will modify it. 
{code:java}
// code placeholder
def get: TimeIndex = {
  if (timeIndex.isEmpty) {
this synchronized {
  if (timeIndex.isEmpty) {
timeIndex = Some(new TimeIndex(_file, baseOffset, maxIndexSize, 
writable))
  }
}
  }
  timeIndex.get
}{code}


was (Author: lushilin):
[~guozhang] as you suggest use val AtomicReference in this case ,i think it can 
work.but i want to fix this bug use double check lock.this is the code, if you 
think it has some problem ,i will modify it.
{code:java}
// code placeholder
def get: TimeIndex = {
  if (timeIndex.isEmpty) {
this synchronized {
  if (timeIndex.isEmpty) {
timeIndex = Some(new TimeIndex(_file, baseOffset, maxIndexSize, 
writable))
  }
}
  }
  timeIndex.get
}{code}

> LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent 
> state
> ---
>
> Key: KAFKA-9156
> URL: https://issues.apache.org/jira/browse/KAFKA-9156
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.3.0
>Reporter: shilin Lu
>Priority: Critical
> Attachments: image-2019-11-07-17-42-13-852.png, 
> image-2019-11-07-17-44-05-357.png, image-2019-11-07-17-46-53-650.png
>
>
> !image-2019-11-07-17-42-13-852.png!
> this timeindex get function is not thread safe ,may cause create some 
> timeindex.
> !image-2019-11-07-17-44-05-357.png!
> When create timeindex not exactly one ,may cause mappedbytebuffer position to 
> end. Then write index entry to this mmap file will cause 
> java.nio.BufferOverflowException.
>  
> !image-2019-11-07-17-46-53-650.png!
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (KAFKA-9156) LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent state

2019-11-07 Thread shilin Lu (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969794#comment-16969794
 ] 

shilin Lu edited comment on KAFKA-9156 at 11/8/19 3:27 AM:
---

[~guozhang] as you suggest use val AtomicReference in this case ,i think it can 
work.but i want to fix this bug use double check lock.this is the code because 
in this case i think it is same as new a singleton instance.if you think it has 
some problem ,i will modify it. 
{code:java}
// code placeholder
def get: TimeIndex = {
  if (timeIndex.isEmpty) {
this synchronized {
  if (timeIndex.isEmpty) {
timeIndex = Some(new TimeIndex(_file, baseOffset, maxIndexSize, 
writable))
  }
}
  }
  timeIndex.get
}{code}


was (Author: lushilin):
[~guozhang] as you suggest use val AtomicReference in this case ,i think it can 
work.but i want to fix this bug use double check lock.this is the code because 
in this case i think it is same as new a singleton instabce.if you think it has 
some problem ,i will modify it. 
{code:java}
// code placeholder
def get: TimeIndex = {
  if (timeIndex.isEmpty) {
this synchronized {
  if (timeIndex.isEmpty) {
timeIndex = Some(new TimeIndex(_file, baseOffset, maxIndexSize, 
writable))
  }
}
  }
  timeIndex.get
}{code}

> LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent 
> state
> ---
>
> Key: KAFKA-9156
> URL: https://issues.apache.org/jira/browse/KAFKA-9156
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.3.0
>Reporter: shilin Lu
>Priority: Critical
> Attachments: image-2019-11-07-17-42-13-852.png, 
> image-2019-11-07-17-44-05-357.png, image-2019-11-07-17-46-53-650.png
>
>
> !image-2019-11-07-17-42-13-852.png!
> this timeindex get function is not thread safe ,may cause create some 
> timeindex.
> !image-2019-11-07-17-44-05-357.png!
> When create timeindex not exactly one ,may cause mappedbytebuffer position to 
> end. Then write index entry to this mmap file will cause 
> java.nio.BufferOverflowException.
>  
> !image-2019-11-07-17-46-53-650.png!
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Issue Comment Deleted] (KAFKA-9156) LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent state

2019-11-07 Thread shilin Lu (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-9156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shilin Lu updated KAFKA-9156:
-
Comment: was deleted

(was: [~guozhang] as you suggest use val AtomicReference in this case ,i think 
it can work.but i want to fix this bug use double check lock.this is the code 
because in this case i think it is same as new a singleton instance.if you 
think it has some problem ,i will modify it. 
{code:java}
// code placeholder
def get: TimeIndex = {
  if (timeIndex.isEmpty) {
this synchronized {
  if (timeIndex.isEmpty) {
timeIndex = Some(new TimeIndex(_file, baseOffset, maxIndexSize, 
writable))
  }
}
  }
  timeIndex.get
}{code})

> LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent 
> state
> ---
>
> Key: KAFKA-9156
> URL: https://issues.apache.org/jira/browse/KAFKA-9156
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.3.0
>Reporter: shilin Lu
>Priority: Critical
> Attachments: image-2019-11-07-17-42-13-852.png, 
> image-2019-11-07-17-44-05-357.png, image-2019-11-07-17-46-53-650.png
>
>
> !image-2019-11-07-17-42-13-852.png!
> this timeindex get function is not thread safe ,may cause create some 
> timeindex.
> !image-2019-11-07-17-44-05-357.png!
> When create timeindex not exactly one ,may cause mappedbytebuffer position to 
> end. Then write index entry to this mmap file will cause 
> java.nio.BufferOverflowException.
>  
> !image-2019-11-07-17-46-53-650.png!
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-9160) Use a common stateMachine to make the state transfer more simple and clear。

2019-11-07 Thread maobaolong (Jira)

maobaolong created KAFKA-9160:
-

 Summary: Use a common stateMachine to make the state transfer more 
simple and clear。
 Key: KAFKA-9160
 URL: https://issues.apache.org/jira/browse/KAFKA-9160
 Project: Kafka
  Issue Type: Improvement
  Components: controller
Affects Versions: 2.5.0
Reporter: maobaolong


We  can reference the repo https://github.com/maobaolong/statemachine_demo



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-6747) kafka-streams Invalid transition attempted from state READY to state ABORTING_TRANSACTION

2019-11-07 Thread Leon (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969814#comment-16969814
 ] 

Leon commented on KAFKA-6747:
-

What is the valid transition? I got Invalid transition attempted from state 
COMMITTING_TRANSACTION to state ABORTING_TRANSACTION, when the brokers are 
down. The code flow is to abort when commit throws exception.

> kafka-streams Invalid transition attempted from state READY to state 
> ABORTING_TRANSACTION
> -
>
> Key: KAFKA-6747
> URL: https://issues.apache.org/jira/browse/KAFKA-6747
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 1.1.0
>Reporter: Frederic Arno
>Assignee: Zhihong Yu
>Priority: Major
> Fix For: 0.11.0.3, 1.0.2, 1.1.1, 2.0.0
>
>
> [~frederica] running tests against kafka-streams 1.1 and get the following 
> stack trace (everything was working alright using kafka-streams 1.0):
> {code}
> ERROR org.apache.kafka.streams.processor.internals.AssignedStreamsTasks - 
> stream-thread [feedBuilder-XXX-StreamThread-4] Failed to close stream task, 
> 0_2
> org.apache.kafka.common.KafkaException: TransactionalId feedBuilder-0_2: 
> Invalid transition attempted from state READY to state ABORTING_TRANSACTION
> at 
> org.apache.kafka.clients.producer.internals.TransactionManager.transitionTo(TransactionManager.java:757)
> at 
> org.apache.kafka.clients.producer.internals.TransactionManager.transitionTo(TransactionManager.java:751)
> at 
> org.apache.kafka.clients.producer.internals.TransactionManager.beginAbort(TransactionManager.java:230)
> at 
> org.apache.kafka.clients.producer.KafkaProducer.abortTransaction(KafkaProducer.java:660)
> at 
> org.apache.kafka.streams.processor.internals.StreamTask.closeSuspended(StreamTask.java:486)
> at 
> org.apache.kafka.streams.processor.internals.StreamTask.close(StreamTask.java:546)
> at 
> org.apache.kafka.streams.processor.internals.AssignedTasks.closeNonRunningTasks(AssignedTasks.java:166)
> at 
> org.apache.kafka.streams.processor.internals.AssignedTasks.suspend(AssignedTasks.java:151)
> at 
> org.apache.kafka.streams.processor.internals.TaskManager.suspendTasksAndState(TaskManager.java:242)
> at 
> org.apache.kafka.streams.processor.internals.StreamThread$RebalanceListener.onPartitionsRevoked(StreamThread.java:291)
> at 
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinPrepare(ConsumerCoordinator.java:414)
> at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:359)
> at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:316)
> at 
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:290)
> at 
> org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1149)
> at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1115)
> at 
> org.apache.kafka.streams.processor.internals.StreamThread.pollRequests(StreamThread.java:827)
> at 
> org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:784)
> at 
> org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:750)
> at 
> org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:720)
> {code}
> This happens when starting the same stream-processing application on 3 JVMs 
> all running on the same linux box, JVMs are named JVM-[2-4]. All 3 instances 
> use separate stream state.dir. No record is ever processed because the input 
> kafka topics are empty at this stage.
> JVM-2 starts first, joined shortly after by JVM-4 and JVM-3, find the state 
> transition logs below. The above stacktrace is from JVM-4
> {code}
> [JVM-2] stream-client [feedBuilder-XXX] State transition from RUNNING to 
> REBALANCING
> [JVM-2] stream-client [feedBuilder-XXX] State transition from REBALANCING to 
> RUNNING
> [JVM-4] stream-client [feedBuilder-XXX] State transition from RUNNING to 
> REBALANCING
> [JVM-2] stream-client [feedBuilder-XXX] State transition from RUNNING to 
> REBALANCING
> [JVM-2] stream-client [feedBuilder-XXX] State transition from REBALANCING to 
> RUNNING
> [JVM-3] stream-client [feedBuilder-XXX] State transition from RUNNING to 
> REBALANCING
> [JVM-2] stream-client [feedBuilder-XXX] State transition from RUNNING to 
> REBALANCING
> JVM-4 crashes here with above stacktrace
> [JVM-2] stream-client [feedBuilder-XXX] State transition from REBALANCING to 
> RUNNING
> [JVM-4] stream-client [feedBuilder-XXX] Sta

[jira] [Commented] (KAFKA-9156) LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent state

2019-11-07 Thread shilin Lu (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969820#comment-16969820
 ] 

shilin Lu commented on KAFKA-9156:
--

[~guozhang] i think only use val AtomicReference can't perform as well.

 
{code:java}
// code placeholder
class LazyTimeIndex(@volatile private var _file: File, baseOffset: Long, 
maxIndexSize: Int = -1, writable: Boolean = true) {
  @volatile private var timeIndex: Option[TimeIndex] = None

  def file: File = {
if (timeIndex.isDefined)
  timeIndex.get.file
else
  _file
  }

  def file_=(f: File) {
if (timeIndex.isDefined)
  timeIndex.get.file = f
else
  _file = f
  }

  def get: TimeIndex = {
if (timeIndex.isEmpty) {
  this synchronized {
if (timeIndex.isEmpty) {
  timeIndex = Some(new TimeIndex(_file, baseOffset, maxIndexSize, 
writable))
}
  }
}
timeIndex.get
  }
}{code}
the file will rename by def file_=(f: File) this function. we should keep it 
Multi-thread safe. i think this code can work.i use synchronized because i 
think this function is lightweight. if you think has some problem, i will 
modify it. thanks
{code:java}
// code placeholder
class LazyTimeIndex(@volatile private var _file: File, baseOffset: Long, 
maxIndexSize: Int = -1, writable: Boolean = true) {
  @volatile private var timeIndex: Option[TimeIndex] = None

  def file: File = {
this synchronized {
  if (timeIndex.isDefined)
timeIndex.get.file
  else
_file
}
  }

  def file_=(f: File) {
this synchronized {
  if (timeIndex.isDefined)
timeIndex.get.file = f
  else
_file = f
}
  }

  def get: TimeIndex = {
if (timeIndex.isEmpty) {
  this synchronized {
if (timeIndex.isEmpty) {
  timeIndex = Some(new TimeIndex(_file, baseOffset, maxIndexSize, 
writable))
}
  }
}
timeIndex.get
  }
}{code}
 

> LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent 
> state
> ---
>
> Key: KAFKA-9156
> URL: https://issues.apache.org/jira/browse/KAFKA-9156
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.3.0
>Reporter: shilin Lu
>Priority: Critical
> Attachments: image-2019-11-07-17-42-13-852.png, 
> image-2019-11-07-17-44-05-357.png, image-2019-11-07-17-46-53-650.png
>
>
> !image-2019-11-07-17-42-13-852.png!
> this timeindex get function is not thread safe ,may cause create some 
> timeindex.
> !image-2019-11-07-17-44-05-357.png!
> When create timeindex not exactly one ,may cause mappedbytebuffer position to 
> end. Then write index entry to this mmap file will cause 
> java.nio.BufferOverflowException.
>  
> !image-2019-11-07-17-46-53-650.png!
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-9161) Close gaps in Streams configs documentation

2019-11-07 Thread Sophie Blee-Goldman (Jira)

Sophie Blee-Goldman created KAFKA-9161:
--

 Summary: Close gaps in Streams configs documentation
 Key: KAFKA-9161
 URL: https://issues.apache.org/jira/browse/KAFKA-9161
 Project: Kafka
  Issue Type: Bug
  Components: documentation, streams
Reporter: Sophie Blee-Goldman


There are a number of Streams configs that aren't documented in the 
"Configuring a Streams Application" section of the docs. As of 2.3 the missing 
configs are:
 # default.windowed.key.serde.inner *
 # default.windowed.value.serde.inner *
 # max.task.idle.ms
 # rocksdb.config.setter. **
 # topology.optimization
 # upgrade.from

* these configs are also missing the corresponding DOC string
** this one actually does appear on that page, but instead of being included in 
the list of Streams configs it is for some reason under  "Consumer and Producer 
Configuration Parameters" ?

There are also a few configs whose documented name is slightly incorrect, ie it 
is missing the "default" prefix that appears in the actual code. I assume this 
was not intentional because there are other configs that do have the "default" 
prefix included, so it doesn't seem to be dropped on purpose. The 
"missing-default" configs are:
 # key.serde
 # value.serde
 # timestamp.extractor



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (KAFKA-9161) Close gaps in Streams configs documentation

2019-11-07 Thread Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sophie Blee-Goldman reassigned KAFKA-9161:
--

Assignee: Sophie Blee-Goldman

> Close gaps in Streams configs documentation
> ---
>
> Key: KAFKA-9161
> URL: https://issues.apache.org/jira/browse/KAFKA-9161
> Project: Kafka
>  Issue Type: Bug
>  Components: documentation, streams
>Reporter: Sophie Blee-Goldman
>Assignee: Sophie Blee-Goldman
>Priority: Major
>
> There are a number of Streams configs that aren't documented in the 
> "Configuring a Streams Application" section of the docs. As of 2.3 the 
> missing configs are:
>  # default.windowed.key.serde.inner *
>  # default.windowed.value.serde.inner *
>  # max.task.idle.ms
>  # rocksdb.config.setter. **
>  # topology.optimization
>  # upgrade.from
> * these configs are also missing the corresponding DOC string
> ** this one actually does appear on that page, but instead of being included 
> in the list of Streams configs it is for some reason under  "Consumer and 
> Producer Configuration Parameters" ?
> There are also a few configs whose documented name is slightly incorrect, ie 
> it is missing the "default" prefix that appears in the actual code. I assume 
> this was not intentional because there are other configs that do have the 
> "default" prefix included, so it doesn't seem to be dropped on purpose. The 
> "missing-default" configs are:
>  # key.serde
>  # value.serde
>  # timestamp.extractor



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-8933) An unhandled SSL handshake exception in polling event - needed a retry logic

2019-11-07 Thread Manikumar (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969828#comment-16969828
 ] 

Manikumar commented on KAFKA-8933:
--

[~mimaison] If this existing bug in previous releases, then it's not a blocker. 
Any regressions in 2.4 branch or data loss issues considered as blocker.

> An unhandled SSL handshake exception in polling event - needed a retry logic
> 
>
> Key: KAFKA-8933
> URL: https://issues.apache.org/jira/browse/KAFKA-8933
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.2.1, 2.4.0
> Environment: software platform
>Reporter: Remigius
>Priority: Critical
>
> Already client is connected and during polling event, SSL handshake failure 
> happened. it led to leaving the co-ordinator. Even on SSL handshake failure 
> which was actually intermittent issue, polling should have some resilient and 
> retry the polling. Leaving group caused all instances of clients to drop and 
> left the messages in Kafka for long time until re-subscribe the kafka topic 
> manually.
>  
>  
> {noformat}
> 2019-09-06 04:03:09,016 ERROR [reactive-kafka-] 
> org.apache.kafka.clients.NetworkClient [Consumer clientId=aaa, groupId=bbb] 
> Connection to node 150 (host:port) failed authentication due to: SSL 
> handshake failed
> 2019-09-06 04:03:09,021 ERROR [reactive-kafka-]  
> reactor.kafka.receiver.internals.DefaultKafkaReceiver Unexpected exception
> java.lang.NullPointerException: null
>  at 
> org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.handleCompletedMetadataResponse(NetworkClient.java:1012)
>  ~[kafka-clients-2.2.1.jar!/:?]
>  at 
> org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:822)
>  ~[kafka-clients-2.2.1.jar!/:?]
>  at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:544) 
> ~[kafka-clients-2.2.1.jar!/:?]
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:265)
>  ~[kafka-clients-2.2.1.jar!/:?]
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:236)
>  ~[kafka-clients-2.2.1.jar!/:?]
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1256)
>  ~[kafka-clients-2.2.1.jar!/:?]
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1200) 
> ~[kafka-clients-2.2.1.jar!/:?]
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1176) 
> ~[kafka-clients-2.2.1.jar!/:?]
>  at 
> reactor.kafka.receiver.internals.DefaultKafkaReceiver$PollEvent.run(DefaultKafkaReceiver.java:470)
>  ~[reactor-kafka-1.1.1.RELEASE.jar!/:1.1.1.RELEASE]
>  at 
> reactor.kafka.receiver.internals.DefaultKafkaReceiver.doEvent(DefaultKafkaReceiver.java:401)
>  ~[reactor-kafka-1.1.1.RELEASE.jar!/:1.1.1.RELEASE]
>  at 
> reactor.kafka.receiver.internals.DefaultKafkaReceiver.lambda$start$14(DefaultKafkaReceiver.java:335)
>  ~[reactor-kafka-1.1.1.RELEASE.jar!/:1.1.1.RELEASE]
>  at reactor.core.publisher.LambdaSubscriber.onNext(LambdaSubscriber.java:130) 
> ~[reactor-core-3.2.10.RELEASE.jar!/:3.2.10.RELEASE]
>  at 
> reactor.core.publisher.FluxPublishOn$PublishOnSubscriber.runAsync(FluxPublishOn.java:398)
>  ~[reactor-core-3.2.10.RELEASE.jar!/:3.2.10.RELEASE]
>  at 
> reactor.core.publisher.FluxPublishOn$PublishOnSubscriber.run(FluxPublishOn.java:484)
>  ~[reactor-core-3.2.10.RELEASE.jar!/:3.2.10.RELEASE]
>  at 
> reactor.kafka.receiver.internals.KafkaSchedulers$EventScheduler.lambda$decorate$1(KafkaSchedulers.java:100)
>  ~[reactor-kafka-1.1.1.RELEASE.jar!/:1.1.1.RELEASE]
>  at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:84) 
> ~[reactor-core-3.2.10.RELEASE.jar!/:3.2.10.RELEASE]
>  at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:37) 
> ~[reactor-core-3.2.10.RELEASE.jar!/:3.2.10.RELEASE]
>  at 
> org.springframework.cloud.sleuth.instrument.async.TraceCallable.call(TraceCallable.java:70)
>  ~[spring-cloud-sleuth-core-2.1.1.RELEASE.jar!/:2.1.1.RELEASE]
>  at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
>  ~[?:?]
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  ~[?:?]
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  ~[?:?]
>  at java.lang.Thread.run(Thread.java:834) [?:?]
> 2019-09-06 04:03:09,023 INFO  [reactive-kafka-] 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator [Consumer 
> clientId=aaa, groupId=bbb] Member x_13-081e61ec-1509-4e0e-819e-58063d1ce8f6 
> sending LeaveGroup request to coordinator{noformat

[jira] [Comment Edited] (KAFKA-8933) An unhandled SSL handshake exception in polling event - needed a retry logic

2019-11-07 Thread Manikumar (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969828#comment-16969828
 ] 

Manikumar edited comment on KAFKA-8933 at 11/8/19 4:40 AM:
---

[~mimaison] If this is existing bug in previous releases, then it's not a 
blocker. Any regressions in 2.4 branch or data loss issues considered as 
blocker.


was (Author: omkreddy):
[~mimaison] If this existing bug in previous releases, then it's not a blocker. 
Any regressions in 2.4 branch or data loss issues considered as blocker.

> An unhandled SSL handshake exception in polling event - needed a retry logic
> 
>
> Key: KAFKA-8933
> URL: https://issues.apache.org/jira/browse/KAFKA-8933
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.2.1, 2.4.0
> Environment: software platform
>Reporter: Remigius
>Priority: Critical
>
> Already client is connected and during polling event, SSL handshake failure 
> happened. it led to leaving the co-ordinator. Even on SSL handshake failure 
> which was actually intermittent issue, polling should have some resilient and 
> retry the polling. Leaving group caused all instances of clients to drop and 
> left the messages in Kafka for long time until re-subscribe the kafka topic 
> manually.
>  
>  
> {noformat}
> 2019-09-06 04:03:09,016 ERROR [reactive-kafka-] 
> org.apache.kafka.clients.NetworkClient [Consumer clientId=aaa, groupId=bbb] 
> Connection to node 150 (host:port) failed authentication due to: SSL 
> handshake failed
> 2019-09-06 04:03:09,021 ERROR [reactive-kafka-]  
> reactor.kafka.receiver.internals.DefaultKafkaReceiver Unexpected exception
> java.lang.NullPointerException: null
>  at 
> org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.handleCompletedMetadataResponse(NetworkClient.java:1012)
>  ~[kafka-clients-2.2.1.jar!/:?]
>  at 
> org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:822)
>  ~[kafka-clients-2.2.1.jar!/:?]
>  at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:544) 
> ~[kafka-clients-2.2.1.jar!/:?]
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:265)
>  ~[kafka-clients-2.2.1.jar!/:?]
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:236)
>  ~[kafka-clients-2.2.1.jar!/:?]
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1256)
>  ~[kafka-clients-2.2.1.jar!/:?]
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1200) 
> ~[kafka-clients-2.2.1.jar!/:?]
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1176) 
> ~[kafka-clients-2.2.1.jar!/:?]
>  at 
> reactor.kafka.receiver.internals.DefaultKafkaReceiver$PollEvent.run(DefaultKafkaReceiver.java:470)
>  ~[reactor-kafka-1.1.1.RELEASE.jar!/:1.1.1.RELEASE]
>  at 
> reactor.kafka.receiver.internals.DefaultKafkaReceiver.doEvent(DefaultKafkaReceiver.java:401)
>  ~[reactor-kafka-1.1.1.RELEASE.jar!/:1.1.1.RELEASE]
>  at 
> reactor.kafka.receiver.internals.DefaultKafkaReceiver.lambda$start$14(DefaultKafkaReceiver.java:335)
>  ~[reactor-kafka-1.1.1.RELEASE.jar!/:1.1.1.RELEASE]
>  at reactor.core.publisher.LambdaSubscriber.onNext(LambdaSubscriber.java:130) 
> ~[reactor-core-3.2.10.RELEASE.jar!/:3.2.10.RELEASE]
>  at 
> reactor.core.publisher.FluxPublishOn$PublishOnSubscriber.runAsync(FluxPublishOn.java:398)
>  ~[reactor-core-3.2.10.RELEASE.jar!/:3.2.10.RELEASE]
>  at 
> reactor.core.publisher.FluxPublishOn$PublishOnSubscriber.run(FluxPublishOn.java:484)
>  ~[reactor-core-3.2.10.RELEASE.jar!/:3.2.10.RELEASE]
>  at 
> reactor.kafka.receiver.internals.KafkaSchedulers$EventScheduler.lambda$decorate$1(KafkaSchedulers.java:100)
>  ~[reactor-kafka-1.1.1.RELEASE.jar!/:1.1.1.RELEASE]
>  at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:84) 
> ~[reactor-core-3.2.10.RELEASE.jar!/:3.2.10.RELEASE]
>  at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:37) 
> ~[reactor-core-3.2.10.RELEASE.jar!/:3.2.10.RELEASE]
>  at 
> org.springframework.cloud.sleuth.instrument.async.TraceCallable.call(TraceCallable.java:70)
>  ~[spring-cloud-sleuth-core-2.1.1.RELEASE.jar!/:2.1.1.RELEASE]
>  at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
>  ~[?:?]
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  ~[?:?]
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  ~[?:?]
>  at java.lang.Thread.run(Thread.java:834) [?:?]
> 2019-09-06 04:03:09,023 INFO

[jira] [Commented] (KAFKA-9134) Refactor the StreamsPartitionAssignor for more code sharing with the FutureStreamsPartitionAssignor

2019-11-07 Thread John Roesler (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969829#comment-16969829
 ] 

John Roesler commented on KAFKA-9134:
-

Thanks for calling this out [~ableegoldman].

Recently, while chasing down some problems with the version probing test, I 
actually developed the opinion that the blame is in the opposite direction... 
Namely, that the FutureStreamsPartitionAssignor shares too much code with the 
StreamsPartitionAssignor. Reading over the details of your thinking, though, 
I'm wondering if these are really just two ways of saying the same thing.

Actually, I recently made some changes that sound directionally similar to what 
you propose. Can you take a look at https://github.com/apache/kafka/pull/7649 
(and the other changes in that PR) and comment on whether you think it 
addressed (some of) your concerns in this ticket?

> Refactor the StreamsPartitionAssignor for more code sharing with the 
> FutureStreamsPartitionAssignor
> ---
>
> Key: KAFKA-9134
> URL: https://issues.apache.org/jira/browse/KAFKA-9134
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Sophie Blee-Goldman
>Priority: Major
>
> Frequently when fixing bugs in the StreamsPartitionAssignor, version probing, 
> or other assignor-related matters, we make a change in the 
> StreamsPartitionAssignor and re-run the version_probing_upgrade system test 
> only to see that the issue is not fixed because we forgot to mirror that 
> change in the FutureStreamsPartitionAssignor. Worse yet, we are making a new 
> change or fixing something that doesn't directly affect version probing, so 
> we update only the StreamsPartitionAssignor and don't even run the version 
> probing test, and discover later the version probing system test has started 
> failing. Then we often waste time digging through old changes just to 
> discover it was just because we forgot to copy any changes to the 
> StreamsUpgradeTest classes (includes also it's future version of the 
> SubscriptionInfo and AssignmentInfo classes)
>  
> We should refactor the StreamsPartitionAssignor so that the future version 
> can rely more heavily on the original class. This will probably involve 
> either or all of:
>  * making the class's latestSupportedVersion configurable in the constructor, 
> to be used only by the system test. this will probably also require making it 
> configurable in some other classes such as SubscriptionInfo and AssignmentInfo
>  * breaking up the methods such as onAssignment and assign so that the future 
> assignor can call smaller pieces at a time – part of the original problem is 
> that the normal assignor will throw an exception partway through these 
> methods if the latest version is a "future" one, so we end up having to call 
> the method, catch the exception, and recode the remainder of the work in that 
> method



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-9161) Close gaps in Streams configs documentation

2019-11-07 Thread Matthias J. Sax (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias J. Sax updated KAFKA-9161:
---
Labels: beginner newbie  (was: )

> Close gaps in Streams configs documentation
> ---
>
> Key: KAFKA-9161
> URL: https://issues.apache.org/jira/browse/KAFKA-9161
> Project: Kafka
>  Issue Type: Bug
>  Components: documentation, streams
>Reporter: Sophie Blee-Goldman
>Assignee: Sophie Blee-Goldman
>Priority: Major
>  Labels: beginner, newbie
>
> There are a number of Streams configs that aren't documented in the 
> "Configuring a Streams Application" section of the docs. As of 2.3 the 
> missing configs are:
>  # default.windowed.key.serde.inner *
>  # default.windowed.value.serde.inner *
>  # max.task.idle.ms
>  # rocksdb.config.setter. **
>  # topology.optimization
>  # upgrade.from
> * these configs are also missing the corresponding DOC string
> ** this one actually does appear on that page, but instead of being included 
> in the list of Streams configs it is for some reason under  "Consumer and 
> Producer Configuration Parameters" ?
> There are also a few configs whose documented name is slightly incorrect, ie 
> it is missing the "default" prefix that appears in the actual code. I assume 
> this was not intentional because there are other configs that do have the 
> "default" prefix included, so it doesn't seem to be dropped on purpose. The 
> "missing-default" configs are:
>  # key.serde
>  # value.serde
>  # timestamp.extractor



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (KAFKA-9156) LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent state

2019-11-07 Thread shilin Lu (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969820#comment-16969820
 ] 

shilin Lu edited comment on KAFKA-9156 at 11/8/19 5:07 AM:
---

[~guozhang] i think only use val AtomicReference can't perform as well.

 
{code:java}
// code placeholder
class LazyTimeIndex(@volatile private var _file: File, baseOffset: Long, 
maxIndexSize: Int = -1, writable: Boolean = true) {
  @volatile private var timeIndex: Option[TimeIndex] = None

  def file: File = {
if (timeIndex.isDefined)
  timeIndex.get.file
else
  _file
  }

  def file_=(f: File) {
if (timeIndex.isDefined)
  timeIndex.get.file = f
else
  _file = f
  }

  def get: TimeIndex = {
if (timeIndex.isEmpty) {
  this synchronized {
if (timeIndex.isEmpty) {
  timeIndex = Some(new TimeIndex(_file, baseOffset, maxIndexSize, 
writable))
}
  }
}
timeIndex.get
  }
}{code}
the file will rename by def file_=(f: File) this function. so _file and 
timeIndex param are unsafe in multi-thread. we should keep it Multi-thread 
safe. i think this code can work.i use synchronized because i think this 
function is lightweight. if you think has some problem, i will modify it. thanks
{code:java}
// code placeholder
class LazyTimeIndex(@volatile private var _file: File, baseOffset: Long, 
maxIndexSize: Int = -1, writable: Boolean = true) {
  @volatile private var timeIndex: Option[TimeIndex] = None

  def file: File = {
this synchronized {
  if (timeIndex.isDefined)
timeIndex.get.file
  else
_file
}
  }

  def file_=(f: File) {
this synchronized {
  if (timeIndex.isDefined)
timeIndex.get.file = f
  else
_file = f
}
  }

  def get: TimeIndex = {
if (timeIndex.isEmpty) {
  this synchronized {
if (timeIndex.isEmpty) {
  timeIndex = Some(new TimeIndex(_file, baseOffset, maxIndexSize, 
writable))
}
  }
}
timeIndex.get
  }
}{code}
 


was (Author: lushilin):
[~guozhang] i think only use val AtomicReference can't perform as well.

 
{code:java}
// code placeholder
class LazyTimeIndex(@volatile private var _file: File, baseOffset: Long, 
maxIndexSize: Int = -1, writable: Boolean = true) {
  @volatile private var timeIndex: Option[TimeIndex] = None

  def file: File = {
if (timeIndex.isDefined)
  timeIndex.get.file
else
  _file
  }

  def file_=(f: File) {
if (timeIndex.isDefined)
  timeIndex.get.file = f
else
  _file = f
  }

  def get: TimeIndex = {
if (timeIndex.isEmpty) {
  this synchronized {
if (timeIndex.isEmpty) {
  timeIndex = Some(new TimeIndex(_file, baseOffset, maxIndexSize, 
writable))
}
  }
}
timeIndex.get
  }
}{code}
the file will rename by def file_=(f: File) this function. we should keep it 
Multi-thread safe. i think this code can work.i use synchronized because i 
think this function is lightweight. if you think has some problem, i will 
modify it. thanks
{code:java}
// code placeholder
class LazyTimeIndex(@volatile private var _file: File, baseOffset: Long, 
maxIndexSize: Int = -1, writable: Boolean = true) {
  @volatile private var timeIndex: Option[TimeIndex] = None

  def file: File = {
this synchronized {
  if (timeIndex.isDefined)
timeIndex.get.file
  else
_file
}
  }

  def file_=(f: File) {
this synchronized {
  if (timeIndex.isDefined)
timeIndex.get.file = f
  else
_file = f
}
  }

  def get: TimeIndex = {
if (timeIndex.isEmpty) {
  this synchronized {
if (timeIndex.isEmpty) {
  timeIndex = Some(new TimeIndex(_file, baseOffset, maxIndexSize, 
writable))
}
  }
}
timeIndex.get
  }
}{code}
 

> LazyTimeIndex & LazyOffsetIndex may cause niobufferoverflow in concurrent 
> state
> ---
>
> Key: KAFKA-9156
> URL: https://issues.apache.org/jira/browse/KAFKA-9156
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.3.0
>Reporter: shilin Lu
>Priority: Critical
> Attachments: image-2019-11-07-17-42-13-852.png, 
> image-2019-11-07-17-44-05-357.png, image-2019-11-07-17-46-53-650.png
>
>
> !image-2019-11-07-17-42-13-852.png!
> this timeindex get function is not thread safe ,may cause create some 
> timeindex.
> !image-2019-11-07-17-44-05-357.png!
> When create timeindex not exactly one ,may cause mappedbytebuffer position to 
> end. Then write index entry to this mmap file will cause 
> java.nio.BufferOverflowException.
>  
> !image-2019-11-07-17-46-53-650.png!
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-9158) Makr producer fetch metadata only when buffer is full

2019-11-07 Thread Brandon Jiang (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-9158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Jiang updated KAFKA-9158:
-
Affects Version/s: (was: 2.2.1)
   2.3.0

> Makr producer fetch metadata only when buffer is full
> -
>
> Key: KAFKA-9158
> URL: https://issues.apache.org/jira/browse/KAFKA-9158
> Project: Kafka
>  Issue Type: Improvement
>  Components: producer 
>Affects Versions: 2.3.0
>Reporter: Brandon Jiang
>Priority: Major
>
> Currently, based on my understanding with KafakProducer.doSend() method, when 
> trigger kafka java client with  
> {code:java}
> org.apache.kafka.clients.producer.Producer.send(message);{code}
> The currently behavior is to do metadata fetch first and then queue the 
> message for sending ?  This could result the kafka producer hangs for 
> max.block.ms time when kafka servers are down.
> Is it possible to change the Producer.send() function behavior to queue all 
> the messages first and then only do metadata fetch async when the queue is 
> full. e.g. 
> {code:java}
> if (msgQueue.batchIsFull ) {
>  clusterAndWaitTime = waitOnMetadata(record.topic(), record.partition(), 
> maxBlockTimeMs);
>  .
>  this.sender.wakeup();
> }{code}
> I am happy to make this changes. But before doing that, just want to check 
> the current design logic behind producer.send(). 
> I have seen there are lots of talks regarding the producer send() behavior.  
> e.g. https://issues.apache.org/jira/browse/KAFKA-2948, 
> https://issues.apache.org/jira/browse/KAFKA-5369
> So trying to make sure I didnot misunderstand the current behavior here.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-9158) Makr producer fetch metadata only when buffer is full

2019-11-07 Thread Brandon Jiang (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-9158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Jiang updated KAFKA-9158:
-
Summary: Makr producer fetch metadata only when buffer is full  (was: 
producer fetch metadata until buffer is full)

> Makr producer fetch metadata only when buffer is full
> -
>
> Key: KAFKA-9158
> URL: https://issues.apache.org/jira/browse/KAFKA-9158
> Project: Kafka
>  Issue Type: Improvement
>  Components: producer 
>Affects Versions: 2.2.1
>Reporter: Brandon Jiang
>Priority: Major
>
> Currently, based on my understanding with KafakProducer.doSend() method, when 
> trigger kafka java client with  
> {code:java}
> org.apache.kafka.clients.producer.Producer.send(message);{code}
> The currently behavior is to do metadata fetch first and then queue the 
> message for sending ?  This could result the kafka producer hangs for 
> max.block.ms time when kafka servers are down.
> Is it possible to change the Producer.send() function behavior to queue all 
> the messages first and then only do metadata fetch async when the queue is 
> full. e.g. 
> {code:java}
> if (msgQueue.batchIsFull ) {
>  clusterAndWaitTime = waitOnMetadata(record.topic(), record.partition(), 
> maxBlockTimeMs);
>  .
>  this.sender.wakeup();
> }{code}
> I am happy to make this changes. But before doing that, just want to check 
> the current design logic behind producer.send(). 
> I have seen there are lots of talks regarding the producer send() behavior.  
> e.g. https://issues.apache.org/jira/browse/KAFKA-2948, 
> https://issues.apache.org/jira/browse/KAFKA-5369
> So trying to make sure I didnot misunderstand the current behavior here.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-9159) Consumer.endOffsets Throw TimeoutException: Failed to get offsets by times in 30000ms after a leader change

2019-11-07 Thread zhangzhanchang (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-9159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangzhanchang updated KAFKA-9159:
--
Description: 
case 1: 0.10.2 version loop call Consumer.endOffsets Throw TimeoutException: 
Failed to get offsets by times in 3ms after kill -9 broker,but a leader 
change ,loop call Consumer.endOffsets no problem

!image-2019-11-08-10-28-19-881.png|width=416,height=299!

case 2: 2.0 version loop call Consumer.endOffsets Throw TimeoutException: 
Failed to get offsets by times in 3ms after a leader change,but kill -9 
broker,loop call Consumer.endOffsets no problem

!image-2019-11-08-10-29-06-282.png|width=412,height=314!

  was:
case 1: 0.10.2 version loop call Consumer.endOffsets Throw TimeoutException: 
Failed to get offsets by times in 3ms after kill -9 broker,but a leader 
change ,loop call Consumer.endOffsets no problem

!image-2019-11-08-10-28-19-881.png!

case 2: 2.0 version loop call Consumer.endOffsets Throw TimeoutException: 
Failed to get offsets by times in 3ms after a leader change,but kill -9 
broker,loop call Consumer.endOffsets no problem

!image-2019-11-08-10-29-06-282.png!


> Consumer.endOffsets Throw TimeoutException: Failed to get offsets by times in 
> 3ms after a leader change
> ---
>
> Key: KAFKA-9159
> URL: https://issues.apache.org/jira/browse/KAFKA-9159
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 0.10.2.0, 2.0.0
>Reporter: zhangzhanchang
>Priority: Major
> Attachments: image-2019-11-08-10-28-19-881.png, 
> image-2019-11-08-10-29-06-282.png
>
>
> case 1: 0.10.2 version loop call Consumer.endOffsets Throw TimeoutException: 
> Failed to get offsets by times in 3ms after kill -9 broker,but a leader 
> change ,loop call Consumer.endOffsets no problem
> !image-2019-11-08-10-28-19-881.png|width=416,height=299!
> case 2: 2.0 version loop call Consumer.endOffsets Throw TimeoutException: 
> Failed to get offsets by times in 3ms after a leader change,but kill -9 
> broker,loop call Consumer.endOffsets no problem
> !image-2019-11-08-10-29-06-282.png|width=412,height=314!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (KAFKA-8677) Flakey test GroupEndToEndAuthorizationTest#testNoDescribeProduceOrConsumeWithoutTopicDescribeAcl

2019-11-07 Thread Lucas Bradstreet (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-8677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Bradstreet reassigned KAFKA-8677:
---

Assignee: (was: Lucas Bradstreet)

> Flakey test 
> GroupEndToEndAuthorizationTest#testNoDescribeProduceOrConsumeWithoutTopicDescribeAcl
> 
>
> Key: KAFKA-8677
> URL: https://issues.apache.org/jira/browse/KAFKA-8677
> Project: Kafka
>  Issue Type: Bug
>  Components: core, security, unit tests
>Affects Versions: 2.4.0
>Reporter: Boyang Chen
>Priority: Blocker
>  Labels: flaky-test
> Fix For: 2.4.0
>
>
> [https://builds.apache.org/job/kafka-pr-jdk11-scala2.12/6325/console]
>  
> *18:43:39* kafka.api.GroupEndToEndAuthorizationTest > 
> testNoDescribeProduceOrConsumeWithoutTopicDescribeAcl STARTED*18:44:00* 
> kafka.api.GroupEndToEndAuthorizationTest.testNoDescribeProduceOrConsumeWithoutTopicDescribeAcl
>  failed, log available in 
> /home/jenkins/jenkins-slave/workspace/kafka-pr-jdk11-scala2.12/core/build/reports/testOutput/kafka.api.GroupEndToEndAuthorizationTest.testNoDescribeProduceOrConsumeWithoutTopicDescribeAcl.test.stdout*18:44:00*
>  *18:44:00* kafka.api.GroupEndToEndAuthorizationTest > 
> testNoDescribeProduceOrConsumeWithoutTopicDescribeAcl FAILED*18:44:00* 
> org.scalatest.exceptions.TestFailedException: Consumed 0 records before 
> timeout instead of the expected 1 records
> ---
> I found this flaky test is actually exposing a real bug in consumer: within 
> {{KafkaConsumer.poll}}, we have an optimization to try to send the next fetch 
> request before returning the data in order to pipelining the fetch requests:
> {code}
> if (!records.isEmpty()) {
> // before returning the fetched records, we can send off 
> the next round of fetches
> // and avoid block waiting for their responses to enable 
> pipelining while the user
> // is handling the fetched records.
> //
> // NOTE: since the consumed position has already been 
> updated, we must not allow
> // wakeups or any other errors to be triggered prior to 
> returning the fetched records.
> if (fetcher.sendFetches() > 0 || 
> client.hasPendingRequests()) {
> client.pollNoWakeup();
> }
> return this.interceptors.onConsume(new 
> ConsumerRecords<>(records));
> }
> {code}
> As the NOTE mentioned, this pollNoWakeup should NOT throw any exceptions, 
> since at this point the fetch position has been updated. If an exception is 
> thrown here, and the callers decides to capture and continue, those records 
> would never be returned again, causing data loss.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (KAFKA-8677) Flakey test GroupEndToEndAuthorizationTest#testNoDescribeProduceOrConsumeWithoutTopicDescribeAcl

2019-11-07 Thread Lucas Bradstreet (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-8677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Bradstreet reassigned KAFKA-8677:
---

Assignee: Lucas Bradstreet  (was: Guozhang Wang)

> Flakey test 
> GroupEndToEndAuthorizationTest#testNoDescribeProduceOrConsumeWithoutTopicDescribeAcl
> 
>
> Key: KAFKA-8677
> URL: https://issues.apache.org/jira/browse/KAFKA-8677
> Project: Kafka
>  Issue Type: Bug
>  Components: core, security, unit tests
>Affects Versions: 2.4.0
>Reporter: Boyang Chen
>Assignee: Lucas Bradstreet
>Priority: Blocker
>  Labels: flaky-test
> Fix For: 2.4.0
>
>
> [https://builds.apache.org/job/kafka-pr-jdk11-scala2.12/6325/console]
>  
> *18:43:39* kafka.api.GroupEndToEndAuthorizationTest > 
> testNoDescribeProduceOrConsumeWithoutTopicDescribeAcl STARTED*18:44:00* 
> kafka.api.GroupEndToEndAuthorizationTest.testNoDescribeProduceOrConsumeWithoutTopicDescribeAcl
>  failed, log available in 
> /home/jenkins/jenkins-slave/workspace/kafka-pr-jdk11-scala2.12/core/build/reports/testOutput/kafka.api.GroupEndToEndAuthorizationTest.testNoDescribeProduceOrConsumeWithoutTopicDescribeAcl.test.stdout*18:44:00*
>  *18:44:00* kafka.api.GroupEndToEndAuthorizationTest > 
> testNoDescribeProduceOrConsumeWithoutTopicDescribeAcl FAILED*18:44:00* 
> org.scalatest.exceptions.TestFailedException: Consumed 0 records before 
> timeout instead of the expected 1 records
> ---
> I found this flaky test is actually exposing a real bug in consumer: within 
> {{KafkaConsumer.poll}}, we have an optimization to try to send the next fetch 
> request before returning the data in order to pipelining the fetch requests:
> {code}
> if (!records.isEmpty()) {
> // before returning the fetched records, we can send off 
> the next round of fetches
> // and avoid block waiting for their responses to enable 
> pipelining while the user
> // is handling the fetched records.
> //
> // NOTE: since the consumed position has already been 
> updated, we must not allow
> // wakeups or any other errors to be triggered prior to 
> returning the fetched records.
> if (fetcher.sendFetches() > 0 || 
> client.hasPendingRequests()) {
> client.pollNoWakeup();
> }
> return this.interceptors.onConsume(new 
> ConsumerRecords<>(records));
> }
> {code}
> As the NOTE mentioned, this pollNoWakeup should NOT throw any exceptions, 
> since at this point the fetch position has been updated. If an exception is 
> thrown here, and the callers decides to capture and continue, those records 
> would never be returned again, causing data loss.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-9162) Re-assignment of Partition gets infinitely stuck "Still in progress" state

2019-11-07 Thread vikash kumar (Jira)

vikash kumar created KAFKA-9162:
---

 Summary: Re-assignment of Partition gets infinitely stuck "Still 
in progress" state
 Key: KAFKA-9162
 URL: https://issues.apache.org/jira/browse/KAFKA-9162
 Project: Kafka
  Issue Type: Bug
Affects Versions: 0.10.1.1
 Environment: Amazon Linux 1 x86_64 GNU/Linux
Reporter: vikash kumar


We have 6 node kafka cluster. Out of the 6 machines, 3 of them have both Kafka 
+ zookeeper, and the remaining 3 of them have just kafka.
Recently, we added one more kafka node. While re-assigning the partition to all 
the nodes (including the newer one), we executed the below command:

/opt/kafka/bin/kafka-reassign-partitions.sh --reassignment-json-file 
new_assignment_details.json --execute --zookeeper localhost:2181

However, when we verified the status by using the below command,

/opt/kafka/bin/kafka-reassign-partitions.sh --reassignment-json-file 
new_assignment_details.json --verify --zookeeper localhost:2181

We get the below output. Some of the partitions are re-assignment is still in 
progress.

/opt/kafka/bin/kafka-reassign-partitions.sh --reassignment-json-file 
new_assignment_details.json --verify --zookeeper localhost:2181 | grep 
'progress'
Reassignment of partition [topic-name,854] is still in progress
Reassignment of partition [topic-name,674] is still in progress
Reassignment of partition [topic-name,944] is still in progress
Reassignment of partition [topic-name,404] is still in progress
Reassignment of partition [topic-name,314] is still in progress
Reassignment of partition [topic-name,853] is still in progress
Reassignment of partition [prom-metrics,403] is still in progress
Reassignment of partition [prom-metrics,134] is still in progress


There is no way to either:

1. Cancel the on-going partition re-assignment.
2. Rollback is also not possible. ( When we try doing that it says that "There 
is an existing assignment running."



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-9163) Re-assignment of Partition gets infinitely stuck "Still in progress" state

2019-11-07 Thread vikash kumar (Jira)

vikash kumar created KAFKA-9163:
---

 Summary: Re-assignment of Partition gets infinitely stuck "Still 
in progress" state
 Key: KAFKA-9163
 URL: https://issues.apache.org/jira/browse/KAFKA-9163
 Project: Kafka
  Issue Type: Bug
Affects Versions: 0.10.1.1
 Environment: Amazon Linux 1 x86_64 GNU/Linux
Reporter: vikash kumar


We have 6 node kafka cluster. Out of the 6 machines, 3 of them have both Kafka 
+ zookeeper, and the remaining 3 of them have just kafka.
Recently, we added one more kafka node. While re-assigning the partition to all 
the nodes (including the newer one), we executed the below command:

/opt/kafka/bin/kafka-reassign-partitions.sh --reassignment-json-file 
new_assignment_details.json --execute --zookeeper localhost:2181

However, when we verified the status by using the below command,

/opt/kafka/bin/kafka-reassign-partitions.sh --reassignment-json-file 
new_assignment_details.json --verify --zookeeper localhost:2181

We get the below output. Some of the partitions are re-assignment is still in 
progress.

/opt/kafka/bin/kafka-reassign-partitions.sh --reassignment-json-file 
new_assignment_details.json --verify --zookeeper localhost:2181 | grep 
'progress'
Reassignment of partition [topic-name,854] is still in progress
Reassignment of partition [topic-name,674] is still in progress
Reassignment of partition [topic-name,944] is still in progress
Reassignment of partition [topic-name,404] is still in progress
Reassignment of partition [topic-name,314] is still in progress
Reassignment of partition [topic-name,853] is still in progress
Reassignment of partition [prom-metrics,403] is still in progress
Reassignment of partition [prom-metrics,134] is still in progress


There is no way to either:

1. Cancel the on-going partition re-assignment.
2. Rollback is also not possible. ( When we try doing that it says that "There 
is an existing assignment running."



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-9133) LogCleaner thread dies with: currentLog cannot be empty on an unexpected exception

2019-11-07 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969917#comment-16969917
 ] 

ASF GitHub Bot commented on KAFKA-9133:
---

timvlaer commented on pull request #7639: KAFKA-9133: add extra bounds check 
for dirty offset before cleaning
URL: https://github.com/apache/kafka/pull/7639
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> LogCleaner thread dies with: currentLog cannot be empty on an unexpected 
> exception
> --
>
> Key: KAFKA-9133
> URL: https://issues.apache.org/jira/browse/KAFKA-9133
> Project: Kafka
>  Issue Type: Bug
>  Components: log cleaner
>Affects Versions: 2.3.1
>Reporter: Karolis Pocius
>Assignee: Jason Gustafson
>Priority: Blocker
> Fix For: 2.4.0, 2.3.2
>
>
> Log cleaner thread dies without a clear reference to which log is causing it:
> {code}
> [2019-11-02 11:59:59,078] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2019-11-02 11:59:59,144] INFO [kafka-log-cleaner-thread-0]: Starting 
> (kafka.log.LogCleaner)
> [2019-11-02 11:59:59,199] ERROR [kafka-log-cleaner-thread-0]: Error due to 
> (kafka.log.LogCleaner)
> java.lang.IllegalStateException: currentLog cannot be empty on an unexpected 
> exception
>  at kafka.log.LogCleaner$CleanerThread.cleanFilthiestLog(LogCleaner.scala:346)
>  at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:307)
>  at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:89)
> Caused by: java.lang.IllegalArgumentException: Illegal request for non-active 
> segments beginning at offset 5033130, which is larger than the active 
> segment's base offset 5019648
>  at kafka.log.Log.nonActiveLogSegmentsFrom(Log.scala:1933)
>  at 
> kafka.log.LogCleanerManager$.maxCompactionDelay(LogCleanerManager.scala:491)
>  at 
> kafka.log.LogCleanerManager.$anonfun$grabFilthiestCompactedLog$4(LogCleanerManager.scala:184)
>  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>  at scala.collection.immutable.List.foreach(List.scala:392)
>  at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>  at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>  at scala.collection.immutable.List.map(List.scala:298)
>  at 
> kafka.log.LogCleanerManager.$anonfun$grabFilthiestCompactedLog$1(LogCleanerManager.scala:181)
>  at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:253)
>  at 
> kafka.log.LogCleanerManager.grabFilthiestCompactedLog(LogCleanerManager.scala:171)
>  at kafka.log.LogCleaner$CleanerThread.cleanFilthiestLog(LogCleaner.scala:321)
>  ... 2 more
> [2019-11-02 11:59:59,200] INFO [kafka-log-cleaner-thread-0]: Stopped 
> (kafka.log.LogCleaner)
> {code}
> If I try to ressurect it by dynamically bumping {{log.cleaner.threads}} it 
> instantly dies with the exact same error.
> Not sure if this is something KAFKA-8725 is supposed to address.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

64 matches

Mail list logo