[jira] [Created] (KAFKA-8375) Offset jumps back after commit

2019-05-16 Thread Markus Dybeck (JIRA)
Markus Dybeck created KAFKA-8375:


 Summary: Offset jumps back after commit
 Key: KAFKA-8375
 URL: https://issues.apache.org/jira/browse/KAFKA-8375
 Project: Kafka
  Issue Type: Bug
  Components: offset manager
Affects Versions: 1.1.1
Reporter: Markus Dybeck
 Attachments: Skärmavbild 2019-05-16 kl. 08.41.53.png

*Setup*

Kafka: 1.1.1
Kafka-client: 1.1.1
Zookeeper: 3.4.11
Akka streams: 0.20

*Topic config*

DELETE_RETENTION_MS_CONFIG: "5000"
CLEANUP_POLICY_CONFIG: "compact,delete"
RETENTION_BYTES_CONFIG: 2000L
RETENTION_MS_CONFIG: 3600


*Behavior*
We have 7 Consumers consuming from 7 partitions, and some of the consumers lag 
jumped back a bit randomly. No new messages were pushed to the topic during the 
time.  We didn't see any strange logs during the time, and the brokers did not 
restart either.

Either way, if there would be a restart or rebalance going on, we can not 
understand why the offset would jump back after it was committed? 

We did observe it both with logs and by watching metrics of the lag. Our logs 
pointed out that after we committed the offset, around 30-35 seconds later we 
consumed an earlier committed message and then the loop begun. The behavior was 
the same after a restart of all the consumers. The behavior then stopped after 
a while all by itself.


We have no clue going forward, or if these might be an issue with akka. But is 
there any known issue that might cause this?

Attaching a screendump with metrics that shows the lag for one partition.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-8375) Offset jumps back after commit

2019-05-16 Thread Markus Dybeck (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Dybeck updated KAFKA-8375:
-
Attachment: partition_lag_metrics.png

> Offset jumps back after commit
> --
>
> Key: KAFKA-8375
> URL: https://issues.apache.org/jira/browse/KAFKA-8375
> Project: Kafka
>  Issue Type: Bug
>  Components: offset manager
>Affects Versions: 1.1.1
>Reporter: Markus Dybeck
>Priority: Major
> Attachments: partition_lag_metrics.png
>
>
> *Setup*
> Kafka: 1.1.1
> Kafka-client: 1.1.1
> Zookeeper: 3.4.11
> Akka streams: 0.20
> *Topic config*
> DELETE_RETENTION_MS_CONFIG: "5000"
> CLEANUP_POLICY_CONFIG: "compact,delete"
> RETENTION_BYTES_CONFIG: 2000L
> RETENTION_MS_CONFIG: 3600
> *Behavior*
> We have 7 Consumers consuming from 7 partitions, and some of the consumers 
> lag jumped back a bit randomly. No new messages were pushed to the topic 
> during the time.  We didn't see any strange logs during the time, and the 
> brokers did not restart either.
> Either way, if there would be a restart or rebalance going on, we can not 
> understand why the offset would jump back after it was committed? 
> We did observe it both with logs and by watching metrics of the lag. Our logs 
> pointed out that after we committed the offset, around 30-35 seconds later we 
> consumed an earlier committed message and then the loop begun. The behavior 
> was the same after a restart of all the consumers. The behavior then stopped 
> after a while all by itself.
> We have no clue going forward, or if these might be an issue with akka. But 
> is there any known issue that might cause this?
> Attaching a screendump with metrics that shows the lag for one partition.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-8375) Offset jumps back after commit

2019-05-16 Thread Markus Dybeck (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Dybeck updated KAFKA-8375:
-
Attachment: (was: Skärmavbild 2019-05-16 kl. 08.41.53.png)

> Offset jumps back after commit
> --
>
> Key: KAFKA-8375
> URL: https://issues.apache.org/jira/browse/KAFKA-8375
> Project: Kafka
>  Issue Type: Bug
>  Components: offset manager
>Affects Versions: 1.1.1
>Reporter: Markus Dybeck
>Priority: Major
> Attachments: partition_lag_metrics.png
>
>
> *Setup*
> Kafka: 1.1.1
> Kafka-client: 1.1.1
> Zookeeper: 3.4.11
> Akka streams: 0.20
> *Topic config*
> DELETE_RETENTION_MS_CONFIG: "5000"
> CLEANUP_POLICY_CONFIG: "compact,delete"
> RETENTION_BYTES_CONFIG: 2000L
> RETENTION_MS_CONFIG: 3600
> *Behavior*
> We have 7 Consumers consuming from 7 partitions, and some of the consumers 
> lag jumped back a bit randomly. No new messages were pushed to the topic 
> during the time.  We didn't see any strange logs during the time, and the 
> brokers did not restart either.
> Either way, if there would be a restart or rebalance going on, we can not 
> understand why the offset would jump back after it was committed? 
> We did observe it both with logs and by watching metrics of the lag. Our logs 
> pointed out that after we committed the offset, around 30-35 seconds later we 
> consumed an earlier committed message and then the loop begun. The behavior 
> was the same after a restart of all the consumers. The behavior then stopped 
> after a while all by itself.
> We have no clue going forward, or if these might be an issue with akka. But 
> is there any known issue that might cause this?
> Attaching a screendump with metrics that shows the lag for one partition.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-8375) Offset jumps back after commit

2019-05-16 Thread Markus Dybeck (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Dybeck updated KAFKA-8375:
-
Description: 
*Setup*

Kafka: 1.1.1
 Kafka-client: 1.1.1
 Zookeeper: 3.4.11
 Akka streams: 0.20

*Topic config*

DELETE_RETENTION_MS_CONFIG: "5000"
 CLEANUP_POLICY_CONFIG: "compact,delete"
 RETENTION_BYTES_CONFIG: 2000L
 RETENTION_MS_CONFIG: 3600

*Consumer config*
AUTO_OFFSET_RESET_CONFIG: "earliest"

*Behavior*
 We have 7 Consumers consuming from 7 partitions, and some of the consumers lag 
jumped back a bit randomly. No new messages were pushed to the topic during the 
time.  We didn't see any strange logs during the time, and the brokers did not 
restart either.

Either way, if there would be a restart or rebalance going on, we can not 
understand why the offset would jump back after it was committed? 

We did observe it both with logs and by watching metrics of the lag. Our logs 
pointed out that after we committed the offset, around 30-35 seconds later we 
consumed an earlier committed message and then the loop begun. The behavior was 
the same after a restart of all the consumers. The behavior then stopped after 
a while all by itself.

We have no clue going forward, or if these might be an issue with akka. But is 
there any known issue that might cause this?

Attaching a screendump with metrics that shows the lag for one partition.

 

  was:
*Setup*

Kafka: 1.1.1
Kafka-client: 1.1.1
Zookeeper: 3.4.11
Akka streams: 0.20

*Topic config*

DELETE_RETENTION_MS_CONFIG: "5000"
CLEANUP_POLICY_CONFIG: "compact,delete"
RETENTION_BYTES_CONFIG: 2000L
RETENTION_MS_CONFIG: 3600


*Behavior*
We have 7 Consumers consuming from 7 partitions, and some of the consumers lag 
jumped back a bit randomly. No new messages were pushed to the topic during the 
time.  We didn't see any strange logs during the time, and the brokers did not 
restart either.

Either way, if there would be a restart or rebalance going on, we can not 
understand why the offset would jump back after it was committed? 

We did observe it both with logs and by watching metrics of the lag. Our logs 
pointed out that after we committed the offset, around 30-35 seconds later we 
consumed an earlier committed message and then the loop begun. The behavior was 
the same after a restart of all the consumers. The behavior then stopped after 
a while all by itself.


We have no clue going forward, or if these might be an issue with akka. But is 
there any known issue that might cause this?

Attaching a screendump with metrics that shows the lag for one partition.

 


> Offset jumps back after commit
> --
>
> Key: KAFKA-8375
> URL: https://issues.apache.org/jira/browse/KAFKA-8375
> Project: Kafka
>  Issue Type: Bug
>  Components: offset manager
>Affects Versions: 1.1.1
>Reporter: Markus Dybeck
>Priority: Major
> Attachments: partition_lag_metrics.png
>
>
> *Setup*
> Kafka: 1.1.1
>  Kafka-client: 1.1.1
>  Zookeeper: 3.4.11
>  Akka streams: 0.20
> *Topic config*
> DELETE_RETENTION_MS_CONFIG: "5000"
>  CLEANUP_POLICY_CONFIG: "compact,delete"
>  RETENTION_BYTES_CONFIG: 2000L
>  RETENTION_MS_CONFIG: 3600
> *Consumer config*
> AUTO_OFFSET_RESET_CONFIG: "earliest"
> *Behavior*
>  We have 7 Consumers consuming from 7 partitions, and some of the consumers 
> lag jumped back a bit randomly. No new messages were pushed to the topic 
> during the time.  We didn't see any strange logs during the time, and the 
> brokers did not restart either.
> Either way, if there would be a restart or rebalance going on, we can not 
> understand why the offset would jump back after it was committed? 
> We did observe it both with logs and by watching metrics of the lag. Our logs 
> pointed out that after we committed the offset, around 30-35 seconds later we 
> consumed an earlier committed message and then the loop begun. The behavior 
> was the same after a restart of all the consumers. The behavior then stopped 
> after a while all by itself.
> We have no clue going forward, or if these might be an issue with akka. But 
> is there any known issue that might cause this?
> Attaching a screendump with metrics that shows the lag for one partition.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KAFKA-8376) Flaky test ClientAuthenticationFailureTest.testTransactionalProducerWithInvalidCredentials test.

2019-05-16 Thread Manikumar (JIRA)
Manikumar created KAFKA-8376:


 Summary: Flaky test 
ClientAuthenticationFailureTest.testTransactionalProducerWithInvalidCredentials 
test.
 Key: KAFKA-8376
 URL: https://issues.apache.org/jira/browse/KAFKA-8376
 Project: Kafka
  Issue Type: Bug
Reporter: Manikumar
 Fix For: 2.3.0
 Attachments: t1.txt

The test is going into hang state and test run was not completing. I think the 
flakiness is due to timing issues and https://github.com/apache/kafka/pull/5971

Attaching the thread dump.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-8376) Flaky test ClientAuthenticationFailureTest.testTransactionalProducerWithInvalidCredentials test.

2019-05-16 Thread Manikumar (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841208#comment-16841208
 ] 

Manikumar commented on KAFKA-8376:
--

[~viktorsomogyi] Can you check, if this is related to 
https://github.com/apache/kafka/pull/5971.

cc [~hachikuji]

> Flaky test 
> ClientAuthenticationFailureTest.testTransactionalProducerWithInvalidCredentials
>  test.
> 
>
> Key: KAFKA-8376
> URL: https://issues.apache.org/jira/browse/KAFKA-8376
> Project: Kafka
>  Issue Type: Bug
>Reporter: Manikumar
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: t1.txt
>
>
> The test is going into hang state and test run was not completing. I think 
> the flakiness is due to timing issues and 
> https://github.com/apache/kafka/pull/5971
> Attaching the thread dump.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (KAFKA-8376) Flaky test ClientAuthenticationFailureTest.testTransactionalProducerWithInvalidCredentials test.

2019-05-16 Thread Manikumar (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841208#comment-16841208
 ] 

Manikumar edited comment on KAFKA-8376 at 5/16/19 10:44 AM:


[~viktorsomogyi] Can you check, if this is related to 
https://github.com/apache/kafka/pull/5971 ?
cc [~hachikuji]


was (Author: omkreddy):
[~viktorsomogyi] Can you check, if this is related to 
https://github.com/apache/kafka/pull/5971.

cc [~hachikuji]

> Flaky test 
> ClientAuthenticationFailureTest.testTransactionalProducerWithInvalidCredentials
>  test.
> 
>
> Key: KAFKA-8376
> URL: https://issues.apache.org/jira/browse/KAFKA-8376
> Project: Kafka
>  Issue Type: Bug
>Reporter: Manikumar
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: t1.txt
>
>
> The test is going into hang state and test run was not completing. I think 
> the flakiness is due to timing issues and 
> https://github.com/apache/kafka/pull/5971
> Attaching the thread dump.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-8315) Historical join issues

2019-05-16 Thread Andrew (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841213#comment-16841213
 ] 

Andrew commented on KAFKA-8315:
---

[~vvcephei] [~ableegoldman] Using the integration test I think I now understand 
what is going on. 

 

The key bit of code is here : 
[https://github.com/the4thamigo-uk/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamTask.java#L332]

 

What appears to be a happening is this :

 

1) Since the topics are already full of data, the left topic has sufficient 
data (1000 records) in order to trigger leaving this loop 
[https://github.com/the4thamigo-uk/kafka/blob/debugging/clients/src/main/java/org/apache/kafka/clients/consumer/internals/Fetcher.java#L485.]
 So, no right records are fetched.

2) All the fetched left stream records are added to PartitionGroup, and 
PartitionGroup.allBuffered = false, since the right stream RecordQueue is still 
empty

3) The code drops into here 
[https://github.com/the4thamigo-uk/kafka/blob/debugging/streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamTask.java#L330,]
 since maxTaskIdleMs == 0 !
4) The 1000 left stream records are processed (thereby almost immediately 
expiring their join windows !
5) The right stream records are fetched and processed, but there are no left 
stream join windows to join with until the latest records in the left stream 
for which the windows have not expired.

 

And the workaround/fix, is a change of configuration setting : 
[https://github.com/the4thamigo-uk/kafka/commit/189aa764aef06643a8a3c30b2aee3c4a29b82ae6]

Perhaps this value should default to non-zero to enable historical joins by 
default?

> Historical join issues
> --
>
> Key: KAFKA-8315
> URL: https://issues.apache.org/jira/browse/KAFKA-8315
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Andrew
>Assignee: John Roesler
>Priority: Major
> Attachments: code.java
>
>
> The problem we are experiencing is that we cannot reliably perform simple 
> joins over pre-populated kafka topics. This seems more apparent where one 
> topic has records at less frequent record timestamp intervals that the other.
>  An example of the issue is provided in this repository :
> [https://github.com/the4thamigo-uk/join-example]
>  
> The only way to increase the period of historically joined records is to 
> increase the grace period for the join windows, and this has repercussions 
> when you extend it to a large period e.g. 2 years of minute-by-minute records.
> Related slack conversations : 
> [https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1556799561287300]
> [https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1557733979453900]
>  
>  Research on this issue has gone through a few phases :
> 1) This issue was initially thought to be due to the inability to set the 
> retention period for a join window via {{Materialized: i.e.}}
> The documentation says to use `Materialized` not `JoinWindows.until()` 
> ([https://kafka.apache.org/22/javadoc/org/apache/kafka/streams/kstream/JoinWindows.html#until-long-]),
>  but there is no where to pass a `Materialized` instance to the join 
> operation, only to the group operation is supported it seems.
> This was considered to be a problem with the documentation not with the API 
> and is addressed in [https://github.com/apache/kafka/pull/6664]
> 2) We then found an apparent issue in the code which would affect the 
> partition that is selected to deliver the next record to the join. This would 
> only be a problem for data that is out-of-order, and join-example uses data 
> that is in order of timestamp in both topics. So this fix is thought not to 
> affect join-example.
> This was considered to be an issue and is being addressed in 
> [https://github.com/apache/kafka/pull/6719]
>  3) Further investigation using a crafted unit test seems to show that the 
> partition-selection and ordering (PartitionGroup/RecordQueue) seems to work ok
> [https://github.com/the4thamigo-uk/kafka/commit/5121851491f2fd0471d8f3c49940175e23a26f1b]
> 4) the current assumption is that the issue is rooted in the way records are 
> consumed from the topics :
> We have tried to set various options to suppress reads form the source topics 
> but it doesnt seem to make any difference : 
> [https://github.com/the4thamigo-uk/join-example/commit/c674977fd0fdc689152695065d9277abea6bef63]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-8315) Historical join issues

2019-05-16 Thread Matthias J. Sax (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841219#comment-16841219
 ] 

Matthias J. Sax commented on KAFKA-8315:


John mentioned the max task idle config earlier in this discussion: 
https://issues.apache.org/jira/browse/KAFKA-8315?focusedCommentId=16835832&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16835832
 
{quote}Perhaps this value should default to non-zero to enable historical joins 
by default?
{quote}
 
 It's not easy to change the default, because of backward compatibility.

> Historical join issues
> --
>
> Key: KAFKA-8315
> URL: https://issues.apache.org/jira/browse/KAFKA-8315
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Andrew
>Assignee: John Roesler
>Priority: Major
> Attachments: code.java
>
>
> The problem we are experiencing is that we cannot reliably perform simple 
> joins over pre-populated kafka topics. This seems more apparent where one 
> topic has records at less frequent record timestamp intervals that the other.
>  An example of the issue is provided in this repository :
> [https://github.com/the4thamigo-uk/join-example]
>  
> The only way to increase the period of historically joined records is to 
> increase the grace period for the join windows, and this has repercussions 
> when you extend it to a large period e.g. 2 years of minute-by-minute records.
> Related slack conversations : 
> [https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1556799561287300]
> [https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1557733979453900]
>  
>  Research on this issue has gone through a few phases :
> 1) This issue was initially thought to be due to the inability to set the 
> retention period for a join window via {{Materialized: i.e.}}
> The documentation says to use `Materialized` not `JoinWindows.until()` 
> ([https://kafka.apache.org/22/javadoc/org/apache/kafka/streams/kstream/JoinWindows.html#until-long-]),
>  but there is no where to pass a `Materialized` instance to the join 
> operation, only to the group operation is supported it seems.
> This was considered to be a problem with the documentation not with the API 
> and is addressed in [https://github.com/apache/kafka/pull/6664]
> 2) We then found an apparent issue in the code which would affect the 
> partition that is selected to deliver the next record to the join. This would 
> only be a problem for data that is out-of-order, and join-example uses data 
> that is in order of timestamp in both topics. So this fix is thought not to 
> affect join-example.
> This was considered to be an issue and is being addressed in 
> [https://github.com/apache/kafka/pull/6719]
>  3) Further investigation using a crafted unit test seems to show that the 
> partition-selection and ordering (PartitionGroup/RecordQueue) seems to work ok
> [https://github.com/the4thamigo-uk/kafka/commit/5121851491f2fd0471d8f3c49940175e23a26f1b]
> 4) the current assumption is that the issue is rooted in the way records are 
> consumed from the topics :
> We have tried to set various options to suppress reads form the source topics 
> but it doesnt seem to make any difference : 
> [https://github.com/the4thamigo-uk/join-example/commit/c674977fd0fdc689152695065d9277abea6bef63]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7941) Connect KafkaBasedLog work thread terminates when getting offsets fails because broker is unavailable

2019-05-16 Thread Randall Hauch (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841226#comment-16841226
 ] 

Randall Hauch commented on KAFKA-7941:
--

It looks like KAFKA-6608 changed the `consumer.endOffsets(...)` methods in AK 
2.0 to add an optional timeout parameter, and for the existing method to 
default to the value set by the `request.timeout.ms` consumer property, which 
itself defaults to 30 seconds. This added the possibility of a TimeoutException 
on these methods, which didn't have it before AK 2.0.

So, one workaround is to set the `request.timeout.ms` property for Connect 
worker's configuration, which is used for the worker's consumer used for 
offsets and other internal topics. Note that doing this will affect the 
producer of internal components, too, and unless it's overridden for the 
worker's `producer.request.timeout.ms` or `consumer.request.timeout.ms` will 
also apply to producers and consumers used for connectors.

> Connect KafkaBasedLog work thread terminates when getting offsets fails 
> because broker is unavailable
> -
>
> Key: KAFKA-7941
> URL: https://issues.apache.org/jira/browse/KAFKA-7941
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Paul Whalen
>Assignee: Paul Whalen
>Priority: Minor
>
> My team has run into this Connect bug regularly in the last six months while 
> doing infrastructure maintenance that causes intermittent broker availability 
> issues.  I'm a little surprised it exists given how routinely it affects us, 
> so perhaps someone in the know can point out if our setup is somehow just 
> incorrect.  My team is running 2.0.0 on both the broker and client, though 
> from what I can tell from reading the code, the issue continues to exist 
> through 2.2; at least, I was able to write a failing unit test that I believe 
> reproduces it.
> When a {{KafkaBasedLog}} worker thread in the Connect runtime calls 
> {{readLogToEnd}} and brokers are unavailable, the {{TimeoutException}} from 
> the consumer {{endOffsets}} call is uncaught all the way up to the top level 
> {{catch (Throwable t)}}, effectively killing the thread until restarting 
> Connect.  The result is Connect stops functioning entirely, with no 
> indication except for that log line - tasks still show as running.
> The proposed fix is to simply catch and log the {{TimeoutException}}, 
> allowing the worker thread to retry forever.
> Alternatively, perhaps there is not an expectation that Connect should be 
> able to recover following broker unavailability, though that would be 
> disappointing.  I would at least hope hope for a louder failure then the 
> single {{ERROR}} log.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-3816) Provide more context in Kafka Connect log messages using MDC

2019-05-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841234#comment-16841234
 ] 

ASF GitHub Bot commented on KAFKA-3816:
---

guozhangwang commented on pull request #5743: KAFKA-3816: Add MDC logging to 
Connect runtime
URL: https://github.com/apache/kafka/pull/5743
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Provide more context in Kafka Connect log messages using MDC
> 
>
> Key: KAFKA-3816
> URL: https://issues.apache.org/jira/browse/KAFKA-3816
> Project: Kafka
>  Issue Type: Improvement
>  Components: KafkaConnect
>Affects Versions: 0.9.0.1
>Reporter: Randall Hauch
>Priority: Critical
>
> Currently it is relatively difficult to correlate individual log messages 
> with the various threads and activities that are going on within a Kafka 
> Connect worker, let along a cluster of workers. Log messages should provide 
> more context to make it easier and to allow log scraping tools to coalesce 
> related log messages.
> One simple way to do this is by using _mapped diagnostic contexts_, or MDC. 
> This is supported by the SLF4J API, and by the Logback and Log4J logging 
> frameworks.
> Basically, the framework would be changed so that each thread is configured 
> with one or more MDC parameters using the 
> {{org.slf4j.MDC.put(String,String)}} method in SLF4J. Once that thread is 
> configured, all log messages made using that thread have that context. The 
> logs can then be configured to use those parameters.
> It would be ideal to define a convention for connectors and the Kafka Connect 
> framework. A single set of MDC parameters means that the logging framework 
> can use the specific parameters on its message formats.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-8078) Flaky Test TableTableJoinIntegrationTest#testInnerInner

2019-05-16 Thread Matthias J. Sax (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841269#comment-16841269
 ] 

Matthias J. Sax commented on KAFKA-8078:


Different error message: 
[https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/21822/testReport/junit/org.apache.kafka.streams.integration/TableTableJoinIntegrationTest/testInner_caching_enabled___false_/]

 
{quote}java.lang.AssertionError: Expected: is <[D-d]> but: was <[null, D-d]> at 
org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18) at 
org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6) at 
org.apache.kafka.streams.integration.AbstractJoinIntegrationTest.checkResult(AbstractJoinIntegrationTest.java:168)
 at 
org.apache.kafka.streams.integration.AbstractJoinIntegrationTest.runTest(AbstractJoinIntegrationTest.java:209)
 at 
org.apache.kafka.streams.integration.TableTableJoinIntegrationTest.testInner(TableTableJoinIntegrationTest.java:120){quote}

> Flaky Test TableTableJoinIntegrationTest#testInnerInner
> ---
>
> Key: KAFKA-8078
> URL: https://issues.apache.org/jira/browse/KAFKA-8078
> Project: Kafka
>  Issue Type: Bug
>  Components: streams, unit tests
>Affects Versions: 2.3.0
>Reporter: Matthias J. Sax
>Priority: Critical
>  Labels: flaky-test
> Fix For: 2.3.0
>
>
> [https://builds.apache.org/blue/organizations/jenkins/kafka-trunk-jdk8/detail/kafka-trunk-jdk8/3445/tests]
> {quote}java.lang.AssertionError: Condition not met within timeout 15000. 
> Never received expected final result.
> at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:365)
> at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:325)
> at 
> org.apache.kafka.streams.integration.AbstractJoinIntegrationTest.runTest(AbstractJoinIntegrationTest.java:246)
> at 
> org.apache.kafka.streams.integration.TableTableJoinIntegrationTest.testInnerInner(TableTableJoinIntegrationTest.java:196){quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-8315) Historical join issues

2019-05-16 Thread Andrew (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841273#comment-16841273
 ] 

Andrew commented on KAFKA-8315:
---

Ha!, you are absolutely right. Maybe I didn't understand the significance of 
this, or I think probably what happened was I misinterpreted it i..e I thought 
I needed to set it to zero, not non-zero. Either way we had a lot of other 
challenges last week which side-tracked me a lot from this investigation. 
Anyway, the deep-dive into kafka internals was well and truly worth it.

Two remaining points :

1) will [https://github.com/apache/kafka/pull/6719] make it into the next 
release do you think, as we have a lot of out-of-order data as well in our 
production feeds?
2) It doesn't seem obvious, that in order to do a historical join that the 
developer has to set a value called 'max task idle milliseconds'. If data is in 
the topic, why should I have to have an 'idle time'? Is there anything that can 
be done to make this more intuitive?

 

 

> Historical join issues
> --
>
> Key: KAFKA-8315
> URL: https://issues.apache.org/jira/browse/KAFKA-8315
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Andrew
>Assignee: John Roesler
>Priority: Major
> Attachments: code.java
>
>
> The problem we are experiencing is that we cannot reliably perform simple 
> joins over pre-populated kafka topics. This seems more apparent where one 
> topic has records at less frequent record timestamp intervals that the other.
>  An example of the issue is provided in this repository :
> [https://github.com/the4thamigo-uk/join-example]
>  
> The only way to increase the period of historically joined records is to 
> increase the grace period for the join windows, and this has repercussions 
> when you extend it to a large period e.g. 2 years of minute-by-minute records.
> Related slack conversations : 
> [https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1556799561287300]
> [https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1557733979453900]
>  
>  Research on this issue has gone through a few phases :
> 1) This issue was initially thought to be due to the inability to set the 
> retention period for a join window via {{Materialized: i.e.}}
> The documentation says to use `Materialized` not `JoinWindows.until()` 
> ([https://kafka.apache.org/22/javadoc/org/apache/kafka/streams/kstream/JoinWindows.html#until-long-]),
>  but there is no where to pass a `Materialized` instance to the join 
> operation, only to the group operation is supported it seems.
> This was considered to be a problem with the documentation not with the API 
> and is addressed in [https://github.com/apache/kafka/pull/6664]
> 2) We then found an apparent issue in the code which would affect the 
> partition that is selected to deliver the next record to the join. This would 
> only be a problem for data that is out-of-order, and join-example uses data 
> that is in order of timestamp in both topics. So this fix is thought not to 
> affect join-example.
> This was considered to be an issue and is being addressed in 
> [https://github.com/apache/kafka/pull/6719]
>  3) Further investigation using a crafted unit test seems to show that the 
> partition-selection and ordering (PartitionGroup/RecordQueue) seems to work ok
> [https://github.com/the4thamigo-uk/kafka/commit/5121851491f2fd0471d8f3c49940175e23a26f1b]
> 4) the current assumption is that the issue is rooted in the way records are 
> consumed from the topics :
> We have tried to set various options to suppress reads form the source topics 
> but it doesnt seem to make any difference : 
> [https://github.com/the4thamigo-uk/join-example/commit/c674977fd0fdc689152695065d9277abea6bef63]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KAFKA-8220) Avoid kicking out members through rebalance timeout

2019-05-16 Thread Jason Gustafson (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-8220.

Resolution: Fixed

> Avoid kicking out members through rebalance timeout
> ---
>
> Key: KAFKA-8220
> URL: https://issues.apache.org/jira/browse/KAFKA-8220
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Boyang Chen
>Assignee: Boyang Chen
>Priority: Major
>
> As stated in KIP-345, we will no longer evict unjoined members out of the 
> group. We need to take care the edge case when the leader fails to rejoin and 
> switch to a new leader in this case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-8220) Avoid kicking out members through rebalance timeout

2019-05-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841319#comment-16841319
 ] 

ASF GitHub Bot commented on KAFKA-8220:
---

hachikuji commented on pull request #: KAFKA-8220 & KIP-345 part-3: Avoid 
kicking out members through rebalance timeout
URL: https://github.com/apache/kafka/pull/
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Avoid kicking out members through rebalance timeout
> ---
>
> Key: KAFKA-8220
> URL: https://issues.apache.org/jira/browse/KAFKA-8220
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Boyang Chen
>Assignee: Boyang Chen
>Priority: Major
>
> As stated in KIP-345, we will no longer evict unjoined members out of the 
> group. We need to take care the edge case when the leader fails to rejoin and 
> switch to a new leader in this case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-8347) Choose next record to process by timestamp

2019-05-16 Thread Bill Bejeck (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Bejeck updated KAFKA-8347:
---
Fix Version/s: 2.3.0

> Choose next record to process by timestamp
> --
>
> Key: KAFKA-8347
> URL: https://issues.apache.org/jira/browse/KAFKA-8347
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Affects Versions: 2.3.0
>Reporter: Sophie Blee-Goldman
>Assignee: Sophie Blee-Goldman
>Priority: Major
> Fix For: 2.3.0
>
>
> Currently PartitionGroup will determine the next record to process by 
> choosing the partition with the lowest stream time. However if a partition 
> contains out of order data its stream time may be significantly larger than 
> the timestamp of the next record. The next record should instead be chosen as 
> the record with the lowest timestamp across all partitions, regardless of 
> which partition it comes from or what its partition time is.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-8347) Choose next record to process by timestamp

2019-05-16 Thread Bill Bejeck (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Bejeck updated KAFKA-8347:
---
Affects Version/s: 2.3.0

> Choose next record to process by timestamp
> --
>
> Key: KAFKA-8347
> URL: https://issues.apache.org/jira/browse/KAFKA-8347
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Affects Versions: 2.3.0
>Reporter: Sophie Blee-Goldman
>Assignee: Sophie Blee-Goldman
>Priority: Major
>
> Currently PartitionGroup will determine the next record to process by 
> choosing the partition with the lowest stream time. However if a partition 
> contains out of order data its stream time may be significantly larger than 
> the timestamp of the next record. The next record should instead be chosen as 
> the record with the lowest timestamp across all partitions, regardless of 
> which partition it comes from or what its partition time is.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-8347) Choose next record to process by timestamp

2019-05-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841383#comment-16841383
 ] 

ASF GitHub Bot commented on KAFKA-8347:
---

bbejeck commented on pull request #6719: KAFKA-8347: Choose next record to 
process by timestamp
URL: https://github.com/apache/kafka/pull/6719
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Choose next record to process by timestamp
> --
>
> Key: KAFKA-8347
> URL: https://issues.apache.org/jira/browse/KAFKA-8347
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Sophie Blee-Goldman
>Assignee: Sophie Blee-Goldman
>Priority: Major
>
> Currently PartitionGroup will determine the next record to process by 
> choosing the partition with the lowest stream time. However if a partition 
> contains out of order data its stream time may be significantly larger than 
> the timestamp of the next record. The next record should instead be chosen as 
> the record with the lowest timestamp across all partitions, regardless of 
> which partition it comes from or what its partition time is.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-8315) Historical join issues

2019-05-16 Thread Matthias J. Sax (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841378#comment-16841378
 ] 

Matthias J. Sax commented on KAFKA-8315:


1) I think, 6719 should make it into 2.3 release.

2) It's actually a known issue: https://issues.apache.org/jira/browse/KAFKA-7458

> Historical join issues
> --
>
> Key: KAFKA-8315
> URL: https://issues.apache.org/jira/browse/KAFKA-8315
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Andrew
>Assignee: John Roesler
>Priority: Major
> Attachments: code.java
>
>
> The problem we are experiencing is that we cannot reliably perform simple 
> joins over pre-populated kafka topics. This seems more apparent where one 
> topic has records at less frequent record timestamp intervals that the other.
>  An example of the issue is provided in this repository :
> [https://github.com/the4thamigo-uk/join-example]
>  
> The only way to increase the period of historically joined records is to 
> increase the grace period for the join windows, and this has repercussions 
> when you extend it to a large period e.g. 2 years of minute-by-minute records.
> Related slack conversations : 
> [https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1556799561287300]
> [https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1557733979453900]
>  
>  Research on this issue has gone through a few phases :
> 1) This issue was initially thought to be due to the inability to set the 
> retention period for a join window via {{Materialized: i.e.}}
> The documentation says to use `Materialized` not `JoinWindows.until()` 
> ([https://kafka.apache.org/22/javadoc/org/apache/kafka/streams/kstream/JoinWindows.html#until-long-]),
>  but there is no where to pass a `Materialized` instance to the join 
> operation, only to the group operation is supported it seems.
> This was considered to be a problem with the documentation not with the API 
> and is addressed in [https://github.com/apache/kafka/pull/6664]
> 2) We then found an apparent issue in the code which would affect the 
> partition that is selected to deliver the next record to the join. This would 
> only be a problem for data that is out-of-order, and join-example uses data 
> that is in order of timestamp in both topics. So this fix is thought not to 
> affect join-example.
> This was considered to be an issue and is being addressed in 
> [https://github.com/apache/kafka/pull/6719]
>  3) Further investigation using a crafted unit test seems to show that the 
> partition-selection and ordering (PartitionGroup/RecordQueue) seems to work ok
> [https://github.com/the4thamigo-uk/kafka/commit/5121851491f2fd0471d8f3c49940175e23a26f1b]
> 4) the current assumption is that the issue is rooted in the way records are 
> consumed from the topics :
> We have tried to set various options to suppress reads form the source topics 
> but it doesnt seem to make any difference : 
> [https://github.com/the4thamigo-uk/join-example/commit/c674977fd0fdc689152695065d9277abea6bef63]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-8315) Historical join issues

2019-05-16 Thread Andrew (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841449#comment-16841449
 ] 

Andrew commented on KAFKA-8315:
---

[~mjsax] [~vvcephei] [~ableegoldman] Thanks for your help on this, I think we 
are good now... Just doing a final test here, but looks promising

> Historical join issues
> --
>
> Key: KAFKA-8315
> URL: https://issues.apache.org/jira/browse/KAFKA-8315
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Andrew
>Assignee: John Roesler
>Priority: Major
> Attachments: code.java
>
>
> The problem we are experiencing is that we cannot reliably perform simple 
> joins over pre-populated kafka topics. This seems more apparent where one 
> topic has records at less frequent record timestamp intervals that the other.
>  An example of the issue is provided in this repository :
> [https://github.com/the4thamigo-uk/join-example]
>  
> The only way to increase the period of historically joined records is to 
> increase the grace period for the join windows, and this has repercussions 
> when you extend it to a large period e.g. 2 years of minute-by-minute records.
> Related slack conversations : 
> [https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1556799561287300]
> [https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1557733979453900]
>  
>  Research on this issue has gone through a few phases :
> 1) This issue was initially thought to be due to the inability to set the 
> retention period for a join window via {{Materialized: i.e.}}
> The documentation says to use `Materialized` not `JoinWindows.until()` 
> ([https://kafka.apache.org/22/javadoc/org/apache/kafka/streams/kstream/JoinWindows.html#until-long-]),
>  but there is no where to pass a `Materialized` instance to the join 
> operation, only to the group operation is supported it seems.
> This was considered to be a problem with the documentation not with the API 
> and is addressed in [https://github.com/apache/kafka/pull/6664]
> 2) We then found an apparent issue in the code which would affect the 
> partition that is selected to deliver the next record to the join. This would 
> only be a problem for data that is out-of-order, and join-example uses data 
> that is in order of timestamp in both topics. So this fix is thought not to 
> affect join-example.
> This was considered to be an issue and is being addressed in 
> [https://github.com/apache/kafka/pull/6719]
>  3) Further investigation using a crafted unit test seems to show that the 
> partition-selection and ordering (PartitionGroup/RecordQueue) seems to work ok
> [https://github.com/the4thamigo-uk/kafka/commit/5121851491f2fd0471d8f3c49940175e23a26f1b]
> 4) the current assumption is that the issue is rooted in the way records are 
> consumed from the topics :
> We have tried to set various options to suppress reads form the source topics 
> but it doesnt seem to make any difference : 
> [https://github.com/the4thamigo-uk/join-example/commit/c674977fd0fdc689152695065d9277abea6bef63]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-8367) Non-heap memory leak in Kafka Streams

2019-05-16 Thread Pavel Savov (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841452#comment-16841452
 ] 

Pavel Savov commented on KAFKA-8367:


Hi [~vvcephei] and [~ableegoldman],

Thank you for your suggestions.

Yes, we are using a RocksDB Config Setter (and had used it before upgrading to 
2.2.0 too). The only object we are creating in that setter is an 
org.rocksdb.BlockBasedTableConfig instance:

 
{code:java}
val tableConfig = new org.rocksdb.BlockBasedTableConfig()
tableConfig.setBlockCacheSize(blockCacheSize) // block_cache_size (fetch cache)
tableConfig.setBlockSize(DefaultBlockSize)
tableConfig.setCacheIndexAndFilterBlocks(DefaultCacheIndexAndFilterBlocks)

options.setTableFormatConfig(tableConfig)
{code}
 

I tried building from the latest trunk but I'm afraid it didn't fix the leak.

Please let me know if there is any info I could provide you with that could 
help narrow down the issue.

 

Thanks!

 

> Non-heap memory leak in Kafka Streams
> -
>
> Key: KAFKA-8367
> URL: https://issues.apache.org/jira/browse/KAFKA-8367
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.2.0
>Reporter: Pavel Savov
>Priority: Major
> Attachments: memory-prod.png, memory-test.png
>
>
> We have been observing a non-heap memory leak after upgrading to Kafka 
> Streams 2.2.0 from 2.0.1. We suspect the source to be around RocksDB as the 
> leak only happens when we enable stateful stream operations (utilizing 
> stores). We are aware of *KAFKA-8323* and have created our own fork of 2.2.0 
> and ported the fix scheduled for release in 2.2.1 to our fork. It did not 
> stop the leak, however.
> We are having this memory leak in our production environment where the 
> consumer group is auto-scaled in and out in response to changes in traffic 
> volume, and in our test environment where we have two consumers, no 
> autoscaling and relatively constant traffic.
> Below is some information I'm hoping will be of help:
>  * RocksDB Config:
> Block cache size: 4 MiB
> Write buffer size: 2 MiB
> Block size: 16 KiB
> Cache index and filter blocks: true
> Manifest preallocation size: 64 KiB
> Max write buffer number: 3
> Max open files: 6144
>  
>  * Memory usage in production
> The attached graph (memory-prod.png) shows memory consumption for each 
> instance as a separate line. The horizontal red line at 6 GiB is the memory 
> limit.
> As illustrated on the attached graph from production, memory consumption in 
> running instances goes up around autoscaling events (scaling the consumer 
> group either in or out) and associated rebalancing. It stabilizes until the 
> next autoscaling event but it never goes back down.
> An example of scaling out can be seen from around 21:00 hrs where three new 
> instances are started in response to a traffic spike.
> Just after midnight traffic drops and some instances are shut down. Memory 
> consumption in the remaining running instances goes up.
> Memory consumption climbs again from around 6:00AM due to increased traffic 
> and new instances are being started until around 10:30AM. Memory consumption 
> never drops until the cluster is restarted around 12:30.
>  
>  * Memory usage in test
> As illustrated by the attached graph (memory-test.png) we have a fixed number 
> of two instances in our test environment and no autoscaling. Memory 
> consumption rises linearly until it reaches the limit (around 2:00 AM on 
> 5/13) and Mesos restarts the offending instances, or we restart the cluster 
> manually.
>  
>  * No heap leaks observed
>  * Window retention: 2 or 11 minutes (depending on operation type)
>  * Issue not present in Kafka Streams 2.0.1
>  * No memory leak for stateless stream operations (when no RocksDB stores are 
> used)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (KAFKA-6817) UnknownProducerIdException when writing messages with old timestamps

2019-05-16 Thread Raman Gupta (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841464#comment-16841464
 ] 

Raman Gupta edited comment on KAFKA-6817 at 5/16/19 3:30 PM:
-

Same problem here, with a stream that is intended to migrate data from a topic 
to another topic. The stream was just created -- minutes after it was created 
we get this exception, which, from a user's perspective, makes no sense given 
that the producer was just created. Retention time on the topic is infinite 
(-1). Can the priority of this be increased, given there is no clear workaround?


was (Author: rocketraman):
Same problem here, with a stream that is intended to migrate data from a topic 
to another topic. The stream was just created -- minutes after it was created 
we get this exception, which makes no sense given that the producer was just 
created. Retention time on the topic is infinite (-1). Can the priority of this 
be increased, given there is no clear workaround?

> UnknownProducerIdException when writing messages with old timestamps
> 
>
> Key: KAFKA-6817
> URL: https://issues.apache.org/jira/browse/KAFKA-6817
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Affects Versions: 1.1.0
>Reporter: Odin Standal
>Priority: Major
>
> We are seeing the following exception in our Kafka application: 
> {code:java}
> ERROR o.a.k.s.p.internals.StreamTask - task [0_0] Failed to close producer 
> due to the following error: org.apache.kafka.streams.errors.StreamsException: 
> task [0_0] Abort sending since an error caught with a previous record (key 
> 22 value some-value timestamp 1519200902670) to topic 
> exactly-once-test-topic- v2 due to This exception is raised by the broker if 
> it could not locate the producer metadata associated with the producerId in 
> question. This could happen if, for instance, the producer's records were 
> deleted because their retention time had elapsed. Once the last records of 
> the producerId are removed, the producer's metadata is removed from the 
> broker, and future appends by the producer will return this exception. at 
> org.apache.kafka.streams.processor.internals.RecordCollectorImpl.recordSendError(RecordCollectorImpl.java:125)
>  at 
> org.apache.kafka.streams.processor.internals.RecordCollectorImpl.access$500(RecordCollectorImpl.java:48)
>  at 
> org.apache.kafka.streams.processor.internals.RecordCollectorImpl$1.onCompletion(RecordCollectorImpl.java:180)
>  at 
> org.apache.kafka.clients.producer.KafkaProducer$InterceptorCallback.onCompletion(KafkaProducer.java:1199)
>  at 
> org.apache.kafka.clients.producer.internals.ProducerBatch.completeFutureAndFireCallbacks(ProducerBatch.java:204)
>  at 
> org.apache.kafka.clients.producer.internals.ProducerBatch.done(ProducerBatch.java:187)
>  at 
> org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:627) 
> at 
> org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:596) 
> at 
> org.apache.kafka.clients.producer.internals.Sender.completeBatch(Sender.java:557)
>  at 
> org.apache.kafka.clients.producer.internals.Sender.handleProduceResponse(Sender.java:481)
>  at 
> org.apache.kafka.clients.producer.internals.Sender.access$100(Sender.java:74) 
> at 
> org.apache.kafka.clients.producer.internals.Sender$1.onComplete(Sender.java:692)
>  at 
> org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:101) 
> at 
> org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:482)
>  at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:474) at 
> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:239) at 
> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:163) at 
> java.lang.Thread.run(Thread.java:748) Caused by: 
> org.apache.kafka.common.errors.UnknownProducerIdException
> {code}
> We discovered this error when we had the need to reprocess old messages. See 
> more details on 
> [Stackoverflow|https://stackoverflow.com/questions/49872827/unknownproduceridexception-in-kafka-streams-when-enabling-exactly-once?noredirect=1#comment86901917_49872827]
> We have reproduced the error with a smaller example application. The error 
> occurs after 10 minutes of producing messages that have old timestamps (type 
> 1 year old). The topic we are writing to has a retention.ms set to 1 year so 
> we are expecting the messages to stay there.
> After digging through the ProducerStateManager-code in the Kafka source code 
> we have a theory of what might be wrong.
> The ProducerStateManager.removeExpiredProducers() seems to remove producers 
> from memory erroneously when processing records which are older than the 
> maxProducerIdExpirationMs (coming from 

[jira] [Commented] (KAFKA-6817) UnknownProducerIdException when writing messages with old timestamps

2019-05-16 Thread Raman Gupta (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841464#comment-16841464
 ] 

Raman Gupta commented on KAFKA-6817:


Same problem here, with a stream that is intended to migrate data from a topic 
to another topic. The stream was just created -- minutes after it was created 
we get this exception, which makes no sense given that the producer was just 
created. Retention time on the topic is infinite (-1). Can the priority of this 
be increased, given there is no clear workaround?

> UnknownProducerIdException when writing messages with old timestamps
> 
>
> Key: KAFKA-6817
> URL: https://issues.apache.org/jira/browse/KAFKA-6817
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Affects Versions: 1.1.0
>Reporter: Odin Standal
>Priority: Major
>
> We are seeing the following exception in our Kafka application: 
> {code:java}
> ERROR o.a.k.s.p.internals.StreamTask - task [0_0] Failed to close producer 
> due to the following error: org.apache.kafka.streams.errors.StreamsException: 
> task [0_0] Abort sending since an error caught with a previous record (key 
> 22 value some-value timestamp 1519200902670) to topic 
> exactly-once-test-topic- v2 due to This exception is raised by the broker if 
> it could not locate the producer metadata associated with the producerId in 
> question. This could happen if, for instance, the producer's records were 
> deleted because their retention time had elapsed. Once the last records of 
> the producerId are removed, the producer's metadata is removed from the 
> broker, and future appends by the producer will return this exception. at 
> org.apache.kafka.streams.processor.internals.RecordCollectorImpl.recordSendError(RecordCollectorImpl.java:125)
>  at 
> org.apache.kafka.streams.processor.internals.RecordCollectorImpl.access$500(RecordCollectorImpl.java:48)
>  at 
> org.apache.kafka.streams.processor.internals.RecordCollectorImpl$1.onCompletion(RecordCollectorImpl.java:180)
>  at 
> org.apache.kafka.clients.producer.KafkaProducer$InterceptorCallback.onCompletion(KafkaProducer.java:1199)
>  at 
> org.apache.kafka.clients.producer.internals.ProducerBatch.completeFutureAndFireCallbacks(ProducerBatch.java:204)
>  at 
> org.apache.kafka.clients.producer.internals.ProducerBatch.done(ProducerBatch.java:187)
>  at 
> org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:627) 
> at 
> org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:596) 
> at 
> org.apache.kafka.clients.producer.internals.Sender.completeBatch(Sender.java:557)
>  at 
> org.apache.kafka.clients.producer.internals.Sender.handleProduceResponse(Sender.java:481)
>  at 
> org.apache.kafka.clients.producer.internals.Sender.access$100(Sender.java:74) 
> at 
> org.apache.kafka.clients.producer.internals.Sender$1.onComplete(Sender.java:692)
>  at 
> org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:101) 
> at 
> org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:482)
>  at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:474) at 
> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:239) at 
> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:163) at 
> java.lang.Thread.run(Thread.java:748) Caused by: 
> org.apache.kafka.common.errors.UnknownProducerIdException
> {code}
> We discovered this error when we had the need to reprocess old messages. See 
> more details on 
> [Stackoverflow|https://stackoverflow.com/questions/49872827/unknownproduceridexception-in-kafka-streams-when-enabling-exactly-once?noredirect=1#comment86901917_49872827]
> We have reproduced the error with a smaller example application. The error 
> occurs after 10 minutes of producing messages that have old timestamps (type 
> 1 year old). The topic we are writing to has a retention.ms set to 1 year so 
> we are expecting the messages to stay there.
> After digging through the ProducerStateManager-code in the Kafka source code 
> we have a theory of what might be wrong.
> The ProducerStateManager.removeExpiredProducers() seems to remove producers 
> from memory erroneously when processing records which are older than the 
> maxProducerIdExpirationMs (coming from the `transactional.id.expiration.ms` 
> configuration), which is set by default to 7 days. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (KAFKA-8376) Flaky test ClientAuthenticationFailureTest.testTransactionalProducerWithInvalidCredentials test.

2019-05-16 Thread Jason Gustafson (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson reassigned KAFKA-8376:
--

Assignee: Jason Gustafson

> Flaky test 
> ClientAuthenticationFailureTest.testTransactionalProducerWithInvalidCredentials
>  test.
> 
>
> Key: KAFKA-8376
> URL: https://issues.apache.org/jira/browse/KAFKA-8376
> Project: Kafka
>  Issue Type: Bug
>Reporter: Manikumar
>Assignee: Jason Gustafson
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: t1.txt
>
>
> The test is going into hang state and test run was not completing. I think 
> the flakiness is due to timing issues and 
> https://github.com/apache/kafka/pull/5971
> Attaching the thread dump.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-8376) Flaky test ClientAuthenticationFailureTest.testTransactionalProducerWithInvalidCredentials test.

2019-05-16 Thread Jason Gustafson (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841478#comment-16841478
 ] 

Jason Gustafson commented on KAFKA-8376:


This could be caused by KAFKA-8275. I will have a patch ready soon.

> Flaky test 
> ClientAuthenticationFailureTest.testTransactionalProducerWithInvalidCredentials
>  test.
> 
>
> Key: KAFKA-8376
> URL: https://issues.apache.org/jira/browse/KAFKA-8376
> Project: Kafka
>  Issue Type: Bug
>Reporter: Manikumar
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: t1.txt
>
>
> The test is going into hang state and test run was not completing. I think 
> the flakiness is due to timing issues and 
> https://github.com/apache/kafka/pull/5971
> Attaching the thread dump.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-8376) Flaky test ClientAuthenticationFailureTest.testTransactionalProducerWithInvalidCredentials test.

2019-05-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841496#comment-16841496
 ] 

ASF GitHub Bot commented on KAFKA-8376:
---

hachikuji commented on pull request #6746: KAFKA-8376; Least loaded node should 
consider connections which are being prepared
URL: https://github.com/apache/kafka/pull/6746
 
 
   This fixes a regression caused by KAFKA-8275. The least loaded node 
selection should take into account nodes which are currently being connect to. 
This includes both the CONNECTING and CHECKING_API_VERSIONS states since 
`canSendRequest` would return false in either case.
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Flaky test 
> ClientAuthenticationFailureTest.testTransactionalProducerWithInvalidCredentials
>  test.
> 
>
> Key: KAFKA-8376
> URL: https://issues.apache.org/jira/browse/KAFKA-8376
> Project: Kafka
>  Issue Type: Bug
>Reporter: Manikumar
>Assignee: Jason Gustafson
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: t1.txt
>
>
> The test is going into hang state and test run was not completing. I think 
> the flakiness is due to timing issues and 
> https://github.com/apache/kafka/pull/5971
> Attaching the thread dump.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6852) Allow Log Levels to be dynamically configured

2019-05-16 Thread Addison (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841555#comment-16841555
 ] 

Addison commented on KAFKA-6852:


[~yevabyzek] I think this is a duplicate of 
https://issues.apache.org/jira/browse/KAFKA-7800

> Allow Log Levels to be dynamically configured
> -
>
> Key: KAFKA-6852
> URL: https://issues.apache.org/jira/browse/KAFKA-6852
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 1.0.1
>Reporter: Yeva Byzek
>Priority: Major
>  Labels: supportability, usability
>
> For operational workflows like troubleshooting, it is useful to change log 
> levels to get more information.
> The current two ways to change log levels is painful:
> 1. Changing logging level configuration requires restarting processes.  
> Challenge: service disruption
> 2. Can be also done through JMX MBean.  Challenge: for the most part 
> customers don't have JMX exposed to the operators.  It exists as PURELY a way 
> to get stats.
> This jira is to make logging level changes dynamic without service restart 
> and without JMX interface.
> This needs to be done in all Kafka components



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-8163) SetSchemaMetadata SMT does not apply to nested types

2019-05-16 Thread JIRA


[ 
https://issues.apache.org/jira/browse/KAFKA-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841564#comment-16841564
 ] 

Marc Löhe commented on KAFKA-8163:
--

I experience the same problem with Connect 2.2 and Debezium 0.9.5.

Not sure if this is implemented Connect or Debezium, though?

> SetSchemaMetadata SMT does not apply to nested types
> 
>
> Key: KAFKA-8163
> URL: https://issues.apache.org/jira/browse/KAFKA-8163
> Project: Kafka
>  Issue Type: Bug
>Reporter: pierre bouvret
>Priority: Minor
>
> In a schema, I want to replace the pg.public.foufou namespace by the 
> pg.other_public.foufou namespace.
> The schema (Envelope from Debezium) has an inner definition for Value also 
> belonging to the pg.public.foufou namespace
> Using a SetSchemaMetadata SMT, the inner namespace is not updated.
> {quote}{
>     "type": "record",
>     "name": "Envelope",
>     "namespace": "pg.other_public.foufou",
>     "fields": [
>         {
>             "name": "before",
>             "type": [
>                 "null",
>                 {
>                     "type": "record",
>                     "name": "Value",
>                     "namespace": "pg.public.foufou",
>                     "fields": [
>                         {
>                             "name": "id",
>                             "type": "int"
>                         },
>                         {
>                             "name": "lib",
>                             "type": [
>                                 "null",
>                                 "string"
>                             ],
>                             "default": null
>                         }
>                     ],
>                     "connect.name": "pg.public.foufou.Value"
>                 }
>             ],
>             "default": null
>         },{quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6852) Allow Log Levels to be dynamically configured

2019-05-16 Thread Yeva Byzek (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841560#comment-16841560
 ] 

Yeva Byzek commented on KAFKA-6852:
---

[~ahuddy] Technically speaking, KAFKA-7800 is a dupe of this one, but as long 
as it gets fixed :)

> Allow Log Levels to be dynamically configured
> -
>
> Key: KAFKA-6852
> URL: https://issues.apache.org/jira/browse/KAFKA-6852
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 1.0.1
>Reporter: Yeva Byzek
>Priority: Major
>  Labels: supportability, usability
>
> For operational workflows like troubleshooting, it is useful to change log 
> levels to get more information.
> The current two ways to change log levels is painful:
> 1. Changing logging level configuration requires restarting processes.  
> Challenge: service disruption
> 2. Can be also done through JMX MBean.  Challenge: for the most part 
> customers don't have JMX exposed to the operators.  It exists as PURELY a way 
> to get stats.
> This jira is to make logging level changes dynamic without service restart 
> and without JMX interface.
> This needs to be done in all Kafka components



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-8315) Historical join issues

2019-05-16 Thread John Roesler (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841571#comment-16841571
 ] 

John Roesler commented on KAFKA-8315:
-

Oh, man. I'm sorry that that I didn't convey the significance of that 
configuration enough. I just assumed you tried it. I feel bad for all the time 
you spent.

It's called "max idle time" and defaults to 0 because Streams actually has to 
resist processing data that it already has (aka, it has to idle) in order to 
wait for extra data. Streams would never idle before we added the config, and 
idling could have a severe impact on throughput for high-volume applications, 
so we basically can't default greater than 0.

Still, it seems like in a replay case like yours, it should at least wait until 
it polls all inputs at least once before starting, so I agree there's room for 
improvement here. I'm wondering if we should just take more control over the 
consumer and explicitly poll each topic, instead of just assigning them all and 
letting the consumer/broker decide which ones to give data back from first.

> Historical join issues
> --
>
> Key: KAFKA-8315
> URL: https://issues.apache.org/jira/browse/KAFKA-8315
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Andrew
>Assignee: John Roesler
>Priority: Major
> Attachments: code.java
>
>
> The problem we are experiencing is that we cannot reliably perform simple 
> joins over pre-populated kafka topics. This seems more apparent where one 
> topic has records at less frequent record timestamp intervals that the other.
>  An example of the issue is provided in this repository :
> [https://github.com/the4thamigo-uk/join-example]
>  
> The only way to increase the period of historically joined records is to 
> increase the grace period for the join windows, and this has repercussions 
> when you extend it to a large period e.g. 2 years of minute-by-minute records.
> Related slack conversations : 
> [https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1556799561287300]
> [https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1557733979453900]
>  
>  Research on this issue has gone through a few phases :
> 1) This issue was initially thought to be due to the inability to set the 
> retention period for a join window via {{Materialized: i.e.}}
> The documentation says to use `Materialized` not `JoinWindows.until()` 
> ([https://kafka.apache.org/22/javadoc/org/apache/kafka/streams/kstream/JoinWindows.html#until-long-]),
>  but there is no where to pass a `Materialized` instance to the join 
> operation, only to the group operation is supported it seems.
> This was considered to be a problem with the documentation not with the API 
> and is addressed in [https://github.com/apache/kafka/pull/6664]
> 2) We then found an apparent issue in the code which would affect the 
> partition that is selected to deliver the next record to the join. This would 
> only be a problem for data that is out-of-order, and join-example uses data 
> that is in order of timestamp in both topics. So this fix is thought not to 
> affect join-example.
> This was considered to be an issue and is being addressed in 
> [https://github.com/apache/kafka/pull/6719]
>  3) Further investigation using a crafted unit test seems to show that the 
> partition-selection and ordering (PartitionGroup/RecordQueue) seems to work ok
> [https://github.com/the4thamigo-uk/kafka/commit/5121851491f2fd0471d8f3c49940175e23a26f1b]
> 4) the current assumption is that the issue is rooted in the way records are 
> consumed from the topics :
> We have tried to set various options to suppress reads form the source topics 
> but it doesnt seem to make any difference : 
> [https://github.com/the4thamigo-uk/join-example/commit/c674977fd0fdc689152695065d9277abea6bef63]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (KAFKA-8315) Historical join issues

2019-05-16 Thread John Roesler (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841571#comment-16841571
 ] 

John Roesler edited comment on KAFKA-8315 at 5/16/19 5:42 PM:
--

Oh, man. I'm sorry that I didn't convey the significance of that configuration 
enough. I just assumed you tried it. I feel bad for all the time you spent.

It's called "max idle time" and defaults to 0 because Streams actually has to 
resist processing data that it already has (aka, it has to idle) in order to 
wait for extra data. Streams would never idle before we added the config, and 
idling could have a severe impact on throughput for high-volume applications, 
so we basically can't default greater than 0.

Still, it seems like in a replay case like yours, it should at least wait until 
it polls all inputs at least once before starting, so I agree there's room for 
improvement here. I'm wondering if we should just take more control over the 
consumer and explicitly poll each topic, instead of just assigning them all and 
letting the consumer/broker decide which ones to give data back from first.


was (Author: vvcephei):
Oh, man. I'm sorry that that I didn't convey the significance of that 
configuration enough. I just assumed you tried it. I feel bad for all the time 
you spent.

It's called "max idle time" and defaults to 0 because Streams actually has to 
resist processing data that it already has (aka, it has to idle) in order to 
wait for extra data. Streams would never idle before we added the config, and 
idling could have a severe impact on throughput for high-volume applications, 
so we basically can't default greater than 0.

Still, it seems like in a replay case like yours, it should at least wait until 
it polls all inputs at least once before starting, so I agree there's room for 
improvement here. I'm wondering if we should just take more control over the 
consumer and explicitly poll each topic, instead of just assigning them all and 
letting the consumer/broker decide which ones to give data back from first.

> Historical join issues
> --
>
> Key: KAFKA-8315
> URL: https://issues.apache.org/jira/browse/KAFKA-8315
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Andrew
>Assignee: John Roesler
>Priority: Major
> Attachments: code.java
>
>
> The problem we are experiencing is that we cannot reliably perform simple 
> joins over pre-populated kafka topics. This seems more apparent where one 
> topic has records at less frequent record timestamp intervals that the other.
>  An example of the issue is provided in this repository :
> [https://github.com/the4thamigo-uk/join-example]
>  
> The only way to increase the period of historically joined records is to 
> increase the grace period for the join windows, and this has repercussions 
> when you extend it to a large period e.g. 2 years of minute-by-minute records.
> Related slack conversations : 
> [https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1556799561287300]
> [https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1557733979453900]
>  
>  Research on this issue has gone through a few phases :
> 1) This issue was initially thought to be due to the inability to set the 
> retention period for a join window via {{Materialized: i.e.}}
> The documentation says to use `Materialized` not `JoinWindows.until()` 
> ([https://kafka.apache.org/22/javadoc/org/apache/kafka/streams/kstream/JoinWindows.html#until-long-]),
>  but there is no where to pass a `Materialized` instance to the join 
> operation, only to the group operation is supported it seems.
> This was considered to be a problem with the documentation not with the API 
> and is addressed in [https://github.com/apache/kafka/pull/6664]
> 2) We then found an apparent issue in the code which would affect the 
> partition that is selected to deliver the next record to the join. This would 
> only be a problem for data that is out-of-order, and join-example uses data 
> that is in order of timestamp in both topics. So this fix is thought not to 
> affect join-example.
> This was considered to be an issue and is being addressed in 
> [https://github.com/apache/kafka/pull/6719]
>  3) Further investigation using a crafted unit test seems to show that the 
> partition-selection and ordering (PartitionGroup/RecordQueue) seems to work ok
> [https://github.com/the4thamigo-uk/kafka/commit/5121851491f2fd0471d8f3c49940175e23a26f1b]
> 4) the current assumption is that the issue is rooted in the way records are 
> consumed from the topics :
> We have tried to set various options to suppress reads form the source topics 
> but it doesnt seem to make any difference : 
> [https://github.com/the4thamigo-uk/join-example/commit/c674977fd0fdc689152695065d9277a

[jira] [Commented] (KAFKA-6852) Allow Log Levels to be dynamically configured

2019-05-16 Thread Addison (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841574#comment-16841574
 ] 

Addison commented on KAFKA-6852:


Very true.  Because KAFKA-7800 is linked to the KIP can we close this one out 
as a duplicate?

> Allow Log Levels to be dynamically configured
> -
>
> Key: KAFKA-6852
> URL: https://issues.apache.org/jira/browse/KAFKA-6852
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 1.0.1
>Reporter: Yeva Byzek
>Priority: Major
>  Labels: supportability, usability
>
> For operational workflows like troubleshooting, it is useful to change log 
> levels to get more information.
> The current two ways to change log levels is painful:
> 1. Changing logging level configuration requires restarting processes.  
> Challenge: service disruption
> 2. Can be also done through JMX MBean.  Challenge: for the most part 
> customers don't have JMX exposed to the operators.  It exists as PURELY a way 
> to get stats.
> This jira is to make logging level changes dynamic without service restart 
> and without JMX interface.
> This needs to be done in all Kafka components



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KAFKA-6852) Allow Log Levels to be dynamically configured

2019-05-16 Thread Yeva Byzek (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yeva Byzek resolved KAFKA-6852.
---
Resolution: Duplicate

This is getting addressed in KAFKA-7800

> Allow Log Levels to be dynamically configured
> -
>
> Key: KAFKA-6852
> URL: https://issues.apache.org/jira/browse/KAFKA-6852
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 1.0.1
>Reporter: Yeva Byzek
>Priority: Major
>  Labels: supportability, usability
>
> For operational workflows like troubleshooting, it is useful to change log 
> levels to get more information.
> The current two ways to change log levels is painful:
> 1. Changing logging level configuration requires restarting processes.  
> Challenge: service disruption
> 2. Can be also done through JMX MBean.  Challenge: for the most part 
> customers don't have JMX exposed to the operators.  It exists as PURELY a way 
> to get stats.
> This jira is to make logging level changes dynamic without service restart 
> and without JMX interface.
> This needs to be done in all Kafka components



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (KAFKA-8315) Historical join issues

2019-05-16 Thread John Roesler (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841571#comment-16841571
 ] 

John Roesler edited comment on KAFKA-8315 at 5/16/19 5:50 PM:
--

Oh, man. I'm sorry that I didn't convey the significance of that configuration 
enough. I just assumed you tried it. I feel bad for all the time you spent.

It's called "max idle time" and defaults to 0 because Streams actually has to 
resist processing data that it already has (aka, it has to idle) in order to 
wait for extra data. Streams would never idle before we added the config, and 
idling could have a severe impact on throughput for high-volume applications, 
so we basically can't default greater than 0.

Still, it seems like in a replay case like yours, it should at least wait until 
it polls all inputs at least once before starting, so I agree there's room for 
improvement here. I'm wondering if we should just take more control over the 
consumer and explicitly poll each topic, instead of just assigning them all and 
letting the consumer/broker decide which ones to give data back from first.

Since you dug in deep enough to figure this out, do you have any ideas, 
[~the4thamigo_uk]?


was (Author: vvcephei):
Oh, man. I'm sorry that I didn't convey the significance of that configuration 
enough. I just assumed you tried it. I feel bad for all the time you spent.

It's called "max idle time" and defaults to 0 because Streams actually has to 
resist processing data that it already has (aka, it has to idle) in order to 
wait for extra data. Streams would never idle before we added the config, and 
idling could have a severe impact on throughput for high-volume applications, 
so we basically can't default greater than 0.

Still, it seems like in a replay case like yours, it should at least wait until 
it polls all inputs at least once before starting, so I agree there's room for 
improvement here. I'm wondering if we should just take more control over the 
consumer and explicitly poll each topic, instead of just assigning them all and 
letting the consumer/broker decide which ones to give data back from first.

> Historical join issues
> --
>
> Key: KAFKA-8315
> URL: https://issues.apache.org/jira/browse/KAFKA-8315
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Andrew
>Assignee: John Roesler
>Priority: Major
> Attachments: code.java
>
>
> The problem we are experiencing is that we cannot reliably perform simple 
> joins over pre-populated kafka topics. This seems more apparent where one 
> topic has records at less frequent record timestamp intervals that the other.
>  An example of the issue is provided in this repository :
> [https://github.com/the4thamigo-uk/join-example]
>  
> The only way to increase the period of historically joined records is to 
> increase the grace period for the join windows, and this has repercussions 
> when you extend it to a large period e.g. 2 years of minute-by-minute records.
> Related slack conversations : 
> [https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1556799561287300]
> [https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1557733979453900]
>  
>  Research on this issue has gone through a few phases :
> 1) This issue was initially thought to be due to the inability to set the 
> retention period for a join window via {{Materialized: i.e.}}
> The documentation says to use `Materialized` not `JoinWindows.until()` 
> ([https://kafka.apache.org/22/javadoc/org/apache/kafka/streams/kstream/JoinWindows.html#until-long-]),
>  but there is no where to pass a `Materialized` instance to the join 
> operation, only to the group operation is supported it seems.
> This was considered to be a problem with the documentation not with the API 
> and is addressed in [https://github.com/apache/kafka/pull/6664]
> 2) We then found an apparent issue in the code which would affect the 
> partition that is selected to deliver the next record to the join. This would 
> only be a problem for data that is out-of-order, and join-example uses data 
> that is in order of timestamp in both topics. So this fix is thought not to 
> affect join-example.
> This was considered to be an issue and is being addressed in 
> [https://github.com/apache/kafka/pull/6719]
>  3) Further investigation using a crafted unit test seems to show that the 
> partition-selection and ordering (PartitionGroup/RecordQueue) seems to work ok
> [https://github.com/the4thamigo-uk/kafka/commit/5121851491f2fd0471d8f3c49940175e23a26f1b]
> 4) the current assumption is that the issue is rooted in the way records are 
> consumed from the topics :
> We have tried to set various options to suppress reads form the source topics 
> but it doesnt seem to make any difference : 
>

[jira] [Commented] (KAFKA-8315) Historical join issues

2019-05-16 Thread Andrew (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841611#comment-16841611
 ] 

Andrew commented on KAFKA-8315:
---

Well, Im not sure that just fixing the start is sufficient. I suspect that if 
you run over a long enough period, that through sheer bad luck you will 
encounter an issue where a queue is drained, and yet also fails to fetch any 
records on the next fetch cycle, in this case you will end up with the same 
situation I think i.e. an empty queue that will still be processed because of 
the zero max idle time. This can happen on either side of the join, so if one 
side travels too far ahead of the other you would lose the join windows.

As I was rummaging through the code, I also wondered whether a closer 
relationship with the consumer would make things a bit easier. Then you can 
control better how many records you grab to push into the queues. For example, 
I'm not sure, but I kinda have a suspicion that the queues can grow quite 
large, as the limit is on the fetch, and the queue cant put back-pressure on 
the fetch. But, if you had control over the consumer, and you wanted to limit 
the size of the queue, then you would know if one side is full then you don't 
need to not fetch more records on that side etc. You would also potentially be 
able to use streamTime to make such decisions I suppose.

> Historical join issues
> --
>
> Key: KAFKA-8315
> URL: https://issues.apache.org/jira/browse/KAFKA-8315
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Andrew
>Assignee: John Roesler
>Priority: Major
> Attachments: code.java
>
>
> The problem we are experiencing is that we cannot reliably perform simple 
> joins over pre-populated kafka topics. This seems more apparent where one 
> topic has records at less frequent record timestamp intervals that the other.
>  An example of the issue is provided in this repository :
> [https://github.com/the4thamigo-uk/join-example]
>  
> The only way to increase the period of historically joined records is to 
> increase the grace period for the join windows, and this has repercussions 
> when you extend it to a large period e.g. 2 years of minute-by-minute records.
> Related slack conversations : 
> [https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1556799561287300]
> [https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1557733979453900]
>  
>  Research on this issue has gone through a few phases :
> 1) This issue was initially thought to be due to the inability to set the 
> retention period for a join window via {{Materialized: i.e.}}
> The documentation says to use `Materialized` not `JoinWindows.until()` 
> ([https://kafka.apache.org/22/javadoc/org/apache/kafka/streams/kstream/JoinWindows.html#until-long-]),
>  but there is no where to pass a `Materialized` instance to the join 
> operation, only to the group operation is supported it seems.
> This was considered to be a problem with the documentation not with the API 
> and is addressed in [https://github.com/apache/kafka/pull/6664]
> 2) We then found an apparent issue in the code which would affect the 
> partition that is selected to deliver the next record to the join. This would 
> only be a problem for data that is out-of-order, and join-example uses data 
> that is in order of timestamp in both topics. So this fix is thought not to 
> affect join-example.
> This was considered to be an issue and is being addressed in 
> [https://github.com/apache/kafka/pull/6719]
>  3) Further investigation using a crafted unit test seems to show that the 
> partition-selection and ordering (PartitionGroup/RecordQueue) seems to work ok
> [https://github.com/the4thamigo-uk/kafka/commit/5121851491f2fd0471d8f3c49940175e23a26f1b]
> 4) the current assumption is that the issue is rooted in the way records are 
> consumed from the topics :
> We have tried to set various options to suppress reads form the source topics 
> but it doesnt seem to make any difference : 
> [https://github.com/the4thamigo-uk/join-example/commit/c674977fd0fdc689152695065d9277abea6bef63]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (KAFKA-8315) Historical join issues

2019-05-16 Thread Andrew (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841611#comment-16841611
 ] 

Andrew edited comment on KAFKA-8315 at 5/16/19 6:16 PM:


Well, Im not sure that just fixing the start is sufficient. I suspect that, if 
you run over a long enough period, through sheer bad luck you will encounter an 
issue where a queue is drained, and yet also fails to fetch any records on the 
next fetch cycle, in this case you will end up with the same situation I think 
i.e. an empty queue that will still be processed because of the zero max idle 
time. This can happen on either side of the join, so if one side travels too 
far ahead of the other you would lose the join windows.

As I was rummaging through the code, I also wondered whether a closer 
relationship with the consumer would make things a bit easier. Then you can 
control better how many records you grab to push into the queues. For example, 
I'm not sure, but I kinda have a suspicion that the queues can grow quite 
large, as the limit is on the fetch, and the queue cant put back-pressure on 
the fetch. But, if you had control over the consumer, and you wanted to limit 
the size of the queue, then you would know if one side is full then you don't 
need to not fetch more records on that side etc. You would also potentially be 
able to use streamTime to make such decisions I suppose.


was (Author: the4thamigo_uk):
Well, Im not sure that just fixing the start is sufficient. I suspect that if 
you run over a long enough period, that through sheer bad luck you will 
encounter an issue where a queue is drained, and yet also fails to fetch any 
records on the next fetch cycle, in this case you will end up with the same 
situation I think i.e. an empty queue that will still be processed because of 
the zero max idle time. This can happen on either side of the join, so if one 
side travels too far ahead of the other you would lose the join windows.

As I was rummaging through the code, I also wondered whether a closer 
relationship with the consumer would make things a bit easier. Then you can 
control better how many records you grab to push into the queues. For example, 
I'm not sure, but I kinda have a suspicion that the queues can grow quite 
large, as the limit is on the fetch, and the queue cant put back-pressure on 
the fetch. But, if you had control over the consumer, and you wanted to limit 
the size of the queue, then you would know if one side is full then you don't 
need to not fetch more records on that side etc. You would also potentially be 
able to use streamTime to make such decisions I suppose.

> Historical join issues
> --
>
> Key: KAFKA-8315
> URL: https://issues.apache.org/jira/browse/KAFKA-8315
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Andrew
>Assignee: John Roesler
>Priority: Major
> Attachments: code.java
>
>
> The problem we are experiencing is that we cannot reliably perform simple 
> joins over pre-populated kafka topics. This seems more apparent where one 
> topic has records at less frequent record timestamp intervals that the other.
>  An example of the issue is provided in this repository :
> [https://github.com/the4thamigo-uk/join-example]
>  
> The only way to increase the period of historically joined records is to 
> increase the grace period for the join windows, and this has repercussions 
> when you extend it to a large period e.g. 2 years of minute-by-minute records.
> Related slack conversations : 
> [https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1556799561287300]
> [https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1557733979453900]
>  
>  Research on this issue has gone through a few phases :
> 1) This issue was initially thought to be due to the inability to set the 
> retention period for a join window via {{Materialized: i.e.}}
> The documentation says to use `Materialized` not `JoinWindows.until()` 
> ([https://kafka.apache.org/22/javadoc/org/apache/kafka/streams/kstream/JoinWindows.html#until-long-]),
>  but there is no where to pass a `Materialized` instance to the join 
> operation, only to the group operation is supported it seems.
> This was considered to be a problem with the documentation not with the API 
> and is addressed in [https://github.com/apache/kafka/pull/6664]
> 2) We then found an apparent issue in the code which would affect the 
> partition that is selected to deliver the next record to the join. This would 
> only be a problem for data that is out-of-order, and join-example uses data 
> that is in order of timestamp in both topics. So this fix is thought not to 
> affect join-example.
> This was considered to be an issue and is being addressed in 
> [https://github.com/apache/kafka/pul

[jira] [Comment Edited] (KAFKA-8315) Historical join issues

2019-05-16 Thread Andrew (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841611#comment-16841611
 ] 

Andrew edited comment on KAFKA-8315 at 5/16/19 6:28 PM:


Well, Im not sure that just fixing the start is sufficient. I suspect that, if 
you run over a long enough period, through sheer bad luck you will encounter an 
issue where a queue is drained, and yet also fails to fetch any records on the 
next fetch cycle, in this case you will end up with the same situation I think 
i.e. an empty queue that will still be processed because of the zero max idle 
time. This can happen on either side of the join, so if one side travels too 
far ahead of the other you would lose the join windows.

As I was rummaging through the code, I also wondered whether a closer 
relationship with the consumer would make things a bit easier. Then you can 
control better how many records you grab to push into the queues. For example, 
I'm not sure, but I kinda have a suspicion that the queues can grow quite 
large, as the limit is on the fetch, and the queue cant put back-pressure on 
the fetch. But, if you had control over the consumer, and you wanted to limit 
the size of the queue, then you would know if one side is full then you don't 
need to not fetch more records on that side etc. You would also potentially be 
able to use streamTime to make such decisions I suppose.

I also wondered whether you could allow the left(right) stream to process (in 
the absence of right(left records), provided the left(right) streamTime stays 
less than right(left) streamTime -right(left) grace - windowSize.


was (Author: the4thamigo_uk):
Well, Im not sure that just fixing the start is sufficient. I suspect that, if 
you run over a long enough period, through sheer bad luck you will encounter an 
issue where a queue is drained, and yet also fails to fetch any records on the 
next fetch cycle, in this case you will end up with the same situation I think 
i.e. an empty queue that will still be processed because of the zero max idle 
time. This can happen on either side of the join, so if one side travels too 
far ahead of the other you would lose the join windows.

As I was rummaging through the code, I also wondered whether a closer 
relationship with the consumer would make things a bit easier. Then you can 
control better how many records you grab to push into the queues. For example, 
I'm not sure, but I kinda have a suspicion that the queues can grow quite 
large, as the limit is on the fetch, and the queue cant put back-pressure on 
the fetch. But, if you had control over the consumer, and you wanted to limit 
the size of the queue, then you would know if one side is full then you don't 
need to not fetch more records on that side etc. You would also potentially be 
able to use streamTime to make such decisions I suppose.

> Historical join issues
> --
>
> Key: KAFKA-8315
> URL: https://issues.apache.org/jira/browse/KAFKA-8315
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Andrew
>Assignee: John Roesler
>Priority: Major
> Attachments: code.java
>
>
> The problem we are experiencing is that we cannot reliably perform simple 
> joins over pre-populated kafka topics. This seems more apparent where one 
> topic has records at less frequent record timestamp intervals that the other.
>  An example of the issue is provided in this repository :
> [https://github.com/the4thamigo-uk/join-example]
>  
> The only way to increase the period of historically joined records is to 
> increase the grace period for the join windows, and this has repercussions 
> when you extend it to a large period e.g. 2 years of minute-by-minute records.
> Related slack conversations : 
> [https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1556799561287300]
> [https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1557733979453900]
>  
>  Research on this issue has gone through a few phases :
> 1) This issue was initially thought to be due to the inability to set the 
> retention period for a join window via {{Materialized: i.e.}}
> The documentation says to use `Materialized` not `JoinWindows.until()` 
> ([https://kafka.apache.org/22/javadoc/org/apache/kafka/streams/kstream/JoinWindows.html#until-long-]),
>  but there is no where to pass a `Materialized` instance to the join 
> operation, only to the group operation is supported it seems.
> This was considered to be a problem with the documentation not with the API 
> and is addressed in [https://github.com/apache/kafka/pull/6664]
> 2) We then found an apparent issue in the code which would affect the 
> partition that is selected to deliver the next record to the join. This would 
> only be a problem for data that is out-of-order, and join-ex

[jira] [Comment Edited] (KAFKA-8315) Historical join issues

2019-05-16 Thread Andrew (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841611#comment-16841611
 ] 

Andrew edited comment on KAFKA-8315 at 5/16/19 6:29 PM:


Well, Im not sure that just fixing the start is sufficient. I suspect that, if 
you run over a long enough period, through sheer bad luck you will encounter an 
issue where a queue is drained, and yet also fails to fetch any records on the 
next fetch cycle, in this case you will end up with the same situation I think 
i.e. an empty queue that will still be processed because of the zero max idle 
time. This can happen on either side of the join, so if one side travels too 
far ahead of the other you would lose the join windows.

As I was rummaging through the code, I also wondered whether a closer 
relationship with the consumer would make things a bit easier. Then you can 
control better how many records you grab to push into the queues. For example, 
I'm not sure, but I kinda have a suspicion that the queues can grow quite 
large, as the limit is on the fetch, and the queue cant put back-pressure on 
the fetch. But, if you had control over the consumer, and you wanted to limit 
the size of the queue, then you would know if one side is full then you don't 
need to not fetch more records on that side etc. You would also potentially be 
able to use streamTime to make such decisions I suppose.

I also wondered whether you could allow the left(right) stream to process (in 
the absence of right(left records), provided the left(right) streamTime stays 
less than right(left) streamTime -right(left) grace - windowSize. i.e. not to 
go beyond the first active window on the right(left) stream.


was (Author: the4thamigo_uk):
Well, Im not sure that just fixing the start is sufficient. I suspect that, if 
you run over a long enough period, through sheer bad luck you will encounter an 
issue where a queue is drained, and yet also fails to fetch any records on the 
next fetch cycle, in this case you will end up with the same situation I think 
i.e. an empty queue that will still be processed because of the zero max idle 
time. This can happen on either side of the join, so if one side travels too 
far ahead of the other you would lose the join windows.

As I was rummaging through the code, I also wondered whether a closer 
relationship with the consumer would make things a bit easier. Then you can 
control better how many records you grab to push into the queues. For example, 
I'm not sure, but I kinda have a suspicion that the queues can grow quite 
large, as the limit is on the fetch, and the queue cant put back-pressure on 
the fetch. But, if you had control over the consumer, and you wanted to limit 
the size of the queue, then you would know if one side is full then you don't 
need to not fetch more records on that side etc. You would also potentially be 
able to use streamTime to make such decisions I suppose.

I also wondered whether you could allow the left(right) stream to process (in 
the absence of right(left records), provided the left(right) streamTime stays 
less than right(left) streamTime -right(left) grace - windowSize.

> Historical join issues
> --
>
> Key: KAFKA-8315
> URL: https://issues.apache.org/jira/browse/KAFKA-8315
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Andrew
>Assignee: John Roesler
>Priority: Major
> Attachments: code.java
>
>
> The problem we are experiencing is that we cannot reliably perform simple 
> joins over pre-populated kafka topics. This seems more apparent where one 
> topic has records at less frequent record timestamp intervals that the other.
>  An example of the issue is provided in this repository :
> [https://github.com/the4thamigo-uk/join-example]
>  
> The only way to increase the period of historically joined records is to 
> increase the grace period for the join windows, and this has repercussions 
> when you extend it to a large period e.g. 2 years of minute-by-minute records.
> Related slack conversations : 
> [https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1556799561287300]
> [https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1557733979453900]
>  
>  Research on this issue has gone through a few phases :
> 1) This issue was initially thought to be due to the inability to set the 
> retention period for a join window via {{Materialized: i.e.}}
> The documentation says to use `Materialized` not `JoinWindows.until()` 
> ([https://kafka.apache.org/22/javadoc/org/apache/kafka/streams/kstream/JoinWindows.html#until-long-]),
>  but there is no where to pass a `Materialized` instance to the join 
> operation, only to the group operation is supported it seems.
> This was considered to be a problem with the documentation n

[jira] [Comment Edited] (KAFKA-8315) Historical join issues

2019-05-16 Thread Andrew (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841611#comment-16841611
 ] 

Andrew edited comment on KAFKA-8315 at 5/16/19 6:30 PM:


Well, Im not sure that just fixing the start is sufficient. I suspect that, if 
you run over a long enough period, through sheer bad luck you will encounter an 
issue where a queue is drained, and yet also fails to fetch any records on the 
next fetch cycle, in this case you will end up with the same situation I think 
i.e. an empty queue that will still be processed because of the zero max idle 
time. This can happen on either side of the join, so if one side travels too 
far ahead of the other you would lose the join windows.

As I was rummaging through the code, I also wondered whether a closer 
relationship with the consumer would make things a bit easier. Then you can 
control better how many records you grab to push into the queues. For example, 
I'm not sure, but I kinda have a suspicion that the queues can grow quite 
large, as the limit is on the fetch, and the queue cant put back-pressure on 
the fetch. But, if you had control over the consumer, and you wanted to limit 
the size of the queue, then you would know if one side is full then you don't 
need to not fetch more records on that side etc. You would also potentially be 
able to use streamTime to make such decisions I suppose.

I also wondered whether you could allow the left(right) stream to process (in 
the absence of right(left records), provided the left(right) streamTime stays 
less than the right(left) streamTime -right(left) grace - windowSize. i.e. not 
to go beyond the first active window on the right(left) stream.


was (Author: the4thamigo_uk):
Well, Im not sure that just fixing the start is sufficient. I suspect that, if 
you run over a long enough period, through sheer bad luck you will encounter an 
issue where a queue is drained, and yet also fails to fetch any records on the 
next fetch cycle, in this case you will end up with the same situation I think 
i.e. an empty queue that will still be processed because of the zero max idle 
time. This can happen on either side of the join, so if one side travels too 
far ahead of the other you would lose the join windows.

As I was rummaging through the code, I also wondered whether a closer 
relationship with the consumer would make things a bit easier. Then you can 
control better how many records you grab to push into the queues. For example, 
I'm not sure, but I kinda have a suspicion that the queues can grow quite 
large, as the limit is on the fetch, and the queue cant put back-pressure on 
the fetch. But, if you had control over the consumer, and you wanted to limit 
the size of the queue, then you would know if one side is full then you don't 
need to not fetch more records on that side etc. You would also potentially be 
able to use streamTime to make such decisions I suppose.

I also wondered whether you could allow the left(right) stream to process (in 
the absence of right(left records), provided the left(right) streamTime stays 
less than right(left) streamTime -right(left) grace - windowSize. i.e. not to 
go beyond the first active window on the right(left) stream.

> Historical join issues
> --
>
> Key: KAFKA-8315
> URL: https://issues.apache.org/jira/browse/KAFKA-8315
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Andrew
>Assignee: John Roesler
>Priority: Major
> Attachments: code.java
>
>
> The problem we are experiencing is that we cannot reliably perform simple 
> joins over pre-populated kafka topics. This seems more apparent where one 
> topic has records at less frequent record timestamp intervals that the other.
>  An example of the issue is provided in this repository :
> [https://github.com/the4thamigo-uk/join-example]
>  
> The only way to increase the period of historically joined records is to 
> increase the grace period for the join windows, and this has repercussions 
> when you extend it to a large period e.g. 2 years of minute-by-minute records.
> Related slack conversations : 
> [https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1556799561287300]
> [https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1557733979453900]
>  
>  Research on this issue has gone through a few phases :
> 1) This issue was initially thought to be due to the inability to set the 
> retention period for a join window via {{Materialized: i.e.}}
> The documentation says to use `Materialized` not `JoinWindows.until()` 
> ([https://kafka.apache.org/22/javadoc/org/apache/kafka/streams/kstream/JoinWindows.html#until-long-]),
>  but there is no where to pass a `Materialized` instance to the join 
> operation, only to the group operation is supp

[jira] [Comment Edited] (KAFKA-8315) Historical join issues

2019-05-16 Thread Andrew (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841611#comment-16841611
 ] 

Andrew edited comment on KAFKA-8315 at 5/16/19 6:32 PM:


Well, Im not sure that just fixing the start is sufficient. I suspect that, if 
you run over a long enough period, through sheer bad luck you will encounter an 
issue where a queue is drained, and yet also fails to fetch any records on the 
next fetch cycle, in this case you will end up with the same situation I think 
i.e. an empty queue that will still be processed because of the zero max idle 
time. This can happen on either side of the join, so if one side travels too 
far ahead of the other you would lose the join windows.

As I was rummaging through the code, I also wondered whether a closer 
relationship with the consumer would make things a bit easier. Then you can 
control better how many records you grab to push into the queues. For example, 
I'm not sure, but I kinda have a suspicion that the queues can grow quite 
large, as the limit is on the fetch, and the queue cant put back-pressure on 
the fetch. But, if you had control over the consumer, and you wanted to limit 
the size of the queue, then you would know if one side is full then you don't 
need to not fetch more records on that side etc. You would also potentially be 
able to use streamTime to make such decisions I suppose.

I also wondered whether you could allow the left(right) stream to process (in 
the absence of right(left records), provided the left(right) streamTime stays 
less than the right(left) streamTime -right(left) grace - windowSize. i.e. not 
to go beyond the start of the first currently active window on the right(left) 
stream.


was (Author: the4thamigo_uk):
Well, Im not sure that just fixing the start is sufficient. I suspect that, if 
you run over a long enough period, through sheer bad luck you will encounter an 
issue where a queue is drained, and yet also fails to fetch any records on the 
next fetch cycle, in this case you will end up with the same situation I think 
i.e. an empty queue that will still be processed because of the zero max idle 
time. This can happen on either side of the join, so if one side travels too 
far ahead of the other you would lose the join windows.

As I was rummaging through the code, I also wondered whether a closer 
relationship with the consumer would make things a bit easier. Then you can 
control better how many records you grab to push into the queues. For example, 
I'm not sure, but I kinda have a suspicion that the queues can grow quite 
large, as the limit is on the fetch, and the queue cant put back-pressure on 
the fetch. But, if you had control over the consumer, and you wanted to limit 
the size of the queue, then you would know if one side is full then you don't 
need to not fetch more records on that side etc. You would also potentially be 
able to use streamTime to make such decisions I suppose.

I also wondered whether you could allow the left(right) stream to process (in 
the absence of right(left records), provided the left(right) streamTime stays 
less than the right(left) streamTime -right(left) grace - windowSize. i.e. not 
to go beyond the first active window on the right(left) stream.

> Historical join issues
> --
>
> Key: KAFKA-8315
> URL: https://issues.apache.org/jira/browse/KAFKA-8315
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Andrew
>Assignee: John Roesler
>Priority: Major
> Attachments: code.java
>
>
> The problem we are experiencing is that we cannot reliably perform simple 
> joins over pre-populated kafka topics. This seems more apparent where one 
> topic has records at less frequent record timestamp intervals that the other.
>  An example of the issue is provided in this repository :
> [https://github.com/the4thamigo-uk/join-example]
>  
> The only way to increase the period of historically joined records is to 
> increase the grace period for the join windows, and this has repercussions 
> when you extend it to a large period e.g. 2 years of minute-by-minute records.
> Related slack conversations : 
> [https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1556799561287300]
> [https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1557733979453900]
>  
>  Research on this issue has gone through a few phases :
> 1) This issue was initially thought to be due to the inability to set the 
> retention period for a join window via {{Materialized: i.e.}}
> The documentation says to use `Materialized` not `JoinWindows.until()` 
> ([https://kafka.apache.org/22/javadoc/org/apache/kafka/streams/kstream/JoinWindows.html#until-long-]),
>  but there is no where to pass a `Materialized` instance to the join 
> operation, only to

[jira] [Comment Edited] (KAFKA-8315) Historical join issues

2019-05-16 Thread Andrew (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841611#comment-16841611
 ] 

Andrew edited comment on KAFKA-8315 at 5/16/19 6:40 PM:


Well, Im not sure that just fixing the start is sufficient. I suspect that, if 
you run over a long enough period, through sheer bad luck you will encounter an 
issue where a queue is drained, and yet also fails to fetch any records on the 
next fetch cycle, in this case you will end up with the same situation I think 
i.e. an empty queue that will still be processed because of the zero max idle 
time. This can happen on either side of the join, so if one side travels too 
far ahead of the other you would lose the join windows.

As I was rummaging through the code, I also wondered whether a closer 
relationship with the consumer would make things a bit easier. Then you can 
control better how many records you grab to push into the queues. For example, 
I'm not sure, but I kinda have a suspicion that the queues can grow quite 
large, as the limit is on the fetch, and the queue cant put back-pressure on 
the fetch. But, if you had control over the consumer, and you wanted to limit 
the size of the queue, then you would know if one side is full then you don't 
need to not fetch more records on that side etc. You would also potentially be 
able to use streamTime to make such decisions I suppose.

I also wondered whether you could allow the left(right) stream to process (in 
the absence of right(left records), provided the left(right) streamTime stays 
less than the right(left) streamTime -right(left) grace - windowSize. i.e. not 
to go beyond the start of the first currently active window on the right(left) 
stream.

Lastly, also it may be worth at least waiting for a fetch that returned 0 
records from a topic, before moving on. I think currently, simple latency or 
the fetch limit, can trigger this, no?


was (Author: the4thamigo_uk):
Well, Im not sure that just fixing the start is sufficient. I suspect that, if 
you run over a long enough period, through sheer bad luck you will encounter an 
issue where a queue is drained, and yet also fails to fetch any records on the 
next fetch cycle, in this case you will end up with the same situation I think 
i.e. an empty queue that will still be processed because of the zero max idle 
time. This can happen on either side of the join, so if one side travels too 
far ahead of the other you would lose the join windows.

As I was rummaging through the code, I also wondered whether a closer 
relationship with the consumer would make things a bit easier. Then you can 
control better how many records you grab to push into the queues. For example, 
I'm not sure, but I kinda have a suspicion that the queues can grow quite 
large, as the limit is on the fetch, and the queue cant put back-pressure on 
the fetch. But, if you had control over the consumer, and you wanted to limit 
the size of the queue, then you would know if one side is full then you don't 
need to not fetch more records on that side etc. You would also potentially be 
able to use streamTime to make such decisions I suppose.

I also wondered whether you could allow the left(right) stream to process (in 
the absence of right(left records), provided the left(right) streamTime stays 
less than the right(left) streamTime -right(left) grace - windowSize. i.e. not 
to go beyond the start of the first currently active window on the right(left) 
stream.

> Historical join issues
> --
>
> Key: KAFKA-8315
> URL: https://issues.apache.org/jira/browse/KAFKA-8315
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Andrew
>Assignee: John Roesler
>Priority: Major
> Attachments: code.java
>
>
> The problem we are experiencing is that we cannot reliably perform simple 
> joins over pre-populated kafka topics. This seems more apparent where one 
> topic has records at less frequent record timestamp intervals that the other.
>  An example of the issue is provided in this repository :
> [https://github.com/the4thamigo-uk/join-example]
>  
> The only way to increase the period of historically joined records is to 
> increase the grace period for the join windows, and this has repercussions 
> when you extend it to a large period e.g. 2 years of minute-by-minute records.
> Related slack conversations : 
> [https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1556799561287300]
> [https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1557733979453900]
>  
>  Research on this issue has gone through a few phases :
> 1) This issue was initially thought to be due to the inability to set the 
> retention period for a join window via {{Materialized: i.e.}}
> The documentation says to use `Materialized` not `Jo

[jira] [Commented] (KAFKA-8315) Historical join issues

2019-05-16 Thread Matthias J. Sax (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841635#comment-16841635
 ] 

Matthias J. Sax commented on KAFKA-8315:


{quote}as the limit is on the fetch, and the queue cant put back-pressure on 
the fetch.
{quote}
Queues have a configurable size and partitions are pause if a queue fills up. 
Hence, setting `max.task.idle` large enough should be sufficient imho.
{quote}I also wondered whether you could allow the left(right) stream to 
process (in the absence of right(left records), provided the left(right) 
streamTime stays less than the right(left) streamTime -right(left) grace - 
windowSize. i.e. not to go beyond the start of the first currently active 
window on the right(left) stream.
{quote}
Not easily possible. At runtime we don't know anything about the semantics of 
the program. We don't know that a join is executed and also retention time is 
not a runtime concept.

> Historical join issues
> --
>
> Key: KAFKA-8315
> URL: https://issues.apache.org/jira/browse/KAFKA-8315
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Andrew
>Assignee: John Roesler
>Priority: Major
> Attachments: code.java
>
>
> The problem we are experiencing is that we cannot reliably perform simple 
> joins over pre-populated kafka topics. This seems more apparent where one 
> topic has records at less frequent record timestamp intervals that the other.
>  An example of the issue is provided in this repository :
> [https://github.com/the4thamigo-uk/join-example]
>  
> The only way to increase the period of historically joined records is to 
> increase the grace period for the join windows, and this has repercussions 
> when you extend it to a large period e.g. 2 years of minute-by-minute records.
> Related slack conversations : 
> [https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1556799561287300]
> [https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1557733979453900]
>  
>  Research on this issue has gone through a few phases :
> 1) This issue was initially thought to be due to the inability to set the 
> retention period for a join window via {{Materialized: i.e.}}
> The documentation says to use `Materialized` not `JoinWindows.until()` 
> ([https://kafka.apache.org/22/javadoc/org/apache/kafka/streams/kstream/JoinWindows.html#until-long-]),
>  but there is no where to pass a `Materialized` instance to the join 
> operation, only to the group operation is supported it seems.
> This was considered to be a problem with the documentation not with the API 
> and is addressed in [https://github.com/apache/kafka/pull/6664]
> 2) We then found an apparent issue in the code which would affect the 
> partition that is selected to deliver the next record to the join. This would 
> only be a problem for data that is out-of-order, and join-example uses data 
> that is in order of timestamp in both topics. So this fix is thought not to 
> affect join-example.
> This was considered to be an issue and is being addressed in 
> [https://github.com/apache/kafka/pull/6719]
>  3) Further investigation using a crafted unit test seems to show that the 
> partition-selection and ordering (PartitionGroup/RecordQueue) seems to work ok
> [https://github.com/the4thamigo-uk/kafka/commit/5121851491f2fd0471d8f3c49940175e23a26f1b]
> 4) the current assumption is that the issue is rooted in the way records are 
> consumed from the topics :
> We have tried to set various options to suppress reads form the source topics 
> but it doesnt seem to make any difference : 
> [https://github.com/the4thamigo-uk/join-example/commit/c674977fd0fdc689152695065d9277abea6bef63]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-8315) Historical join issues

2019-05-16 Thread Andrew (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841640#comment-16841640
 ] 

Andrew commented on KAFKA-8315:
---

Also, I had a thought whether you could modify 
[https://github.com/the4thamigo-uk/kafka/blob/debugging/clients/src/main/java/org/apache/kafka/clients/consumer/internals/Fetcher.java#L485.]
 so that at least some records are fetched for a topic if there is a completed 
fetch for it. I think it drains the first one then the next, but you could 
imagine round robin, or timestamp based extraction from more than one completed 
fetch, in order to balance the delivery across streams?

All, just vague hand-wavey thoughts, I appreciate that there is much more to it 
than I comprehend... :oD

> Historical join issues
> --
>
> Key: KAFKA-8315
> URL: https://issues.apache.org/jira/browse/KAFKA-8315
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Andrew
>Assignee: John Roesler
>Priority: Major
> Attachments: code.java
>
>
> The problem we are experiencing is that we cannot reliably perform simple 
> joins over pre-populated kafka topics. This seems more apparent where one 
> topic has records at less frequent record timestamp intervals that the other.
>  An example of the issue is provided in this repository :
> [https://github.com/the4thamigo-uk/join-example]
>  
> The only way to increase the period of historically joined records is to 
> increase the grace period for the join windows, and this has repercussions 
> when you extend it to a large period e.g. 2 years of minute-by-minute records.
> Related slack conversations : 
> [https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1556799561287300]
> [https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1557733979453900]
>  
>  Research on this issue has gone through a few phases :
> 1) This issue was initially thought to be due to the inability to set the 
> retention period for a join window via {{Materialized: i.e.}}
> The documentation says to use `Materialized` not `JoinWindows.until()` 
> ([https://kafka.apache.org/22/javadoc/org/apache/kafka/streams/kstream/JoinWindows.html#until-long-]),
>  but there is no where to pass a `Materialized` instance to the join 
> operation, only to the group operation is supported it seems.
> This was considered to be a problem with the documentation not with the API 
> and is addressed in [https://github.com/apache/kafka/pull/6664]
> 2) We then found an apparent issue in the code which would affect the 
> partition that is selected to deliver the next record to the join. This would 
> only be a problem for data that is out-of-order, and join-example uses data 
> that is in order of timestamp in both topics. So this fix is thought not to 
> affect join-example.
> This was considered to be an issue and is being addressed in 
> [https://github.com/apache/kafka/pull/6719]
>  3) Further investigation using a crafted unit test seems to show that the 
> partition-selection and ordering (PartitionGroup/RecordQueue) seems to work ok
> [https://github.com/the4thamigo-uk/kafka/commit/5121851491f2fd0471d8f3c49940175e23a26f1b]
> 4) the current assumption is that the issue is rooted in the way records are 
> consumed from the topics :
> We have tried to set various options to suppress reads form the source topics 
> but it doesnt seem to make any difference : 
> [https://github.com/the4thamigo-uk/join-example/commit/c674977fd0fdc689152695065d9277abea6bef63]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (KAFKA-8315) Historical join issues

2019-05-16 Thread Andrew (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841640#comment-16841640
 ] 

Andrew edited comment on KAFKA-8315 at 5/16/19 6:55 PM:


Also, I had a thought whether you could modify 
[https://github.com/the4thamigo-uk/kafka/blob/debugging/clients/src/main/java/org/apache/kafka/clients/consumer/internals/Fetcher.java#L485.]
 so that at least some records are fetched for a topic, providing there is a 
completed fetch for it. I think it drains the first one then the next, but you 
could imagine round robin, or timestamp based extraction from more than one 
completed fetch, in order to balance the delivery across streams?

All, just vague hand-wavey thoughts, I appreciate that there is much more to it 
than I comprehend... :oD


was (Author: the4thamigo_uk):
Also, I had a thought whether you could modify 
[https://github.com/the4thamigo-uk/kafka/blob/debugging/clients/src/main/java/org/apache/kafka/clients/consumer/internals/Fetcher.java#L485.]
 so that at least some records are fetched for a topic if there is a completed 
fetch for it. I think it drains the first one then the next, but you could 
imagine round robin, or timestamp based extraction from more than one completed 
fetch, in order to balance the delivery across streams?

All, just vague hand-wavey thoughts, I appreciate that there is much more to it 
than I comprehend... :oD

> Historical join issues
> --
>
> Key: KAFKA-8315
> URL: https://issues.apache.org/jira/browse/KAFKA-8315
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Andrew
>Assignee: John Roesler
>Priority: Major
> Attachments: code.java
>
>
> The problem we are experiencing is that we cannot reliably perform simple 
> joins over pre-populated kafka topics. This seems more apparent where one 
> topic has records at less frequent record timestamp intervals that the other.
>  An example of the issue is provided in this repository :
> [https://github.com/the4thamigo-uk/join-example]
>  
> The only way to increase the period of historically joined records is to 
> increase the grace period for the join windows, and this has repercussions 
> when you extend it to a large period e.g. 2 years of minute-by-minute records.
> Related slack conversations : 
> [https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1556799561287300]
> [https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1557733979453900]
>  
>  Research on this issue has gone through a few phases :
> 1) This issue was initially thought to be due to the inability to set the 
> retention period for a join window via {{Materialized: i.e.}}
> The documentation says to use `Materialized` not `JoinWindows.until()` 
> ([https://kafka.apache.org/22/javadoc/org/apache/kafka/streams/kstream/JoinWindows.html#until-long-]),
>  but there is no where to pass a `Materialized` instance to the join 
> operation, only to the group operation is supported it seems.
> This was considered to be a problem with the documentation not with the API 
> and is addressed in [https://github.com/apache/kafka/pull/6664]
> 2) We then found an apparent issue in the code which would affect the 
> partition that is selected to deliver the next record to the join. This would 
> only be a problem for data that is out-of-order, and join-example uses data 
> that is in order of timestamp in both topics. So this fix is thought not to 
> affect join-example.
> This was considered to be an issue and is being addressed in 
> [https://github.com/apache/kafka/pull/6719]
>  3) Further investigation using a crafted unit test seems to show that the 
> partition-selection and ordering (PartitionGroup/RecordQueue) seems to work ok
> [https://github.com/the4thamigo-uk/kafka/commit/5121851491f2fd0471d8f3c49940175e23a26f1b]
> 4) the current assumption is that the issue is rooted in the way records are 
> consumed from the topics :
> We have tried to set various options to suppress reads form the source topics 
> but it doesnt seem to make any difference : 
> [https://github.com/the4thamigo-uk/join-example/commit/c674977fd0fdc689152695065d9277abea6bef63]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-8256) Replace Heartbeat request/response with automated protocol

2019-05-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841686#comment-16841686
 ] 

ASF GitHub Bot commented on KAFKA-8256:
---

hachikuji commented on pull request #6691: KAFKA-8256: Replace Heartbeat 
request/response with automated protocol
URL: https://github.com/apache/kafka/pull/6691
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Replace Heartbeat request/response with automated protocol
> --
>
> Key: KAFKA-8256
> URL: https://issues.apache.org/jira/browse/KAFKA-8256
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Mickael Maison
>Assignee: Mickael Maison
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KAFKA-8256) Replace Heartbeat request/response with automated protocol

2019-05-16 Thread Jason Gustafson (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-8256.

Resolution: Fixed

> Replace Heartbeat request/response with automated protocol
> --
>
> Key: KAFKA-8256
> URL: https://issues.apache.org/jira/browse/KAFKA-8256
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Mickael Maison
>Assignee: Mickael Maison
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KAFKA-8377) KTable#transformValue might lead to incorrect result in joins

2019-05-16 Thread Matthias J. Sax (JIRA)
Matthias J. Sax created KAFKA-8377:
--

 Summary: KTable#transformValue might lead to incorrect result in 
joins
 Key: KAFKA-8377
 URL: https://issues.apache.org/jira/browse/KAFKA-8377
 Project: Kafka
  Issue Type: Bug
  Components: streams
Affects Versions: 2.0.0
Reporter: Matthias J. Sax


Kafka Streams uses an optimization to not materialize every result KTable. If a 
non-materialized KTable is input to a join, the lookup into the table results 
in a lookup of the parents table plus a call to the operator. For example,
{code:java}
KTable nonMaterialized = materializedTable.filter(...);
KTable table2 = ...

table2.join(nonMaterialized,...){code}
If there is a table2 input record, the lookup to the other side is performed as 
a lookup into materializedTable plus applying the filter().

For stateless operation like filter, this is safe. However, #transformValues() 
might have an attached state store. Hence, when an input record r is processed 
by #transformValues() with current state S, it might produce an output record 
r' (that is not materialized). When the join later does a lookup to get r from 
the parent table, there is no guarantee that #transformValues() again produces 
r' because its state might not be the same any longer.

Hence, it seems to be required, to always materialize the result of a 
KTable#transformValues() operation if there is state. Note, that if there would 
be a consecutive filter() after tranformValue(), it would also be ok to 
materialize the filter() result. Furthermore, if there is no downstream join(), 
materialization is also not required.

Basically, it seems to be unsafe to apply `KTableValueGetter` on a stateful 
#transformValues()` operator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-8376) Flaky test ClientAuthenticationFailureTest.testTransactionalProducerWithInvalidCredentials test.

2019-05-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841701#comment-16841701
 ] 

ASF GitHub Bot commented on KAFKA-8376:
---

ijuma commented on pull request #6746: KAFKA-8376; Least loaded node should 
consider connections which are being prepared
URL: https://github.com/apache/kafka/pull/6746
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Flaky test 
> ClientAuthenticationFailureTest.testTransactionalProducerWithInvalidCredentials
>  test.
> 
>
> Key: KAFKA-8376
> URL: https://issues.apache.org/jira/browse/KAFKA-8376
> Project: Kafka
>  Issue Type: Bug
>Reporter: Manikumar
>Assignee: Jason Gustafson
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: t1.txt
>
>
> The test is going into hang state and test run was not completing. I think 
> the flakiness is due to timing issues and 
> https://github.com/apache/kafka/pull/5971
> Attaching the thread dump.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-8315) Historical join issues

2019-05-16 Thread Matthias J. Sax (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841710#comment-16841710
 ] 

Matthias J. Sax commented on KAFKA-8315:


Don't think that should be required? The way Kafka Streams' record queue work, 
should actually work around this behavior. If a partition is paused, `poll()` 
won't return any data for this partition, even if the consumer has buffered 
data.

> Historical join issues
> --
>
> Key: KAFKA-8315
> URL: https://issues.apache.org/jira/browse/KAFKA-8315
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Andrew
>Assignee: John Roesler
>Priority: Major
> Attachments: code.java
>
>
> The problem we are experiencing is that we cannot reliably perform simple 
> joins over pre-populated kafka topics. This seems more apparent where one 
> topic has records at less frequent record timestamp intervals that the other.
>  An example of the issue is provided in this repository :
> [https://github.com/the4thamigo-uk/join-example]
>  
> The only way to increase the period of historically joined records is to 
> increase the grace period for the join windows, and this has repercussions 
> when you extend it to a large period e.g. 2 years of minute-by-minute records.
> Related slack conversations : 
> [https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1556799561287300]
> [https://confluentcommunity.slack.com/archives/C48AHTCUQ/p1557733979453900]
>  
>  Research on this issue has gone through a few phases :
> 1) This issue was initially thought to be due to the inability to set the 
> retention period for a join window via {{Materialized: i.e.}}
> The documentation says to use `Materialized` not `JoinWindows.until()` 
> ([https://kafka.apache.org/22/javadoc/org/apache/kafka/streams/kstream/JoinWindows.html#until-long-]),
>  but there is no where to pass a `Materialized` instance to the join 
> operation, only to the group operation is supported it seems.
> This was considered to be a problem with the documentation not with the API 
> and is addressed in [https://github.com/apache/kafka/pull/6664]
> 2) We then found an apparent issue in the code which would affect the 
> partition that is selected to deliver the next record to the join. This would 
> only be a problem for data that is out-of-order, and join-example uses data 
> that is in order of timestamp in both topics. So this fix is thought not to 
> affect join-example.
> This was considered to be an issue and is being addressed in 
> [https://github.com/apache/kafka/pull/6719]
>  3) Further investigation using a crafted unit test seems to show that the 
> partition-selection and ordering (PartitionGroup/RecordQueue) seems to work ok
> [https://github.com/the4thamigo-uk/kafka/commit/5121851491f2fd0471d8f3c49940175e23a26f1b]
> 4) the current assumption is that the issue is rooted in the way records are 
> consumed from the topics :
> We have tried to set various options to suppress reads form the source topics 
> but it doesnt seem to make any difference : 
> [https://github.com/the4thamigo-uk/join-example/commit/c674977fd0fdc689152695065d9277abea6bef63]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-8309) KIP-465: Add Consolidated Connector Endpoint to Connect REST API

2019-05-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841739#comment-16841739
 ] 

ASF GitHub Bot commented on KAFKA-8309:
---

rhauch commented on pull request #6658: KAFKA-8309: Add Consolidated Connector 
Endpoint to Connect REST API
URL: https://github.com/apache/kafka/pull/6658
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KIP-465: Add Consolidated Connector Endpoint to Connect REST API
> 
>
> Key: KAFKA-8309
> URL: https://issues.apache.org/jira/browse/KAFKA-8309
> Project: Kafka
>  Issue Type: Improvement
>Reporter: dan norwood
>Priority: Major
>
> {color:#33}https://cwiki.apache.org/confluence/display/KAFKA/KIP-465%3A+Add+Consolidated+Connector+Endpoint+to+Connect+REST+API{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (KAFKA-8367) Non-heap memory leak in Kafka Streams

2019-05-16 Thread Sophie Blee-Goldman (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841748#comment-16841748
 ] 

Sophie Blee-Goldman edited comment on KAFKA-8367 at 5/16/19 9:46 PM:
-

Hm. I notice in going from 2.0 to 2.1 we upgraded Rocks from 5.7.3 to 5.15.10 
and would like to rule out the possibility that the leak is in Rocks itself. If 
we can test the older version of Rocks with the newer version of Streams that 
should help us isolate the problem. I opened a quick branch off 2.2 with Rocks 
downgraded to v5.7 – can you build from  this 
[https://github.com/ableegoldman/kafka/tree/testRocksDBleak]  and see if the 
leak is still present? 

Did your RocksDBConfigSetter before 2.2.0 use/set the same configs or did any 
of those change? I agree your ConfigSetter shouldn't be leaking just trying to 
get all the details. It might also be worth investigating whether the leak is 
present  since 2.1 or just 2.2


was (Author: ableegoldman):
Hm. I notice in going from 2.0 to 2.1 we upgraded Rocks from 5.7.3 to 5.15.10 
and would like to rule out the possibility that the leak is in Rocks itself. If 
we can test the older version of Rocks with the newer version of Streams that 
should help us isolate the problem. I opened a quick branch off 2.2 with Rocks 
downgraded to v5.7 – can you build from 
[this|[https://github.com/ableegoldman/kafka/tree/testRocksDBleak]] and see if 
the leak is still present? 

Did your RocksDBConfigSetter before 2.2.0 use/set the same configs or did any 
of those change? I agree your ConfigSetter shouldn't be leaking just trying to 
get all the details. It might also be worth investigating whether the leak is 
present  since 2.1 or just 2.2

> Non-heap memory leak in Kafka Streams
> -
>
> Key: KAFKA-8367
> URL: https://issues.apache.org/jira/browse/KAFKA-8367
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.2.0
>Reporter: Pavel Savov
>Priority: Major
> Attachments: memory-prod.png, memory-test.png
>
>
> We have been observing a non-heap memory leak after upgrading to Kafka 
> Streams 2.2.0 from 2.0.1. We suspect the source to be around RocksDB as the 
> leak only happens when we enable stateful stream operations (utilizing 
> stores). We are aware of *KAFKA-8323* and have created our own fork of 2.2.0 
> and ported the fix scheduled for release in 2.2.1 to our fork. It did not 
> stop the leak, however.
> We are having this memory leak in our production environment where the 
> consumer group is auto-scaled in and out in response to changes in traffic 
> volume, and in our test environment where we have two consumers, no 
> autoscaling and relatively constant traffic.
> Below is some information I'm hoping will be of help:
>  * RocksDB Config:
> Block cache size: 4 MiB
> Write buffer size: 2 MiB
> Block size: 16 KiB
> Cache index and filter blocks: true
> Manifest preallocation size: 64 KiB
> Max write buffer number: 3
> Max open files: 6144
>  
>  * Memory usage in production
> The attached graph (memory-prod.png) shows memory consumption for each 
> instance as a separate line. The horizontal red line at 6 GiB is the memory 
> limit.
> As illustrated on the attached graph from production, memory consumption in 
> running instances goes up around autoscaling events (scaling the consumer 
> group either in or out) and associated rebalancing. It stabilizes until the 
> next autoscaling event but it never goes back down.
> An example of scaling out can be seen from around 21:00 hrs where three new 
> instances are started in response to a traffic spike.
> Just after midnight traffic drops and some instances are shut down. Memory 
> consumption in the remaining running instances goes up.
> Memory consumption climbs again from around 6:00AM due to increased traffic 
> and new instances are being started until around 10:30AM. Memory consumption 
> never drops until the cluster is restarted around 12:30.
>  
>  * Memory usage in test
> As illustrated by the attached graph (memory-test.png) we have a fixed number 
> of two instances in our test environment and no autoscaling. Memory 
> consumption rises linearly until it reaches the limit (around 2:00 AM on 
> 5/13) and Mesos restarts the offending instances, or we restart the cluster 
> manually.
>  
>  * No heap leaks observed
>  * Window retention: 2 or 11 minutes (depending on operation type)
>  * Issue not present in Kafka Streams 2.0.1
>  * No memory leak for stateless stream operations (when no RocksDB stores are 
> used)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-8367) Non-heap memory leak in Kafka Streams

2019-05-16 Thread Sophie Blee-Goldman (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841748#comment-16841748
 ] 

Sophie Blee-Goldman commented on KAFKA-8367:


Hm. I notice in going from 2.0 to 2.1 we upgraded Rocks from 5.7.3 to 5.15.10 
and would like to rule out the possibility that the leak is in Rocks itself. If 
we can test the older version of Rocks with the newer version of Streams that 
should help us isolate the problem. I opened a quick branch off 2.2 with Rocks 
downgraded to v5.7 – can you build from 
[this|[https://github.com/ableegoldman/kafka/tree/testRocksDBleak]] and see if 
the leak is still present? 

Did your RocksDBConfigSetter before 2.2.0 use/set the same configs or did any 
of those change? I agree your ConfigSetter shouldn't be leaking just trying to 
get all the details. It might also be worth investigating whether the leak is 
present  since 2.1 or just 2.2

> Non-heap memory leak in Kafka Streams
> -
>
> Key: KAFKA-8367
> URL: https://issues.apache.org/jira/browse/KAFKA-8367
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.2.0
>Reporter: Pavel Savov
>Priority: Major
> Attachments: memory-prod.png, memory-test.png
>
>
> We have been observing a non-heap memory leak after upgrading to Kafka 
> Streams 2.2.0 from 2.0.1. We suspect the source to be around RocksDB as the 
> leak only happens when we enable stateful stream operations (utilizing 
> stores). We are aware of *KAFKA-8323* and have created our own fork of 2.2.0 
> and ported the fix scheduled for release in 2.2.1 to our fork. It did not 
> stop the leak, however.
> We are having this memory leak in our production environment where the 
> consumer group is auto-scaled in and out in response to changes in traffic 
> volume, and in our test environment where we have two consumers, no 
> autoscaling and relatively constant traffic.
> Below is some information I'm hoping will be of help:
>  * RocksDB Config:
> Block cache size: 4 MiB
> Write buffer size: 2 MiB
> Block size: 16 KiB
> Cache index and filter blocks: true
> Manifest preallocation size: 64 KiB
> Max write buffer number: 3
> Max open files: 6144
>  
>  * Memory usage in production
> The attached graph (memory-prod.png) shows memory consumption for each 
> instance as a separate line. The horizontal red line at 6 GiB is the memory 
> limit.
> As illustrated on the attached graph from production, memory consumption in 
> running instances goes up around autoscaling events (scaling the consumer 
> group either in or out) and associated rebalancing. It stabilizes until the 
> next autoscaling event but it never goes back down.
> An example of scaling out can be seen from around 21:00 hrs where three new 
> instances are started in response to a traffic spike.
> Just after midnight traffic drops and some instances are shut down. Memory 
> consumption in the remaining running instances goes up.
> Memory consumption climbs again from around 6:00AM due to increased traffic 
> and new instances are being started until around 10:30AM. Memory consumption 
> never drops until the cluster is restarted around 12:30.
>  
>  * Memory usage in test
> As illustrated by the attached graph (memory-test.png) we have a fixed number 
> of two instances in our test environment and no autoscaling. Memory 
> consumption rises linearly until it reaches the limit (around 2:00 AM on 
> 5/13) and Mesos restarts the offending instances, or we restart the cluster 
> manually.
>  
>  * No heap leaks observed
>  * Window retention: 2 or 11 minutes (depending on operation type)
>  * Issue not present in Kafka Streams 2.0.1
>  * No memory leak for stateless stream operations (when no RocksDB stores are 
> used)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KAFKA-3522) Consider adding version information into rocksDB storage format

2019-05-16 Thread Matthias J. Sax (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias J. Sax resolved KAFKA-3522.

   Resolution: Fixed
Fix Version/s: 2.3.0

> Consider adding version information into rocksDB storage format
> ---
>
> Key: KAFKA-3522
> URL: https://issues.apache.org/jira/browse/KAFKA-3522
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Guozhang Wang
>Assignee: Matthias J. Sax
>Priority: Major
>  Labels: architecture
> Fix For: 2.3.0
>
>
> Kafka Streams does not introduce any modifications to the data format in the 
> underlying Kafka protocol, but it does use RocksDB for persistent state 
> storage, and currently its data format is fixed and hard-coded. We want to 
> consider the evolution path in the future we we change the data format, and 
> hence having some version info stored along with the storage file / directory 
> would be useful.
> And this information could be even out of the storage file; for example, we 
> can just use a small "version indicator" file in the rocksdb directory for 
> this purposes. Thoughts? [~enothereska] [~jkreps]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6455) Improve timestamp propagation at DSL level

2019-05-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841847#comment-16841847
 ] 

ASF GitHub Bot commented on KAFKA-6455:
---

mjsax commented on pull request #6751: KAFKA-6455: Update integration tests to 
verify result timestamps
URL: https://github.com/apache/kafka/pull/6751
 
 
   This PR contains the changes from #6725 and must be rebased after #6725 is 
merged.
   
   For review, only consider the second commit of this PR.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Improve timestamp propagation at DSL level
> --
>
> Key: KAFKA-6455
> URL: https://issues.apache.org/jira/browse/KAFKA-6455
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Affects Versions: 1.0.0
>Reporter: Matthias J. Sax
>Assignee: Matthias J. Sax
>Priority: Major
>  Labels: needs-kip
>
> At DSL level, we inherit the timestamp propagation "contract" from the 
> Processor API. This contract in not optimal at DSL level, and we should 
> define a DSL level contract that matches the semantics of the corresponding 
> DSL operator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7895) Ktable supress operator emitting more than one record for the same key per window

2019-05-16 Thread Kun Song (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841851#comment-16841851
 ] 

Kun Song commented on KAFKA-7895:
-

Hi [~vvcephei],

Great work!

I want to test it in my env, but I can't figure how to find this version, since 
it's still not released, could you please give me some tips where I can 
download it, thanks!

> Ktable supress operator emitting more than one record for the same key per 
> window
> -
>
> Key: KAFKA-7895
> URL: https://issues.apache.org/jira/browse/KAFKA-7895
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.1.0, 2.2.0, 2.1.1
>Reporter: prasanthi
>Assignee: John Roesler
>Priority: Blocker
> Fix For: 2.1.2, 2.2.1
>
>
> Hi, We are using kstreams to get the aggregated counts per vendor(key) within 
> a specified window.
> Here's how we configured the suppress operator to emit one final record per 
> key/window.
> {code:java}
> KTable, Long> windowedCount = groupedStream
>  .windowedBy(TimeWindows.of(Duration.ofMinutes(1)).grace(ofMillis(5L)))
>  .count(Materialized.with(Serdes.Integer(),Serdes.Long()))
>  .suppress(Suppressed.untilWindowCloses(unbounded()));
> {code}
> But we are getting more than one record for the same key/window as shown 
> below.
> {code:java}
> [KTABLE-TOSTREAM-10]: [131@154906704/154906710], 1039
> [KTABLE-TOSTREAM-10]: [131@154906704/154906710], 1162
> [KTABLE-TOSTREAM-10]: [9@154906704/154906710], 6584
> [KTABLE-TOSTREAM-10]: [88@154906704/154906710], 107
> [KTABLE-TOSTREAM-10]: [108@154906704/154906710], 315
> [KTABLE-TOSTREAM-10]: [119@154906704/154906710], 119
> [KTABLE-TOSTREAM-10]: [154@154906704/154906710], 746
> [KTABLE-TOSTREAM-10]: [154@154906704/154906710], 809{code}
> Could you please take a look?
> Thanks
>  
>  
> Added by John:
> Acceptance Criteria:
>  * add suppress to system tests, such that it's exercised with crash/shutdown 
> recovery, rebalance, etc.
>  ** [https://github.com/apache/kafka/pull/6278]
>  * make sure that there's some system test coverage with caching disabled.
>  ** Follow-on ticket: https://issues.apache.org/jira/browse/KAFKA-7943
>  * test with tighter time bounds with windows of say 30 seconds and use 
> system time without adding any extra time for verification
>  ** Follow-on ticket: https://issues.apache.org/jira/browse/KAFKA-7944



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (KAFKA-7895) Ktable supress operator emitting more than one record for the same key per window

2019-05-16 Thread Kun Song (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841851#comment-16841851
 ] 

Kun Song edited comment on KAFKA-7895 at 5/17/19 2:03 AM:
--

Hi [~vvcephei],

Great work!

I want to test it in my env, but I can't figure how to find this version, since 
it's still not released, could you please give me some tips where I can 
download it, thanks!

Edit: I just find it now :)


was (Author: songkun):
Hi [~vvcephei],

Great work!

I want to test it in my env, but I can't figure how to find this version, since 
it's still not released, could you please give me some tips where I can 
download it, thanks!

> Ktable supress operator emitting more than one record for the same key per 
> window
> -
>
> Key: KAFKA-7895
> URL: https://issues.apache.org/jira/browse/KAFKA-7895
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.1.0, 2.2.0, 2.1.1
>Reporter: prasanthi
>Assignee: John Roesler
>Priority: Blocker
> Fix For: 2.1.2, 2.2.1
>
>
> Hi, We are using kstreams to get the aggregated counts per vendor(key) within 
> a specified window.
> Here's how we configured the suppress operator to emit one final record per 
> key/window.
> {code:java}
> KTable, Long> windowedCount = groupedStream
>  .windowedBy(TimeWindows.of(Duration.ofMinutes(1)).grace(ofMillis(5L)))
>  .count(Materialized.with(Serdes.Integer(),Serdes.Long()))
>  .suppress(Suppressed.untilWindowCloses(unbounded()));
> {code}
> But we are getting more than one record for the same key/window as shown 
> below.
> {code:java}
> [KTABLE-TOSTREAM-10]: [131@154906704/154906710], 1039
> [KTABLE-TOSTREAM-10]: [131@154906704/154906710], 1162
> [KTABLE-TOSTREAM-10]: [9@154906704/154906710], 6584
> [KTABLE-TOSTREAM-10]: [88@154906704/154906710], 107
> [KTABLE-TOSTREAM-10]: [108@154906704/154906710], 315
> [KTABLE-TOSTREAM-10]: [119@154906704/154906710], 119
> [KTABLE-TOSTREAM-10]: [154@154906704/154906710], 746
> [KTABLE-TOSTREAM-10]: [154@154906704/154906710], 809{code}
> Could you please take a look?
> Thanks
>  
>  
> Added by John:
> Acceptance Criteria:
>  * add suppress to system tests, such that it's exercised with crash/shutdown 
> recovery, rebalance, etc.
>  ** [https://github.com/apache/kafka/pull/6278]
>  * make sure that there's some system test coverage with caching disabled.
>  ** Follow-on ticket: https://issues.apache.org/jira/browse/KAFKA-7943
>  * test with tighter time bounds with windows of say 30 seconds and use 
> system time without adding any extra time for verification
>  ** Follow-on ticket: https://issues.apache.org/jira/browse/KAFKA-7944



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-5505) Connect: Do not restart connector and existing tasks on task-set change

2019-05-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-5505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841884#comment-16841884
 ] 

ASF GitHub Bot commented on KAFKA-5505:
---

rhauch commented on pull request #6363: KAFKA-5505: Incremental cooperative 
rebalancing in Connect (KIP-415)
URL: https://github.com/apache/kafka/pull/6363
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Connect: Do not restart connector and existing tasks on task-set change
> ---
>
> Key: KAFKA-5505
> URL: https://issues.apache.org/jira/browse/KAFKA-5505
> Project: Kafka
>  Issue Type: Improvement
>  Components: KafkaConnect
>Affects Versions: 0.10.2.1
>Reporter: Per Steffensen
>Assignee: Konstantine Karantasis
>Priority: Major
>
> I am writing a connector with a frequently changing task-set. It is really 
> not working very well, because the connector and all existing tasks are 
> restarted when the set of tasks changes. E.g. if the connector is running 
> with 10 tasks, and an additional task is needed, the connector itself and all 
> 10 existing tasks are restarted, just to make the 11th task run also. My 
> tasks have a fairly heavy initialization, making it extra annoying. I would 
> like to see a change, introducing a "mode", where only new/deleted tasks are 
> started/stopped when notifying the system that the set of tasks changed 
> (calling context.requestTaskReconfiguration() - or something similar).
> Discussed this issue a little on d...@kafka.apache.org in the thread "Kafka 
> Connect: To much restarting with a SourceConnector with dynamic set of tasks"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-5505) Connect: Do not restart connector and existing tasks on task-set change

2019-05-16 Thread Randall Hauch (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Randall Hauch updated KAFKA-5505:
-
Fix Version/s: 2.3.0

Merged onto `trunk` for AK 2.3.0.

> Connect: Do not restart connector and existing tasks on task-set change
> ---
>
> Key: KAFKA-5505
> URL: https://issues.apache.org/jira/browse/KAFKA-5505
> Project: Kafka
>  Issue Type: Improvement
>  Components: KafkaConnect
>Affects Versions: 0.10.2.1
>Reporter: Per Steffensen
>Assignee: Konstantine Karantasis
>Priority: Major
> Fix For: 2.3.0
>
>
> I am writing a connector with a frequently changing task-set. It is really 
> not working very well, because the connector and all existing tasks are 
> restarted when the set of tasks changes. E.g. if the connector is running 
> with 10 tasks, and an additional task is needed, the connector itself and all 
> 10 existing tasks are restarted, just to make the 11th task run also. My 
> tasks have a fairly heavy initialization, making it extra annoying. I would 
> like to see a change, introducing a "mode", where only new/deleted tasks are 
> started/stopped when notifying the system that the set of tasks changed 
> (calling context.requestTaskReconfiguration() - or something similar).
> Discussed this issue a little on d...@kafka.apache.org in the thread "Kafka 
> Connect: To much restarting with a SourceConnector with dynamic set of tasks"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KAFKA-5505) Connect: Do not restart connector and existing tasks on task-set change

2019-05-16 Thread Randall Hauch (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Randall Hauch resolved KAFKA-5505.
--
Resolution: Fixed
  Reviewer: Randall Hauch

> Connect: Do not restart connector and existing tasks on task-set change
> ---
>
> Key: KAFKA-5505
> URL: https://issues.apache.org/jira/browse/KAFKA-5505
> Project: Kafka
>  Issue Type: Improvement
>  Components: KafkaConnect
>Affects Versions: 0.10.2.1
>Reporter: Per Steffensen
>Assignee: Konstantine Karantasis
>Priority: Major
> Fix For: 2.3.0
>
>
> I am writing a connector with a frequently changing task-set. It is really 
> not working very well, because the connector and all existing tasks are 
> restarted when the set of tasks changes. E.g. if the connector is running 
> with 10 tasks, and an additional task is needed, the connector itself and all 
> 10 existing tasks are restarted, just to make the 11th task run also. My 
> tasks have a fairly heavy initialization, making it extra annoying. I would 
> like to see a change, introducing a "mode", where only new/deleted tasks are 
> started/stopped when notifying the system that the set of tasks changed 
> (calling context.requestTaskReconfiguration() - or something similar).
> Discussed this issue a little on d...@kafka.apache.org in the thread "Kafka 
> Connect: To much restarting with a SourceConnector with dynamic set of tasks"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)