date:20161213

Re: [DISCUSS] KIP-98: Exactly Once Delivery and Transactional Messaging

2016-12-13 Thread Sriram Subramanian

I am not sure if it is a good idea to support both init() and lazy
initialization. The ideal state would have been to implement init as a non
blocking api and have the rest of the methods throw uninitialized exception
if init was not called. This would ensure that init can still be used by
other non blocking frameworks but at the same time enforces that other apis
are not called till init is complete. The problem with supporting both is
that it is going to confuse users. Why should I call init over lazy
initialization? The lazy initialization of transactions will not work if
you want to recover transactions, read offsets and then start a new
transaction. In such cases, you would have to resort to the init api.
Client developers need to provide support for both the lazy initialization
and init method. This problem might be solved if we just renamed the
initTransaction api (eg: recoverTransactions).

On Mon, Dec 12, 2016 at 11:36 PM, Ismael Juma  wrote:

> Hi Jay,
>
> I like the idea of having a single `init`, but I am not sure about the
> specifics of the metadata initialisation (as Jason alluded to). More
> inline.
>
> On Tue, Dec 13, 2016 at 6:01 AM, Jay Kreps  wrote:
>
> >1. Add a generic init() call which initializes both transactions and
> >metadata
> >
>
> Would this initialise metadata for all topics? One advantage of doing the
> metadata call during `send` is that we only retrieve metadata for the
> subset of topics that you are producing to. For large clusters, retrieving
> the metadata for all the topics is relatively expensive and I think users
> would prefer to avoid that unless there are some concrete benefits. We
> could pass the topics to `init`, but that seems a bit clunky.
>
>
> >2. If you don't call init(), metadata is initialized on the first send
> >(as now)
>
>
> We need to maintain the logic to refresh the metadata on `send` anyway if
> you try to send to a topic that is missing from the metadata (e.g. if it's
> added after the `init` method is called, assuming that we don't expect
> people to call `init` more than once) so that seems fine.
>
>
> > and transactions are lazily initialized at the first beginTransaction()
> > call.
>
>
> I'll leave it to Jason to say if this is feasible. However, if it is, it
> seems like we can just do this and avoid the `init` method altogether?
>
> Ismael
>
> On Tue, Dec 13, 2016 at 6:01 AM, Jay Kreps  wrote:
>
> > Hey Jason/Neha,
> >
> > Yeah, clearly having a mandatory, generic init() method that initializes
> > both transactions and topic metadata would be the ideal solution. This
> > would solve the occasional complaint about blocking behavior during
> > initialization of metadata (or at least shift it to a new complaint about
> > an inability to initialize when the cluster isn't up in test
> environments).
> > The challenge is that we can't do this because it isn't backwards
> > compatible with existing apps that don't call init.
> >
> > The alternative of having an optional generic init() call is a bit odd
> > because to figure out if you need to call it you need to discover what it
> > does, which is not generic, it initializes transactions. We can't really
> > add more logic to init because it only gets invoked by transaction users
> so
> > it doesn't really function as a generic init.
> >
> > What do you think of this solution:
> >
> >1. Add a generic init() call which initializes both transactions and
> >metadata
> >2. If you don't call init(), metadata is initialized on the first send
> >(as now) and transactions are lazily initialized at the first
> >beginTransaction() call.
> >
> > -Jay
> >
> >
> >
> >
> >
> >
> >
> > On Mon, Dec 12, 2016 at 9:17 PM, Jason Gustafson 
> > wrote:
> >
> > > @Neha
> > >
> > >
> > > 1. I think we should consider renaming initTransactions to just init()
> > and
> > > > moving the metadata initialization there. Let's make sure we don't
> add
> > > APIs
> > > > that are relevant to this proposal only. Instead, try to think what
> > we'd
> > > > propose if we were writing the producer from scratch today. I suspect
> > we
> > > > would end up with an init() API that would do the metadata
> > initialization
> > > > as well as the transaction stuff lazily. If so, let's make that
> change
> > > now.
> > >
> > >
> > > I think the only awkwardness with `init()` is that it would probably
> have
> > > to be an optional API for non-transactional usage to support existing
> > code.
> > > I'm also not sure what metadata we can actually initialize at that
> point
> > > since we don't know which topics will be produced to. That said, I'm
> also
> > > not fond of the `initTransactions` name, and we may find other uses
> for a
> > > generic `init()` in the future, so I'm in favor this renaming.
> > >
> > >
> > > > 2. Along the same lines, let's think about the role of each id that
> the
> > > > producer will have and see if everything still makes sense. For
> > instance,
> > > > we have quite a

Re: [VOTE] KIP-100 - Relax Type constraints in Kafka Streams API

2016-12-13 Thread Ismael Juma

Hi Xavier,

Thanks for the KIP. If Java had declaration site variance (proposed for a
future Java version[1]), we'd mark function parameters as contravariant
(i.e. "super") and the result as covariant (i.e. "extends"). In the
meantime, we have to use the wildcards at use site as per your proposal.
However, it seems that only the first case is covered by your proposal.
This is an improvement, but is there any reason not to do the latter as
well? It would be good to get it completely right this time.

Ismael

[1] http://openjdk.java.net/jeps/300

On Fri, Dec 9, 2016 at 6:27 PM, Xavier Léauté  wrote:

> Hi everyone,
>
> I would like to start the vote for KIP-100 unless there are any more
> comments.
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-100+-+
> Relax+Type+constraints+in+Kafka+Streams+API
>
> corresponding PR here https://github.com/apache/kafka/pull/2205
>
> Thanks,
> Xavier
>

Build failed in Jenkins: kafka-trunk-jdk8 #1098

2016-12-13 Thread Apache Jenkins Server

See 

Changes:

[wangguoz] KAFKA-4516: When a CachingStateStore is closed it should clear its

--
[...truncated 26860 lines...]

org.apache.kafka.streams.StreamsConfigTest > 
shouldBeSupportNonPrefixedRestoreConsumerConfigs STARTED

org.apache.kafka.streams.StreamsConfigTest > 
shouldBeSupportNonPrefixedRestoreConsumerConfigs PASSED

org.apache.kafka.streams.StreamsConfigTest > 
shouldSupportPrefixedRestoreConsumerConfigs STARTED

org.apache.kafka.streams.StreamsConfigTest > 
shouldSupportPrefixedRestoreConsumerConfigs PASSED

org.apache.kafka.streams.KafkaStreamsTest > shouldNotGetAllTasksWhenNotRunning 
STARTED

org.apache.kafka.streams.KafkaStreamsTest > shouldNotGetAllTasksWhenNotRunning 
PASSED

org.apache.kafka.streams.KafkaStreamsTest > 
shouldNotGetTaskWithKeyAndPartitionerWhenNotRunning STARTED

org.apache.kafka.streams.KafkaStreamsTest > 
shouldNotGetTaskWithKeyAndPartitionerWhenNotRunning PASSED

org.apache.kafka.streams.KafkaStreamsTest > 
shouldNotGetTaskWithKeyAndSerializerWhenNotRunning STARTED

org.apache.kafka.streams.KafkaStreamsTest > 
shouldNotGetTaskWithKeyAndSerializerWhenNotRunning PASSED

org.apache.kafka.streams.KafkaStreamsTest > 
shouldReturnFalseOnCloseWhenThreadsHaventTerminated STARTED

org.apache.kafka.streams.KafkaStreamsTest > 
shouldReturnFalseOnCloseWhenThreadsHaventTerminated PASSED

org.apache.kafka.streams.KafkaStreamsTest > 
shouldNotGetAllTasksWithStoreWhenNotRunning STARTED

org.apache.kafka.streams.KafkaStreamsTest > 
shouldNotGetAllTasksWithStoreWhenNotRunning PASSED

org.apache.kafka.streams.KafkaStreamsTest > testCannotStartOnceClosed STARTED

org.apache.kafka.streams.KafkaStreamsTest > testCannotStartOnceClosed PASSED

org.apache.kafka.streams.KafkaStreamsTest > testCleanup STARTED

org.apache.kafka.streams.KafkaStreamsTest > testCleanup PASSED

org.apache.kafka.streams.KafkaStreamsTest > testStartAndClose STARTED

org.apache.kafka.streams.KafkaStreamsTest > testStartAndClose PASSED

org.apache.kafka.streams.KafkaStreamsTest > testCloseIsIdempotent STARTED

org.apache.kafka.streams.KafkaStreamsTest > testCloseIsIdempotent PASSED

org.apache.kafka.streams.KafkaStreamsTest > testCannotCleanupWhileRunning 
STARTED

org.apache.kafka.streams.KafkaStreamsTest > testCannotCleanupWhileRunning PASSED

org.apache.kafka.streams.KafkaStreamsTest > testCannotStartTwice STARTED

org.apache.kafka.streams.KafkaStreamsTest > testCannotStartTwice PASSED

org.apache.kafka.streams.integration.KStreamKTableJoinIntegrationTest > 
shouldCountClicksPerRegion[0] STARTED

org.apache.kafka.streams.integration.KStreamKTableJoinIntegrationTest > 
shouldCountClicksPerRegion[0] PASSED

org.apache.kafka.streams.integration.KStreamKTableJoinIntegrationTest > 
shouldCountClicksPerRegion[1] STARTED

org.apache.kafka.streams.integration.KStreamKTableJoinIntegrationTest > 
shouldCountClicksPerRegion[1] PASSED

org.apache.kafka.streams.integration.QueryableStateIntegrationTest > 
shouldBeAbleToQueryState[0] STARTED

org.apache.kafka.streams.integration.QueryableStateIntegrationTest > 
shouldBeAbleToQueryState[0] PASSED

org.apache.kafka.streams.integration.QueryableStateIntegrationTest > 
shouldNotMakeStoreAvailableUntilAllStoresAvailable[0] STARTED

org.apache.kafka.streams.integration.QueryableStateIntegrationTest > 
shouldNotMakeStoreAvailableUntilAllStoresAvailable[0] FAILED
java.lang.AssertionError: Condition not met within timeout 3. waiting 
for store count-by-key
at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:279)
at 
org.apache.kafka.streams.integration.QueryableStateIntegrationTest.shouldNotMakeStoreAvailableUntilAllStoresAvailable(QueryableStateIntegrationTest.java:491)

org.apache.kafka.streams.integration.QueryableStateIntegrationTest > 
queryOnRebalance[0] STARTED

org.apache.kafka.streams.integration.QueryableStateIntegrationTest > 
queryOnRebalance[0] PASSED

org.apache.kafka.streams.integration.QueryableStateIntegrationTest > 
concurrentAccesses[0] STARTED

org.apache.kafka.streams.integration.QueryableStateIntegrationTest > 
concurrentAccesses[0] PASSED

org.apache.kafka.streams.integration.QueryableStateIntegrationTest > 
shouldBeAbleToQueryState[1] STARTED

org.apache.kafka.streams.integration.QueryableStateIntegrationTest > 
shouldBeAbleToQueryState[1] PASSED

org.apache.kafka.streams.integration.QueryableStateIntegrationTest > 
shouldNotMakeStoreAvailableUntilAllStoresAvailable[1] STARTED

org.apache.kafka.streams.integration.QueryableStateIntegrationTest > 
shouldNotMakeStoreAvailableUntilAllStoresAvailable[1] FAILED
java.lang.AssertionError: Condition not met within timeout 3. waiting 
for store count-by-key
at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:279)
at 
org.apache.kafka.streams.integration.QueryableStateIntegrationTest.shouldNotMakeStoreAvailableUntilAllStoresAvailable(QueryableS

[jira] [Commented] (KAFKA-4529) tombstone may be removed earlier than it should

2016-12-13 Thread Jiangjie Qin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15744529#comment-15744529
 ] 

Jiangjie Qin commented on KAFKA-4529:
-

Sure, I'll submit a patch.

> tombstone may be removed earlier than it should
> ---
>
> Key: KAFKA-4529
> URL: https://issues.apache.org/jira/browse/KAFKA-4529
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.1.0
>Reporter: Jun Rao
> Fix For: 0.10.1.1
>
>
> As part of KIP-33, we introduced a regression on how tombstone is removed in 
> a compacted topic. We want to delay the removal of a tombstone to avoid the 
> case that a reader first reads a non-tombstone message on a key and then 
> doesn't see the tombstone for the key because it's deleted too quickly. So, a 
> tombstone is supposed to only be removed from a compacted topic after the 
> tombstone is part of the cleaned portion of the log after delete.retention.ms.
> Before KIP-33, deleteHorizonMs in LogCleaner is calculated based on the last 
> modified time, which is monotonically increasing from old to new segments. 
> With KIP-33, deleteHorizonMs is calculated based on the message timestamp, 
> which is not necessarily monotonically increasing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (KAFKA-4529) tombstone may be removed earlier than it should

2016-12-13 Thread Jiangjie Qin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiangjie Qin reassigned KAFKA-4529:
---

Assignee: Jiangjie Qin

> tombstone may be removed earlier than it should
> ---
>
> Key: KAFKA-4529
> URL: https://issues.apache.org/jira/browse/KAFKA-4529
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.1.0
>Reporter: Jun Rao
>Assignee: Jiangjie Qin
> Fix For: 0.10.1.1
>
>
> As part of KIP-33, we introduced a regression on how tombstone is removed in 
> a compacted topic. We want to delay the removal of a tombstone to avoid the 
> case that a reader first reads a non-tombstone message on a key and then 
> doesn't see the tombstone for the key because it's deleted too quickly. So, a 
> tombstone is supposed to only be removed from a compacted topic after the 
> tombstone is part of the cleaned portion of the log after delete.retention.ms.
> Before KIP-33, deleteHorizonMs in LogCleaner is calculated based on the last 
> modified time, which is monotonically increasing from old to new segments. 
> With KIP-33, deleteHorizonMs is calculated based on the message timestamp, 
> which is not necessarily monotonically increasing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Work started] (KAFKA-4529) tombstone may be removed earlier than it should

2016-12-13 Thread Jiangjie Qin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on KAFKA-4529 started by Jiangjie Qin.
---
> tombstone may be removed earlier than it should
> ---
>
> Key: KAFKA-4529
> URL: https://issues.apache.org/jira/browse/KAFKA-4529
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.1.0
>Reporter: Jun Rao
>Assignee: Jiangjie Qin
> Fix For: 0.10.1.1
>
>
> As part of KIP-33, we introduced a regression on how tombstone is removed in 
> a compacted topic. We want to delay the removal of a tombstone to avoid the 
> case that a reader first reads a non-tombstone message on a key and then 
> doesn't see the tombstone for the key because it's deleted too quickly. So, a 
> tombstone is supposed to only be removed from a compacted topic after the 
> tombstone is part of the cleaned portion of the log after delete.retention.ms.
> Before KIP-33, deleteHorizonMs in LogCleaner is calculated based on the last 
> modified time, which is monotonically increasing from old to new segments. 
> With KIP-33, deleteHorizonMs is calculated based on the message timestamp, 
> which is not necessarily monotonically increasing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [DISCUSS] KIP-98: Exactly Once Delivery and Transactional Messaging

2016-12-13 Thread Jay Kreps

Hey Ismael,

Yes, you are correct, I remember now why didn't do that. I rescind that
suggestion. I still think lazy initialization is more in keeping with what
we've done if feasible.

-Jay

On Mon, Dec 12, 2016 at 11:36 PM, Ismael Juma  wrote:

> Hi Jay,
>
> I like the idea of having a single `init`, but I am not sure about the
> specifics of the metadata initialisation (as Jason alluded to). More
> inline.
>
> On Tue, Dec 13, 2016 at 6:01 AM, Jay Kreps  wrote:
>
> >1. Add a generic init() call which initializes both transactions and
> >metadata
> >
>
> Would this initialise metadata for all topics? One advantage of doing the
> metadata call during `send` is that we only retrieve metadata for the
> subset of topics that you are producing to. For large clusters, retrieving
> the metadata for all the topics is relatively expensive and I think users
> would prefer to avoid that unless there are some concrete benefits. We
> could pass the topics to `init`, but that seems a bit clunky.
>
>
> >2. If you don't call init(), metadata is initialized on the first send
> >(as now)
>
>
> We need to maintain the logic to refresh the metadata on `send` anyway if
> you try to send to a topic that is missing from the metadata (e.g. if it's
> added after the `init` method is called, assuming that we don't expect
> people to call `init` more than once) so that seems fine.
>
>
> > and transactions are lazily initialized at the first beginTransaction()
> > call.
>
>
> I'll leave it to Jason to say if this is feasible. However, if it is, it
> seems like we can just do this and avoid the `init` method altogether?
>
> Ismael
>
> On Tue, Dec 13, 2016 at 6:01 AM, Jay Kreps  wrote:
>
> > Hey Jason/Neha,
> >
> > Yeah, clearly having a mandatory, generic init() method that initializes
> > both transactions and topic metadata would be the ideal solution. This
> > would solve the occasional complaint about blocking behavior during
> > initialization of metadata (or at least shift it to a new complaint about
> > an inability to initialize when the cluster isn't up in test
> environments).
> > The challenge is that we can't do this because it isn't backwards
> > compatible with existing apps that don't call init.
> >
> > The alternative of having an optional generic init() call is a bit odd
> > because to figure out if you need to call it you need to discover what it
> > does, which is not generic, it initializes transactions. We can't really
> > add more logic to init because it only gets invoked by transaction users
> so
> > it doesn't really function as a generic init.
> >
> > What do you think of this solution:
> >
> >1. Add a generic init() call which initializes both transactions and
> >metadata
> >2. If you don't call init(), metadata is initialized on the first send
> >(as now) and transactions are lazily initialized at the first
> >beginTransaction() call.
> >
> > -Jay
> >
> >
> >
> >
> >
> >
> >
> > On Mon, Dec 12, 2016 at 9:17 PM, Jason Gustafson 
> > wrote:
> >
> > > @Neha
> > >
> > >
> > > 1. I think we should consider renaming initTransactions to just init()
> > and
> > > > moving the metadata initialization there. Let's make sure we don't
> add
> > > APIs
> > > > that are relevant to this proposal only. Instead, try to think what
> > we'd
> > > > propose if we were writing the producer from scratch today. I suspect
> > we
> > > > would end up with an init() API that would do the metadata
> > initialization
> > > > as well as the transaction stuff lazily. If so, let's make that
> change
> > > now.
> > >
> > >
> > > I think the only awkwardness with `init()` is that it would probably
> have
> > > to be an optional API for non-transactional usage to support existing
> > code.
> > > I'm also not sure what metadata we can actually initialize at that
> point
> > > since we don't know which topics will be produced to. That said, I'm
> also
> > > not fond of the `initTransactions` name, and we may find other uses
> for a
> > > generic `init()` in the future, so I'm in favor this renaming.
> > >
> > >
> > > > 2. Along the same lines, let's think about the role of each id that
> the
> > > > producer will have and see if everything still makes sense. For
> > instance,
> > > > we have quite a few per-producer-instance notions -- client.id, a
> > > producer
> > > > id and a transaction.app.id, some set via config and some generated
> > > > on-the-fly. What role does each play, how do they relate to each
> other
> > > and
> > > > is there an opportunity to get rid of any.
> > >
> > >
> > > The abundance of ids is super annoying. The producer ID is not actually
> > > exposed in either the producer or consumer, but I'm not sure how
> > successful
> > > we'll be in hiding its existence from the user (you probably need to
> know
> > > about it for debugging and administrative purposes at least). This
> issue
> > > has been a continual thorn and I'm not sure I have a great answer. We
> > have
> > > been te

Re: [VOTE] KIP-84: Support SASL SCRAM mechanisms

2016-12-13 Thread Rajini Sivaram

Jun,

Any thoughts on reducing the number of mechanisms and supporting only
SCRAM-SHA-256 and SCRAM-SHA-512?

Thank you,

Rajini

On Fri, Dec 2, 2016 at 2:44 PM, Ismael Juma  wrote:

> Thanks Rajini. Let's see what Jun says about limiting the number of SHA
> variants. Either way, +1 from me.
>
> Ismael
>
> On Fri, Dec 2, 2016 at 2:40 PM, Rajini Sivaram <
> rajinisiva...@googlemail.com
> > wrote:
>
> > Ismael,
> >
> > 1. Jun had suggested added the full list of SHA-nnn in the [DISCUSS]
> > thread. I am ok with limiting to a smaller number if required.
> >
> > 3. Added a section on security considerations to the KIP.
> >
> > Thank you,
> >
> > Rajini
> >
> > On Thu, Dec 1, 2016 at 4:22 PM, Ismael Juma  wrote:
> >
> > > Hi Rajini,
> > >
> > > Sorry for the delay. For some reason, both of your replies (for this
> and
> > > KIP-85) were marked as spam by Gmail. Comments inline.
> > >
> > > On Mon, Nov 28, 2016 at 3:47 PM, Rajini Sivaram <
> > > rajinisiva...@googlemail.com> wrote:
> > > >
> > > > 1. I think you had asked earlier for SCRAM-SHA-1 to be removed since
> it
> > > is
> > > > not secure :-) I am happy to add that back in so that clients which
> > don't
> > > > have access to a more secure algorithm can use it. But it would be a
> > > shame
> > > > to prevent users who only need Java clients from using more secure
> > > > mechanisms. Since SHA-1 is not secure, you need a secure Zookeeper
> > > > installation (or store your credentials in an alternative secure
> > store)..
> > > > By supporting multiple algorithms, we are giving the choice to users.
> > It
> > > > doesn't add much additional code, just the additional tests (one
> > > > integration test per mechanism). As more clients support new
> > mechanisms,
> > > > users can enable these without any changes to Kafka.
> > > >
> > >
> > > Yes, I remember that I asked for SCRAM-SHA-1 to be removed. I probably
> > > wasn't clear. My suggestion was not to add that back, but whether we
> > needed
> > > so many variants. For example, we could support SCRAM-SHA-256 and
> > > SCRAM-SHA-512.
> > > Would that be sufficient? It's true that the cost is not that large for
> > us,
> > > but every other client also has to pay that additional extra cost and I
> > am
> > > not sure sure about the benefit of some of the options.
> > >
> > > 3. I am assuming that ZK authentication will be enabled and ZK
> > > > configuration will be done directly using ZK commands. This is true
> for
> > > > ACLs, quotas etc. as well?
> > > >
> > >
> > > Right, I also thought that ACLs was the closest example. However, it
> > seems
> > > like getting read access to the SCRAM DB has potentially worse
> > > consequences:
> > >
> > > "For a specific secret compromised, if an exchange is obtained from the
> > > wire by some mechanism, this gives sufficient information for an
> imposter
> > > to pose as the client for that server (but not another one using the
> same
> > > password). Note that this interception is only useful if the database
> has
> > > been compromised – SCRAM is safe against replay attack. This is the
> > primary
> > > SCRAM weakness, and why it is important to protect the secret database
> > > carefully and to use TLS."[1]
> > >
> > > Also, because we are using fast hashes (instead of slow ones like
> bcrypt,
> > > scrypt, etc.), we are more susceptible to dictionary attacks
> (potentially
> > > mitigated by a reasonably large iteration count combined with good
> > quality
> > > passwords).
> > >
> > > If nothing else, it may be worth mentioning some of this in the KIP
> > and/or
> > > documentation.
> > >
> > > Ismael
> > >
> > > [1] http://www.isode.com/whitepapers/scram.html
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > Rajini
> >
>



-- 
Regards,

Rajini

[jira] [Created] (KAFKA-4530) cant stop kafka server

2016-12-13 Thread xin (JIRA)

xin created KAFKA-4530:
--

 Summary: cant stop kafka server
 Key: KAFKA-4530
 URL: https://issues.apache.org/jira/browse/KAFKA-4530
 Project: Kafka
  Issue Type: Bug
  Components: admin
Affects Versions: 0.10.0.1, 0.10.1.0
Reporter: xin


ps command cant find kafka.Kafka,
start server clath path too long to "kafka.Kafka" be found ??

183-manager:/home/xxx/kafka_2.10-0.10.0.1/bin # ./kafka-server-stop.sh 
No kafka server to stop
183-manager:/home/xxx/kafka_2.10-0.10.0.1/bin # ps -ef|grep kafka
root 28517 77538  7 16:07 pts/600:00:03 /usr/java/jdk/bin/java -Xmx1G 
-Xms1G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 
-XX:InitiatingHeapOccupancyPercent=35 -XX:+DisableExplicitGC 
-Djava.awt.headless=true 
-Xloggc:/home/xxx/kafka_2.10-0.10.0.1/bin/../logs/kafkaServer-gc.log 
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps 
-Dcom.sun.management.jmxremote 
-Dcom.sun.management.jmxremote.authenticate=false 
-Dcom.sun.management.jmxremote.ssl=false 
-Dkafka.logs.dir=/home/xxx/kafka_2.10-0.10.0.1/bin/../logs 
-Dlog4j.configuration=file:./../config/log4j.properties -cp 
.:/usr/java/jdk/lib/dt.jar:/usr/java/jdk/lib/tools.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/aopalliance-repackaged-2.4.0-b34.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/argparse4j-0.5.0.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/connect-api-0.10.0.1.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/connect-file-0.10.0.1.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/connect-json-0.10.0.1.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/connect-runtime-0.10.0.1.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/guava-18.0.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/hk2-api-2.4.0-b34.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/hk2-locator-2.4.0-b34.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/hk2-utils-2.4.0-b34.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jackson-annotations-2.6.0.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jackson-core-2.6.3.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jackson-databind-2.6.3.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jackson-jaxrs-base-2.6.3.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jackson-jaxrs-json-provider-2.6.3.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jackson-module-jaxb-annotations-2.6.3.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/javassist-3.18.2-GA.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/javax.annotation-api-1.2.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/javax.inject-1.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/javax.inject-2.4.0-b34.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/javax.servlet-api-3.1.0.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/javax.ws.rs-api-2.0.1.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jersey-client-2.22.2.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jersey-common-2.22.2.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jersey-container-servlet-2.22.2.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jersey-container-servlet-core-2.22.2.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jersey-guava-2.22.2.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jersey-media-jaxb-2.22.2.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jersey-server-2.22.2.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jetty-continuation-9.2.15.v20160210.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jetty-http-9.2.15.v20160210.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jetty-io-9.2.15.v20160210.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jetty-security-9.2.15.v20160210.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jetty-server-9.2.15.v20160210.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jetty-servlet-9.2.15.v20160210.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jetty-servlets-9.2.15.v20160210.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jetty-util-9.2.15.v20160210.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jopt-simple-4.9.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/kafka-clients-0.10.0.1.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/kafka-log4j-appender-0.10.0.1.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/kafka-streams-0.10.0.1.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/kafka-streams-examples-0.10.0.1.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/kafka-tools-0.10.0.1.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/kafka_2.10-0.10.0.1-sources.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/kafka_2.10-0.10.0.1-test-sources.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/kafka_2.10-0.10.0.1.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/log4j-1.2.17.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/lz4-1.3.0.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/metrics-core-2.2.0.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../l



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [DISCUSS] KIP-81: Max in-flight fetches

2016-12-13 Thread Jay Kreps

Hey Ismael,

Yeah I think we are both saying the same thing---removing only works if you
have a truly optimal strategy. Actually even dynamically computing a
reasonable default isn't totally obvious (do you set fetch.max.bytes to
equal buffer.memory to try to queue up as much data in the network buffers?
Do you try to limit it to your socket.receive.buffer size so that you can
read it in a single shot?).

Regarding what is being measured, my interpretation was the same as yours.
I was just adding to the previous point that buffer.memory setting would
not be a very close proxy for memory usage. Someone was pointing out that
compression would make this true, and I was just adding that even without
compression the object overhead would lead to a high expansion factor.

-Jay

On Mon, Dec 12, 2016 at 11:53 PM, Ismael Juma  wrote:

> Hi Jay,
>
> About `max.partition.fetch.bytes`, yes it was an oversight not to lower its
> priority as part of KIP-74 given the existence of `fetch.max.bytes` and the
> fact that we can now make progress in the presence of oversized messages
> independently of either of those settings.
>
> I agree that we should try to set those values automatically based on
> `buffer.memory`, but I am not sure if we can have a truly optimal strategy.
> So, I'd go with reducing the priority to "low" instead of removing
> `fetch.max.bytes` and `max.partition.fetch.bytes` altogether for now. If
> experience in the field tells us that the auto strategy is good enough, we
> can consider removing them (yes, I know, it's unlikely to happen as there
> won't be that much motivation then).
>
> Regarding the "conversion from packed bytes to java objects" comment, that
> raises the question: what are we actually measuring here? From the KIP,
> it's not too clear. My interpretation was that we were not measuring the
> memory usage of the Java objects. In that case, `buffer.memory` seems like
> a reasonable name although perhaps the user's expectation is that we would
> measure the memory usage of the Java objects?
>
> Ismael
>
> On Tue, Dec 13, 2016 at 6:21 AM, Jay Kreps  wrote:
>
> > I think the question is whether we have a truly optimal strategy for
> > deriving the partition- and fetch-level configs from the global setting.
> If
> > we do then we should just get rid of them. If not, then if we can at
> least
> > derive usually good and never terrible settings from the global limit at
> > initialization time maybe we can set them automatically unless the user
> > overrides with an explicit conifg. Even the latter would let us mark it
> low
> > priority which at least takes it off the list of things you have to grok
> to
> > use the consumer which I suspect would be much appreciated by our poor
> > users.
> >
> > Regardless it'd be nice to make sure we get an explanation of the
> > relationships between the remaining memory configs in the KIP and in the
> > docs.
> >
> > I agree that buffer.memory isn't bad.
> >
> > -Jay
> >
> >
> > On Mon, Dec 12, 2016 at 2:56 PM, Jason Gustafson 
> > wrote:
> >
> > > Yeah, that's a good point. Perhaps in retrospect, it would have been
> > better
> > > to define `buffer.memory` first and let `fetch.max.bytes` be based off
> of
> > > it. I like `buffer.memory` since it gives the consumer nice symmetry
> with
> > > the producer and its generic naming gives us some flexibility
> internally
> > > with how we use it. We could still do that I guess, if we're willing to
> > > deprecate `fetch.max.bytes` (one release after adding it!).
> > >
> > > As for `max.partition.fetch.bytes`, it's noted in KIP-74 that it is
> still
> > > useful in Kafka Streams, but I agree it makes sense to lower its
> priority
> > > in favor of `fetch.max.bytes`.
> > >
> > > -Jason
> > >
> > > On Sat, Dec 10, 2016 at 2:27 PM, Jay Kreps  wrote:
> > >
> > > > Jason, it's not just decompression but also the conversion from
> packed
> > > > bytes to java objects, right? That can be even larger than the
> > > > decompression blow up. I think this may be okay, the problem may just
> > be
> > > > that the naming is a bit misleading. In the producer you are
> literally
> > > > allocating a buffer of that size, so the name buffer.memory makes
> > sense.
> > > In
> > > > this case it is something more like max.bytes.read.per.poll.call
> > > (terrible
> > > > name, but maybe something like that?).
> > > >
> > > > Mickael, I'd second Jason's request for the default and expand on it.
> > We
> > > > currently have several consumer-related memory
> > > > settings--max.partition.fetch.bytes, fetch.max.bytes. I don't think
> it
> > > is
> > > > clear today how to set these. For example we mark
> > > max.partition.fetch.bytes
> > > > as high importance and fetch.max.bytes as medium, but it seems like
> it
> > > > would be the other way around. Can we think this through from the
> point
> > > of
> > > > view of a lazy user? I.e. I have 64MB of space to use for my
> consumer,
> > in
> > > > an ideal world I'd say, "hey consumer

[jira] [Commented] (KAFKA-4530) cant stop kafka server

2016-12-13 Thread huxi (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15744578#comment-15744578
 ] 

huxi commented on KAFKA-4530:
-

Is it a duplicate of 
[kafka-4297|https://issues.apache.org/jira/browse/KAFKA-4297]?


> cant stop kafka server
> --
>
> Key: KAFKA-4530
> URL: https://issues.apache.org/jira/browse/KAFKA-4530
> Project: Kafka
>  Issue Type: Bug
>  Components: admin
>Affects Versions: 0.10.1.0, 0.10.0.1
>Reporter: xin
>
> ps command cant find kafka.Kafka,
> start server clath path too long to "kafka.Kafka" be found ??
> 183-manager:/home/xxx/kafka_2.10-0.10.0.1/bin # ./kafka-server-stop.sh 
> No kafka server to stop
> 183-manager:/home/xxx/kafka_2.10-0.10.0.1/bin # ps -ef|grep kafka
> root 28517 77538  7 16:07 pts/600:00:03 /usr/java/jdk/bin/java -Xmx1G 
> -Xms1G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 
> -XX:InitiatingHeapOccupancyPercent=35 -XX:+DisableExplicitGC 
> -Djava.awt.headless=true 
> -Xloggc:/home/xxx/kafka_2.10-0.10.0.1/bin/../logs/kafkaServer-gc.log 
> -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps 
> -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> -Dkafka.logs.dir=/home/xxx/kafka_2.10-0.10.0.1/bin/../logs 
> -Dlog4j.configuration=file:./../config/log4j.properties -cp 
> .:/usr/java/jdk/lib/dt.jar:/usr/java/jdk/lib/tools.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/aopalliance-repackaged-2.4.0-b34.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/argparse4j-0.5.0.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/connect-api-0.10.0.1.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/connect-file-0.10.0.1.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/connect-json-0.10.0.1.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/connect-runtime-0.10.0.1.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/guava-18.0.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/hk2-api-2.4.0-b34.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/hk2-locator-2.4.0-b34.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/hk2-utils-2.4.0-b34.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jackson-annotations-2.6.0.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jackson-core-2.6.3.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jackson-databind-2.6.3.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jackson-jaxrs-base-2.6.3.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jackson-jaxrs-json-provider-2.6.3.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jackson-module-jaxb-annotations-2.6.3.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/javassist-3.18.2-GA.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/javax.annotation-api-1.2.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/javax.inject-1.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/javax.inject-2.4.0-b34.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/javax.servlet-api-3.1.0.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/javax.ws.rs-api-2.0.1.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jersey-client-2.22.2.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jersey-common-2.22.2.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jersey-container-servlet-2.22.2.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jersey-container-servlet-core-2.22.2.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jersey-guava-2.22.2.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jersey-media-jaxb-2.22.2.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jersey-server-2.22.2.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jetty-continuation-9.2.15.v20160210.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jetty-http-9.2.15.v20160210.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jetty-io-9.2.15.v20160210.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jetty-security-9.2.15.v20160210.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jetty-server-9.2.15.v20160210.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jetty-servlet-9.2.15.v20160210.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jetty-servlets-9.2.15.v20160210.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jetty-util-9.2.15.v20160210.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jopt-simple-4.9.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/kafka-clients-0.10.0.1.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/kafka-log4j-appender-0.10.0.1.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/kafka-streams-0.10.0.1.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/kafka-streams-examples-0.10.0.1.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/kafka-tools-0.10.0.1.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/kafka_2.10-0.10.0.1-sources.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/kafka_2.10-0.10.0.1-test-sources.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/kafka_2.10-0.10.0.1.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/log4j-1.2.17.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/lz4-1.3.0.jar:/home/xxx/kafka_2.

Re: [DISCUSS] KIP-48 Support for delegation tokens as an authentication mechanism

2016-12-13 Thread Manikumar

Ashish,

Thank you for reviewing the KIP.  Please see the replies inline.


> 1. How to disable delegation token authentication?
>
> This can be achieved in various ways, however I think reusing delegation
> token secret config for this makes sense here. Avoids creating yet another
> config and forces delegation token users to consciously set the secret. If
> the secret is not set or set to empty string, brokers should turn off
> delegation token support. This will however require a new error code to
> indicate delegation token support is turned off on broker.
>

  Thanks for the suggestion. Option to turnoff delegation token
authentication will be useful.
  I'll update the KIP.


>
> 2. ACLs on delegation token?
>
> Do we need to have ACLs defined for tokens? I do not think it buys us
> anything, as delegation token can be treated as impersonation of the owner.
> Any thing the owner has permission to do, delegation tokens should be
> allowed to do as well. If so, we probably won't need to return
> authorization exception error code while creating delegation token. It
> however would make sense to check renew and expire requests are coming from
> owner or renewers of the token, but that does not require explicit acls.
>


Yes, We agreed to not have new acl on who can request delegation token.
 I'll update the KIP.


>
> 3. How to restrict max life time of a token?
>
> Admins might want to restrict max life time of tokens created on a cluster,
> and this can very from cluster to cluster based on use-cases. This might
> warrant a separate broker config.
>
>
Currently we  have "delegation.token.max.lifetime.sec" server config
property
May be we can take min(User supplied MaxTime, Server MaxTime) as max life
time.
I am open to add new config property.

Few more comments based on recent KIP update.
>
> 1. Do we need a separate {{InvalidateTokenRequest}}? Can't we use
> {{ExpireTokenRequest}} with with expiryDate set to anything before current
> date?
>

makes sense. we don't need special request to cancel the token. We can use
ExpireTokenRequest.
I'll update the KIP.


> 2. Can we change time field names to indicate their unit is milliseconds,
> like, IssueDateMs, ExpiryDateMs, etc.?
>
>
  Done.


> 3. Can we allow users to renew a token for a specified amount of time? In
> current version of KIP, renew request does not take time as a param, not
> sure what is expiry time set to after renewal.
>
>
 Yes, we need to specify renew period.  I'll update the KIP.


Thanks,
Mankumar



>
> On Mon, Dec 12, 2016 at 9:08 AM Manikumar 
> wrote:
>
> > Hi,
> >
> >
> >
> > I would like to reinitiate the discussion on Delegation token support for
> >
> > Kafka.
> >
> >
> >
> > Brief summary of the past discussion:
> >
> >
> >
> > 1) Broker stores delegation tokens in zookeeper.  All brokers will have a
> >
> > cache backed by
> >
> >zookeeper so they will all get notified whenever a new token is
> >
> > generated and they will
> >
> >update their local cache whenever token state changes.
> >
> > 2) The current proposal does not support rotation of secret
> >
> > 3) Only allow the renewal by users that authenticated using *non*
> >
> > delegation token mechanism
> >
> > 4) KIP-84 proposes to support  SASL SCRAM mechanisms. Kafka clients can
> >
> > authenticate using
> >
> >SCRAM-SHA-256, providing the delegation token HMAC as password.
> >
> >
> >
> > Updated the KIP with the following:
> >
> > 1. Protocol and Config changes
> >
> > 2. format of the data stored in ZK.
> >
> > 3. Changes to Java Clients/Usage of SASL SCRAM mechanism
> >
> >
> >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> >
> > 48+Delegation+token+support+for+Kafka
> >
> >
> >
> >
> >
> > Jun, Ashish, Gwen,
> >
> >
> >
> > Pl review the updated KIP.
> >
> >
> >
> > Thanks
> >
> > Manikumar
> >
> >
> >
> >
> >
> > On Thu, Sep 29, 2016 at 9:56 PM, Ashish Singh 
> wrote:
> >
> >
> >
> > > Harsha/ Gwen,
> >
> > >
> >
> > > How do we proceed here? I am willing to help out with here.
> >
> > >
> >
> > > On Fri, Sep 23, 2016 at 11:41 AM, Gwen Shapira 
> > wrote:
> >
> > >
> >
> > > > Is it updated? are all concerns addressed? do you want to start a
> vote?
> >
> > > >
> >
> > > > Sorry for being pushy, I do appreciate that we are all volunteers and
> >
> > > > finding time is difficult. This feature is important for anything
> that
> >
> > > > integrates with Kafka (stream processors, Flume, NiFi, etc) and I
> >
> > > > don't want to see this getting stuck because we lack coordination
> >
> > > > within the community.
> >
> > > >
> >
> > > > On Thu, Sep 15, 2016 at 6:39 PM, Harsha Chintalapani <
> ka...@harsha.io>
> >
> > > > wrote:
> >
> > > > > The only pending update for the KIP is to write up the protocol
> > changes
> >
> > > > like
> >
> > > > > we've it KIP-4. I'll update the wiki.
> >
> > > > >
> >
> > > > >
> >
> > > > > On Thu, Sep 15, 2016 at 4:27 PM Ashish Singh 
> >
> > > > wrote:
> >
> > > > >>
> >
> > > > >> I think we decid

[jira] [Commented] (KAFKA-4530) cant stop kafka server

2016-12-13 Thread xin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15744587#comment-15744587
 ] 

xin commented on KAFKA-4530:


yeah, duplicate 

> cant stop kafka server
> --
>
> Key: KAFKA-4530
> URL: https://issues.apache.org/jira/browse/KAFKA-4530
> Project: Kafka
>  Issue Type: Bug
>  Components: admin
>Affects Versions: 0.10.1.0, 0.10.0.1
>Reporter: xin
>
> ps command cant find kafka.Kafka,
> start server clath path too long to "kafka.Kafka" be found ??
> 183-manager:/home/xxx/kafka_2.10-0.10.0.1/bin # ./kafka-server-stop.sh 
> No kafka server to stop
> 183-manager:/home/xxx/kafka_2.10-0.10.0.1/bin # ps -ef|grep kafka
> root 28517 77538  7 16:07 pts/600:00:03 /usr/java/jdk/bin/java -Xmx1G 
> -Xms1G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 
> -XX:InitiatingHeapOccupancyPercent=35 -XX:+DisableExplicitGC 
> -Djava.awt.headless=true 
> -Xloggc:/home/xxx/kafka_2.10-0.10.0.1/bin/../logs/kafkaServer-gc.log 
> -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps 
> -Dcom.sun.management.jmxremote 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dcom.sun.management.jmxremote.ssl=false 
> -Dkafka.logs.dir=/home/xxx/kafka_2.10-0.10.0.1/bin/../logs 
> -Dlog4j.configuration=file:./../config/log4j.properties -cp 
> .:/usr/java/jdk/lib/dt.jar:/usr/java/jdk/lib/tools.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/aopalliance-repackaged-2.4.0-b34.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/argparse4j-0.5.0.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/connect-api-0.10.0.1.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/connect-file-0.10.0.1.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/connect-json-0.10.0.1.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/connect-runtime-0.10.0.1.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/guava-18.0.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/hk2-api-2.4.0-b34.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/hk2-locator-2.4.0-b34.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/hk2-utils-2.4.0-b34.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jackson-annotations-2.6.0.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jackson-core-2.6.3.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jackson-databind-2.6.3.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jackson-jaxrs-base-2.6.3.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jackson-jaxrs-json-provider-2.6.3.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jackson-module-jaxb-annotations-2.6.3.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/javassist-3.18.2-GA.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/javax.annotation-api-1.2.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/javax.inject-1.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/javax.inject-2.4.0-b34.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/javax.servlet-api-3.1.0.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/javax.ws.rs-api-2.0.1.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jersey-client-2.22.2.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jersey-common-2.22.2.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jersey-container-servlet-2.22.2.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jersey-container-servlet-core-2.22.2.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jersey-guava-2.22.2.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jersey-media-jaxb-2.22.2.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jersey-server-2.22.2.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jetty-continuation-9.2.15.v20160210.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jetty-http-9.2.15.v20160210.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jetty-io-9.2.15.v20160210.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jetty-security-9.2.15.v20160210.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jetty-server-9.2.15.v20160210.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jetty-servlet-9.2.15.v20160210.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jetty-servlets-9.2.15.v20160210.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jetty-util-9.2.15.v20160210.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/jopt-simple-4.9.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/kafka-clients-0.10.0.1.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/kafka-log4j-appender-0.10.0.1.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/kafka-streams-0.10.0.1.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/kafka-streams-examples-0.10.0.1.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/kafka-tools-0.10.0.1.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/kafka_2.10-0.10.0.1-sources.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/kafka_2.10-0.10.0.1-test-sources.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/kafka_2.10-0.10.0.1.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/log4j-1.2.17.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/lz4-1.3.0.jar:/home/xxx/kafka_2.10-0.10.0.1/bin/../libs/metrics-core-2.2.0.jar:/home/xxx/kafka_2.10-0.1

Re: [DISCUSS] KIP-81: Max in-flight fetches

2016-12-13 Thread Ismael Juma

Makes sense Jay.

Mickael, in addition to how we can compute defaults of the other settings
from `buffer.memory`, it would be good to specify what is allowed and how
we handle the different cases (e.g. what do we do if
`max.partition.fetch.bytes`
is greater than `buffer.memory`, is that simply not allowed?).

To summarise the gap between the ideal scenario (user specifies how much
memory the consumer can use) and what is being proposed:

1. We will decompress and deserialize the data for one or more partitions
in order to return them to the user and we don't account for the increased
memory usage resulting from that. This is likely to be significant on a per
record basis, but we try to do it for the minimal number of records
possible within the constraints of the system. Currently the constraints
are: we decompress and deserialize the data for a partition at a time
(default `max.partition.fetch.bytes` is 1MB, but this is a soft limit in
case there are oversized messages) until we have enough records to
satisfy `max.poll.records`
(default 500) or there are no more completed fetches. It seems like this
may be OK for a lot of cases, but some tuning will still be required in
others.

2. We don't account for bookkeeping data structures or intermediate objects
allocated during the general operation of the consumer. Probably something
we have to live with as the cost/benefit of fixing this doesn't seem worth
it.

Ismael

On Tue, Dec 13, 2016 at 8:34 AM, Jay Kreps  wrote:

> Hey Ismael,
>
> Yeah I think we are both saying the same thing---removing only works if you
> have a truly optimal strategy. Actually even dynamically computing a
> reasonable default isn't totally obvious (do you set fetch.max.bytes to
> equal buffer.memory to try to queue up as much data in the network buffers?
> Do you try to limit it to your socket.receive.buffer size so that you can
> read it in a single shot?).
>
> Regarding what is being measured, my interpretation was the same as yours.
> I was just adding to the previous point that buffer.memory setting would
> not be a very close proxy for memory usage. Someone was pointing out that
> compression would make this true, and I was just adding that even without
> compression the object overhead would lead to a high expansion factor.
>
> -Jay
>
> On Mon, Dec 12, 2016 at 11:53 PM, Ismael Juma  wrote:
>
> > Hi Jay,
> >
> > About `max.partition.fetch.bytes`, yes it was an oversight not to lower
> its
> > priority as part of KIP-74 given the existence of `fetch.max.bytes` and
> the
> > fact that we can now make progress in the presence of oversized messages
> > independently of either of those settings.
> >
> > I agree that we should try to set those values automatically based on
> > `buffer.memory`, but I am not sure if we can have a truly optimal
> strategy.
> > So, I'd go with reducing the priority to "low" instead of removing
> > `fetch.max.bytes` and `max.partition.fetch.bytes` altogether for now. If
> > experience in the field tells us that the auto strategy is good enough,
> we
> > can consider removing them (yes, I know, it's unlikely to happen as there
> > won't be that much motivation then).
> >
> > Regarding the "conversion from packed bytes to java objects" comment,
> that
> > raises the question: what are we actually measuring here? From the KIP,
> > it's not too clear. My interpretation was that we were not measuring the
> > memory usage of the Java objects. In that case, `buffer.memory` seems
> like
> > a reasonable name although perhaps the user's expectation is that we
> would
> > measure the memory usage of the Java objects?
> >
> > Ismael
> >
> > On Tue, Dec 13, 2016 at 6:21 AM, Jay Kreps  wrote:
> >
> > > I think the question is whether we have a truly optimal strategy for
> > > deriving the partition- and fetch-level configs from the global
> setting.
> > If
> > > we do then we should just get rid of them. If not, then if we can at
> > least
> > > derive usually good and never terrible settings from the global limit
> at
> > > initialization time maybe we can set them automatically unless the user
> > > overrides with an explicit conifg. Even the latter would let us mark it
> > low
> > > priority which at least takes it off the list of things you have to
> grok
> > to
> > > use the consumer which I suspect would be much appreciated by our poor
> > > users.
> > >
> > > Regardless it'd be nice to make sure we get an explanation of the
> > > relationships between the remaining memory configs in the KIP and in
> the
> > > docs.
> > >
> > > I agree that buffer.memory isn't bad.
> > >
> > > -Jay
> > >
> > >
> > > On Mon, Dec 12, 2016 at 2:56 PM, Jason Gustafson 
> > > wrote:
> > >
> > > > Yeah, that's a good point. Perhaps in retrospect, it would have been
> > > better
> > > > to define `buffer.memory` first and let `fetch.max.bytes` be based
> off
> > of
> > > > it. I like `buffer.memory` since it gives the consumer nice symmetry
> > with
> > > > the producer and its generic n

Improve default Kafka logger settings to prevent extremely high disk space usage issue?

2016-12-13 Thread Jaikiran Pai

We happened to run into a disk space usage issue with Kafka 0.10.0.1 
(the version we are using) on one of our production setups this morning. 
Turns out (log4j) logging from Kafka ended up using 81G and more of disk 
space. Looking at the files, I see the controller.log itself is 30G and 
more (for a single day). Looking at the default log4j.properties that's 
shipped in Kafka, it uses the DailyRollingFileAppender which is one of 
the things that contributes to this issue. I see that there's already a 
patch and JIRA to fix this 
https://issues.apache.org/jira/browse/KAFKA-2394. It's been marked for 
0.11 because there wasn't a clear decision when to ship it.


Given that we have been going through 0.10.x releases these days and the 
0.11 release looking a bit away, is there any chance, this specific JIRA 
can make it to 0.10.x? I personally don't see any compatibility issues 
that it will introduce when it comes to *functionality/features* of 
Kafka itself, so I am not sure if it's that big a change to wait all the 
way till 0.11. Furthermore, since the default shipped setting can cause 
issues like the one I noted, I think it probably would be fine to 
include it in one of the 0.10.x releases. Of course, we ourselves can 
setup the logging config on our setup to use a size based rolling file 
config instead of the one shipped by default, but if this is something 
that can make it to 0.10.x, we would like to avoid doing this 
customization ourselves.


That's one part of the issue. The other is, I see this in the default 
shipped log4j.properties:



log4j.logger.kafka.controller=*TRACE,* controllerAppender
log4j.additivity.kafka.controller=false

log4j.logger.state.change.logger=*TRACE*, stateChangeAppender
log4j.additivity.state.change.logger=false


Is it intentional to have this at TRACE level for the default shipped 
config instead of having something like INFO or maybe DEBUG?



-Jaikiran

[jira] [Commented] (KAFKA-4505) Cannot get topic lag since kafka upgrade from 0.8.1.0 to 0.10.1.0

2016-12-13 Thread Romaric Parmentier (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15744686#comment-15744686
 ] 

Romaric Parmentier commented on KAFKA-4505:
---

Any idea ?


> Cannot get topic lag since kafka upgrade from 0.8.1.0 to 0.10.1.0
> -
>
> Key: KAFKA-4505
> URL: https://issues.apache.org/jira/browse/KAFKA-4505
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer, metrics, offset manager
>Affects Versions: 0.10.1.0
>Reporter: Romaric Parmentier
>Priority: Critical
>
> We were using kafka 0.8.1.1 and we just migrate to version 0.10.1.0.
> Since we migrate we are using the new script kafka-consumer-groups.sh to 
> retreive topic lags but it don't seem to work anymore. 
> Because the application is using the 0.8 driver we have added the following 
> conf to each kafka servers:
> log.message.format.version=0.8.2
> inter.broker.protocol.version=0.10.0.0
> When I'm using the option --list with kafka-consumer-groups.sh I can see 
> every consumer groups I'm using but the --describe is not working:
> /usr/share/kafka$ bin/kafka-consumer-groups.sh --zookeeper ip:2181 --describe 
> --group group_name
> No topic available for consumer group provided
> GROUP  TOPIC  PARTITION  
> CURRENT-OFFSET  LOG-END-OFFSET  LAG OWNER
> When I'm looking into zookeeper I can see the offset increasing for this 
> consumer group.
> Any idea ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Improve default Kafka logger settings to prevent extremely high disk space usage issue?

2016-12-13 Thread Jaikiran Pai




On Tuesday 13 December 2016 03:02 PM, Jaikiran Pai wrote:


log4j.logger.kafka.controller=*TRACE,* controllerAppender
log4j.additivity.kafka.controller=false

log4j.logger.state.change.logger=*TRACE*, stateChangeAppender
log4j.additivity.state.change.logger=false


Is it intentional to have this at TRACE level for the default shipped 
config instead of having something like INFO or maybe DEBUG?



-Jaikiran
Having now looked at one of the log messages in that controller.log, I 
don't think DEBUG level would be a good default either since it logs 
messages which dump all topic names and the broker ids associated with 
it very often at DEBUG level. Something like:


[2016-12-13 09:34:53,869] DEBUG [Controller 0]: preferred replicas by 
broker Map.


(note, I couldn't paste the exact log message, which was large, because 
there wasn't a easy way to remove references to our application specific 
names from it, for the topic names).


That single log message itself is huge every time since we have a large 
number of topics which get created dynamically for our application use 
case.


-Jaikiran

Re: Improve default Kafka logger settings to prevent extremely high disk space usage issue?

2016-12-13 Thread Ismael Juma

Hi Jaikiran,

Thanks for raising this. The compatibility issue is that the file name
pattern for the rolled files are different. Because there's a simple
workaround, it was deemed that it could wait until 0.11.0. This is actually
not far off if you consider that we have 4 active KIPs that are proposing
message format changes (a message format change implies a major version
bump).

The log config settings for the controller and state change logger have
been that way since they were introduced. They're generally useful when
investigating issues with the controller. Looks like this is too noisy in
some scenarios though. It may be worth filing a JIRA with specifics of your
use case to see if something can be done to improve that.

Ismael

On Tue, Dec 13, 2016 at 9:32 AM, Jaikiran Pai 
wrote:

> We happened to run into a disk space usage issue with Kafka 0.10.0.1 (the
> version we are using) on one of our production setups this morning. Turns
> out (log4j) logging from Kafka ended up using 81G and more of disk space.
> Looking at the files, I see the controller.log itself is 30G and more (for
> a single day). Looking at the default log4j.properties that's shipped in
> Kafka, it uses the DailyRollingFileAppender which is one of the things that
> contributes to this issue. I see that there's already a patch and JIRA to
> fix this https://issues.apache.org/jira/browse/KAFKA-2394. It's been
> marked for 0.11 because there wasn't a clear decision when to ship it.
>
> Given that we have been going through 0.10.x releases these days and the
> 0.11 release looking a bit away, is there any chance, this specific JIRA
> can make it to 0.10.x? I personally don't see any compatibility issues that
> it will introduce when it comes to *functionality/features* of Kafka
> itself, so I am not sure if it's that big a change to wait all the way till
> 0.11. Furthermore, since the default shipped setting can cause issues like
> the one I noted, I think it probably would be fine to include it in one of
> the 0.10.x releases. Of course, we ourselves can setup the logging config
> on our setup to use a size based rolling file config instead of the one
> shipped by default, but if this is something that can make it to 0.10.x, we
> would like to avoid doing this customization ourselves.
>
> That's one part of the issue. The other is, I see this in the default
> shipped log4j.properties:
>
>
> log4j.logger.kafka.controller=*TRACE,* controllerAppender
> log4j.additivity.kafka.controller=false
>
> log4j.logger.state.change.logger=*TRACE*, stateChangeAppender
> log4j.additivity.state.change.logger=false
>
>
> Is it intentional to have this at TRACE level for the default shipped
> config instead of having something like INFO or maybe DEBUG?
>
>
> -Jaikiran
>

[jira] [Commented] (KAFKA-4522) Using Disruptor instead of Array Blocking queue in Kafka Producer

2016-12-13 Thread Ismael Juma (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15744753#comment-15744753
 ] 

Ismael Juma commented on KAFKA-4522:


JCTools is also an option:

http://psy-lob-saw.blogspot.com/2016/10/linked-array-queues-part-1-spsc.html

> Using Disruptor instead of Array Blocking queue in Kafka Producer
> -
>
> Key: KAFKA-4522
> URL: https://issues.apache.org/jira/browse/KAFKA-4522
> Project: Kafka
>  Issue Type: Improvement
>  Components: producer 
>Reporter: Pratik kumar
>
> Kafka Producer currently uses Java's Array Blocking Queue to store outbound 
> kafka message before batching them in async mode. In case of high production 
> rate of kafka messages,this adds to lock contention on the user and is 
> generally hidden from user.(quoting from personal experience)
> Usage of LMAX Disruptor can reduce the lock contention overhead put by Kafka 
> Producer 
> LMAX Disruptor -> https://github.com/LMAX-Exchange/disruptor
> Also can someone help me understand if blocking queue gives any guarantees 
> inherent to kafka's design(and hence is irreplaceable)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] kafka pull request #2249: KAFKA-4473: RecordCollector should handle retriabl...

2016-12-13 Thread dguy

GitHub user dguy opened a pull request:

https://github.com/apache/kafka/pull/2249

KAFKA-4473: RecordCollector should handle retriable exceptions more strictly

The `RecordCollectorImpl` currently drops messages on the floor if an 
exception is non-null in the producer callback. This will result in message 
loss and violates at-least-once processing.
Rather than just log an error in the callback, save the exception in a 
field. On subsequent calls to `send`, `flush`, `close`, first check for the 
existence of an exception and throw a `StreamsException` if it is non-null. 
Also, in the callback, if an exception has already occurred, the `offsets` map 
should not be updated.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dguy/kafka kafka-4473

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/2249.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2249


commit 9ceaa67a9967edfa7522bb9daea64c1bc44738c0
Author: Damian Guy 
Date:   2016-12-12T17:47:27Z

throw exception in record collector instead of just dropping messages on 
the floor




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (KAFKA-4473) RecordCollector should handle retriable exceptions more strictly

2016-12-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15744756#comment-15744756
 ] 

ASF GitHub Bot commented on KAFKA-4473:
---

GitHub user dguy opened a pull request:

https://github.com/apache/kafka/pull/2249

KAFKA-4473: RecordCollector should handle retriable exceptions more strictly

The `RecordCollectorImpl` currently drops messages on the floor if an 
exception is non-null in the producer callback. This will result in message 
loss and violates at-least-once processing.
Rather than just log an error in the callback, save the exception in a 
field. On subsequent calls to `send`, `flush`, `close`, first check for the 
existence of an exception and throw a `StreamsException` if it is non-null. 
Also, in the callback, if an exception has already occurred, the `offsets` map 
should not be updated.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dguy/kafka kafka-4473

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/2249.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2249


commit 9ceaa67a9967edfa7522bb9daea64c1bc44738c0
Author: Damian Guy 
Date:   2016-12-12T17:47:27Z

throw exception in record collector instead of just dropping messages on 
the floor




> RecordCollector should handle retriable exceptions more strictly
> 
>
> Key: KAFKA-4473
> URL: https://issues.apache.org/jira/browse/KAFKA-4473
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.10.1.0
>Reporter: Thomas Schulz
>Assignee: Damian Guy
>Priority: Critical
>  Labels: architecture
>
> see: https://groups.google.com/forum/#!topic/confluent-platform/DT5bk1oCVk8
> There is probably a bug in the RecordCollector as described in my detailed 
> Cluster test published in the aforementioned post.
> The class RecordCollector has the following behavior:
> - if there is no exception, add the message offset to a map
> - otherwise, do not add the message offset and instead log the above statement
> Is it possible that this offset map contains the latest offset to commit? If 
> so, a message that fails might be overriden be a successful (later) message 
> and the consumer commits every message up to the latest offset?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [VOTE] KIP-87 - Add Compaction Tombstone Flag

2016-12-13 Thread Michael Pearce

Hi Ewen, Jay,

We discussed internally. If this is the reason for not giving a +1 to the 
below, we assume you want a solution where we can provide all 4 on day one.

To do this our idea is to simply leave the compaction policy alone, it 
continues to work based on null values, but instead we make a new policy 
“tombstone-compaction” want for a better name.

As such we only need to deal with new clients, and flows, to which then we can 
essentially do away with any need for a migration, or compromised due to the 
historic compaction.

Whilst this does deviate from the VOTE its unfortunate these concerns not 
raised during the lengthy DISCUSS but it’s worth discussing to get the votes 
needed, as we’d like to make this for the Jan release.

Thoughts? Would having a new policy and no migration but supporting all four 
combinations be preferred over updating the compaction policy and mechanics?

@Mayuresh, Radai, Becket, Jun as you’ve all +1 voted on the existing solution 
to do this by upgrading the existing compaction policy would any of you have 
concerns with the alternative?

Regards,
Mike

On 11/12/2016, 23:33, "Michael Pearce"  wrote:

Hi Ewen

So this is what got discussed in the discussion thread. We wanted to go for 
supporting all four combinations

Tombstone + non null value
Tombstone + null value
Non tombstone + non null value
Non tombstone + null value

The issues were mostly around compatibility with old consumers.

In the end the consensus was for reasons of making an upgrade/compatibility 
we would get to the position of handling in this KIP:

Tombstone + null value
Tombstone + non null value
Non tombstone + non null value

Only later once either we can break client compatibility or time expires 
long enough we can say consumers should no longer be inferring from null value 
instead the isTombstone flag can we support:

Non tombstone + null value.

Saying that this puts in a lot of the work to enable the last piece.

Also re having the change events of that makes the state aka compacted 
topic by doing what we're doing would give this to the supported combinations 
above mentioned.

Ideally I'd love to have ability to support non tombstone + null value, 
unfortunately this is some of the tech debt in the original compaction design 
we can't easily undo, without a period of time.

Regards
Mike

Sent using OWA for iPhone

From: Ewen Cheslack-Postava 
Sent: Sunday, December 11, 2016 11:15:42 PM
To: dev@kafka.apache.org
Subject: Re: [VOTE] KIP-87 - Add Compaction Tombstone Flag

Michael,

It kind of depends on how you want to interpret the tombstone flag. If it's
purely a producer-facing Kafka-level thing that we treat as internal to the
broker and log cleaner once the record is sent, then your approach makes
sense. You're just moving copying the null-indicates-delete behavior of the
old constructor into the tombstone flag.

However, if you want this change to more generally decouple the idea of
deletion and null values, then you are sometimes converting what might be a
completely valid null value that doesn't indicate deletion into a
tombstone. Downstream applications could potentially handle these cases
differently given the separation of deletion from value.

I guess the question is if we want to try to support the latter even for
topics where we have older produce requests. An example where this could
come up is in something like a CDC Connector. If we try to support the
semantic difference, a connector might write changes to Kafka using the
tombstone flag to indicate when a row was truly deleted (vs an update that
sets it to null but still present; this probably makes more sense for CDC
from document stores or extracting single columns). There are various
reasons we might want to maintain the full log and not turn compaction on
(or just use a time-based retention policy), but downstream applications
might care to know the difference between a delete and a null value. In
fact both versions of the same log (compacted and time-retention) could be
useful and I don't think it'll be uncommon to maintain both or use KIP-71
to maintain a hybrid compacted/retention topic.

-Ewen

On Sun, Dec 11, 2016 at 1:18 PM, Michael Pearce 
wrote:

> Hi Jay,
>
> Why wouldn't that work, the tombstone value is only looked at by the
> broker, on a topic configured for compaction as such is benign on non
> compacted topics. This is as much as sending a null value currently
>
>
> Regards
> Mike
>
>
>
> Sent using OWA for iPhone
> 
> From: Jay Kreps 
> Sent: Sunday, December 11, 2016 8:58:53 PM
> To

[jira] [Updated] (KAFKA-4405) Avoid calling pollNoWakeup unnecessarily

2016-12-13 Thread Eno Thereska (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eno Thereska updated KAFKA-4405:

Summary: Avoid calling pollNoWakeup unnecessarily  (was: Kafka consumer 
improperly send prefetch request)

> Avoid calling pollNoWakeup unnecessarily
> 
>
> Key: KAFKA-4405
> URL: https://issues.apache.org/jira/browse/KAFKA-4405
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.0.1
>Reporter: ysysberserk
>Assignee: Eno Thereska
> Fix For: 0.10.2.0
>
>
> Now kafka consumer has added max.poll.records to limit the count of messages 
> return by poll().
> According to KIP-41, to implement  max.poll.records, the prefetch request 
> should only be sent when the total number of retained records is less than 
> max.poll.records.
> But in the code of 0.10.0.1 , the consumer will send a prefetch request if it 
> retained any records and never check if total number of retained records is 
> less than max.poll.records..
> If max.poll.records is set to a count much less than the count of message 
> fetched , the poll() loop will send a lot of requests than expected and will 
> have more and more records fetched and stored in memory before they can be 
> consumed.
> So before sending a  prefetch request , the consumer must check if total 
> number of retained records is less than max.poll.records.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (KAFKA-4405) Avoid calling pollNoWakeup unnecessarily

2016-12-13 Thread Eno Thereska (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eno Thereska updated KAFKA-4405:

Description: 
In KafkaConsumer:poll, the code always calls "pollNoWakeup", which turns out to 
be expensive. When max.poll.records=1, for example, that call adds about 50% 
performance overhead. The code should avoid avoid that function unnecessarily 
when there are no outstanding prefetches.


Old JIRA description (discarded because turned out not to be the case):
---
Now kafka consumer has added max.poll.records to limit the count of messages 
return by poll().

According to KIP-41, to implement  max.poll.records, the prefetch request 
should only be sent when the total number of retained records is less than 
max.poll.records.

But in the code of 0.10.0.1 , the consumer will send a prefetch request if it 
retained any records and never check if total number of retained records is 
less than max.poll.records..

If max.poll.records is set to a count much less than the count of message 
fetched , the poll() loop will send a lot of requests than expected and will 
have more and more records fetched and stored in memory before they can be 
consumed.

So before sending a  prefetch request , the consumer must check if total number 
of retained records is less than max.poll.records.


  was:
Now kafka consumer has added max.poll.records to limit the count of messages 
return by poll().

According to KIP-41, to implement  max.poll.records, the prefetch request 
should only be sent when the total number of retained records is less than 
max.poll.records.

But in the code of 0.10.0.1 , the consumer will send a prefetch request if it 
retained any records and never check if total number of retained records is 
less than max.poll.records..

If max.poll.records is set to a count much less than the count of message 
fetched , the poll() loop will send a lot of requests than expected and will 
have more and more records fetched and stored in memory before they can be 
consumed.

So before sending a  prefetch request , the consumer must check if total number 
of retained records is less than max.poll.records.



> Avoid calling pollNoWakeup unnecessarily
> 
>
> Key: KAFKA-4405
> URL: https://issues.apache.org/jira/browse/KAFKA-4405
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.0.1
>Reporter: ysysberserk
>Assignee: Eno Thereska
> Fix For: 0.10.2.0
>
>
> In KafkaConsumer:poll, the code always calls "pollNoWakeup", which turns out 
> to be expensive. When max.poll.records=1, for example, that call adds about 
> 50% performance overhead. The code should avoid avoid that function 
> unnecessarily when there are no outstanding prefetches.
> Old JIRA description (discarded because turned out not to be the case):
> ---
> Now kafka consumer has added max.poll.records to limit the count of messages 
> return by poll().
> According to KIP-41, to implement  max.poll.records, the prefetch request 
> should only be sent when the total number of retained records is less than 
> max.poll.records.
> But in the code of 0.10.0.1 , the consumer will send a prefetch request if it 
> retained any records and never check if total number of retained records is 
> less than max.poll.records..
> If max.poll.records is set to a count much less than the count of message 
> fetched , the poll() loop will send a lot of requests than expected and will 
> have more and more records fetched and stored in memory before they can be 
> consumed.
> So before sending a  prefetch request , the consumer must check if total 
> number of retained records is less than max.poll.records.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-4405) Avoid calling pollNoWakeup unnecessarily

2016-12-13 Thread Eno Thereska (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15744924#comment-15744924
 ] 

Eno Thereska commented on KAFKA-4405:
-

Done, thanks [~ijuma]

> Avoid calling pollNoWakeup unnecessarily
> 
>
> Key: KAFKA-4405
> URL: https://issues.apache.org/jira/browse/KAFKA-4405
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.0.1
>Reporter: ysysberserk
>Assignee: Eno Thereska
> Fix For: 0.10.2.0
>
>
> In KafkaConsumer:poll, the code always calls "pollNoWakeup", which turns out 
> to be expensive. When max.poll.records=1, for example, that call adds about 
> 50% performance overhead. The code should avoid avoid that function 
> unnecessarily when there are no outstanding prefetches.
> Old JIRA description (discarded because turned out not to be the case):
> ---
> Now kafka consumer has added max.poll.records to limit the count of messages 
> return by poll().
> According to KIP-41, to implement  max.poll.records, the prefetch request 
> should only be sent when the total number of retained records is less than 
> max.poll.records.
> But in the code of 0.10.0.1 , the consumer will send a prefetch request if it 
> retained any records and never check if total number of retained records is 
> less than max.poll.records..
> If max.poll.records is set to a count much less than the count of message 
> fetched , the poll() loop will send a lot of requests than expected and will 
> have more and more records fetched and stored in memory before they can be 
> consumed.
> So before sending a  prefetch request , the consumer must check if total 
> number of retained records is less than max.poll.records.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [VOTE] KIP-87 - Add Compaction Tombstone Flag

2016-12-13 Thread Ismael Juma

Yes, this is actually tricky to do in a way where we both have the desired
semantics and maintain compatibility. When someone creates a
`ProducerRecord` with a `null` value today, the producer doesn't know if
it's meant to be a tombstone or not. For V3 messages, it's easy when the
constructor that takes a tombstone is used. However, if any other
constructor is used, it's not clear. A couple of options I can think of,
none of them particularly nice:

1. Have a third state where tombstone = unknown and the broker would set
the tombstone bit if the value was null and the topic was compacted. People
that wanted to pass a non-null value for the tombstone would have to use
the constructor that takes a tombstone. The drawbacks: third state for
tombstone in message format, message conversion at the broker for a common
case.

2. Extend MetadataResponse to optionally include topic configs, which would
make it possible for the producer to be smarter about setting the
tombstone. It would only do it if a tombstone was not passed explicitly,
the value was null and the topic was compacted. The main drawback is that
the producer would be getting a bit more data for each topic even though it
probably won't use it most of the time. Extending MetadataResponse to
return topic configs would be useful for other reasons as well, so that
part seems OK.

In addition, for both proposals, we could consider adding warnings to the
documentation that the behaviour of the constructors that don't take a
tombstone would change in the next major release so that tombstone = false.
Not sure if this would be worth it though.

Ismael

On Sun, Dec 11, 2016 at 11:15 PM, Ewen Cheslack-Postava 
wrote:

> Michael,
>
> It kind of depends on how you want to interpret the tombstone flag. If it's
> purely a producer-facing Kafka-level thing that we treat as internal to the
> broker and log cleaner once the record is sent, then your approach makes
> sense. You're just moving copying the null-indicates-delete behavior of the
> old constructor into the tombstone flag.
>
> However, if you want this change to more generally decouple the idea of
> deletion and null values, then you are sometimes converting what might be a
> completely valid null value that doesn't indicate deletion into a
> tombstone. Downstream applications could potentially handle these cases
> differently given the separation of deletion from value.
>
> I guess the question is if we want to try to support the latter even for
> topics where we have older produce requests. An example where this could
> come up is in something like a CDC Connector. If we try to support the
> semantic difference, a connector might write changes to Kafka using the
> tombstone flag to indicate when a row was truly deleted (vs an update that
> sets it to null but still present; this probably makes more sense for CDC
> from document stores or extracting single columns). There are various
> reasons we might want to maintain the full log and not turn compaction on
> (or just use a time-based retention policy), but downstream applications
> might care to know the difference between a delete and a null value. In
> fact both versions of the same log (compacted and time-retention) could be
> useful and I don't think it'll be uncommon to maintain both or use KIP-71
> to maintain a hybrid compacted/retention topic.
>
> -Ewen
>
> On Sun, Dec 11, 2016 at 1:18 PM, Michael Pearce 
> wrote:
>
> > Hi Jay,
> >
> > Why wouldn't that work, the tombstone value is only looked at by the
> > broker, on a topic configured for compaction as such is benign on non
> > compacted topics. This is as much as sending a null value currently
> >
> >
> > Regards
> > Mike
> >
> >
> >
> > Sent using OWA for iPhone
> > 
> > From: Jay Kreps 
> > Sent: Sunday, December 11, 2016 8:58:53 PM
> > To: dev@kafka.apache.org
> > Subject: Re: [VOTE] KIP-87 - Add Compaction Tombstone Flag
> >
> > Hey Michael,
> >
> > I'm not quite sure that works as that would translate ALL null values to
> > tombstones, even for non-compacted topics that use null as an acceptable
> > value sent by the producer and expected by the consumer.
> >
> > -Jay
> >
> > On Sun, Dec 11, 2016 at 3:26 AM, Michael Pearce 
> > wrote:
> >
> > > Hi Ewen,
> > >
> > > I think the easiest way to show this is with code.
> > >
> > > As you can see we keep the existing behaviour for code/binaries calling
> > > the pre-existing constructors, whereby if the value is null the
> tombstone
> > > is set to true.
> > >
> > > Regards
> > > Mike
> > >
> > >
> > >
> > > /**
> > >  * Creates a record with a specified timestamp to be sent to a
> > > specified topic and partition
> > >  *
> > >  * @param topic The topic the record will be appended to
> > >  * @param partition The partition to which the record should be
> sent
> > >  * @param timestamp The timestamp of the record
> > >  * @param tombstone if the record should be treated as a tom

[DISCUSS] Dormant/Inactive KIPs

2016-12-13 Thread Ismael Juma

Hi all,

A while back Grant proposed moving inactive/dormant KIPs to a separate
table in the wiki. I think this is a good idea as it will make it easier
for people to see the KIPs that are actually active. The list that Grant
proposed then was:

- KIP-6 - New reassignment partition logic for rebalancing (dormant)
> - KIP-14 - Tools standardization (dormant)
> - KIP-17 - Add HighwaterMarkOffset to OffsetFetchResponse (dormant)
> - KIP-18 - JBOD Support (dormant)
> - KIP-23 - Add JSON/CSV output and looping options to ConsumerGroupCommand
> (dormant)
> - KIP-27 - Conditional Publish (dormant)
>  - KIP-30 - Allow for brokers to have plug-able consensus and meta data 
> storage
> sub systems (dormant)
>  - KIP-39: Pinning controller to broker (dormant)
>  - KIP-44 - Allow Kafka to have a customized security protocol (dormant)
>  - KIP-46 - Self Healing (dormant)
>  - KIP-47 - Add timestamp-based log deletion policy (blocked - by KIP-33)
>  - KIP-53 - Add custom policies for reconnect attempts to NetworkdClient
>  - KIP-58 - Make Log Compaction Point Configurable (blocked - by KIP-33)
>  - KIP-61: Add a log retention parameter for maximum disk space usage
>percentage (dormant)
>  - KIP-68 Add a consumed log retention before log retention (dormant)


KIP-58 was adopted since then and it probably makes sense to add KIP-10 to
the list.

Are people OK with this? Feel free to suggest KIPs that should or should
not be in the inactive/dormant list.

Ismael

[jira] [Created] (KAFKA-4531) Rationalise client configuration validation

2016-12-13 Thread Edoardo Comar (JIRA)

Edoardo Comar created KAFKA-4531:


 Summary: Rationalise client configuration validation 
 Key: KAFKA-4531
 URL: https://issues.apache.org/jira/browse/KAFKA-4531
 Project: Kafka
  Issue Type: Improvement
  Components: clients
Reporter: Edoardo Comar


The broker-side configuration has a {{validateValues()}} method that could be 
introduced also in the client-side {{ProducerConfig}} and {{ConsumerConfig}} 
classes.

The rationale is to centralise constraints between values, like e.g. this one 
currently in the {{KafkaConsumer}} constructor:
{code}
if (this.requestTimeoutMs <= sessionTimeOutMs || 
this.requestTimeoutMs <= fetchMaxWaitMs)
throw new 
ConfigException(ConsumerConfig.REQUEST_TIMEOUT_MS_CONFIG + " should be greater 
than " + ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG + " and " + 
ConsumerConfig.FETCH_MAX_WAIT_MS_CONFIG);
{code}

or custom validation of the provided values, e.g. this one in the 
{{KafkaProducer}} :
{code}
private static int parseAcks(String acksString) {
try {
return acksString.trim().equalsIgnoreCase("all") ? -1 : 
Integer.parseInt(acksString.trim());
} catch (NumberFormatException e) {
throw new ConfigException("Invalid configuration value for 'acks': 
" + acksString);
}
}
{code}
also some new KIPs, e.g. KIP-81 propose constraints among different values,
so it would be good not to scatter them around.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [DISCUSS] KIP-83 - Allow multiple SASL PLAIN authenticated Java clients in a single JVM process

2016-12-13 Thread Ismael Juma

Thanks for the KIP. A few comments:

1. The suggestion is to use the JAAS config value as the key to the map in
`LoginManager`. The config value can include passwords, so we could
potentially end up leaking them if we log the keys of `LoginManager`. This
seems a bit dangerous.

2. If someone uses the same JAAS config value in two clients, they'll get
the same `JaasConfig`, which seems fine, but worth mentioning (it means
that the `JaasConfig` has to be thread-safe).

3. How big can a JAAS config get? Is it an issue to use it as a map key?
Probably not given how this is used, but worth covering in the KIP as well.

Ismael

On Tue, Sep 27, 2016 at 10:15 AM, Edoardo Comar  wrote:

> Hi,
> I had a go at a KIP that addresses this JIRA
> https://issues.apache.org/jira/browse/KAFKA-4180
> "Shared authentification with multiple actives Kafka producers/consumers"
>
> which is a limitation of the current Java client that we (IBM MessageHub)
> get asked quite often lately.
>
> We will have a go at a PR soon, just as a proof of concept, but as it
> introduces new public interfaces it needs a KIP.
>
> I'll welcome your input.
>
> Edo
> --
> Edoardo Comar
> MQ Cloud Technologies
> eco...@uk.ibm.com
> +44 (0)1962 81 5576
> IBM UK Ltd, Hursley Park, SO21 2JN
>
> IBM United Kingdom Limited Registered in England and Wales with number
> 741598 Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6
> 3AU
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>

Re: [DISCUSS] KIP-86: Configurable SASL callback handlers

2016-12-13 Thread Ismael Juma

Hi Rajini,

Thanks for the KIP. I think this is useful and users have asked for
something like that. I like that you have a scenarios section, do you think
you could provide a rough sketch of what a callback handler would look like
for the first 2 scenarios? They seem to be the common ones, so it would
help to see a concrete example.

Ismael

On Tue, Oct 11, 2016 at 7:28 AM, Rajini Sivaram <
rajinisiva...@googlemail.com> wrote:

> Hi all,
>
> I have just created KIP-86 make callback handlers in SASL configurable so
> that credential providers for SASL/PLAIN (and SASL/SCRAM when it is
> implemented) can be used with custom credential callbacks:
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 86%3A+Configurable+SASL+callback+handlers
>
> Comments and suggestions are welcome.
>
> Thank you...
>
>
> Regards,
>
> Rajini
>

[jira] [Commented] (KAFKA-4505) Cannot get topic lag since kafka upgrade from 0.8.1.0 to 0.10.1.0

2016-12-13 Thread huxi (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15745241#comment-15745241
 ] 

huxi commented on KAFKA-4505:
-

I am assuming you are still using the old consumer (using 'zookeeper.connect' 
not 'bootstrap.servers'). Is the consumer application running when you check 
the znode existence?

> Cannot get topic lag since kafka upgrade from 0.8.1.0 to 0.10.1.0
> -
>
> Key: KAFKA-4505
> URL: https://issues.apache.org/jira/browse/KAFKA-4505
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer, metrics, offset manager
>Affects Versions: 0.10.1.0
>Reporter: Romaric Parmentier
>Priority: Critical
>
> We were using kafka 0.8.1.1 and we just migrate to version 0.10.1.0.
> Since we migrate we are using the new script kafka-consumer-groups.sh to 
> retreive topic lags but it don't seem to work anymore. 
> Because the application is using the 0.8 driver we have added the following 
> conf to each kafka servers:
> log.message.format.version=0.8.2
> inter.broker.protocol.version=0.10.0.0
> When I'm using the option --list with kafka-consumer-groups.sh I can see 
> every consumer groups I'm using but the --describe is not working:
> /usr/share/kafka$ bin/kafka-consumer-groups.sh --zookeeper ip:2181 --describe 
> --group group_name
> No topic available for consumer group provided
> GROUP  TOPIC  PARTITION  
> CURRENT-OFFSET  LOG-END-OFFSET  LAG OWNER
> When I'm looking into zookeeper I can see the offset increasing for this 
> consumer group.
> Any idea ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-4505) Cannot get topic lag since kafka upgrade from 0.8.1.0 to 0.10.1.0

2016-12-13 Thread Romaric Parmentier (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15745266#comment-15745266
 ] 

Romaric Parmentier commented on KAFKA-4505:
---

Hi Huxi,

Yes the application is running while I'm checking the znode existence. I 
finally just found a solution:
Let's say you have a topic called MyTopic and a group name called MyGroupName 
and you cannot retrieve the lag for your group name. For this situation, when 
I'm doing a "ls" in zookeeper I have:
ls /consumers/MyGroupName
> [offsets]

The solution is to create the owners and topic nodes:
create /consumers/MyGroupName/owners /MyTopic
create /consumers/MyGroupName/owners/MyTopic 0

I'm not sure it's a good solution but the describe command of 
kafka-consumer-groups.sh is now working as before.

Now, we are planning to migrate the code to use the new consumer.

> Cannot get topic lag since kafka upgrade from 0.8.1.0 to 0.10.1.0
> -
>
> Key: KAFKA-4505
> URL: https://issues.apache.org/jira/browse/KAFKA-4505
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer, metrics, offset manager
>Affects Versions: 0.10.1.0
>Reporter: Romaric Parmentier
>Priority: Critical
>
> We were using kafka 0.8.1.1 and we just migrate to version 0.10.1.0.
> Since we migrate we are using the new script kafka-consumer-groups.sh to 
> retreive topic lags but it don't seem to work anymore. 
> Because the application is using the 0.8 driver we have added the following 
> conf to each kafka servers:
> log.message.format.version=0.8.2
> inter.broker.protocol.version=0.10.0.0
> When I'm using the option --list with kafka-consumer-groups.sh I can see 
> every consumer groups I'm using but the --describe is not working:
> /usr/share/kafka$ bin/kafka-consumer-groups.sh --zookeeper ip:2181 --describe 
> --group group_name
> No topic available for consumer group provided
> GROUP  TOPIC  PARTITION  
> CURRENT-OFFSET  LOG-END-OFFSET  LAG OWNER
> When I'm looking into zookeeper I can see the offset increasing for this 
> consumer group.
> Any idea ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-4437) Incremental Batch Processing for Kafka Streams

2016-12-13 Thread Avi Flax (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15745331#comment-15745331
 ] 

Avi Flax commented on KAFKA-4437:
-

I have a comment but there doesn’t seem to be a mailing list thread yet… do you 
plan to start one soon?

> Incremental Batch Processing for Kafka Streams
> --
>
> Key: KAFKA-4437
> URL: https://issues.apache.org/jira/browse/KAFKA-4437
> Project: Kafka
>  Issue Type: New Feature
>  Components: streams
>Reporter: Matthias J. Sax
>Assignee: Matthias J. Sax
>
> We want to add an “auto stop” feature that terminate a stream application 
> when it has processed all the data that was newly available at the time the 
> application started (to at current end-of-log, i.e., current high watermark). 
> This allows to chop the (infinite) log into finite chunks where each run for 
> the application processes one chunk. This feature allows for incremental 
> batch-like processing; think "start-process-stop-restart-process-stop-..."
> For details see KIP-95: 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-95%3A+Incremental+Batch+Processing+for+Kafka+Streams



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (KAFKA-4532) StateStores can be connected to the wrong source topic

2016-12-13 Thread Damian Guy (JIRA)

Damian Guy created KAFKA-4532:
-

 Summary: StateStores can be connected to the wrong source topic
 Key: KAFKA-4532
 URL: https://issues.apache.org/jira/browse/KAFKA-4532
 Project: Kafka
  Issue Type: Bug
  Components: streams
Affects Versions: 0.10.1.0, 0.10.1.1, 0.10.2.0
Reporter: Damian Guy
Assignee: Damian Guy
 Fix For: 0.10.2.0


When building a topology with tables and StateStores, the StateStores are 
mapped to the source topic names. This map is retrieved via 
{{TopologyBuilder.stateStoreNameToSourceTopics()}} and is used in Interactive 
Queries to find the source topics and partitions when resolving the partitions 
that particular keys will be in.

There is an issue where by this mapping for a table that is originally created 
with {{builder.table("topic", "table");}}, and then is subsequently used in a 
join, is changed to the join topic. This is because the mapping is updated 
during the call to {{topology.connectProcessorAndStateStores(..)}}. 

In the case that the {{stateStoreNameToSourceTopics}} Map already has a value 
for the state store name it should not update the Map.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [DISCUSS] KIP-81: Max in-flight fetches

2016-12-13 Thread Mickael Maison

Thanks for all the feedback.

I've updated the KIP with all the details.
Below are a few of the main points:

- Overall memory usage of the consumer:
I made it clear the memory pool is only used to store the raw bytes
from the network and that the decompressed/deserialized messages are
not stored in it but as extra memory on the heap. In addition, the
consumer also keeps track of other things (in flight requests,
subscriptions, etc..) that account for extra memory as well. So this
is not a hard bound memory constraint but should still allow to
roughly size how much memory can be used.

- Relation with the existing settings:
There are already 2 settings that deal with memory usage of the
consumer. I suggest we lower the priority of
`max.partition.fetch.bytes` (I wonder if we should attempt to
deprecate it or increase its default value so it's a contraint less
likely to be hit) and have the new setting `buffer.memory` as High.
I'm a bit unsure what's the best default value for `buffer.memory`, I
suggested 100MB in the KIP (2 x `fetch.max.bytes`), but I'd appreciate
feedback. It should always at least be equal to `max.fetch.bytes`.

- Configuration name `buffer.memory`:
I think it's the name that makes the most sense. It's aligned with the
producer and as mentioned generic enough to allow future changes if
needed.

- Coordination starvation:
Yes this is a potential issue. I'd expect these requests to be small
enough to not be affected too much. If that's the case KAFKA-4137
suggests a possible fix.



On Tue, Dec 13, 2016 at 9:31 AM, Ismael Juma  wrote:
> Makes sense Jay.
>
> Mickael, in addition to how we can compute defaults of the other settings
> from `buffer.memory`, it would be good to specify what is allowed and how
> we handle the different cases (e.g. what do we do if
> `max.partition.fetch.bytes`
> is greater than `buffer.memory`, is that simply not allowed?).
>
> To summarise the gap between the ideal scenario (user specifies how much
> memory the consumer can use) and what is being proposed:
>
> 1. We will decompress and deserialize the data for one or more partitions
> in order to return them to the user and we don't account for the increased
> memory usage resulting from that. This is likely to be significant on a per
> record basis, but we try to do it for the minimal number of records
> possible within the constraints of the system. Currently the constraints
> are: we decompress and deserialize the data for a partition at a time
> (default `max.partition.fetch.bytes` is 1MB, but this is a soft limit in
> case there are oversized messages) until we have enough records to
> satisfy `max.poll.records`
> (default 500) or there are no more completed fetches. It seems like this
> may be OK for a lot of cases, but some tuning will still be required in
> others.
>
> 2. We don't account for bookkeeping data structures or intermediate objects
> allocated during the general operation of the consumer. Probably something
> we have to live with as the cost/benefit of fixing this doesn't seem worth
> it.
>
> Ismael
>
> On Tue, Dec 13, 2016 at 8:34 AM, Jay Kreps  wrote:
>
>> Hey Ismael,
>>
>> Yeah I think we are both saying the same thing---removing only works if you
>> have a truly optimal strategy. Actually even dynamically computing a
>> reasonable default isn't totally obvious (do you set fetch.max.bytes to
>> equal buffer.memory to try to queue up as much data in the network buffers?
>> Do you try to limit it to your socket.receive.buffer size so that you can
>> read it in a single shot?).
>>
>> Regarding what is being measured, my interpretation was the same as yours.
>> I was just adding to the previous point that buffer.memory setting would
>> not be a very close proxy for memory usage. Someone was pointing out that
>> compression would make this true, and I was just adding that even without
>> compression the object overhead would lead to a high expansion factor.
>>
>> -Jay
>>
>> On Mon, Dec 12, 2016 at 11:53 PM, Ismael Juma  wrote:
>>
>> > Hi Jay,
>> >
>> > About `max.partition.fetch.bytes`, yes it was an oversight not to lower
>> its
>> > priority as part of KIP-74 given the existence of `fetch.max.bytes` and
>> the
>> > fact that we can now make progress in the presence of oversized messages
>> > independently of either of those settings.
>> >
>> > I agree that we should try to set those values automatically based on
>> > `buffer.memory`, but I am not sure if we can have a truly optimal
>> strategy.
>> > So, I'd go with reducing the priority to "low" instead of removing
>> > `fetch.max.bytes` and `max.partition.fetch.bytes` altogether for now. If
>> > experience in the field tells us that the auto strategy is good enough,
>> we
>> > can consider removing them (yes, I know, it's unlikely to happen as there
>> > won't be that much motivation then).
>> >
>> > Regarding the "conversion from packed bytes to java objects" comment,
>> that
>> > raises the question: what are we actually measuring here? Fr

[jira] [Work started] (KAFKA-4532) StateStores can be connected to the wrong source topic

2016-12-13 Thread Damian Guy (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on KAFKA-4532 started by Damian Guy.
-
> StateStores can be connected to the wrong source topic
> --
>
> Key: KAFKA-4532
> URL: https://issues.apache.org/jira/browse/KAFKA-4532
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.10.1.0, 0.10.1.1, 0.10.2.0
>Reporter: Damian Guy
>Assignee: Damian Guy
> Fix For: 0.10.2.0
>
>
> When building a topology with tables and StateStores, the StateStores are 
> mapped to the source topic names. This map is retrieved via 
> {{TopologyBuilder.stateStoreNameToSourceTopics()}} and is used in Interactive 
> Queries to find the source topics and partitions when resolving the 
> partitions that particular keys will be in.
> There is an issue where by this mapping for a table that is originally 
> created with {{builder.table("topic", "table");}}, and then is subsequently 
> used in a join, is changed to the join topic. This is because the mapping is 
> updated during the call to {{topology.connectProcessorAndStateStores(..)}}. 
> In the case that the {{stateStoreNameToSourceTopics}} Map already has a value 
> for the state store name it should not update the Map.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [DISCUSS] KIP-86: Configurable SASL callback handlers

2016-12-13 Thread Rajini Sivaram

Ismael,

Thank you for the review. I will add an example.

On Tue, Dec 13, 2016 at 1:07 PM, Ismael Juma  wrote:

> Hi Rajini,
>
> Thanks for the KIP. I think this is useful and users have asked for
> something like that. I like that you have a scenarios section, do you think
> you could provide a rough sketch of what a callback handler would look like
> for the first 2 scenarios? They seem to be the common ones, so it would
> help to see a concrete example.
>
> Ismael
>
> On Tue, Oct 11, 2016 at 7:28 AM, Rajini Sivaram <
> rajinisiva...@googlemail.com> wrote:
>
> > Hi all,
> >
> > I have just created KIP-86 make callback handlers in SASL configurable so
> > that credential providers for SASL/PLAIN (and SASL/SCRAM when it is
> > implemented) can be used with custom credential callbacks:
> >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > 86%3A+Configurable+SASL+callback+handlers
> >
> > Comments and suggestions are welcome.
> >
> > Thank you...
> >
> >
> > Regards,
> >
> > Rajini
> >
>



-- 
Regards,

Rajini

Dev list subscribe

2016-12-13 Thread Rajini Sivaram

Please subscribe me to the Kafka dev list.


Thank you,

Rajini

[jira] [Created] (KAFKA-4533) subscribe() then poll() on new topic is very slow when subscribed to many topics

2016-12-13 Thread Sergey Alaev (JIRA)

Sergey Alaev created KAFKA-4533:
---

 Summary: subscribe() then poll() on new topic is very slow when 
subscribed to many topics
 Key: KAFKA-4533
 URL: https://issues.apache.org/jira/browse/KAFKA-4533
 Project: Kafka
  Issue Type: Bug
  Components: clients
Affects Versions: 0.10.1.0
Reporter: Sergey Alaev


Given following case:

consumer.subscribe(my_new_topic, [249 existing topics])
publisher.send(my_new_topic, key, value)
poll(10) until data from my_new_topic arrives

I see data from `my_new_topic` only after approx. 90 seconds.

If I subscribe only to my_new_topic, I get results within seconds.

Logs contain lots of lines like this:

19:28:07.972 [kafka-thread] DEBUG 
org.apache.kafka.clients.consumer.internals.Fetcher - Resetting offset for 
partition demo.com_recipient-2-0 to earliest offset.
19:28:08.247 [kafka-thread] DEBUG 
org.apache.kafka.clients.consumer.internals.Fetcher - Fetched {timestamp=-1, 
offset=0} for partition demo.com_recipient-2-0

Probably you should do that in batch.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [VOTE] KIP-87 - Add Compaction Tombstone Flag

2016-12-13 Thread Michael Pearce

Hi Ismael

Did you see our email this morning, what's your thoughts on this approach to 
instead we simply have a brand new policy?

Cheers
Mike

Sent using OWA for iPhone

From: isma...@gmail.com  on behalf of Ismael Juma 

Sent: Tuesday, December 13, 2016 11:30:05 AM
To: dev@kafka.apache.org
Subject: Re: [VOTE] KIP-87 - Add Compaction Tombstone Flag

Yes, this is actually tricky to do in a way where we both have the desired
semantics and maintain compatibility. When someone creates a
`ProducerRecord` with a `null` value today, the producer doesn't know if
it's meant to be a tombstone or not. For V3 messages, it's easy when the
constructor that takes a tombstone is used. However, if any other
constructor is used, it's not clear. A couple of options I can think of,
none of them particularly nice:

1. Have a third state where tombstone = unknown and the broker would set
the tombstone bit if the value was null and the topic was compacted. People
that wanted to pass a non-null value for the tombstone would have to use
the constructor that takes a tombstone. The drawbacks: third state for
tombstone in message format, message conversion at the broker for a common
case.

2. Extend MetadataResponse to optionally include topic configs, which would
make it possible for the producer to be smarter about setting the
tombstone. It would only do it if a tombstone was not passed explicitly,
the value was null and the topic was compacted. The main drawback is that
the producer would be getting a bit more data for each topic even though it
probably won't use it most of the time. Extending MetadataResponse to
return topic configs would be useful for other reasons as well, so that
part seems OK.

In addition, for both proposals, we could consider adding warnings to the
documentation that the behaviour of the constructors that don't take a
tombstone would change in the next major release so that tombstone = false.
Not sure if this would be worth it though.

Ismael

On Sun, Dec 11, 2016 at 11:15 PM, Ewen Cheslack-Postava 
wrote:

> Michael,
>
> It kind of depends on how you want to interpret the tombstone flag. If it's
> purely a producer-facing Kafka-level thing that we treat as internal to the
> broker and log cleaner once the record is sent, then your approach makes
> sense. You're just moving copying the null-indicates-delete behavior of the
> old constructor into the tombstone flag.
>
> However, if you want this change to more generally decouple the idea of
> deletion and null values, then you are sometimes converting what might be a
> completely valid null value that doesn't indicate deletion into a
> tombstone. Downstream applications could potentially handle these cases
> differently given the separation of deletion from value.
>
> I guess the question is if we want to try to support the latter even for
> topics where we have older produce requests. An example where this could
> come up is in something like a CDC Connector. If we try to support the
> semantic difference, a connector might write changes to Kafka using the
> tombstone flag to indicate when a row was truly deleted (vs an update that
> sets it to null but still present; this probably makes more sense for CDC
> from document stores or extracting single columns). There are various
> reasons we might want to maintain the full log and not turn compaction on
> (or just use a time-based retention policy), but downstream applications
> might care to know the difference between a delete and a null value. In
> fact both versions of the same log (compacted and time-retention) could be
> useful and I don't think it'll be uncommon to maintain both or use KIP-71
> to maintain a hybrid compacted/retention topic.
>
> -Ewen
>
> On Sun, Dec 11, 2016 at 1:18 PM, Michael Pearce 
> wrote:
>
> > Hi Jay,
> >
> > Why wouldn't that work, the tombstone value is only looked at by the
> > broker, on a topic configured for compaction as such is benign on non
> > compacted topics. This is as much as sending a null value currently
> >
> >
> > Regards
> > Mike
> >
> >
> >
> > Sent using OWA for iPhone
> > 
> > From: Jay Kreps 
> > Sent: Sunday, December 11, 2016 8:58:53 PM
> > To: dev@kafka.apache.org
> > Subject: Re: [VOTE] KIP-87 - Add Compaction Tombstone Flag
> >
> > Hey Michael,
> >
> > I'm not quite sure that works as that would translate ALL null values to
> > tombstones, even for non-compacted topics that use null as an acceptable
> > value sent by the producer and expected by the consumer.
> >
> > -Jay
> >
> > On Sun, Dec 11, 2016 at 3:26 AM, Michael Pearce 
> > wrote:
> >
> > > Hi Ewen,
> > >
> > > I think the easiest way to show this is with code.
> > >
> > > As you can see we keep the existing behaviour for code/binaries calling
> > > the pre-existing constructors, whereby if the value is null the
> tombstone
> > > is set to true.
> > >
> > > Regards
> > > Mike
> > >
> > >
> > >
> > > /**
>

[jira] [Updated] (KAFKA-4532) StateStores can be connected to the wrong source topic resulting in incorrect metadata returned from IQ

2016-12-13 Thread Damian Guy (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Damian Guy updated KAFKA-4532:
--
Summary: StateStores can be connected to the wrong source topic resulting 
in incorrect metadata returned from IQ  (was: StateStores can be connected to 
the wrong source topic)

> StateStores can be connected to the wrong source topic resulting in incorrect 
> metadata returned from IQ
> ---
>
> Key: KAFKA-4532
> URL: https://issues.apache.org/jira/browse/KAFKA-4532
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.10.1.0, 0.10.1.1, 0.10.2.0
>Reporter: Damian Guy
>Assignee: Damian Guy
> Fix For: 0.10.2.0
>
>
> When building a topology with tables and StateStores, the StateStores are 
> mapped to the source topic names. This map is retrieved via 
> {{TopologyBuilder.stateStoreNameToSourceTopics()}} and is used in Interactive 
> Queries to find the source topics and partitions when resolving the 
> partitions that particular keys will be in.
> There is an issue where by this mapping for a table that is originally 
> created with {{builder.table("topic", "table");}}, and then is subsequently 
> used in a join, is changed to the join topic. This is because the mapping is 
> updated during the call to {{topology.connectProcessorAndStateStores(..)}}. 
> In the case that the {{stateStoreNameToSourceTopics}} Map already has a value 
> for the state store name it should not update the Map.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (KAFKA-4532) StateStores can be connected to the wrong source topic resulting in incorrect metadata returned from IQ

2016-12-13 Thread Damian Guy (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Damian Guy updated KAFKA-4532:
--
Description: 
When building a topology with tables and StateStores, the StateStores are 
mapped to the source topic names. This map is retrieved via 
{{TopologyBuilder.stateStoreNameToSourceTopics()}} and is used in Interactive 
Queries to find the source topics and partitions when resolving the partitions 
that particular keys will be in.

There is an issue where by this mapping for a table that is originally created 
with {{builder.table("topic", "table");}}, and then is subsequently used in a 
join, is changed to the join topic. This is because the mapping is updated 
during the call to {{topology.connectProcessorAndStateStores(..)}}. 

In the case that the {{stateStoreNameToSourceTopics}} Map already has a value 
for the state store name it should not update the Map.

This is also effects Interactive Queries. The metadata returned when trying to 
find the instance for a key from the source table is incorrect as the store 
name is mapped to the incorrect topic.

  was:
When building a topology with tables and StateStores, the StateStores are 
mapped to the source topic names. This map is retrieved via 
{{TopologyBuilder.stateStoreNameToSourceTopics()}} and is used in Interactive 
Queries to find the source topics and partitions when resolving the partitions 
that particular keys will be in.

There is an issue where by this mapping for a table that is originally created 
with {{builder.table("topic", "table");}}, and then is subsequently used in a 
join, is changed to the join topic. This is because the mapping is updated 
during the call to {{topology.connectProcessorAndStateStores(..)}}. 

In the case that the {{stateStoreNameToSourceTopics}} Map already has a value 
for the state store name it should not update the Map.


> StateStores can be connected to the wrong source topic resulting in incorrect 
> metadata returned from IQ
> ---
>
> Key: KAFKA-4532
> URL: https://issues.apache.org/jira/browse/KAFKA-4532
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.10.1.0, 0.10.1.1, 0.10.2.0
>Reporter: Damian Guy
>Assignee: Damian Guy
> Fix For: 0.10.2.0
>
>
> When building a topology with tables and StateStores, the StateStores are 
> mapped to the source topic names. This map is retrieved via 
> {{TopologyBuilder.stateStoreNameToSourceTopics()}} and is used in Interactive 
> Queries to find the source topics and partitions when resolving the 
> partitions that particular keys will be in.
> There is an issue where by this mapping for a table that is originally 
> created with {{builder.table("topic", "table");}}, and then is subsequently 
> used in a join, is changed to the join topic. This is because the mapping is 
> updated during the call to {{topology.connectProcessorAndStateStores(..)}}. 
> In the case that the {{stateStoreNameToSourceTopics}} Map already has a value 
> for the state store name it should not update the Map.
> This is also effects Interactive Queries. The metadata returned when trying 
> to find the instance for a key from the source table is incorrect as the 
> store name is mapped to the incorrect topic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [DISCUSS] KIP-83 - Allow multiple SASL PLAIN authenticated Java clients in a single JVM process

2016-12-13 Thread Edoardo Comar

Thanks for your review, Ismael.

First, I am no longer sure KIP-83 is worth keeping as KIP, I created it 
just before Rajini's 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-85%3A+Dynamic+JAAS+configuration+for+Kafka+clients
With KIP-85 as presented, my proposal has become a simple JIRA, there are 
no interface changes on top of KIP-85.
So I'll have no objection if you want to retire it as part of your 
cleanup.

As for your comments :
1) We can change the map to use the Password object as a key in the 
LoginManager cache, so logging its content won't leak the key.
Though I can't see why we would log the content of the cache.

2) If two clients use the same Jaas Config value, they will obtain the 
same LoginManager.
No new concurrency issue would arise as this happens today with any two 
clients (Producers/Consumers) in the same process.

3) Based on most jaas.config samples I have seen for kerberos and 
sasl/plain, the text used as key should be no larger than 0.5k.

Please let us know of any other concerns you may have, as 
IBM Message Hub is very eager to have the issue 
https://issues.apache.org/jira/browse/KAFKA-4180 merged in the next 
release (February timeframe 0.10.2 ? 0.11 ?). 
so we're waiting for Rajini's 
https://issues.apache.org/jira/browse/KAFKA-4259 on which our changes are 
based.

thanks
Edo
--
Edoardo Comar
IBM MessageHub
eco...@uk.ibm.com
IBM UK Ltd, Hursley Park, SO21 2JN

IBM United Kingdom Limited Registered in England and Wales with number 
741598 Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 
3AU

From:   Ismael Juma 
To: dev@kafka.apache.org
Date:   13/12/2016 12:49
Subject:Re: [DISCUSS] KIP-83 - Allow multiple SASL PLAIN 
authenticated Java clients in a single JVM process
Sent by:isma...@gmail.com

Thanks for the KIP. A few comments:

1. The suggestion is to use the JAAS config value as the key to the map in
`LoginManager`. The config value can include passwords, so we could
potentially end up leaking them if we log the keys of `LoginManager`. This
seems a bit dangerous.

2. If someone uses the same JAAS config value in two clients, they'll get
the same `JaasConfig`, which seems fine, but worth mentioning (it means
that the `JaasConfig` has to be thread-safe).

3. How big can a JAAS config get? Is it an issue to use it as a map key?
Probably not given how this is used, but worth covering in the KIP as 
well.

Ismael

On Tue, Sep 27, 2016 at 10:15 AM, Edoardo Comar  wrote:

> Hi,
> I had a go at a KIP that addresses this JIRA
> https://issues.apache.org/jira/browse/KAFKA-4180
> "Shared authentification with multiple actives Kafka 
producers/consumers"
>
> which is a limitation of the current Java client that we (IBM 
MessageHub)
> get asked quite often lately.
>
> We will have a go at a PR soon, just as a proof of concept, but as it
> introduces new public interfaces it needs a KIP.
>
> I'll welcome your input.
>
> Edo
> --
> Edoardo Comar
> MQ Cloud Technologies
> eco...@uk.ibm.com
> +44 (0)1962 81 5576
> IBM UK Ltd, Hursley Park, SO21 2JN
>
> IBM United Kingdom Limited Registered in England and Wales with number
> 741598 Registered office: PO Box 41, North Harbour, Portsmouth, Hants. 
PO6
> 3AU
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 
3AU
>

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

[jira] [Comment Edited] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Tom DeVoe (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15745663#comment-15745663
 ] 

Tom DeVoe edited comment on KAFKA-4477 at 12/13/16 5:20 PM:


[~junrao] I respectfully disagree, and this is why I originally was hesitant to 
post the extended logs - the shrinking ISR from 1003, 1001, 1002 happened after 
I restarted node 1002 (as is expected).

If we pay attention to the timestamps, the symptoms described in the ticket 
*exactly* match what I have seen. 

- In the node 1002 log, we see all ISRs reduced to itself at {{2016-11-28 
19:57:05}}.

- Approximately 10 seconds later at {{2016-11-28 19:57:16,003}} (as in the 
original issue description) the other two nodes (1001, 1003) both log 
{{java.io.IOException: Connection to 1002 was disconnected before the response 
was read}}.

- After this occurs, we also see an increasing amount of file descriptors 
opening on node 1002.

Checking the zookeeper logs does not indicate any sessions expired at the time 
this issue occurred.


was (Author: tdevoe):
[~junrao] I respectfully disagree, and this is why I originally was hesitant to 
post the extended logs - the shrinking ISR from 1003, 1001, 1002 happened after 
I restarted node 1002 (as is expected).

If we pay attention to the timestamps, the symptoms described in the ticket 
*exactly* match what I have seen. 

- In the node 1002 log, we see all ISRs reduced to itself at {{2016-11-28 
19:57:05}}.

- Approximately 10 seconds later at {{2016-11-28 19:57:16,003}} (as in the 
original issue description) the other two nodes (1001, 1003) both log 
{{java.io.IOException: Connection to 1002 was disconnected before the response 
was read}}.

- After this occurs, we also see an increasing amount of file descriptors 
opening on node 1002.

Checking the zookeeper logs does not indicate *any* sessions expired at the 
time this issue occurred.

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v

[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Tom DeVoe (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15745663#comment-15745663
 ] 

Tom DeVoe commented on KAFKA-4477:
--

[~junrao] I respectfully disagree, and this is why I originally was hesitant to 
post the extended logs - the shrinking ISR from 1003, 1001, 1002 happened after 
I restarted node 1002 (as is expected).

If we pay attention to the timestamps, the symptoms described in the ticket 
*exactly* match what I have seen. 

- In the node 1002 log, we see all ISRs reduced to itself at {{2016-11-28 
19:57:05}}.

- Approximately 10 seconds later at {{2016-11-28 19:57:16,003}} (as in the 
original issue description) the other two nodes (1001, 1003) both log 
{{java.io.IOException: Connection to 1002 was disconnected before the response 
was read}}.

- After this occurs, we also see an increasing amount of file descriptors 
opening on node 1002.

Checking the zookeeper logs does not indicate *any* sessions expired at the 
time this issue occurred.

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-4437) Incremental Batch Processing for Kafka Streams

2016-12-13 Thread Matthias J. Sax (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15745672#comment-15745672
 ] 

Matthias J. Sax commented on KAFKA-4437:


There is a mailing list thread... Just forgot to update the KIP Wiki page... 
Here it goes: 
http://search-hadoop.com/m/Kafka/uyzND1YI7Uf2hpKcc?subj=+DISCUSS+KIP+95+Incremental+Batch+Processing+for+Kafka+Streams

> Incremental Batch Processing for Kafka Streams
> --
>
> Key: KAFKA-4437
> URL: https://issues.apache.org/jira/browse/KAFKA-4437
> Project: Kafka
>  Issue Type: New Feature
>  Components: streams
>Reporter: Matthias J. Sax
>Assignee: Matthias J. Sax
>
> We want to add an “auto stop” feature that terminate a stream application 
> when it has processed all the data that was newly available at the time the 
> application started (to at current end-of-log, i.e., current high watermark). 
> This allows to chop the (infinite) log into finite chunks where each run for 
> the application processes one chunk. This feature allows for incremental 
> batch-like processing; think "start-process-stop-restart-process-stop-..."
> For details see KIP-95: 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-95%3A+Incremental+Batch+Processing+for+Kafka+Streams



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [DISCUSS] KIP-81: Max in-flight fetches

2016-12-13 Thread Rajini Sivaram

Coordinator starvation: For an implementation based on KIP-72, there will
be coordinator starvation without KAFKA-4137 since you would stop reading
from sockets when the memory pool is full (the fact that coordinator
messages are small doesn't help). I imagine you can work around this by
treating coordinator connections as special connections but that spills
over to common network code. Separate NetworkClient for coordinator
proposed in KAFKA-4137 would be much better.

On Tue, Dec 13, 2016 at 3:47 PM, Mickael Maison 
wrote:

> Thanks for all the feedback.
>
> I've updated the KIP with all the details.
> Below are a few of the main points:
>
> - Overall memory usage of the consumer:
> I made it clear the memory pool is only used to store the raw bytes
> from the network and that the decompressed/deserialized messages are
> not stored in it but as extra memory on the heap. In addition, the
> consumer also keeps track of other things (in flight requests,
> subscriptions, etc..) that account for extra memory as well. So this
> is not a hard bound memory constraint but should still allow to
> roughly size how much memory can be used.
>
> - Relation with the existing settings:
> There are already 2 settings that deal with memory usage of the
> consumer. I suggest we lower the priority of
> `max.partition.fetch.bytes` (I wonder if we should attempt to
> deprecate it or increase its default value so it's a contraint less
> likely to be hit) and have the new setting `buffer.memory` as High.
> I'm a bit unsure what's the best default value for `buffer.memory`, I
> suggested 100MB in the KIP (2 x `fetch.max.bytes`), but I'd appreciate
> feedback. It should always at least be equal to `max.fetch.bytes`.
>
> - Configuration name `buffer.memory`:
> I think it's the name that makes the most sense. It's aligned with the
> producer and as mentioned generic enough to allow future changes if
> needed.
>
> - Coordination starvation:
> Yes this is a potential issue. I'd expect these requests to be small
> enough to not be affected too much. If that's the case KAFKA-4137
> suggests a possible fix.
>
>
>
> On Tue, Dec 13, 2016 at 9:31 AM, Ismael Juma  wrote:
> > Makes sense Jay.
> >
> > Mickael, in addition to how we can compute defaults of the other settings
> > from `buffer.memory`, it would be good to specify what is allowed and how
> > we handle the different cases (e.g. what do we do if
> > `max.partition.fetch.bytes`
> > is greater than `buffer.memory`, is that simply not allowed?).
> >
> > To summarise the gap between the ideal scenario (user specifies how much
> > memory the consumer can use) and what is being proposed:
> >
> > 1. We will decompress and deserialize the data for one or more partitions
> > in order to return them to the user and we don't account for the
> increased
> > memory usage resulting from that. This is likely to be significant on a
> per
> > record basis, but we try to do it for the minimal number of records
> > possible within the constraints of the system. Currently the constraints
> > are: we decompress and deserialize the data for a partition at a time
> > (default `max.partition.fetch.bytes` is 1MB, but this is a soft limit in
> > case there are oversized messages) until we have enough records to
> > satisfy `max.poll.records`
> > (default 500) or there are no more completed fetches. It seems like this
> > may be OK for a lot of cases, but some tuning will still be required in
> > others.
> >
> > 2. We don't account for bookkeeping data structures or intermediate
> objects
> > allocated during the general operation of the consumer. Probably
> something
> > we have to live with as the cost/benefit of fixing this doesn't seem
> worth
> > it.
> >
> > Ismael
> >
> > On Tue, Dec 13, 2016 at 8:34 AM, Jay Kreps  wrote:
> >
> >> Hey Ismael,
> >>
> >> Yeah I think we are both saying the same thing---removing only works if
> you
> >> have a truly optimal strategy. Actually even dynamically computing a
> >> reasonable default isn't totally obvious (do you set fetch.max.bytes to
> >> equal buffer.memory to try to queue up as much data in the network
> buffers?
> >> Do you try to limit it to your socket.receive.buffer size so that you
> can
> >> read it in a single shot?).
> >>
> >> Regarding what is being measured, my interpretation was the same as
> yours.
> >> I was just adding to the previous point that buffer.memory setting would
> >> not be a very close proxy for memory usage. Someone was pointing out
> that
> >> compression would make this true, and I was just adding that even
> without
> >> compression the object overhead would lead to a high expansion factor.
> >>
> >> -Jay
> >>
> >> On Mon, Dec 12, 2016 at 11:53 PM, Ismael Juma 
> wrote:
> >>
> >> > Hi Jay,
> >> >
> >> > About `max.partition.fetch.bytes`, yes it was an oversight not to
> lower
> >> its
> >> > priority as part of KIP-74 given the existence of `fetch.max.bytes`
> and
> >> the
> >> > fact that we can now make progress in the presen

Re: Dev list subscribe

2016-12-13 Thread Guozhang Wang

Rajini,

It's self-service :)

https://kafka.apache.org/contact

Guozhang

On Tue, Dec 13, 2016 at 5:38 AM, Rajini Sivaram  wrote:

> Please subscribe me to the Kafka dev list.
>
>
> Thank you,
>
> Rajini
>



-- 
-- Guozhang

[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Michael Andre Pearce (IG) (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15745734#comment-15745734
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4477:
--

Hi Jun,

The stack was taken by the automated restart script we've had to put in place 
before it restarted the nodes, which picked up the issue 20 seconds after it 
started.

The broker during the period is not under high load. We do not see any GC 
issues, nor do we see any ZK issues.

The logs we are seeing are matching those of other people, we have had this 
occur 3 times further all having very similar logs aka nothing new is showing 
up.

On a side note, we are looking to upgrade to 0.10.1.1 as soon as its released 
and we see it released by Confluent also. We do this as we expect some further 
sanity checks have occurred and use this as a measure to check no critical 
issues,

We will aim to push to UAT quickly (where we see this issue also (weirdly we 
haven't had this occur in TEST or DEV)) to see if this is resolved. What is the 
expected timeline for this? We still expecting it to be released today? And 
when would Confluent likely to complete their testing and release.

Cheers
Mike

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] kafka pull request #2247: MINOR: Fix Streams examples in documentation

2016-12-13 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/2247


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] kafka pull request #2243: KAFKA-4497: LogCleaner appended the wrong offset t...

2016-12-13 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/2243


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (KAFKA-4497) log cleaner breaks on timeindex

2016-12-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15745742#comment-15745742
 ] 

ASF GitHub Bot commented on KAFKA-4497:
---

Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/2243


> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
> Fix For: 0.10.1.1
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retaining deletes. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,968] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogCleaner)
> kafka.common.InvalidOffsetException: Attempt to append an offset (-1) to slot 
> 9 no larger than the last offset appended (11832) to 
> /var/lib/kafka/xdc_listing-status-v2-1/.timeindex.cleaned.
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply$mcV$sp(TimeIndex.scala:117)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
> at kafka.log.TimeIndex.maybeAppend(TimeIndex.scala:107)
> at kafka.log.LogSegment.append(LogSegment.scala:106)
> at kafka.log.Cleaner.cleanInto(LogCleaner.scala:518)
> at 
> kafka.log.Cleaner$$anonfun$cleanSegments$1.apply(LogCleaner.scala:404)
> at 
> kafka.log.Cleaner$$anonfun$cleanSegments$1.apply(LogCleaner.scala:400)
> at scala.collection.immutable.List.foreach(L

[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Michael Andre Pearce (IG) (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15745749#comment-15745749
 ] 

Michael Andre Pearce (IG) commented on KAFKA-4477:
--

It is worth noting we see the open file descriptors if we leave the process in 
a sick mode (now we restart quickly we don't get to observe this).

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (KAFKA-4497) log cleaner breaks on timeindex

2016-12-13 Thread Jun Rao (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-4497.

Resolution: Fixed

[~roschumann], I merged Jiangjie's batch to 0.10.1 branch. Do you think you 
could give it a try and see if it fixes your issue?

The fix for trunk will be included in KAFKA-4390. Closing this jira for now.

> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
> Fix For: 0.10.1.1
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retaining deletes. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,968] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogCleaner)
> kafka.common.InvalidOffsetException: Attempt to append an offset (-1) to slot 
> 9 no larger than the last offset appended (11832) to 
> /var/lib/kafka/xdc_listing-status-v2-1/.timeindex.cleaned.
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply$mcV$sp(TimeIndex.scala:117)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
> at kafka.log.TimeIndex.maybeAppend(TimeIndex.scala:107)
> at kafka.log.LogSegment.append(LogSegment.scala:106)
> at kafka.log.Cleaner.cleanInto(LogCleaner.scala:518)
> at 
> kafka.log.Cleaner$$anonfun$cleanSegments$1.apply(LogCleaner.scala:404)
> at 
> kafka.log.Cleaner$$anonfun$cleanSegments$1.apply(LogCleaner.

Re: Dev list subscribe

2016-12-13 Thread Rajini Sivaram

Sorry, that was a mail sent by mistake.

On Tue, Dec 13, 2016 at 5:39 PM, Guozhang Wang  wrote:

> Rajini,
>
> It's self-service :)
>
> https://kafka.apache.org/contact
>
> Guozhang
>
> On Tue, Dec 13, 2016 at 5:38 AM, Rajini Sivaram 
> wrote:
>
> > Please subscribe me to the Kafka dev list.
> >
> >
> > Thank you,
> >
> > Rajini
> >
>
>
>
> --
> -- Guozhang
>

[jira] [Comment Edited] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Michael Andre Pearce (IG) (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15745749#comment-15745749
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4477 at 12/13/16 5:49 PM:


It is worth noting we see the open file descriptors increase as mentioned by 
someone else if we leave the process in a sick mode (now we restart quickly we 
don't get to observe this).


was (Author: michael.andre.pearce):
It is worth noting we see the open file descriptors if we leave the process in 
a sick mode (now we restart quickly we don't get to observe this).

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (KAFKA-4532) StateStores can be connected to the wrong source topic resulting in incorrect metadata returned from IQ

2016-12-13 Thread Damian Guy (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Damian Guy updated KAFKA-4532:
--
Description: 
When building a topology with tables and StateStores, the StateStores are 
mapped to the source topic names. This map is retrieved via 
{{TopologyBuilder.stateStoreNameToSourceTopics()}} and is used in Interactive 
Queries to find the source topics and partitions when resolving the partitions 
that particular keys will be in.

There is an issue where by this mapping for a table that is originally created 
with {{builder.table("topic", "table");}}, and then is subsequently used in a 
join, is changed to the join topic. This is because the mapping is updated 
during the call to {{topology.connectProcessorAndStateStores(..)}}. 

In the case that the {{stateStoreNameToSourceTopics}} Map already has a value 
for the state store name it should not update the Map.

  was:
When building a topology with tables and StateStores, the StateStores are 
mapped to the source topic names. This map is retrieved via 
{{TopologyBuilder.stateStoreNameToSourceTopics()}} and is used in Interactive 
Queries to find the source topics and partitions when resolving the partitions 
that particular keys will be in.

There is an issue where by this mapping for a table that is originally created 
with {{builder.table("topic", "table");}}, and then is subsequently used in a 
join, is changed to the join topic. This is because the mapping is updated 
during the call to {{topology.connectProcessorAndStateStores(..)}}. 

In the case that the {{stateStoreNameToSourceTopics}} Map already has a value 
for the state store name it should not update the Map.

This is also effects Interactive Queries. The metadata returned when trying to 
find the instance for a key from the source table is incorrect as the store 
name is mapped to the incorrect topic.


> StateStores can be connected to the wrong source topic resulting in incorrect 
> metadata returned from IQ
> ---
>
> Key: KAFKA-4532
> URL: https://issues.apache.org/jira/browse/KAFKA-4532
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.10.1.0, 0.10.1.1, 0.10.2.0
>Reporter: Damian Guy
>Assignee: Damian Guy
> Fix For: 0.10.2.0
>
>
> When building a topology with tables and StateStores, the StateStores are 
> mapped to the source topic names. This map is retrieved via 
> {{TopologyBuilder.stateStoreNameToSourceTopics()}} and is used in Interactive 
> Queries to find the source topics and partitions when resolving the 
> partitions that particular keys will be in.
> There is an issue where by this mapping for a table that is originally 
> created with {{builder.table("topic", "table");}}, and then is subsequently 
> used in a join, is changed to the join topic. This is because the mapping is 
> updated during the call to {{topology.connectProcessorAndStateStores(..)}}. 
> In the case that the {{stateStoreNameToSourceTopics}} Map already has a value 
> for the state store name it should not update the Map.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Jun Rao (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15745782#comment-15745782
 ] 

Jun Rao commented on KAFKA-4477:


[~michael.andre.pearce], the 0.10.1.1 release will need RC1 since we need to 
fix another critical bug.

Note that the cause of the leader dropping all followers out of ISR is 
potentially different from dropping just 1 follower. For your issue, it would 
be useful to also look at the controller/state-change log around that time to 
see if the followers have received any LeaderAndIsrRequests.

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] kafka pull request #2250: KAFKA-4532: StateStores can be connected to the wr...

2016-12-13 Thread dguy

GitHub user dguy opened a pull request:

https://github.com/apache/kafka/pull/2250

KAFKA-4532: StateStores can be connected to the wrong source topic 
resulting in incorrect metadata returned from Interactive Queries

When building a topology with tables and StateStores, the StateStores are 
mapped to the source topic names. This map is retrieved via 
TopologyBuilder.stateStoreNameToSourceTopics() and is used in Interactive 
Queries to find the source topics and partitions when resolving the partitions 
that particular keys will be in.
There is an issue where by this mapping for a table that is originally 
created with builder.table("topic", "table");, and then is subsequently used in 
a join, is changed to the join topic. This is because the mapping is updated 
during the call to topology.connectProcessorAndStateStores(..).
In the case that the stateStoreNameToSourceTopics Map already has a value 
for the state store name it should not update the Map.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dguy/kafka kafka-4532

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/2250.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2250


commit 45833be3ca1b6f8ac86f516cae4ff1b6571089e8
Author: Damian Guy 
Date:   2016-12-13T18:06:38Z

state store name to topic mapping incorrect




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (KAFKA-4532) StateStores can be connected to the wrong source topic resulting in incorrect metadata returned from IQ

2016-12-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15745809#comment-15745809
 ] 

ASF GitHub Bot commented on KAFKA-4532:
---

GitHub user dguy opened a pull request:

https://github.com/apache/kafka/pull/2250

KAFKA-4532: StateStores can be connected to the wrong source topic 
resulting in incorrect metadata returned from Interactive Queries

When building a topology with tables and StateStores, the StateStores are 
mapped to the source topic names. This map is retrieved via 
TopologyBuilder.stateStoreNameToSourceTopics() and is used in Interactive 
Queries to find the source topics and partitions when resolving the partitions 
that particular keys will be in.
There is an issue where by this mapping for a table that is originally 
created with builder.table("topic", "table");, and then is subsequently used in 
a join, is changed to the join topic. This is because the mapping is updated 
during the call to topology.connectProcessorAndStateStores(..).
In the case that the stateStoreNameToSourceTopics Map already has a value 
for the state store name it should not update the Map.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dguy/kafka kafka-4532

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/2250.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2250


commit 45833be3ca1b6f8ac86f516cae4ff1b6571089e8
Author: Damian Guy 
Date:   2016-12-13T18:06:38Z

state store name to topic mapping incorrect




> StateStores can be connected to the wrong source topic resulting in incorrect 
> metadata returned from IQ
> ---
>
> Key: KAFKA-4532
> URL: https://issues.apache.org/jira/browse/KAFKA-4532
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.10.1.0, 0.10.1.1, 0.10.2.0
>Reporter: Damian Guy
>Assignee: Damian Guy
> Fix For: 0.10.2.0
>
>
> When building a topology with tables and StateStores, the StateStores are 
> mapped to the source topic names. This map is retrieved via 
> {{TopologyBuilder.stateStoreNameToSourceTopics()}} and is used in Interactive 
> Queries to find the source topics and partitions when resolving the 
> partitions that particular keys will be in.
> There is an issue where by this mapping for a table that is originally 
> created with {{builder.table("topic", "table");}}, and then is subsequently 
> used in a join, is changed to the join topic. This is because the mapping is 
> updated during the call to {{topology.connectProcessorAndStateStores(..)}}. 
> In the case that the {{stateStoreNameToSourceTopics}} Map already has a value 
> for the state store name it should not update the Map.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Jun Rao (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15745813#comment-15745813
 ] 

Jun Rao commented on KAFKA-4477:


[~tdevoe], thanks for the clarification. Then, it looks similar to what's 
reported in the jira.

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] kafka pull request #2090: KAFKA-4269: Follow up for 0.10.1 branch -update to...

2016-12-13 Thread bbejeck

Github user bbejeck closed the pull request at:

https://github.com/apache/kafka/pull/2090


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (KAFKA-4269) Multiple KStream instances with at least one Regex source causes NPE when using multiple consumers

2016-12-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15745818#comment-15745818
 ] 

ASF GitHub Bot commented on KAFKA-4269:
---

Github user bbejeck closed the pull request at:

https://github.com/apache/kafka/pull/2090


> Multiple KStream instances with at least one Regex source causes NPE when 
> using multiple consumers
> --
>
> Key: KAFKA-4269
> URL: https://issues.apache.org/jira/browse/KAFKA-4269
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.10.1.0
>Reporter: Bill Bejeck
>Assignee: Bill Bejeck
> Fix For: 0.10.1.1, 0.10.2.0
>
>
> I discovered this issue while doing testing for for KAFKA-4114. 
> KAFKA-4131 fixed the issue of a _single_ KStream with a regex source on 
> partitioned topics across multiple consumers.
> //KAFKA-4131 fixed this case assuming an "foo*" topics are partitioned
> KStream kstream = builder.source(Pattern.compile("foo.*"));
> KafkaStream stream = new KafkaStreams(builder, props);
> stream.start();  
> This is a new issue where there are _multiple_
> KStream instances (and one has a regex source) within a single KafkaStreams 
> object. When running the second or "following"
> consumer there are NPE errors generated in the RecordQueue.addRawRecords 
> method when attempting to consume records. 
> For example:
> KStream kstream = builder.source(Pattern.compile("foo.*"));
> KStream kstream2 = builder.source(.): //can be regex or named topic 
> sources
> KafkaStream stream = new KafkaStreams(builder, props);
> stream.start();
> By adding an additional KStream instance like above (whether Regex or Named 
> topic) causes a NPE when run as "follower"
> From my initial debugging I can see the TopicPartition assignments being set 
> on the "follower" KafkaStreams instance, but need to track down why and where 
> all assignments aren't being set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Jun Rao (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15745817#comment-15745817
 ] 

Jun Rao commented on KAFKA-4477:


[~michael.andre.pearce], [~tdevoe],  was the following exception in the 
follower continuous after it started?
java.io.IOException: Connection to 1002 was disconnected before the response 
was read

Do you know the FollowerFetch request latency (reported by jmx) on the leader 
when the issue happened?

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[DISCUSS] KIP-90 Remove zkClient dependency from Streams

2016-12-13 Thread Hojjat Jafarpour

Hi all,

The following is a KIP for removing zkClient dependency from Streams.
Please check out the KIP page:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-90+-+Remove+zkClient+dependency+from+Streams

Thanks,
--Hojjat

[jira] [Resolved] (KAFKA-4390) Replace MessageSet usage with client-side equivalents

2016-12-13 Thread Jason Gustafson (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-4390.

   Resolution: Fixed
Fix Version/s: 0.10.2.0

Issue resolved by pull request 2140
[https://github.com/apache/kafka/pull/2140]

> Replace MessageSet usage with client-side equivalents
> -
>
> Key: KAFKA-4390
> URL: https://issues.apache.org/jira/browse/KAFKA-4390
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
> Fix For: 0.10.2.0
>
>
> Currently we have two separate implementations of Kafka's message format and 
> log structure, one on the client side and one on the server side. Once 
> KAFKA-2066 is merged, we will only be using the client side objects for 
> direct serialization/deserialization in the request APIs, but we we still be 
> using the server-side MessageSet objects everywhere else. Ideally, we can 
> update this code to use the client objects everywhere so that future message 
> format changes only need to be made in one place. This would eliminate the 
> potential for implementation differences and gives us a uniform API for 
> accessing the low-level log structure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] kafka pull request #2140: KAFKA-4390: Replace MessageSet usage with client-s...

2016-12-13 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/2140


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (KAFKA-4390) Replace MessageSet usage with client-side equivalents

2016-12-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15745892#comment-15745892
 ] 

ASF GitHub Bot commented on KAFKA-4390:
---

Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/2140


> Replace MessageSet usage with client-side equivalents
> -
>
> Key: KAFKA-4390
> URL: https://issues.apache.org/jira/browse/KAFKA-4390
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
> Fix For: 0.10.2.0
>
>
> Currently we have two separate implementations of Kafka's message format and 
> log structure, one on the client side and one on the server side. Once 
> KAFKA-2066 is merged, we will only be using the client side objects for 
> direct serialization/deserialization in the request APIs, but we we still be 
> using the server-side MessageSet objects everywhere else. Ideally, we can 
> update this code to use the client objects everywhere so that future message 
> format changes only need to be made in one place. This would eliminate the 
> potential for implementation differences and gives us a uniform API for 
> accessing the low-level log structure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Tom DeVoe (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom DeVoe updated KAFKA-4477:
-
Attachment: state_change_controller.tar.gz

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack, state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Tom DeVoe (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15745919#comment-15745919
 ] 

Tom DeVoe commented on KAFKA-4477:
--

Sorry about that [~apurva], I attached the state change and controller logs 
from that period in the state_change_controller.tar.gz tarball.

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack, state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] kafka pull request #2251: KAFKA-4529; Fix the issue that tombstone can be de...

2016-12-13 Thread becketqin

GitHub user becketqin opened a pull request:

https://github.com/apache/kafka/pull/2251

KAFKA-4529; Fix the issue that tombstone can be deleted too early.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/becketqin/kafka KAFKA-4529

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/2251.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2251


commit 42f284bf0c1897899b4d620d18de9f66f65617f3
Author: Jiangjie Qin 
Date:   2016-12-13T19:02:08Z

KAFKA-4529; Fix the issue that tombstone can be deleted too early.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (KAFKA-4529) tombstone may be removed earlier than it should

2016-12-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15745956#comment-15745956
 ] 

ASF GitHub Bot commented on KAFKA-4529:
---

GitHub user becketqin opened a pull request:

https://github.com/apache/kafka/pull/2251

KAFKA-4529; Fix the issue that tombstone can be deleted too early.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/becketqin/kafka KAFKA-4529

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/2251.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2251


commit 42f284bf0c1897899b4d620d18de9f66f65617f3
Author: Jiangjie Qin 
Date:   2016-12-13T19:02:08Z

KAFKA-4529; Fix the issue that tombstone can be deleted too early.




> tombstone may be removed earlier than it should
> ---
>
> Key: KAFKA-4529
> URL: https://issues.apache.org/jira/browse/KAFKA-4529
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.1.0
>Reporter: Jun Rao
>Assignee: Jiangjie Qin
> Fix For: 0.10.1.1
>
>
> As part of KIP-33, we introduced a regression on how tombstone is removed in 
> a compacted topic. We want to delay the removal of a tombstone to avoid the 
> case that a reader first reads a non-tombstone message on a key and then 
> doesn't see the tombstone for the key because it's deleted too quickly. So, a 
> tombstone is supposed to only be removed from a compacted topic after the 
> tombstone is part of the cleaned portion of the log after delete.retention.ms.
> Before KIP-33, deleteHorizonMs in LogCleaner is calculated based on the last 
> modified time, which is monotonically increasing from old to new segments. 
> With KIP-33, deleteHorizonMs is calculated based on the message timestamp, 
> which is not necessarily monotonically increasing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Build failed in Jenkins: kafka-trunk-jdk8 #1099

2016-12-13 Thread Apache Jenkins Server

See 

Changes:

[wangguoz] MINOR: Fix Streams examples in documentation

--
[...truncated 3907 lines...]

kafka.utils.CommandLineUtilsTest > testParseEmptyArg STARTED

kafka.utils.CommandLineUtilsTest > testParseEmptyArg PASSED

kafka.utils.CommandLineUtilsTest > testParseSingleArg STARTED

kafka.utils.CommandLineUtilsTest > testParseSingleArg PASSED

kafka.utils.CommandLineUtilsTest > testParseArgs STARTED

kafka.utils.CommandLineUtilsTest > testParseArgs PASSED

kafka.utils.CommandLineUtilsTest > testParseEmptyArgAsValid STARTED

kafka.utils.CommandLineUtilsTest > testParseEmptyArgAsValid PASSED

kafka.utils.ReplicationUtilsTest > testUpdateLeaderAndIsr STARTED

kafka.utils.ReplicationUtilsTest > testUpdateLeaderAndIsr PASSED

kafka.utils.ReplicationUtilsTest > testGetLeaderIsrAndEpochForPartition STARTED

kafka.utils.ReplicationUtilsTest > testGetLeaderIsrAndEpochForPartition PASSED

kafka.utils.JsonTest > testJsonEncoding STARTED

kafka.utils.JsonTest > testJsonEncoding PASSED

kafka.utils.SchedulerTest > testMockSchedulerNonPeriodicTask STARTED

kafka.utils.SchedulerTest > testMockSchedulerNonPeriodicTask PASSED

kafka.utils.SchedulerTest > testMockSchedulerPeriodicTask STARTED

kafka.utils.SchedulerTest > testMockSchedulerPeriodicTask PASSED

kafka.utils.SchedulerTest > testNonPeriodicTask STARTED

kafka.utils.SchedulerTest > testNonPeriodicTask PASSED

kafka.utils.SchedulerTest > testRestart STARTED

kafka.utils.SchedulerTest > testRestart PASSED

kafka.utils.SchedulerTest > testReentrantTaskInMockScheduler STARTED

kafka.utils.SchedulerTest > testReentrantTaskInMockScheduler PASSED

kafka.utils.SchedulerTest > testPeriodicTask STARTED

kafka.utils.SchedulerTest > testPeriodicTask PASSED

kafka.utils.ZkUtilsTest > testAbortedConditionalDeletePath STARTED

kafka.utils.ZkUtilsTest > testAbortedConditionalDeletePath PASSED

kafka.utils.ZkUtilsTest > testSuccessfulConditionalDeletePath STARTED

kafka.utils.ZkUtilsTest > testSuccessfulConditionalDeletePath PASSED

kafka.utils.ZkUtilsTest > testClusterIdentifierJsonParsing STARTED

kafka.utils.ZkUtilsTest > testClusterIdentifierJsonParsing PASSED

kafka.utils.IteratorTemplateTest > testIterator STARTED

kafka.utils.IteratorTemplateTest > testIterator PASSED

kafka.utils.UtilsTest > testGenerateUuidAsBase64 STARTED

kafka.utils.UtilsTest > testGenerateUuidAsBase64 PASSED

kafka.utils.UtilsTest > testAbs STARTED

kafka.utils.UtilsTest > testAbs PASSED

kafka.utils.UtilsTest > testReplaceSuffix STARTED

kafka.utils.UtilsTest > testReplaceSuffix PASSED

kafka.utils.UtilsTest > testCircularIterator STARTED

kafka.utils.UtilsTest > testCircularIterator PASSED

kafka.utils.UtilsTest > testReadBytes STARTED

kafka.utils.UtilsTest > testReadBytes PASSED

kafka.utils.UtilsTest > testCsvList STARTED

kafka.utils.UtilsTest > testCsvList PASSED

kafka.utils.UtilsTest > testReadInt STARTED

kafka.utils.UtilsTest > testReadInt PASSED

kafka.utils.UtilsTest > testUrlSafeBase64EncodeUUID STARTED

kafka.utils.UtilsTest > testUrlSafeBase64EncodeUUID PASSED

kafka.utils.UtilsTest > testCsvMap STARTED

kafka.utils.UtilsTest > testCsvMap PASSED

kafka.utils.UtilsTest > testInLock STARTED

kafka.utils.UtilsTest > testInLock PASSED

kafka.utils.UtilsTest > testSwallow STARTED

kafka.utils.UtilsTest > testSwallow PASSED

kafka.producer.AsyncProducerTest > testFailedSendRetryLogic STARTED

kafka.producer.AsyncProducerTest > testFailedSendRetryLogic PASSED

kafka.producer.AsyncProducerTest > testQueueTimeExpired STARTED

kafka.producer.AsyncProducerTest > testQueueTimeExpired PASSED

kafka.producer.AsyncProducerTest > testPartitionAndCollateEvents STARTED

kafka.producer.AsyncProducerTest > testPartitionAndCollateEvents PASSED

kafka.producer.AsyncProducerTest > testBatchSize STARTED

kafka.producer.AsyncProducerTest > testBatchSize PASSED

kafka.producer.AsyncProducerTest > testSerializeEvents STARTED

kafka.producer.AsyncProducerTest > testSerializeEvents PASSED

kafka.producer.AsyncProducerTest > testProducerQueueSize STARTED

kafka.producer.AsyncProducerTest > testProducerQueueSize PASSED

kafka.producer.AsyncProducerTest > testRandomPartitioner STARTED

kafka.producer.AsyncProducerTest > testRandomPartitioner PASSED

kafka.producer.AsyncProducerTest > testInvalidConfiguration STARTED

kafka.producer.AsyncProducerTest > testInvalidConfiguration PASSED

kafka.producer.AsyncProducerTest > testInvalidPartition STARTED

kafka.producer.AsyncProducerTest > testInvalidPartition PASSED

kafka.producer.AsyncProducerTest > testNoBroker STARTED

kafka.producer.AsyncProducerTest > testNoBroker PASSED

kafka.producer.AsyncProducerTest > testProduceAfterClosed STARTED

kafka.producer.AsyncProducerTest > testProduceAfterClosed PASSED

kafka.producer.AsyncProducerTest > testJavaProducer STARTED

kafka.producer.AsyncProducerTest > testJavaProducer PASSED

kafka.producer.AsyncProducerTest > testI

[jira] [Commented] (KAFKA-4437) Incremental Batch Processing for Kafka Streams

2016-12-13 Thread Avi Flax (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15745978#comment-15745978
 ] 

Avi Flax commented on KAFKA-4437:
-

Ah, great, thanks!

> Incremental Batch Processing for Kafka Streams
> --
>
> Key: KAFKA-4437
> URL: https://issues.apache.org/jira/browse/KAFKA-4437
> Project: Kafka
>  Issue Type: New Feature
>  Components: streams
>Reporter: Matthias J. Sax
>Assignee: Matthias J. Sax
>
> We want to add an “auto stop” feature that terminate a stream application 
> when it has processed all the data that was newly available at the time the 
> application started (to at current end-of-log, i.e., current high watermark). 
> This allows to chop the (infinite) log into finite chunks where each run for 
> the application processes one chunk. This feature allows for incremental 
> batch-like processing; think "start-process-stop-restart-process-stop-..."
> For details see KIP-95: 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-95%3A+Incremental+Batch+Processing+for+Kafka+Streams



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (KAFKA-4534) StreamPartitionAssignor only ever updates the partitionsByHostState and metadataWithInternalTopics once.

2016-12-13 Thread Damian Guy (JIRA)

Damian Guy created KAFKA-4534:
-

 Summary: StreamPartitionAssignor only ever updates the 
partitionsByHostState and metadataWithInternalTopics once.
 Key: KAFKA-4534
 URL: https://issues.apache.org/jira/browse/KAFKA-4534
 Project: Kafka
  Issue Type: Bug
  Components: streams
Affects Versions: 0.10.2.0
Reporter: Damian Guy
Assignee: Damian Guy
 Fix For: 0.10.2.0


StreamPartitionAssignor only ever updates the partitionsByHostState and 
metadataWithInternalTopics once. This results in incorrect metadata on 
rebalances.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (KAFKA-4509) Task reusage on rebalance fails for threads on same host

2016-12-13 Thread Guozhang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guozhang Wang updated KAFKA-4509:
-
   Resolution: Fixed
Fix Version/s: 0.10.2.0
   Status: Resolved  (was: Patch Available)

Issue resolved by pull request 2233
[https://github.com/apache/kafka/pull/2233]

> Task reusage on rebalance fails for threads on same host
> 
>
> Key: KAFKA-4509
> URL: https://issues.apache.org/jira/browse/KAFKA-4509
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Matthias J. Sax
>Assignee: Matthias J. Sax
> Fix For: 0.10.2.0
>
>
> In https://issues.apache.org/jira/browse/KAFKA-3559 task reusage on rebalance 
> was introduces as a performance optimization. Instead of closing a task on 
> rebalance (ie, {{onPartitionsRevoked()}}, it only get's suspended for a 
> potential reuse in {{onPartitionsAssigned()}}. Only if a task cannot be 
> reused, it will eventually get closed in {{onPartitionsAssigned()}}.
> This mechanism can fail, if multiple {{StreamThreads}} run in the same host 
> (same or different JVM). The scenario is as follows:
>  - assume 2 running threads A and B
>  - assume 3 tasks t1, t2, t3
>  - assignment: A-(t1,t2) and B-(t3)
>  - on the same host, a new single threaded Stream application (same app-id) 
> gets started (thread C)
>  - on rebalance, t2 (could also be t1 -- does not matter) will be moved from 
> A to C
>  - as assignment is only sticky base on an heurictic t1 can sometimes be 
> assigned to B, too -- and t3 get's assigned to A (thre is a race condition if 
> this "task flipping" happens or not)
>  - on revoke, A will suspend task t1 and t2 (not releasing any locks)
>  - on assign
> - A tries to create t3 but as B did not release it yet, A dies with an 
> "cannot get lock" exception
> - B tries to create t1 but as A did not release it yet, B dies with an 
> "cannot get lock" exception
> - as A and B trie to create the task first, this will always fail if task 
> flipping happened
>- C tries to create t2 but A did not release t2 lock yet (race condition) 
> and C dies with an exception (this could even happen without "task flipping" 
> between A and B)
> We want to fix this, by:
>   # first release unassigned suspended tasks in {{onPartitionsAssignment()}}, 
> and afterward create new tasks (this fixes the "task flipping" issue)
>   # use a "backoff and retry mechanism" if a task cannot be created (to 
> handle release-create race condition between different threads)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] kafka pull request #2233: KAFKA-4509: Task reusage on rebalance fails for th...

2016-12-13 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/2233


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (KAFKA-4509) Task reusage on rebalance fails for threads on same host

2016-12-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15746138#comment-15746138
 ] 

ASF GitHub Bot commented on KAFKA-4509:
---

Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/2233


> Task reusage on rebalance fails for threads on same host
> 
>
> Key: KAFKA-4509
> URL: https://issues.apache.org/jira/browse/KAFKA-4509
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Matthias J. Sax
>Assignee: Matthias J. Sax
> Fix For: 0.10.2.0
>
>
> In https://issues.apache.org/jira/browse/KAFKA-3559 task reusage on rebalance 
> was introduces as a performance optimization. Instead of closing a task on 
> rebalance (ie, {{onPartitionsRevoked()}}, it only get's suspended for a 
> potential reuse in {{onPartitionsAssigned()}}. Only if a task cannot be 
> reused, it will eventually get closed in {{onPartitionsAssigned()}}.
> This mechanism can fail, if multiple {{StreamThreads}} run in the same host 
> (same or different JVM). The scenario is as follows:
>  - assume 2 running threads A and B
>  - assume 3 tasks t1, t2, t3
>  - assignment: A-(t1,t2) and B-(t3)
>  - on the same host, a new single threaded Stream application (same app-id) 
> gets started (thread C)
>  - on rebalance, t2 (could also be t1 -- does not matter) will be moved from 
> A to C
>  - as assignment is only sticky base on an heurictic t1 can sometimes be 
> assigned to B, too -- and t3 get's assigned to A (thre is a race condition if 
> this "task flipping" happens or not)
>  - on revoke, A will suspend task t1 and t2 (not releasing any locks)
>  - on assign
> - A tries to create t3 but as B did not release it yet, A dies with an 
> "cannot get lock" exception
> - B tries to create t1 but as A did not release it yet, B dies with an 
> "cannot get lock" exception
> - as A and B trie to create the task first, this will always fail if task 
> flipping happened
>- C tries to create t2 but A did not release t2 lock yet (race condition) 
> and C dies with an exception (this could even happen without "task flipping" 
> between A and B)
> We want to fix this, by:
>   # first release unassigned suspended tasks in {{onPartitionsAssignment()}}, 
> and afterward create new tasks (this fixes the "task flipping" issue)
>   # use a "backoff and retry mechanism" if a task cannot be created (to 
> handle release-create race condition between different threads)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (KAFKA-4532) StateStores can be connected to the wrong source topic resulting in incorrect metadata returned from IQ

2016-12-13 Thread Guozhang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guozhang Wang resolved KAFKA-4532.
--
Resolution: Fixed

Issue resolved by pull request 2250
[https://github.com/apache/kafka/pull/2250]

> StateStores can be connected to the wrong source topic resulting in incorrect 
> metadata returned from IQ
> ---
>
> Key: KAFKA-4532
> URL: https://issues.apache.org/jira/browse/KAFKA-4532
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.10.1.0, 0.10.1.1, 0.10.2.0
>Reporter: Damian Guy
>Assignee: Damian Guy
> Fix For: 0.10.2.0
>
>
> When building a topology with tables and StateStores, the StateStores are 
> mapped to the source topic names. This map is retrieved via 
> {{TopologyBuilder.stateStoreNameToSourceTopics()}} and is used in Interactive 
> Queries to find the source topics and partitions when resolving the 
> partitions that particular keys will be in.
> There is an issue where by this mapping for a table that is originally 
> created with {{builder.table("topic", "table");}}, and then is subsequently 
> used in a join, is changed to the join topic. This is because the mapping is 
> updated during the call to {{topology.connectProcessorAndStateStores(..)}}. 
> In the case that the {{stateStoreNameToSourceTopics}} Map already has a value 
> for the state store name it should not update the Map.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] kafka pull request #2250: KAFKA-4532: StateStores can be connected to the wr...

2016-12-13 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/2250


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (KAFKA-4532) StateStores can be connected to the wrong source topic resulting in incorrect metadata returned from IQ

2016-12-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15746201#comment-15746201
 ] 

ASF GitHub Bot commented on KAFKA-4532:
---

Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/2250


> StateStores can be connected to the wrong source topic resulting in incorrect 
> metadata returned from IQ
> ---
>
> Key: KAFKA-4532
> URL: https://issues.apache.org/jira/browse/KAFKA-4532
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.10.1.0, 0.10.1.1, 0.10.2.0
>Reporter: Damian Guy
>Assignee: Damian Guy
> Fix For: 0.10.2.0
>
>
> When building a topology with tables and StateStores, the StateStores are 
> mapped to the source topic names. This map is retrieved via 
> {{TopologyBuilder.stateStoreNameToSourceTopics()}} and is used in Interactive 
> Queries to find the source topics and partitions when resolving the 
> partitions that particular keys will be in.
> There is an issue where by this mapping for a table that is originally 
> created with {{builder.table("topic", "table");}}, and then is subsequently 
> used in a join, is changed to the join topic. This is because the mapping is 
> updated during the call to {{topology.connectProcessorAndStateStores(..)}}. 
> In the case that the {{stateStoreNameToSourceTopics}} Map already has a value 
> for the state store name it should not update the Map.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Build failed in Jenkins: kafka-trunk-jdk8 #1100

2016-12-13 Thread Apache Jenkins Server

See 

Changes:

[jason] KAFKA-4390; Replace MessageSet usage with client-side alternatives

--
[...truncated 7920 lines...]

kafka.utils.timer.TimerTaskListTest > testAll STARTED

kafka.utils.timer.TimerTaskListTest > testAll PASSED

kafka.utils.CommandLineUtilsTest > testParseEmptyArg STARTED

kafka.utils.CommandLineUtilsTest > testParseEmptyArg PASSED

kafka.utils.CommandLineUtilsTest > testParseSingleArg STARTED

kafka.utils.CommandLineUtilsTest > testParseSingleArg PASSED

kafka.utils.CommandLineUtilsTest > testParseArgs STARTED

kafka.utils.CommandLineUtilsTest > testParseArgs PASSED

kafka.utils.CommandLineUtilsTest > testParseEmptyArgAsValid STARTED

kafka.utils.CommandLineUtilsTest > testParseEmptyArgAsValid PASSED

kafka.utils.ReplicationUtilsTest > testUpdateLeaderAndIsr STARTED

kafka.utils.ReplicationUtilsTest > testUpdateLeaderAndIsr PASSED

kafka.utils.ReplicationUtilsTest > testGetLeaderIsrAndEpochForPartition STARTED

kafka.utils.ReplicationUtilsTest > testGetLeaderIsrAndEpochForPartition PASSED

kafka.utils.JsonTest > testJsonEncoding STARTED

kafka.utils.JsonTest > testJsonEncoding PASSED

kafka.utils.SchedulerTest > testMockSchedulerNonPeriodicTask STARTED

kafka.utils.SchedulerTest > testMockSchedulerNonPeriodicTask PASSED

kafka.utils.SchedulerTest > testMockSchedulerPeriodicTask STARTED

kafka.utils.SchedulerTest > testMockSchedulerPeriodicTask PASSED

kafka.utils.SchedulerTest > testNonPeriodicTask STARTED

kafka.utils.SchedulerTest > testNonPeriodicTask PASSED

kafka.utils.SchedulerTest > testRestart STARTED

kafka.utils.SchedulerTest > testRestart PASSED

kafka.utils.SchedulerTest > testReentrantTaskInMockScheduler STARTED

kafka.utils.SchedulerTest > testReentrantTaskInMockScheduler PASSED

kafka.utils.SchedulerTest > testPeriodicTask STARTED

kafka.utils.SchedulerTest > testPeriodicTask PASSED

kafka.utils.ZkUtilsTest > testAbortedConditionalDeletePath STARTED

kafka.utils.ZkUtilsTest > testAbortedConditionalDeletePath PASSED

kafka.utils.ZkUtilsTest > testSuccessfulConditionalDeletePath STARTED

kafka.utils.ZkUtilsTest > testSuccessfulConditionalDeletePath PASSED

kafka.utils.ZkUtilsTest > testClusterIdentifierJsonParsing STARTED

kafka.utils.ZkUtilsTest > testClusterIdentifierJsonParsing PASSED

kafka.utils.IteratorTemplateTest > testIterator STARTED

kafka.utils.IteratorTemplateTest > testIterator PASSED

kafka.utils.UtilsTest > testGenerateUuidAsBase64 STARTED

kafka.utils.UtilsTest > testGenerateUuidAsBase64 PASSED

kafka.utils.UtilsTest > testAbs STARTED

kafka.utils.UtilsTest > testAbs PASSED

kafka.utils.UtilsTest > testReplaceSuffix STARTED

kafka.utils.UtilsTest > testReplaceSuffix PASSED

kafka.utils.UtilsTest > testCircularIterator STARTED

kafka.utils.UtilsTest > testCircularIterator PASSED

kafka.utils.UtilsTest > testReadBytes STARTED

kafka.utils.UtilsTest > testReadBytes PASSED

kafka.utils.UtilsTest > testCsvList STARTED

kafka.utils.UtilsTest > testCsvList PASSED

kafka.utils.UtilsTest > testReadInt STARTED

kafka.utils.UtilsTest > testReadInt PASSED

kafka.utils.UtilsTest > testUrlSafeBase64EncodeUUID STARTED

kafka.utils.UtilsTest > testUrlSafeBase64EncodeUUID PASSED

kafka.utils.UtilsTest > testCsvMap STARTED

kafka.utils.UtilsTest > testCsvMap PASSED

kafka.utils.UtilsTest > testInLock STARTED

kafka.utils.UtilsTest > testInLock PASSED

kafka.utils.UtilsTest > testSwallow STARTED

kafka.utils.UtilsTest > testSwallow PASSED

kafka.producer.AsyncProducerTest > testFailedSendRetryLogic STARTED

kafka.producer.AsyncProducerTest > testFailedSendRetryLogic PASSED

kafka.producer.AsyncProducerTest > testQueueTimeExpired STARTED

kafka.producer.AsyncProducerTest > testQueueTimeExpired PASSED

kafka.producer.AsyncProducerTest > testPartitionAndCollateEvents STARTED

kafka.producer.AsyncProducerTest > testPartitionAndCollateEvents PASSED

kafka.producer.AsyncProducerTest > testBatchSize STARTED

kafka.producer.AsyncProducerTest > testBatchSize PASSED

kafka.producer.AsyncProducerTest > testSerializeEvents STARTED

kafka.producer.AsyncProducerTest > testSerializeEvents PASSED

kafka.producer.AsyncProducerTest > testProducerQueueSize STARTED

kafka.producer.AsyncProducerTest > testProducerQueueSize PASSED

kafka.producer.AsyncProducerTest > testRandomPartitioner STARTED

kafka.producer.AsyncProducerTest > testRandomPartitioner PASSED

kafka.producer.AsyncProducerTest > testInvalidConfiguration STARTED

kafka.producer.AsyncProducerTest > testInvalidConfiguration PASSED

kafka.producer.AsyncProducerTest > testInvalidPartition STARTED

kafka.producer.AsyncProducerTest > testInvalidPartition PASSED

kafka.producer.AsyncProducerTest > testNoBroker STARTED

kafka.producer.AsyncProducerTest > testNoBroker PASSED

kafka.producer.AsyncProducerTest > testProduceAfterClosed STARTED

kafka.producer.AsyncProducerTest > testProduceAfterClosed PASSED

kafka.producer.AsyncProducerTest

[GitHub] kafka pull request #2252: HOTFIX: fix state transition stuck on rebalance

2016-12-13 Thread enothereska

GitHub user enothereska opened a pull request:

https://github.com/apache/kafka/pull/2252

HOTFIX: fix state transition stuck on rebalance

This fixes a problem where the Kafka instance state transition gets stuck 
on rebalance. Also adjusts the test in QueryableStateIntegration test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/enothereska/kafka hotfix_state_never_running

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/2252.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2252


commit 0476125a19b9e824d8cdd181149a1a038d52b7c5
Author: Eno Thereska 
Date:   2016-12-13T20:50:13Z

Fixed state stuck on rebalance

commit 67afcf855af702b4aedafc8f403b7ac5cb82e080
Author: Eno Thereska 
Date:   2016-12-13T20:51:11Z

Merge remote-tracking branch 'origin/trunk' into hotfix_state_never_running




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Build failed in Jenkins: kafka-trunk-jdk7 #1752

2016-12-13 Thread Apache Jenkins Server

See 

Changes:

[wangguoz] MINOR: Fix Streams examples in documentation

--
[...truncated 6277 lines...]

kafka.api.PlaintextProducerSendTest > testSendOffset STARTED

kafka.api.PlaintextProducerSendTest > testSendOffset PASSED

kafka.api.PlaintextProducerSendTest > testSendCompressedMessageWithCreateTime 
STARTED

kafka.api.PlaintextProducerSendTest > testSendCompressedMessageWithCreateTime 
PASSED

kafka.api.PlaintextProducerSendTest > testCloseWithZeroTimeoutFromCallerThread 
STARTED

kafka.api.PlaintextProducerSendTest > testCloseWithZeroTimeoutFromCallerThread 
PASSED

kafka.api.PlaintextProducerSendTest > testCloseWithZeroTimeoutFromSenderThread 
STARTED

kafka.api.PlaintextProducerSendTest > testCloseWithZeroTimeoutFromSenderThread 
PASSED

kafka.api.PlaintextProducerSendTest > testSendBeforeAndAfterPartitionExpansion 
STARTED

kafka.api.PlaintextProducerSendTest > testSendBeforeAndAfterPartitionExpansion 
PASSED

kafka.api.PlaintextConsumerTest > testEarliestOrLatestOffsets STARTED

kafka.api.PlaintextConsumerTest > testEarliestOrLatestOffsets PASSED

kafka.api.PlaintextConsumerTest > testPartitionsForAutoCreate STARTED

kafka.api.PlaintextConsumerTest > testPartitionsForAutoCreate PASSED

kafka.api.PlaintextConsumerTest > testShrinkingTopicSubscriptions STARTED

kafka.api.PlaintextConsumerTest > testShrinkingTopicSubscriptions PASSED

kafka.api.PlaintextConsumerTest > testMaxPollIntervalMs STARTED

kafka.api.PlaintextConsumerTest > testMaxPollIntervalMs PASSED

kafka.api.PlaintextConsumerTest > testOffsetsForTimes STARTED

kafka.api.PlaintextConsumerTest > testOffsetsForTimes PASSED

kafka.api.PlaintextConsumerTest > testSubsequentPatternSubscription STARTED

kafka.api.PlaintextConsumerTest > testSubsequentPatternSubscription PASSED

kafka.api.PlaintextConsumerTest > testAsyncCommit STARTED

kafka.api.PlaintextConsumerTest > testAsyncCommit PASSED

kafka.api.PlaintextConsumerTest > testLowMaxFetchSizeForRequestAndPartition 
STARTED

kafka.api.PlaintextConsumerTest > testLowMaxFetchSizeForRequestAndPartition 
PASSED

kafka.api.PlaintextConsumerTest > testMultiConsumerSessionTimeoutOnStopPolling 
STARTED

kafka.api.PlaintextConsumerTest > testMultiConsumerSessionTimeoutOnStopPolling 
PASSED

kafka.api.PlaintextConsumerTest > testMaxPollIntervalMsDelayInRevocation STARTED

kafka.api.PlaintextConsumerTest > testMaxPollIntervalMsDelayInRevocation PASSED

kafka.api.PlaintextConsumerTest > testPartitionsForInvalidTopic STARTED

kafka.api.PlaintextConsumerTest > testPartitionsForInvalidTopic PASSED

kafka.api.PlaintextConsumerTest > testPauseStateNotPreservedByRebalance STARTED

kafka.api.PlaintextConsumerTest > testPauseStateNotPreservedByRebalance PASSED

kafka.api.PlaintextConsumerTest > 
testFetchHonoursFetchSizeIfLargeRecordNotFirst STARTED

kafka.api.PlaintextConsumerTest > 
testFetchHonoursFetchSizeIfLargeRecordNotFirst PASSED

kafka.api.PlaintextConsumerTest > testSeek STARTED

kafka.api.PlaintextConsumerTest > testSeek PASSED

kafka.api.PlaintextConsumerTest > testPositionAndCommit STARTED

kafka.api.PlaintextConsumerTest > testPositionAndCommit PASSED

kafka.api.PlaintextConsumerTest > 
testFetchRecordLargerThanMaxPartitionFetchBytes STARTED

kafka.api.PlaintextConsumerTest > 
testFetchRecordLargerThanMaxPartitionFetchBytes PASSED

kafka.api.PlaintextConsumerTest > testUnsubscribeTopic STARTED

kafka.api.PlaintextConsumerTest > testUnsubscribeTopic PASSED

kafka.api.PlaintextConsumerTest > testMultiConsumerSessionTimeoutOnClose STARTED

kafka.api.PlaintextConsumerTest > testMultiConsumerSessionTimeoutOnClose PASSED

kafka.api.PlaintextConsumerTest > testFetchRecordLargerThanFetchMaxBytes STARTED

kafka.api.PlaintextConsumerTest > testFetchRecordLargerThanFetchMaxBytes PASSED

kafka.api.PlaintextConsumerTest > testMultiConsumerDefaultAssignment STARTED

kafka.api.PlaintextConsumerTest > testMultiConsumerDefaultAssignment PASSED

kafka.api.PlaintextConsumerTest > testAutoCommitOnClose STARTED

kafka.api.PlaintextConsumerTest > testAutoCommitOnClose PASSED

kafka.api.PlaintextConsumerTest > testListTopics STARTED

kafka.api.PlaintextConsumerTest > testListTopics PASSED

kafka.api.PlaintextConsumerTest > testExpandingTopicSubscriptions STARTED

kafka.api.PlaintextConsumerTest > testExpandingTopicSubscriptions PASSED

kafka.api.PlaintextConsumerTest > testInterceptors STARTED

kafka.api.PlaintextConsumerTest > testInterceptors PASSED

kafka.api.PlaintextConsumerTest > testPatternUnsubscription STARTED

kafka.api.PlaintextConsumerTest > testPatternUnsubscription PASSED

kafka.api.PlaintextConsumerTest > testGroupConsumption STARTED

kafka.api.PlaintextConsumerTest > testGroupConsumption PASSED

kafka.api.PlaintextConsumerTest > testPartitionsFor STARTED

kafka.api.PlaintextConsumerTest > testPartitionsFor PASSED

kafka.api.PlaintextConsumerTest > testAutoCommitOnRebalance STARTED

kafka.api.PlaintextCo

[GitHub] kafka pull request #2242: KAFKA-4497: Fix the ByteBufferMessageSet.filterInt...

2016-12-13 Thread becketqin

Github user becketqin closed the pull request at:

https://github.com/apache/kafka/pull/2242


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (KAFKA-4497) log cleaner breaks on timeindex

2016-12-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15746249#comment-15746249
 ] 

ASF GitHub Bot commented on KAFKA-4497:
---

Github user becketqin closed the pull request at:

https://github.com/apache/kafka/pull/2242


> log cleaner breaks on timeindex
> ---
>
> Key: KAFKA-4497
> URL: https://issues.apache.org/jira/browse/KAFKA-4497
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Affects Versions: 0.10.1.0
> Environment: Debian Jessie, Oracle Java 8u92, kafka_2.11-0.10.1.0
>Reporter: Robert Schumann
>Assignee: Jiangjie Qin
>Priority: Critical
> Fix For: 0.10.1.1
>
>
> _created from ML entry by request of [~ijuma]_
> Hi all,
> we are facing an issue with latest kafka 0.10.1 and the log cleaner thread 
> with regards to the timeindex files. From the log of the log-cleaner we see 
> after startup that it tries to cleanup a topic called xdc_listing-status-v2 
> [1]. The topic is setup with log compaction [2] and the kafka cluster 
> configuration has log.cleaner enabled [3]. Looking at the log and the newly 
> created file [4], the cleaner seems to refer to tombstones prior to 
> epoch_time=0 - maybe because he finds messages, which don’t have a timestamp 
> at all (?). All producers and consumers are using 0.10.1 and the topics have 
> been created completely new, so I’m not sure, where this issue would come 
> from. The original timeindex file [5] seems to show only valid timestamps for 
> the mentioned offsets. I would also like to mention that the issue happened 
> in two independent datacenters at the same time, so I would rather expect an 
> application/producer issue instead of random disk failures. We didn’t have 
> the problem with 0.10.0 for around half a year, it appeared shortly after the 
> upgrade to 0.10.1.
> The puzzling message from the cleaner “cleaning prior to Fri Dec 02 16:35:50 
> CET 2016, discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970” also 
> confuses me a bit. Does that mean, it does not find any log segments which 
> can be cleaned up or the last timestamp of the last log segment is somehow 
> broken/missing?
> I’m also a bit wondering, why the log cleaner thread stops completely after 
> an error with one topic. I would at least expect that it keeps on cleaning up 
> other topics, but apparently it doesn’t do that, e.g. it’s not even cleaning 
> the __consumer_offsets anymore.
> Does anybody have the same issues or can explain, what’s going on? Thanks for 
> any help or suggestions.
> Cheers
> Robert
> [1]
> {noformat}
> [2016-12-06 12:49:17,885] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,895] INFO [kafka-log-cleaner-thread-0], Starting  
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,947] INFO Cleaner 0: Beginning cleaning of log 
> xdc_listing-status-v2-1. (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,948] INFO Cleaner 0: Building offset map for 
> xdc_listing-status-v2-1... (kafka.log.LogCleaner)
> [2016-12-06 12:49:17,989] INFO Cleaner 0: Building offset map for log 
> xdc_listing-status-v2-1 for 1 segments in offset range [0, 194991). 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,572] INFO Cleaner 0: Offset map for log 
> xdc_listing-status-v2-1 complete. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,577] INFO Cleaner 0: Cleaning log 
> xdc_listing-status-v2-1 (cleaning prior to Fri Dec 02 16:35:50 CET 2016, 
> discarding tombstones prior to Thu Jan 01 01:00:00 CET 1970)... 
> (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,580] INFO Cleaner 0: Cleaning segment 0 in log 
> xdc_listing-status-v2-1 (largest timestamp Fri Dec 02 16:35:50 CET 2016) into 
> 0, retaining deletes. (kafka.log.LogCleaner)
> [2016-12-06 12:49:24,968] ERROR [kafka-log-cleaner-thread-0], Error due to  
> (kafka.log.LogCleaner)
> kafka.common.InvalidOffsetException: Attempt to append an offset (-1) to slot 
> 9 no larger than the last offset appended (11832) to 
> /var/lib/kafka/xdc_listing-status-v2-1/.timeindex.cleaned.
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply$mcV$sp(TimeIndex.scala:117)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at 
> kafka.log.TimeIndex$$anonfun$maybeAppend$1.apply(TimeIndex.scala:107)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
> at kafka.log.TimeIndex.maybeAppend(TimeIndex.scala:107)
> at kafka.log.LogSegment.append(LogSegment.scala:106)
> at kafka.log.Cleaner.cleanInto(LogCleaner.scala:518)
> at 
> kafka.log.Cleaner$$anonfun$cleanSegments$1.apply(LogCleaner.scala:404)
> at 
> kafka.log.Cleaner$$anonfun$cleanSegments$1.apply(LogCleaner.scala:400)
> at scala.collection.immutable.List.foreac

Re: [DISCUSS] KIP-90 Remove zkClient dependency from Streams

2016-12-13 Thread Gwen Shapira

Great idea, go for it :)

On Tue, Dec 13, 2016 at 10:39 AM, Hojjat Jafarpour 
wrote:

> Hi all,
>
> The following is a KIP for removing zkClient dependency from Streams.
> Please check out the KIP page:
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 90+-+Remove+zkClient+dependency+from+Streams
>
> Thanks,
> --Hojjat
>



-- 
*Gwen Shapira*
Product Manager | Confluent
650.450.2760 | @gwenshap
Follow us: Twitter  | blog

Re: [DISCUSS] KIP-90 Remove zkClient dependency from Streams

2016-12-13 Thread Ismael Juma

Thanks for the KIP, Hojjat. It will be great for Streams apps not to
require ZK access.

Ismael

On Tue, Dec 13, 2016 at 10:39 AM, Hojjat Jafarpour 
wrote:

> Hi all,
>
> The following is a KIP for removing zkClient dependency from Streams.
> Please check out the KIP page:
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 90+-+Remove+zkClient+dependency+from+Streams
>
> Thanks,
> --Hojjat
>

[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Lorand Peter Kasler (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15746368#comment-15746368
 ] 

Lorand Peter Kasler commented on KAFKA-4477:


We had encountered the same situation (even having this constantly growing file 
handle usage mentioned earlier) and the Follower nodes were trying continuously 
to connect (returning the same error every time that has been posted). 
Also the (possibly deadlocked) leader node refused to export any metrics, but 
before the incident the request latency wasn't at it's peak and it wasn't the 
highest in the cluster. 

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack, state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Jun Rao (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15746393#comment-15746393
 ] 

Jun Rao commented on KAFKA-4477:


[~tdevoe], from the controller log, starting from 19:59:13, the controller 
keeps failing connecting to broker 1002. What 1002 up at that point?

[2016-11-28 19:59:13,140] WARN [Controller-1003-to-broker-1002-send-thread], 
Controller 1003 epoch 23 fails to send request 
{controller_id=1003,controller_epoch=23,partition_states=[{topic=__consumer_offsets,partition=18,controller_epoch=22,leader=1002,leader_epoch=25,isr=[1002],zk_version=73,replicas=[1002,1001,1003]},{topic=__consumer_offsets,partition=45,controller_epoch=22,leader=1002,leader_epoch=25,isr=[1002],zk_version=68,replicas=[1002,1003,1001]},{topic=topic_23,partition=0,controller_epoch=22,leader=1002,leader_epoch=10,isr=[1002],zk_version=30,replicas=[1002,1003,1001]},{topic=topic_22,partition=2,controller_epoch=22,leader=1002,leader_epoch=10,isr=[1002],zk_version=26,replicas=[1002,1003,1001]},{topic=topic_3,partition=0,controller_epoch=22,leader=1002,leader_epoch=12,isr=[1002],zk_version=33,replicas=[1002,1001,1003]},{topic=__consumer_offsets,partition=36,controller_epoch=22,leader=1002,leader_epoch=25,isr=[1002],zk_version=72,replicas=[1002,1001,1003]},{topic=connect-offsets,partition=13,controller_epoch=22,leader=1002,leader_epoch=12,isr=[1002],zk_version=37,replicas=[1002,1001,1003]}],live_brokers=[{id=1003,end_points=[{port=9092,host=node_1003,security_protocol_type=2}],rack=null},{id=1002,end_points=[{port=9092,host=node_1002,security_protocol_type=2}],rack=null},{id=1001,end_points=[{port=9092,host=node_1001,security_protocol_type=2}],rack=null}]}
 to broker node_1002:9092 (id: 1002 rack: null). Reconnecting to broker. 
(kafka.controller.RequestSendThread)
java.io.IOException: Connection to 1002 was disconnected before the response 
was read
at 
kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:115)
at 
kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:112)
at scala.Option.foreach(Option.scala:257)
at 
kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:112)
at 
kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:108)
at 
kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:137)
at 
kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollContinuously$extension(NetworkClientBlockingOps.scala:143)
at 
kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108)
at 
kafka.controller.RequestSendThread.liftedTree1$1(ControllerChannelManager.scala:190)
at 
kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:181)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)


> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack, state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network,

[jira] [Issue Comment Deleted] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Tom DeVoe (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom DeVoe updated KAFKA-4477:
-
Comment: was deleted

(was: I was able to telnet from the controller node to the kafka port on node 
1002 while it was saying it was unable to connect, so the network seemed to 
fine at the time.

>From the logs it seems node 1002 was moving along as though it was behaving 
>normally (except with all of its ISR's shrunk))

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack, state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Tom DeVoe (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15746402#comment-15746402
 ] 

Tom DeVoe commented on KAFKA-4477:
--

I was able to telnet from the controller node to the kafka port on node 1002 
while it was saying it was unable to connect, so the network seemed to fine at 
the time.

>From the logs it seems node 1002 was moving along as though it was behaving 
>normally (except with all of its ISR's shrunk)

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack, state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Tom DeVoe (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15746403#comment-15746403
 ] 

Tom DeVoe commented on KAFKA-4477:
--

I was able to telnet from the controller node to the kafka port on node 1002 
while it was saying it was unable to connect, so the network seemed to fine at 
the time.

>From the logs it seems node 1002 was moving along as though it was behaving 
>normally (except with all of its ISR's shrunk)

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack, state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Jenkins build is back to normal : kafka-trunk-jdk7 #1753

2016-12-13 Thread Apache Jenkins Server

See

[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Ismael Juma (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15746466#comment-15746466
 ] 

Ismael Juma commented on KAFKA-4477:


One question for the others that have reported this issue: have you upgraded to 
0.10.1.0 and starting seeing these issues (like Michael)? And if so, which 
version of Kafka were you running before the upgrade?

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack, state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Lorand Peter Kasler (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15746479#comment-15746479
 ] 

Lorand Peter Kasler commented on KAFKA-4477:


Yes, we recently upgraded and started getting these weird issues, approximately 
once or twice a week. 
The previous version was: 0.10.0.0

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack, state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Jenkins build is back to normal : kafka-trunk-jdk8 #1101

2016-12-13 Thread Apache Jenkins Server

See

Re: [DISCUSS] KIP-95: Incremental Batch Processing for Kafka Streams

2016-12-13 Thread Avi Flax

On 2016-11-28 13:47 (-0500), "Matthias J. Sax"  wrote: 
>
> I want to start a discussion about KIP-95:
> 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-95%3A+Incremental+Batch+Processing+for+Kafka+Streams
> 
> Looking forward to your feedback.

Hi Matthias,

I’d just like to share some feedback on this proposal, from the perspective
of a Kafka user rather than developer:

* Overall, makes a ton of sense and I think it’ll be very useful and will
  open up a bunch of additional use cases to Kafka Streams

* Two use cases that are of particular interest to me:

* Just last week I created a simple Kafka Streams app (I called it a
  “script”) to copy certain records from one topic over to another,
  with filtering. I ran the app/script until it reached the end of
  the topic, then I manually shut it down with ctrl-c. This worked 
  fine, but it would have been an even better UX to have specified
  the config value `autostop.at` as `eol` and have the process stop
  itself at the desired point. That would have required less manual
  monitoring on my part.

* With this new mode I might be able to run Streams apps on AWS Lambda.
  I’ve been super-excited about Lambda and similar FaaS services since
  their early days, and I’ve been itching to run Kafka Streams apps on
  Lambda for since I started using Streams in April or May. Unfortunately,
  Lambda functions are limited to 5 minutes per invocation — after 5
  minutes they’re killed. I’m not sure, but I wonder if perhaps this new
  autostop.at feature could make it more practical to run a Streams
  app on Lambda - the feature seems like it could potentially be adapted
  to enable a Streams app to be more generally resilient to being
  frequently stopped and started.

I look forward to seeing progress on this enhancement!

Thanks,
Avi

Software Architect @ Park Assist
We’re hiring! http://tech.parkassist.com/jobs/

Kafka ACL's with SSL Protocol is not working

2016-12-13 Thread Raghu B

Hi All,

I am trying to enable ACL's in my Kafka cluster with along with SSL
Protocol.

I tried with each and every parameters but no luck, so I need help to
enable the SSL(without Kerberos) and I am attaching all the configuration
details in this.

Kindly Help me.


*I tested SSL without ACL, it worked fine
(listeners=SSL://10.247.195.122:9093 )*


*This is my Kafka server properties file:*

*# ACL SETTINGS #*

*auto.create.topics.enable=true*

*authorizer.class.name
=kafka.security.auth.SimpleAclAuthorizer*

*security.inter.broker.protocol=SSL*

*#allow.everyone.if.no.acl.found=true*

*#principal.builder.class=CustomizedPrincipalBuilderClass*

*#super.users=User:"CN=writeuser,OU=Unknown,O=Unknown,L=Unknown,ST=Unknown,C=Unknown"*

*#super.users=User:Raghu;User:Admin*

*#offsets.storage=kafka*

*#dual.commit.enabled=true*

*listeners=SSL://10.247.195.122:9093 *

*#listeners=PLAINTEXT://10.247.195.122:9092 *

*#listeners=PLAINTEXT://10.247.195.122:9092
,SSL://10.247.195.122:9093
*

*#advertised.listeners=PLAINTEXT://10.247.195.122:9092
*


*
ssl.keystore.location=/home/raghu/kafka/security/server.keystore.jks*

*ssl.keystore.password=123456*

*ssl.key.password=123456*

*
ssl.truststore.location=/home/raghu/kafka/security/server.truststore.jks*

*ssl.truststore.password=123456*



*Set the ACL from Authorizer CLI:*

> *bin/kafka-acls.sh --authorizer-properties
zookeeper.connect=10.247.195.122:2181  --list
--topic ssltopic*

*Current ACLs for resource `Topic:ssltopic`: *

*  User:CN=writeuser, OU=Unknown, O=Unknown, L=Unknown, ST=Unknown,
C=Unknown has Allow permission for operations: Write from hosts: * *


*XXXWMXXX-7:kafka_2.11-0.10.1.0 rbaddam$ bin/kafka-console-producer.sh
--broker-list 10.247.195.122:9093  --topic
ssltopic --producer.config client-ssl.properties*


*[2016-12-13 14:53:45,839] WARN Error while fetching metadata with
correlation id 0 : {ssltopic=UNKNOWN_TOPIC_OR_PARTITION}
(org.apache.kafka.clients.NetworkClient)*

*[2016-12-13 14:53:45,984] WARN Error while fetching metadata with
correlation id 1 : {ssltopic=UNKNOWN_TOPIC_OR_PARTITION}
(org.apache.kafka.clients.NetworkClient)*


*XXXWMXXX-7:kafka_2.11-0.10.1.0 rbaddam$ cat client-ssl.properties*

*#group.id =sslgroup*

*security.protocol=SSL*

*ssl.truststore.location=/Users/rbaddam/Desktop/Dev/kafka_2.11-0.10.1.0/ssl/client.truststore.jks*

*ssl.truststore.password=123456*

* #Configure Below if you use Client Auth*


*ssl.keystore.location=/Users/rbaddam/Desktop/Dev/kafka_2.11-0.10.1.0/ssl/client.keystore.jks*

*ssl.keystore.password=123456*

*ssl.key.password=123456*


*XXXWMXXX-7:kafka_2.11-0.10.1.0 rbaddam$ bin/kafka-console-consumer.sh
--bootstrap-server 10.247.195.122:9093 
--new-consumer --consumer.config client-ssl.properties --topic ssltopic
--from-beginning*

*[2016-12-13 14:53:28,817] WARN Error while fetching metadata with
correlation id 1 : {ssltopic=UNKNOWN_TOPIC_OR_PARTITION}
(org.apache.kafka.clients.NetworkClient)*

*[2016-12-13 14:53:28,819] ERROR Unknown error when running consumer:
(kafka.tools.ConsoleConsumer$)*

*org.apache.kafka.common.errors.GroupAuthorizationException: Not authorized
to access group: console-consumer-52826*


Thanks in advance,

Raghu - raghu98...@gmail.com

[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Apurva Mehta (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15746799#comment-15746799
 ] 

Apurva Mehta commented on KAFKA-4477:
-

[~tdevoe], thanks for sharing all your extend broker logs, as well as the 
controller and state change logs. 

I have a few questions: 

# The original description in the ticket stats that the problem node reduces 
the ISR to itself, and then doesn't recover. In the logs you shared, the 
problem node 1002 does shrink its ISRs to itself, but then the ISR begins to 
expand back to the original set only 2 seconds after. The broker log for node 
1002 also shows connections from the other replicas coming in. We can tell 
since the SASL handshake is being logged. the strange bit is that nodes 1001 
and nodes 1003, however,  can't seem to connect until 2130, which brings me to 
my next point.
# Did you bounce the hosts at 2130? If not, when were the hosts bounced?
# We have fixed some deadlock bugs where the ISR shrinks to a single node but 
expands back again. Given the observation in point 1, it maybe worth trying the 
0.10.1.1 RC to see if you can reproduce this problem when using that code. If 
it reproduces, then we know for certain that the existing deadlocks are not the 
issue.
# Another suspicion we have is the changes to the `NetworkClientBlockingOps` 
code. However, this code does not have any logging. If you try the RC, and 
still hit the issue, would you be willing to deploy a version of 0.10.1 with 
some instrumentation around the network client code. This would enable us to 
validate or disprove our hypothesis.

Thanks,
Apurva

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack, state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

1 2 >

1 - 100 of 138 matches

Mail list logo