Re: [DISCUSSION] KIP-15 close(timeout) for producer

2015-03-15 Thread Jay Kreps
Cool.

I think blocking is good or alternately throwing an exception directly from
close(). Basically I would just worry about subtly doing something slightly
different from what the user asked for as it will be hard to notice that
behavior difference.

-Jay

On Sat, Mar 14, 2015 at 5:48 PM, Jiangjie Qin 
wrote:

> Hi Jay,
>
> I have modified the KIP as you suggested. I thinks as long as we have
> consistent define for timeout across Kafka interface, there would be no
> problem. And I also agree it is better if we can make producer block when
> close() is called from sender thread so user will notice something went
> wrong.
>
> Thanks.
>
> Jiangjie (Becket) Qin
>
> On 3/14/15, 11:37 AM, "Jay Kreps"  wrote:
>
> >Hey Jiangjie,
> >
> >I think this is going to be very confusing that
> >  close(0) waits indefinitely and
> >  close(-1) waits for 0.
> >I understand this appears in other apis, but it is a constant cause of
> >bugs. Let's not repeat that mistake.
> >
> >Let's make close(0) wait for 0. We don't need a way to wait indefinitely
> >as
> >we already have close() so having a magical constant for that is
> >redundant.
> >
> >Calling close() from the I/O thread was already possible and would block
> >indefinitely. I think trying to silently change the behavior is probably
> >not right. I.e. if the user calls close() in the callback there is
> >actually
> >some misunderstanding and they need to think more, silently making this
> >not
> >block will hide the problem from them which is the opposite of what we
> >want.
> >
> >-Jay
> >
> >On Thu, Mar 12, 2015 at 1:49 AM, Jiangjie Qin 
> >wrote:
> >
> >> Hey Joe & Jay,
> >>
> >> Thanks for the comments on the voting thread. Since it seems we probably
> >> will have more discussion on this, I am just replying from the
> >>discussion
> >> thread here.
> >> I’ve updated the KIP page to make it less like half-baked, apologize for
> >> the rush...
> >>
> >> The contract in current KIP is:
> >>   1. close() - wait until all requests either are sent or reach request
> >> timeout.
> >>   2. close(-1, TimeUnit.MILLISECONDS) - close immediately
> >>   3. close(0, TimeUnit.MILLISECONDS) - equivalent to close(), i.e. Wait
> >> until all requests are sent or reach request timeout
> >>   4. close(5, TimeUnit.MILLISECONDS) - try the best to finish sending
> >>in 5
> >> milliseconds, if something went wrong, just shutdown the producer
> >>anyway,
> >> my callback will handle the failures.
> >>
> >> About how we define what timeout value stands for, I actually struggled
> >>a
> >> little bit when wrote the patch. Intuitively, close(0) should mean
> >> immediately, however it seems that all the existing java class have this
> >> convention of timeout=0 means no timeout or never timeout
> >>(Thread.join(0),
> >> Object.wait(0), etc.) So here the dilemma is either we follow the
> >> intuition or we follow the convention. What I chose is to follow the
> >> convention but document the interface to let user be aware of the usage.
> >> The reason is that I think producer.close() is a public interface so it
> >> might be better to follow java convention. Whereas selector is not a
> >> public interface that used by user, so as long as it makes sense to us,
> >>it
> >> is less a problem to be different from java convention. That said since
> >> consumer.poll(timeout) is also a public interface, I think it also makes
> >> sense to make producer.close() to have the same definition of
> >> consumer.poll(timeout).
> >>
> >> The main argument for keeping a timeout in close would be separating the
> >> close timeout from request timeout, which probably makes sense. I would
> >> guess typically the request timeout would be long (e.g. 60 seconds)
> >> because we might want to consider retries with back off time. If we have
> >> multiple batches in accumulator, in worst case that could take up to
> >> several minutes to complete all the requests. But when we close a
> >> producer, we might not want to wait for that long as it might cause some
> >> other problem like deployment tool timeout.
> >>
> >> There is also a subtle difference between close(timeout) and
> >> flush(timeout). The only purpose for flush() is to write data to the
> >> broker, so it makes perfect sense to wait until request timeout. I think
> >> that is why flush(timeout) looks strange. On the other hand, the top
> >> priority for close() is to close the producer rather than flush() data,
> >>so
> >> close(timeout) gives guarantee on bounded waiting for its main job.
> >>
> >> Sorry for the confusion about forceClose flag. It is not a public
> >> interface. I mentioned it in Proposed Changes section which I thought
> >>was
> >> supposed to provide implementation details.
> >>
> >> Thanks again for all the comments and suggestions!
> >>
> >> Jiangjie (Becket) Qin
> >>
> >> On 3/10/15, 8:57 PM, "Jiangjie Qin"  wrote:
> >>
> >> >The KIP page has been updated per Jay¹s comments.
> >> >I¹d like to initiate the voting process if no fur

Re: Review Request 31967: Patch for KAFKA-1546

2015-03-15 Thread Neha Narkhede

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/31967/#review76511
---



core/src/main/scala/kafka/cluster/Partition.scala


General comment for changes to this file: The lines are longer than 80 in 
many cases. Can you please fix that?



core/src/main/scala/kafka/cluster/Partition.scala


scala.Some is unused.



core/src/main/scala/kafka/cluster/Partition.scala


Let's rename FetchDataInfo.fetchOffset to fetchOffsetMetadata. It is 
confusing to read ...fetchOffset.messageOffset.



core/src/main/scala/kafka/cluster/Partition.scala


Remove space after ! in all relevant places in this patch?



core/src/main/scala/kafka/cluster/Partition.scala


This comment needs an update to match the logic in your patch.



core/src/main/scala/kafka/cluster/Replica.scala


Easier to express clearly the case for UnknownLogReadResult. It is harder 
to read and convince oneself that the case where the broker becomes a leader 
doesn't set the lagBeginValue to current time. In addition to adding that 
condition explicitly, can you also add a comment explaining the significance of 
the check?



core/src/main/scala/kafka/server/KafkaConfig.scala


minor nit: hasnt => hasn't
Also, how about "hasn't consumed up to the leader's log end offset for at 
least this time"



core/src/main/scala/kafka/server/ReplicaManager.scala


Please break the long lines. Makes it harder to read and maintain the code



core/src/main/scala/kafka/server/ReplicaManager.scala


ditto



core/src/main/scala/kafka/server/ReplicaManager.scala


ditto


- Neha Narkhede


On March 12, 2015, 8:42 p.m., Aditya Auradkar wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/31967/
> ---
> 
> (Updated March 12, 2015, 8:42 p.m.)
> 
> 
> Review request for kafka.
> 
> 
> Bugs: KAFKA-1546
> https://issues.apache.org/jira/browse/KAFKA-1546
> 
> 
> Repository: kafka
> 
> 
> Description
> ---
> 
> PATCH for KAFKA-1546
> 
> Brief summary of changes:
> - Added a lagBegin metric inside Replica to track the lag in terms of time 
> since the replica did not read from the LEO
> - Using lag begin value in the check for ISR expand and shrink
> - Removed the max lag messages config since it is no longer necessary
> - Returning the initialLogEndOffset in LogReadResult corresponding to the the 
> LEO before actually reading from the log.
> - Unit test cases to test ISR shrinkage and expansion
> 
> 
> Diffs
> -
> 
>   core/src/main/scala/kafka/cluster/Partition.scala 
> c4bf48a801007ebe7497077d2018d6dffe1677d4 
>   core/src/main/scala/kafka/cluster/Replica.scala 
> bd13c20338ce3d73113224440e858a12814e5adb 
>   core/src/main/scala/kafka/server/KafkaConfig.scala 
> 48e33626695ad8a28b0018362ac225f11df94973 
>   core/src/main/scala/kafka/server/ReplicaManager.scala 
> c5274822c57bf3c1f9e4135c0bdcaa87ee50ce20 
>   core/src/test/scala/unit/kafka/server/ISRExpirationTest.scala 
> 92152358c95fa9178d71bd1c079af0a0bd8f1da8 
>   core/src/test/scala/unit/kafka/server/KafkaConfigConfigDefTest.scala 
> c124c8df5b5079e5ffbd0c4ea359562a66aaf317 
>   core/src/test/scala/unit/kafka/server/LogRecoveryTest.scala 
> 92d6b2c672f74cdd526f2e98da8f7fb3696a88e3 
>   core/src/test/scala/unit/kafka/server/SimpleFetchTest.scala 
> efb457334bd87e4c5b0dd2c66ae1995993cd0bc1 
> 
> Diff: https://reviews.apache.org/r/31967/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Aditya Auradkar
> 
>



Re: [VOTE] KIP-16: Replica Lag Tuning

2015-03-15 Thread Neha Narkhede
+1 on the KIP. Minor nit: "it is deemed to not be in ISR because it is not
caught up" => "it is deemed to not be in the ISR because it has fallen
behind for more than a certain amount of time as controlled by this config"

Also took a look at the patch. Looks correct, left review comments. Thanks
for sharing the test results. This change is going to be great for users!

On Sat, Mar 14, 2015 at 9:01 AM, Jay Kreps  wrote:

> +1
>
> -Jay
>
> On Fri, Mar 13, 2015 at 9:54 AM, Aditya Auradkar <
> aaurad...@linkedin.com.invalid> wrote:
>
> > Details in the KIP, Jira and RB.
> >
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP+16+:+Automated+Replica+Lag+Tuning
> > https://issues.apache.org/jira/browse/KAFKA-1546
> > https://reviews.apache.org/r/31967/
> >
> > Aditya
> >
> >
>



-- 
Thanks,
Neha


Re: [VOTE] KIP-16: Replica Lag Tuning

2015-03-15 Thread Joe Stein
+1

one more minor nit, please update the KIP with the link to the discuss
thread too.

~ Joe Stein
- - - - - - - - - - - - - - - - -

  http://www.stealth.ly
- - - - - - - - - - - - - - - - -

On Sun, Mar 15, 2015 at 5:27 PM, Neha Narkhede  wrote:

> +1 on the KIP. Minor nit: "it is deemed to not be in ISR because it is not
> caught up" => "it is deemed to not be in the ISR because it has fallen
> behind for more than a certain amount of time as controlled by this config"
>
> Also took a look at the patch. Looks correct, left review comments. Thanks
> for sharing the test results. This change is going to be great for users!
>
> On Sat, Mar 14, 2015 at 9:01 AM, Jay Kreps  wrote:
>
> > +1
> >
> > -Jay
> >
> > On Fri, Mar 13, 2015 at 9:54 AM, Aditya Auradkar <
> > aaurad...@linkedin.com.invalid> wrote:
> >
> > > Details in the KIP, Jira and RB.
> > >
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP+16+:+Automated+Replica+Lag+Tuning
> > > https://issues.apache.org/jira/browse/KAFKA-1546
> > > https://reviews.apache.org/r/31967/
> > >
> > > Aditya
> > >
> > >
> >
>
>
>
> --
> Thanks,
> Neha
>


[jira] [Commented] (KAFKA-1546) Automate replica lag tuning

2015-03-15 Thread Neha Narkhede (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362581#comment-14362581
 ] 

Neha Narkhede commented on KAFKA-1546:
--

[~aauradkar] Thanks for the patch. Overall, the changes look correct. I left a 
few review comments. And thanks for sharing the test results. Look forward to 
the updated patch.

> Automate replica lag tuning
> ---
>
> Key: KAFKA-1546
> URL: https://issues.apache.org/jira/browse/KAFKA-1546
> Project: Kafka
>  Issue Type: Improvement
>  Components: replication
>Affects Versions: 0.8.0, 0.8.1, 0.8.1.1
>Reporter: Neha Narkhede
>Assignee: Aditya Auradkar
>  Labels: newbie++
> Fix For: 0.8.3
>
> Attachments: KAFKA-1546.patch, KAFKA-1546_2015-03-11_18:48:09.patch, 
> KAFKA-1546_2015-03-12_13:42:01.patch
>
>
> Currently, there is no good way to tune the replica lag configs to 
> automatically account for high and low volume topics on the same cluster. 
> For the low-volume topic it will take a very long time to detect a lagging
> replica, and for the high-volume topic it will have false-positives.
> One approach to making this easier would be to have the configuration
> be something like replica.lag.max.ms and translate this into a number
> of messages dynamically based on the throughput of the partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [DISCUSSION] KIP-15 close(timeout) for producer

2015-03-15 Thread Neha Narkhede
>
> And I also agree it is better if we can make producer block when
> close() is called from sender thread so user will notice something went
> wrong.


This isn't a great experience either. Why can't we just throw an exception
for a behavior we know is incorrect and we'd like the user to know.
Blocking as a means of doing that seems wrong and annoying.

On Sun, Mar 15, 2015 at 11:56 AM, Jay Kreps  wrote:

> Cool.
>
> I think blocking is good or alternately throwing an exception directly from
> close(). Basically I would just worry about subtly doing something slightly
> different from what the user asked for as it will be hard to notice that
> behavior difference.
>
> -Jay
>
> On Sat, Mar 14, 2015 at 5:48 PM, Jiangjie Qin 
> wrote:
>
> > Hi Jay,
> >
> > I have modified the KIP as you suggested. I thinks as long as we have
> > consistent define for timeout across Kafka interface, there would be no
> > problem. And I also agree it is better if we can make producer block when
> > close() is called from sender thread so user will notice something went
> > wrong.
> >
> > Thanks.
> >
> > Jiangjie (Becket) Qin
> >
> > On 3/14/15, 11:37 AM, "Jay Kreps"  wrote:
> >
> > >Hey Jiangjie,
> > >
> > >I think this is going to be very confusing that
> > >  close(0) waits indefinitely and
> > >  close(-1) waits for 0.
> > >I understand this appears in other apis, but it is a constant cause of
> > >bugs. Let's not repeat that mistake.
> > >
> > >Let's make close(0) wait for 0. We don't need a way to wait indefinitely
> > >as
> > >we already have close() so having a magical constant for that is
> > >redundant.
> > >
> > >Calling close() from the I/O thread was already possible and would block
> > >indefinitely. I think trying to silently change the behavior is probably
> > >not right. I.e. if the user calls close() in the callback there is
> > >actually
> > >some misunderstanding and they need to think more, silently making this
> > >not
> > >block will hide the problem from them which is the opposite of what we
> > >want.
> > >
> > >-Jay
> > >
> > >On Thu, Mar 12, 2015 at 1:49 AM, Jiangjie Qin  >
> > >wrote:
> > >
> > >> Hey Joe & Jay,
> > >>
> > >> Thanks for the comments on the voting thread. Since it seems we
> probably
> > >> will have more discussion on this, I am just replying from the
> > >>discussion
> > >> thread here.
> > >> I’ve updated the KIP page to make it less like half-baked, apologize
> for
> > >> the rush...
> > >>
> > >> The contract in current KIP is:
> > >>   1. close() - wait until all requests either are sent or reach
> request
> > >> timeout.
> > >>   2. close(-1, TimeUnit.MILLISECONDS) - close immediately
> > >>   3. close(0, TimeUnit.MILLISECONDS) - equivalent to close(), i.e.
> Wait
> > >> until all requests are sent or reach request timeout
> > >>   4. close(5, TimeUnit.MILLISECONDS) - try the best to finish sending
> > >>in 5
> > >> milliseconds, if something went wrong, just shutdown the producer
> > >>anyway,
> > >> my callback will handle the failures.
> > >>
> > >> About how we define what timeout value stands for, I actually
> struggled
> > >>a
> > >> little bit when wrote the patch. Intuitively, close(0) should mean
> > >> immediately, however it seems that all the existing java class have
> this
> > >> convention of timeout=0 means no timeout or never timeout
> > >>(Thread.join(0),
> > >> Object.wait(0), etc.) So here the dilemma is either we follow the
> > >> intuition or we follow the convention. What I chose is to follow the
> > >> convention but document the interface to let user be aware of the
> usage.
> > >> The reason is that I think producer.close() is a public interface so
> it
> > >> might be better to follow java convention. Whereas selector is not a
> > >> public interface that used by user, so as long as it makes sense to
> us,
> > >>it
> > >> is less a problem to be different from java convention. That said
> since
> > >> consumer.poll(timeout) is also a public interface, I think it also
> makes
> > >> sense to make producer.close() to have the same definition of
> > >> consumer.poll(timeout).
> > >>
> > >> The main argument for keeping a timeout in close would be separating
> the
> > >> close timeout from request timeout, which probably makes sense. I
> would
> > >> guess typically the request timeout would be long (e.g. 60 seconds)
> > >> because we might want to consider retries with back off time. If we
> have
> > >> multiple batches in accumulator, in worst case that could take up to
> > >> several minutes to complete all the requests. But when we close a
> > >> producer, we might not want to wait for that long as it might cause
> some
> > >> other problem like deployment tool timeout.
> > >>
> > >> There is also a subtle difference between close(timeout) and
> > >> flush(timeout). The only purpose for flush() is to write data to the
> > >> broker, so it makes perfect sense to wait until request timeout. I
> think
> > >> that is why flush(timeout) looks strange. On t

Re: [DISCUSSION] KIP-15 close(timeout) for producer

2015-03-15 Thread Guozhang Wang
Yeah I agree we should not silently change the behavior of the function
with the given parameters; and I would prefer error-logging-and-shutdown
over blocking when close(>0) is used, since as Neha suggested blocking
would also not proceed with sending any data, bu will just let users to
realize the issue later than sooner.

On Sun, Mar 15, 2015 at 3:25 PM, Neha Narkhede  wrote:

> >
> > And I also agree it is better if we can make producer block when
> > close() is called from sender thread so user will notice something went
> > wrong.
>
>
> This isn't a great experience either. Why can't we just throw an exception
> for a behavior we know is incorrect and we'd like the user to know.
> Blocking as a means of doing that seems wrong and annoying.
>
> On Sun, Mar 15, 2015 at 11:56 AM, Jay Kreps  wrote:
>
> > Cool.
> >
> > I think blocking is good or alternately throwing an exception directly
> from
> > close(). Basically I would just worry about subtly doing something
> slightly
> > different from what the user asked for as it will be hard to notice that
> > behavior difference.
> >
> > -Jay
> >
> > On Sat, Mar 14, 2015 at 5:48 PM, Jiangjie Qin  >
> > wrote:
> >
> > > Hi Jay,
> > >
> > > I have modified the KIP as you suggested. I thinks as long as we have
> > > consistent define for timeout across Kafka interface, there would be no
> > > problem. And I also agree it is better if we can make producer block
> when
> > > close() is called from sender thread so user will notice something went
> > > wrong.
> > >
> > > Thanks.
> > >
> > > Jiangjie (Becket) Qin
> > >
> > > On 3/14/15, 11:37 AM, "Jay Kreps"  wrote:
> > >
> > > >Hey Jiangjie,
> > > >
> > > >I think this is going to be very confusing that
> > > >  close(0) waits indefinitely and
> > > >  close(-1) waits for 0.
> > > >I understand this appears in other apis, but it is a constant cause of
> > > >bugs. Let's not repeat that mistake.
> > > >
> > > >Let's make close(0) wait for 0. We don't need a way to wait
> indefinitely
> > > >as
> > > >we already have close() so having a magical constant for that is
> > > >redundant.
> > > >
> > > >Calling close() from the I/O thread was already possible and would
> block
> > > >indefinitely. I think trying to silently change the behavior is
> probably
> > > >not right. I.e. if the user calls close() in the callback there is
> > > >actually
> > > >some misunderstanding and they need to think more, silently making
> this
> > > >not
> > > >block will hide the problem from them which is the opposite of what we
> > > >want.
> > > >
> > > >-Jay
> > > >
> > > >On Thu, Mar 12, 2015 at 1:49 AM, Jiangjie Qin
>  > >
> > > >wrote:
> > > >
> > > >> Hey Joe & Jay,
> > > >>
> > > >> Thanks for the comments on the voting thread. Since it seems we
> > probably
> > > >> will have more discussion on this, I am just replying from the
> > > >>discussion
> > > >> thread here.
> > > >> I’ve updated the KIP page to make it less like half-baked, apologize
> > for
> > > >> the rush...
> > > >>
> > > >> The contract in current KIP is:
> > > >>   1. close() - wait until all requests either are sent or reach
> > request
> > > >> timeout.
> > > >>   2. close(-1, TimeUnit.MILLISECONDS) - close immediately
> > > >>   3. close(0, TimeUnit.MILLISECONDS) - equivalent to close(), i.e.
> > Wait
> > > >> until all requests are sent or reach request timeout
> > > >>   4. close(5, TimeUnit.MILLISECONDS) - try the best to finish
> sending
> > > >>in 5
> > > >> milliseconds, if something went wrong, just shutdown the producer
> > > >>anyway,
> > > >> my callback will handle the failures.
> > > >>
> > > >> About how we define what timeout value stands for, I actually
> > struggled
> > > >>a
> > > >> little bit when wrote the patch. Intuitively, close(0) should mean
> > > >> immediately, however it seems that all the existing java class have
> > this
> > > >> convention of timeout=0 means no timeout or never timeout
> > > >>(Thread.join(0),
> > > >> Object.wait(0), etc.) So here the dilemma is either we follow the
> > > >> intuition or we follow the convention. What I chose is to follow the
> > > >> convention but document the interface to let user be aware of the
> > usage.
> > > >> The reason is that I think producer.close() is a public interface so
> > it
> > > >> might be better to follow java convention. Whereas selector is not a
> > > >> public interface that used by user, so as long as it makes sense to
> > us,
> > > >>it
> > > >> is less a problem to be different from java convention. That said
> > since
> > > >> consumer.poll(timeout) is also a public interface, I think it also
> > makes
> > > >> sense to make producer.close() to have the same definition of
> > > >> consumer.poll(timeout).
> > > >>
> > > >> The main argument for keeping a timeout in close would be separating
> > the
> > > >> close timeout from request timeout, which probably makes sense. I
> > would
> > > >> guess typically the request timeout would be long (e.g. 60 seconds)
> > > >> b

Re: [DISCUSS] KIP 16 - Replica lag tuning

2015-03-15 Thread Guozhang Wang
The wiki page and the patch LGTM. Just a couple of minor comments:


On the wiki page, definition of* replica.lag.time.max.ms
*: just mention that "If a follower hasn't
sent..." is its original definition, and now we are adding another aspect
in addition to this.

I think it is better to add the server metrics of replica-inactive-time as
"time.millisecond - r.logEndOffsetUpdateTimeMs", and replica-lag-time for
each replica.



Guozhang

On Fri, Mar 13, 2015 at 5:25 PM, Joel Koshy  wrote:

> Thanks for the write-up. Looks good to me. Minor comments:
>
> Under proposed changes:
>
> """The proposal is to calculate replica lag as the amount of time not
> caught up to the leader. A replica is only in ISR if it is caught
> up."""
>
> It may be worth clarifying this on the KIP a bit more - i.e., we
> currently have two forms of replica lag (time-based and message-count
> based). The latter is hard to do correctly across all topics and the
> former is in fact sufficient. i.e., we now have only one concept: "For
> how long can a replica be out-of-sync before it is explicitly removed
> from the ISR?  This is the new interpretation of replica.lag.max.ms. A
> replica that is currently in ISR and happens to fall behind due to a
> burst is given this grace period to catch up to the leader without
> being removed from ISR."
>
> Also, may want to warn if replica.max.messages is provided (doesn't seem
> to be there in the current patch).
>
> Joel
>
> On Thu, Mar 12, 2015 at 09:35:24PM +, Aditya Auradkar wrote:
> > I will change the wording to reflect this. But yes, a broker follower
> should only enter the ISR once it is fully caught up.
> >
> > Caught up means that the follower has read from the log end offset from
> the broker. I'm using the log end offset from before the actual read
> operation to avoid these off by one errors. In any case, I plan to run this
> locally with a small cluster and see how it performs.
> >
> > Aditya
> >
> > 
> > From: Joe Stein [joe.st...@stealth.ly]
> > Sent: Thursday, March 12, 2015 1:54 PM
> > To: dev@kafka.apache.org
> > Subject: Re: [DISCUSS] KIP 16 - Replica lag tuning
> >
> > Hi Aditya, thanks for the writeup.
> >
> > Lets say a broker follower goes down. And it is down for an hour or
> two
> >
> > When the broker follower comes back up it will start sending fetch
> requests
> > (lets say every 2ms which would be under a configured lets say 100ms
> > (whatever)). Then right away the brokers gets added back to the ISR?
> >
> > Maybe it is just the wording or how I am reading it... I think/thought
> that
> > once the replica is caught up THEN the setting goes into action and as
> long
> > as (every 100ms ... whatever) the broker leader is seeing the broker
> > follower as "caught up" then it is in the ISR.
> >
> > Also, what is the definition of "caught up" now without the number of
> > messages? If it is === i worry about that not happening in some networks
> > where it is always off by one or something maybe?
> >
> > ~ Joe Stein
> > - - - - - - - - - - - - - - - - -
> >
> >   http://www.stealth.ly
> > - - - - - - - - - - - - - - - - -
> >
> > On Thu, Mar 12, 2015 at 4:36 PM, Aditya Auradkar <
> > aaurad...@linkedin.com.invalid> wrote:
> >
> > > I wrote a KIP for this after some discussion on KAFKA-1546.
> > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP+16+:+Automated+Replica+Lag+Tuning
> > >
> > > The RB is here: https://reviews.apache.org/r/31967/
> > >
> > > Thanks,
> > > Aditya
> > >
>
>


-- 
-- Guozhang


Re: [VOTE] KIP-16: Replica Lag Tuning

2015-03-15 Thread Guozhang Wang
+1 on the KIP.

On Sun, Mar 15, 2015 at 2:56 PM, Joe Stein  wrote:

> +1
>
> one more minor nit, please update the KIP with the link to the discuss
> thread too.
>
> ~ Joe Stein
> - - - - - - - - - - - - - - - - -
>
>   http://www.stealth.ly
> - - - - - - - - - - - - - - - - -
>
> On Sun, Mar 15, 2015 at 5:27 PM, Neha Narkhede  wrote:
>
> > +1 on the KIP. Minor nit: "it is deemed to not be in ISR because it is
> not
> > caught up" => "it is deemed to not be in the ISR because it has fallen
> > behind for more than a certain amount of time as controlled by this
> config"
> >
> > Also took a look at the patch. Looks correct, left review comments.
> Thanks
> > for sharing the test results. This change is going to be great for users!
> >
> > On Sat, Mar 14, 2015 at 9:01 AM, Jay Kreps  wrote:
> >
> > > +1
> > >
> > > -Jay
> > >
> > > On Fri, Mar 13, 2015 at 9:54 AM, Aditya Auradkar <
> > > aaurad...@linkedin.com.invalid> wrote:
> > >
> > > > Details in the KIP, Jira and RB.
> > > >
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP+16+:+Automated+Replica+Lag+Tuning
> > > > https://issues.apache.org/jira/browse/KAFKA-1546
> > > > https://reviews.apache.org/r/31967/
> > > >
> > > > Aditya
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Thanks,
> > Neha
> >
>



-- 
-- Guozhang


[jira] [Commented] (KAFKA-1682) Security for Kafka

2015-03-15 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362638#comment-14362638
 ] 

Jun Rao commented on KAFKA-1682:


Just so that everyone knows the sequencing and who is working on what. The 
following is a summary of the jiras that are being actively worked on.

1. KAFKA-1809 (multi-port), by [~gwenshap].
2. KAFKA-1928 (reuse network code in o.a.k.c.n in server), by [~gwenshap], 
depending on KAFKA-1809.
3. KAFKA-1684 (SSL), by [~sriharsha], depending on KAFKA-1928.
4. KAFKA-1686 (SASL), by [~sriharsha], depending on KAFKA-1928.

> Security for Kafka
> --
>
> Key: KAFKA-1682
> URL: https://issues.apache.org/jira/browse/KAFKA-1682
> Project: Kafka
>  Issue Type: New Feature
>  Components: security
>Affects Versions: 0.9.0
>Reporter: Jay Kreps
>
> Parent ticket for security. Wiki and discussion is here:
> https://cwiki.apache.org/confluence/display/KAFKA/Security



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (KAFKA-1685) Implement TLS/SSL tests

2015-03-15 Thread Sriharsha Chintalapani (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriharsha Chintalapani reassigned KAFKA-1685:
-

Assignee: Sriharsha Chintalapani

> Implement TLS/SSL tests
> ---
>
> Key: KAFKA-1685
> URL: https://issues.apache.org/jira/browse/KAFKA-1685
> Project: Kafka
>  Issue Type: Sub-task
>  Components: security
>Affects Versions: 0.9.0
>Reporter: Jay Kreps
>Assignee: Sriharsha Chintalapani
> Fix For: 0.9.0
>
>
> We need to write a suite of unit tests for TLS authentication. This should be 
> doable with a junit integration test. We can use the simple authorization 
> plugin with only a single user whitelisted. The test can start the server and 
> then connects with and without TLS and validates that access is only possible 
> when authenticated. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-1682) Security for Kafka

2015-03-15 Thread Sriharsha Chintalapani (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362644#comment-14362644
 ] 

Sriharsha Chintalapani commented on KAFKA-1682:
---

Also
5. KAFKA-1688(Authorization), by [~parthrbhatt] , depending on KAFKA-1684 & 
KAFKA-1686

> Security for Kafka
> --
>
> Key: KAFKA-1682
> URL: https://issues.apache.org/jira/browse/KAFKA-1682
> Project: Kafka
>  Issue Type: New Feature
>  Components: security
>Affects Versions: 0.9.0
>Reporter: Jay Kreps
>
> Parent ticket for security. Wiki and discussion is here:
> https://cwiki.apache.org/confluence/display/KAFKA/Security



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (KAFKA-1682) Security for Kafka

2015-03-15 Thread Sriharsha Chintalapani (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362644#comment-14362644
 ] 

Sriharsha Chintalapani edited comment on KAFKA-1682 at 3/16/15 1:14 AM:


Also
5. KAFKA-1688(Authorization), by [~parth.brahmbhatt] , depending on KAFKA-1684 
& KAFKA-1686


was (Author: sriharsha):
Also
5. KAFKA-1688(Authorization), by [~parthrbhatt] , depending on KAFKA-1684 & 
KAFKA-1686

> Security for Kafka
> --
>
> Key: KAFKA-1682
> URL: https://issues.apache.org/jira/browse/KAFKA-1682
> Project: Kafka
>  Issue Type: New Feature
>  Components: security
>Affects Versions: 0.9.0
>Reporter: Jay Kreps
>
> Parent ticket for security. Wiki and discussion is here:
> https://cwiki.apache.org/confluence/display/KAFKA/Security



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 31967: Patch for KAFKA-1546

2015-03-15 Thread Jun Rao

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/31967/#review76517
---


Thanks for the patch. A few comments below.


core/src/main/scala/kafka/cluster/Partition.scala


This is actualy not a reliable check for isr expansion. Consider a case 
where there are 3 replicas a, b, and c. Suppose that a and b are in isr and c 
is not. At some point, c fully catches up to LEO. Its lagBeginTimeMs is set to 
-1 and we are about to call maybeExpandIsr(). Before that happens, a and b both 
advance its LEO and HW also advances. Now, we do the check in maybeExpandIsr() 
and add c to isr. However, c now misses messages btw the LEO it sees and the 
new HW.

The original check is more reliable for isr expansion.



core/src/test/scala/unit/kafka/server/ISRExpirationTest.scala


The comment says 10, which is inconsistent with the code change below.



core/src/test/scala/unit/kafka/server/ISRExpirationTest.scala


"catch up to only 10" is no longer accurate.


- Jun Rao


On March 12, 2015, 8:42 p.m., Aditya Auradkar wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/31967/
> ---
> 
> (Updated March 12, 2015, 8:42 p.m.)
> 
> 
> Review request for kafka.
> 
> 
> Bugs: KAFKA-1546
> https://issues.apache.org/jira/browse/KAFKA-1546
> 
> 
> Repository: kafka
> 
> 
> Description
> ---
> 
> PATCH for KAFKA-1546
> 
> Brief summary of changes:
> - Added a lagBegin metric inside Replica to track the lag in terms of time 
> since the replica did not read from the LEO
> - Using lag begin value in the check for ISR expand and shrink
> - Removed the max lag messages config since it is no longer necessary
> - Returning the initialLogEndOffset in LogReadResult corresponding to the the 
> LEO before actually reading from the log.
> - Unit test cases to test ISR shrinkage and expansion
> 
> 
> Diffs
> -
> 
>   core/src/main/scala/kafka/cluster/Partition.scala 
> c4bf48a801007ebe7497077d2018d6dffe1677d4 
>   core/src/main/scala/kafka/cluster/Replica.scala 
> bd13c20338ce3d73113224440e858a12814e5adb 
>   core/src/main/scala/kafka/server/KafkaConfig.scala 
> 48e33626695ad8a28b0018362ac225f11df94973 
>   core/src/main/scala/kafka/server/ReplicaManager.scala 
> c5274822c57bf3c1f9e4135c0bdcaa87ee50ce20 
>   core/src/test/scala/unit/kafka/server/ISRExpirationTest.scala 
> 92152358c95fa9178d71bd1c079af0a0bd8f1da8 
>   core/src/test/scala/unit/kafka/server/KafkaConfigConfigDefTest.scala 
> c124c8df5b5079e5ffbd0c4ea359562a66aaf317 
>   core/src/test/scala/unit/kafka/server/LogRecoveryTest.scala 
> 92d6b2c672f74cdd526f2e98da8f7fb3696a88e3 
>   core/src/test/scala/unit/kafka/server/SimpleFetchTest.scala 
> efb457334bd87e4c5b0dd2c66ae1995993cd0bc1 
> 
> Diff: https://reviews.apache.org/r/31967/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Aditya Auradkar
> 
>



Re: [VOTE] KIP-16: Replica Lag Tuning

2015-03-15 Thread Jun Rao
+1 on the approach. Much better than before.

Thanks,

Jun

On Fri, Mar 13, 2015 at 9:54 AM, Aditya Auradkar <
aaurad...@linkedin.com.invalid> wrote:

> Details in the KIP, Jira and RB.
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP+16+:+Automated+Replica+Lag+Tuning
> https://issues.apache.org/jira/browse/KAFKA-1546
> https://reviews.apache.org/r/31967/
>
> Aditya
>
>


RE: [VOTE] KIP-16: Replica Lag Tuning

2015-03-15 Thread Aditya Auradkar
Thanks. We have 5 binding +1s. I'll update the KIP with your comments and mark 
as accepted.

Aditya


From: Jun Rao [j...@confluent.io]
Sent: Sunday, March 15, 2015 8:41 PM
To: dev@kafka.apache.org
Subject: Re: [VOTE] KIP-16: Replica Lag Tuning

+1 on the approach. Much better than before.

Thanks,

Jun

On Fri, Mar 13, 2015 at 9:54 AM, Aditya Auradkar <
aaurad...@linkedin.com.invalid> wrote:

> Details in the KIP, Jira and RB.
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP+16+:+Automated+Replica+Lag+Tuning
> https://issues.apache.org/jira/browse/KAFKA-1546
> https://reviews.apache.org/r/31967/
>
> Aditya
>
>


RE: [VOTE] KIP-16: Replica Lag Tuning

2015-03-15 Thread Aditya Auradkar
Rather, 6 binding +1's.

Aditya


From: Aditya Auradkar
Sent: Sunday, March 15, 2015 10:39 PM
To: dev@kafka.apache.org
Subject: RE: [VOTE] KIP-16: Replica Lag Tuning

Thanks. We have 5 binding +1s. I'll update the KIP with your comments and mark 
as accepted.

Aditya


From: Jun Rao [j...@confluent.io]
Sent: Sunday, March 15, 2015 8:41 PM
To: dev@kafka.apache.org
Subject: Re: [VOTE] KIP-16: Replica Lag Tuning

+1 on the approach. Much better than before.

Thanks,

Jun

On Fri, Mar 13, 2015 at 9:54 AM, Aditya Auradkar <
aaurad...@linkedin.com.invalid> wrote:

> Details in the KIP, Jira and RB.
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP+16+:+Automated+Replica+Lag+Tuning
> https://issues.apache.org/jira/browse/KAFKA-1546
> https://reviews.apache.org/r/31967/
>
> Aditya
>
>