Hi all,

First of all, thanks Onur for fixing this.

With regards to where we merge the code, my personal preference would be to
merge it to trunk ASAP ((which means that it would be part of 0.10.3.0).
Generally, I think we should be conservative when it comes to code merged
right before the RC gets cut, i.e. we should limit it to documentation,
tests, fixes for regressions and critical fixes that affect new
functionality (to avoid shipping broken new features). As we now have
frequent time-based releases, the next release is never far away and it
reduces/eliminates the need to fast-track KIPs in my opinion. Also we have
to be particularly careful to avoid a situation where we have frequent, but
slightly broken time-based releases making it difficult for users to choose
a good release.

For this particular KIP, I agree with Onur that people wouldn't want to
rely on a non-deterministic RF, so the code change would only affect people
setting up new clusters with 1 or 2 Kafka nodes using an existing config
(i.e. not the updated server.properties) and with no
offsets.topic.replication.factor set in the config. Hopefully people are
not doing this outside of test clusters and the fix is easy.

The second point is the new config in `server.properties`. The fact that
`server.properties` should only be used for quickstarts is obvious to us,
but seemingly not for other people. We have a couple of JIRAs[1][2] and a
number of mailing list threads about people who are surprised by the fact
that `log.dir` is set to `/tmp` in `server.properties`. I actually think
that we should rename `server.properties` to `server-quickstart.properties`
(or something similar) and have a more reasonable `server.properties`
(maybe with a different name). In any case, that's out of scope for this
KIP, but the point seemed relevant.

Having said all that, if I'm the only one who feels this way about
including this KIP in 0.10.2, then I'm OK with it being included.

Ewen, your suggestions for the post-mortem thread sound reasonable to me
(although I'd like to understand better what we would classify as bug fix
KIPs).

Ismael

[1] https://issues.apache.org/jira/browse/KAFKA-1906
[2] https://issues.apache.org/jira/browse/KAFKA-3925

On Tue, Jan 31, 2017 at 7:53 AM, Onur Karaman <okara...@linkedin.com.invalid
> wrote:

> I get where Ewen's coming from but personally, I have trouble understanding
> a use case where end users would specifically rely on existing behavior of
> group coordination and offset commits succeeding with nondeterministic RF
> (which is only later to be manually fixed) during the window of time when a
> legitimate non-quickstart cluster is in the process of being setup.
>
> As Ewen mentioned, server.properties is already blatantly not meant for
> production use cases but actually for quickstart. It chooses /tmp for its
> log.dirs, localhost for its zookeeper.connect, and references
> kafka.server.KafkaConfig for additional details and defaults. It seems that
> setting offsets.topic.replication.factor to 1 in server.properties is the
> right way to go.
>
> That being said, if there actually is opposition, then it would be fair to
> bump it to 0.10.3.0.
>
> On Mon, Jan 30, 2017 at 6:12 PM, Ewen Cheslack-Postava <e...@confluent.io>
> wrote:
>
> > So, we have one other blocker bug in system tests that we're trying to
> make
> > sure can safely be removed, so we've had a bit of slack time with this.
> > Obviously having this all happen very last minute isn't really ideal
> since
> > it didn't allow enough time to address the feedback -- Stevo's questions
> > didn't get a response until today.
> >
> > I'd really like this fix in because I think it actually seems pretty low
> > risk, although it would potentially make some consumers on a new cluster
> be
> > delayed indefinitely if the user is relying on having fewer than the
> > default offset.topic.replication.factor. Many users will have there
> > server.properties files stored elsewhere and not rely on the version we
> > ship (and they shouldn't -- it contains other dangerous settings like
> > log.dirs in /tmp). This does mean some users could be negatively
> affected,
> > although I have to admit it seems like it'd be an extremely small
> > population of our users.
> >
> > All that said, given the late voting and discussion continuing while the
> > VOTE thread proceeded, I'd like to suggest that if anyone objects to
> > including in 0.10.2.0, we bump it to 0.10.3.0 and also follow up on
> > Ismael's question (re: does this need to be in a major version, which
> > afaict was only followed up by Jeff's comment that he'd rather not wait).
> > The patch is there, so those that desperately want it can easily
> > cherry-pick it and build a version themselves if it's that big a problem
> > for them, else they would see the fix in the next release.
> >
> > More generally, I think this KIP (and my experience dealing w/ the KIPs +
> > feature freeze for this release) suggest we need a little more clarity on
> > how KIPs fit into the release process. I'm going to follow up in a new
> > release post-mortem thread further but I think we should a) start to
> > distinguish between feature KIPs and bug fix KIPs and possibly apply
> > different rules for them (this one is admittedly ambiguous) and b) push
> the
> > deadline for feature KIPs to a week before feature freeze (giving a more
> > realistic review period for all that code) and make bugfix KIPs
> acceptable
> > up to a week before code freeze (again, providing a realistic review
> > period), and c) move our freeze dates off of Fridays since having them
> > there implicitly gives people 2 extra days to squeeze things in.
> >
> > -Ewen
> >
> > On Mon, Jan 30, 2017 at 1:46 PM, Onur Karaman
> > <okara...@linkedin.com.invalid
> > > wrote:
> >
> > > I've updated the KIP title to reflect this distinction:
> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > 115%3A+Enforce+offsets.topic.replication.factor+upon+__
> > > consumer_offsets+auto+topic+creation
> > >
> > > On Mon, Jan 30, 2017 at 12:52 AM, Onur Karaman <
> > > onurkaraman.apa...@gmail.com
> > > > wrote:
> > >
> > > > Regarding Joel's comment:
> > > > > On Jan 25, 2017, at 9:26 PM, Joel Koshy <jjkosh...@gmail.com>
> wrote:
> > > > >
> > > > > already voted, but one thing worth considering (since this KIP
> speaks
> > > of
> > > > > *enforcement*) is desired behavior if the topic already exists and
> > the
> > > > > config != existing RF.
> > > > >
> > > >
> > > > The short answer: The first line of the KIP specifically states that
> it
> > > > attempts enforcement only upon topic creation and I think anything
> else
> > > > should be beyond the scope of this KIP.
> > > >
> > > > The long answer:
> > > > Mismatch between existing RF and the "offsets.topic.replication.
> > factor"
> > > > config happens with:
> > > > a. topic creation paths 3-5 as defined in the KIP if the size of the
> > > > replicas set resulting from AdminUtils != "offsets.topic.replication.
> > > > factor"
> > > > b. topic creation path 6
> > > > c. a config change to the broker's "offsets.topic.replication.
> factor"
> > > > d. partition reassignments that expand the RF
> > > >
> > > > For all of these scenarios, I believe it all boils down to the intent
> > of
> > > > the imperfectly named "offsets.topic.replication.factor" and
> > > > "default.replication.factor" configs. These configs really only
> > represent
> > > > the RF to be used upon auto topic creation by the broker and are
> never
> > > > referenced anywhere else, whether it's "offsets.topic.replication.
> > > factor"
> > > > for the __consumer_offsets topic or "default.replication.factor" for
> > any
> > > > other topic.
> > > >
> > > > I think any RF mismatch after topic creation is beyond the scope of
> > this
> > > > discussion since the configs anyways were not intended to enforce RF
> > > > anywhere other than upon auto topic creation by the broker.
> > > >
> > > > Regarding Stevo's comment:
> > > > > On Thu, Jan 26, 2017 at 4:26 AM, Stevo Slavić <ssla...@gmail.com>
> > > wrote:
> > > > > test cluster. If this problem of offsets.topic.replication.factor
> > not
> > > > being
> > > > > enforced others also observed only in their tests only, than I
> don't
> > > like
> > > > > the KIP proposed change, of setting offsets.topic.replication.
> factor
> > > to
> > > > 1
> > > > > by default. I understand backward compatibility goals of this, but
> I
> > > can
> > > > > imagine late discovered production issues as consequences of this
> > > change.
> > > >
> > > > It's worth noting that the KafkaConfig default for
> > > > "offsets.topic.replication.factor" is still 3. This KIP merely
> changes
> > > the
> > > > config/server.properties default to 1 so that the quickstart
> > > > "./bin/kafka-server-start.sh config/server.properties" continues to
> run
> > > > smoothly.
> > > >
> > > > On Thu, Jan 26, 2017 at 9:34 AM, Colin McCabe <cmcc...@apache.org>
> > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > +1 (non-binding) for KIP-115.
> > > > >
> > > > > On Thu, Jan 26, 2017, at 04:26, Stevo Slavić wrote:
> > > > > > If I understood well, this KIP is trying to solve for the problem
> > of
> > > > > > offsets.topic.replication.factor not being enforced,
> particularly
> > in
> > > > > > context of  "when you have clients or tooling running as the
> > cluster
> > > is
> > > > > > getting setup". Assuming that this problem was observed in
> > > production,
> > > > so
> > > > > > in non-testing only conditions, would it make sense to introduce
> > > > > > additional
> > > > > > property - min number of alive brokers before offsets topic is
> > > allowed
> > > > to
> > > > > > be created?
> > > > >
> > > > > It's an interesting idea, but... is there a user use-case for a
> > > property
> > > > > like this?  I'm having a hard time thinking of one, but maybe I
> > missed
> > > > > something.
> > > > >
> > > > > cheers,
> > > > > Colin
> > > > >
> > > > > >
> > > > > > Currently offsets.topic.replication.factor is used for that
> > purpose,
> > > > so
> > > > > > with offsets.topic.replication.factor set to 3 it's enough to
> have
> > > > just
> > > > > 3
> > > > > > brokers up for offsets topic to be created. Then all replicas of
> > all
> > > > (by
> > > > > > default 50) partitions of this topic would be spread out over
> just
> > > > these
> > > > > > 3
> > > > > > brokers, while eventually entire cluster might be much larger in
> > size
> > > > and
> > > > > > would benefit from wider spread of consumer offsets topic
> > partitions
> > > > > > leadership.
> > > > > >
> > > > > > One can achieve wider spread later, manually. But that would
> first
> > > have
> > > > > > to
> > > > > > be detected, and then use provided CLI/scripts to change replica
> > > > > > assignment. IMO it would be better if it was possible to
> configure
> > > > > > desired
> > > > > > spread, even if just indirectly through configuring min number of
> > > alive
> > > > > > brokers. If not overriden in server.properties, this new property
> > can
> > > > > > default to offsets.topic.replication.factor
> > > > > >
> > > > > > I've been bitten by problem of offsets.topic.replication.factor
> > not
> > > > > being
> > > > > > enforced but only in testing, integration tests, it was almost
> > > > > > unpredictable when offsets topic is ready, test cluster
> > initialized,
> > > > > > would
> > > > > > get lots of false failures, unstable tests, but eventually got to
> > > > > > predictable deterministic test behavior, found ways to fully
> > > initialize
> > > > > > test cluster. If this problem of offsets.topic.replication.
> factor
> > > not
> > > > > > being
> > > > > > enforced others also observed only in their tests only, than I
> > don't
> > > > like
> > > > > > the KIP proposed change, of setting offsets.topic.replication.
> > factor
> > > > to
> > > > > 1
> > > > > > by default. I understand backward compatibility goals of this,
> but
> > I
> > > > can
> > > > > > imagine late discovered production issues as consequences of this
> > > > change.
> > > > > > So I wouldn't like to trade off production issues probability for
> > > > testing
> > > > > > convenience.
> > > > > >
> > > > > > Current Kafka documentation has nice note about
> > > > > > offsets.topic.replication.factor and related behavior. New note
> > > about
> > > > > new
> > > > > > default would have to be a warning in bold and red in docs, and
> > every
> > > > > > broker should output proper warning in log if configuration for
> > > > > > offsets.topic.replication.factor is on new proposed default of
> 1.
> > > > > >
> > > > > > Kind regards,
> > > > > > Stevo Slavic.
> > > > > >
> > > > > > On Thu, Jan 26, 2017 at 8:43 AM, James Cheng <
> wushuja...@gmail.com
> > >
> > > > > > wrote:
> > > > > >
> > > > > > >
> > > > > > > > On Jan 25, 2017, at 9:26 PM, Joel Koshy <jjkosh...@gmail.com
> >
> > > > wrote:
> > > > > > > >
> > > > > > > > already voted, but one thing worth considering (since this
> KIP
> > > > > speaks of
> > > > > > > > *enforcement*) is desired behavior if the topic already
> exists
> > > and
> > > > > the
> > > > > > > > config != existing RF.
> > > > > > > >
> > > > > > >
> > > > > > > Yeah, I'm curious about this too.
> > > > > > >
> > > > > > > -James
> > > > > > >
> > > > > > > > On Wed, Jan 25, 2017 at 4:30 PM, Dong Lin <
> lindon...@gmail.com
> > >
> > > > > wrote:
> > > > > > > >
> > > > > > > >> +1
> > > > > > > >>
> > > > > > > >> On Wed, Jan 25, 2017 at 4:22 PM, Ismael Juma <
> > ism...@juma.me.uk
> > > >
> > > > > wrote:
> > > > > > > >>
> > > > > > > >>> An important question is if this needs to wait for a major
> > > > release
> > > > > or
> > > > > > > >> not.
> > > > > > > >>>
> > > > > > > >>> Ismael
> > > > > > > >>>
> > > > > > > >>> On Thu, Jan 26, 2017 at 12:19 AM, Ismael Juma <
> > > ism...@juma.me.uk
> > > > >
> > > > > > > wrote:
> > > > > > > >>>
> > > > > > > >>>> +1 from me too.
> > > > > > > >>>>
> > > > > > > >>>> Ismael
> > > > > > > >>>>
> > > > > > > >>>> On Thu, Jan 26, 2017 at 12:07 AM, Ewen Cheslack-Postava <
> > > > > > > >>> e...@confluent.io
> > > > > > > >>>>> wrote:
> > > > > > > >>>>
> > > > > > > >>>>> +1
> > > > > > > >>>>>
> > > > > > > >>>>> Since this is an unusual one, I think it's worth pointing
> > out
> > > > > that
> > > > > > > the
> > > > > > > >>> KIP
> > > > > > > >>>>> notes it is really a bug fix, but since it has
> > compatibility
> > > > > > > >>> implications
> > > > > > > >>>>> the KIP was worth it. It was a sort of intentional bug,
> but
> > > > > confusing
> > > > > > > >>> and
> > > > > > > >>>>> dangerous.
> > > > > > > >>>>>
> > > > > > > >>>>> Seems important to fix this ASAP since people are hitting
> > > this
> > > > in
> > > > > > > >>> practice
> > > > > > > >>>>> and would have to go out of their way to set up
> monitoring
> > to
> > > > > catch
> > > > > > > >> the
> > > > > > > >>>>> issue.
> > > > > > > >>>>>
> > > > > > > >>>>> -Ewen
> > > > > > > >>>>>
> > > > > > > >>>>> On Wed, Jan 25, 2017 at 4:02 PM, Jason Gustafson <
> > > > > ja...@confluent.io
> > > > > > > >
> > > > > > > >>>>> wrote:
> > > > > > > >>>>>
> > > > > > > >>>>>> +1 from me. The current behavior seems both surprising
> and
> > > > > > > >> dangerous.
> > > > > > > >>>>>>
> > > > > > > >>>>>> -Jason
> > > > > > > >>>>>>
> > > > > > > >>>>>> On Wed, Jan 25, 2017 at 3:58 PM, Onur Karaman <
> > > > > > > >>>>>> onurkaraman.apa...@gmail.com>
> > > > > > > >>>>>> wrote:
> > > > > > > >>>>>>
> > > > > > > >>>>>>> Hey everyone.
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> I made a bug-fix KIP-115 to enforce
> > > > offsets.topic.replication.
> > > > > > > >>> factor:
> > > > > > > >>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > > > > >>>>>>> 115%3A+Enforce+offsets.topic.replication.factor
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> Comments are welcome.
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> - Onur
> > > > > > > >>>>>>>
> > > > > > > >>>>>>
> > > > > > > >>>>>
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to