Re: [DISCUSS] KIP-58 - Make Log Compaction Point Configurable

Jay Kreps Mon, 16 May 2016 21:21:07 -0700

Yeah I think I gave a scenario but that is not the same as a concrete use
case. I think the question you have is how common is it that people care
about this and what concrete things would you build where you had this
requirement? I think that would be good to figure out.


I think the issue with the current state is that it really gives no SLA at
all, the last write to a segment is potentially compacted immediately so
even a few seconds of lag (if your segment size is small) would cause this.

-Jay

On Mon, May 16, 2016 at 9:05 PM, Gwen Shapira <g...@confluent.io> wrote:

> I agree that log.cleaner.min.compaction.lag.ms gives slightly more
> flexibility for potentially-lagging consumers than tuning
> segment.roll.ms for the exact same scenario.
>
> If more people think that the use-case of "consumer which must see
> every single record, is running on a compacted topic, and is lagging
> enough that tuning segment.roll.ms won't help" is important enough
> that we need to address, I won't object to proceeding with the KIP
> (i.e. I'm probably -0 on this). It is easy to come up with a scenario
> in which a feature is helpful (heck, I do it all the time), I'm just
> not sure there is a real problem that cannot be addressed using
> Kafka's existing behavior.
>
> I do think that it will be an excellent idea to revisit the log
> compaction configurations and see whether they make sense to users.
> For example, if "log.cleaner.min.compaction.lag.X" can replace
> "log.cleaner.min.cleanable.ratio" as an easier-to-tune alternative,
> I'll be more excited about the replacement, even without a strong
> use-case for a specific compaction lag.
>
> Gwen
>
> On Mon, May 16, 2016 at 7:46 PM, Jay Kreps <j...@confluent.io> wrote:
> > I think it would be good to hammer out some of the practical use cases--I
> > definitely share your disdain for adding more configs. Here is my sort of
> > theoretical understanding of why you might want this.
> >
> > As you say a consumer bootstrapping itself in the compacted part of the
> log
> > isn't actually traversing through valid states globally. i.e. if you have
> > written the following:
> >   offset, key, value
> >   0, k0, v0
> >   1, k1, v1
> >   2, k0, v2
> > it could be compacted to
> >   1, k1, v1
> >   2, k0, v2
> > Thus at offset 1 in the compacted log, you would have applied k1, but not
> > k0. So even though k0 and k1 both have valid values they get applied out
> of
> > order. This is totally normal, there is obviously no way to both compact
> > and retain every valid state.
> >
> > For many things this is a non-issue since they treat items only on a
> > per-key basis without any global notion of consistency.
> >
> > But let's say you want to guarantee you only traverse valid states in a
> > caught-up real-time consumer, how can you do this? It's actually a bit
> > tough. Generally speaking since we don't compact the active segment a
> > real-time consumer should have this property but this doesn't really
> give a
> > hard SLA. With a small segment size and a lagging consumer you could
> > imagine the compactor potentially getting ahead of the consumer.
> >
> > So effectively what this config would establish is a guarantee that as
> long
> > as you consume all messages in log.cleaner.min.compaction.lag.ms you
> will
> > get every single produced record.
> >
> > -Jay
> >
> >
> >
> >
> >
> > On Mon, May 16, 2016 at 6:42 PM, Gwen Shapira <g...@confluent.io> wrote:
> >
> >> Hi Eric,
> >>
> >> Thank you for submitting this improvement suggestion.
> >>
> >> Do you mind clarifying the use-case for me?
> >>
> >> Looking at your gist:
> >> https://gist.github.com/ewasserman/f8c892c2e7a9cf26ee46
> >>
> >> If my consumer started reading all the CDC topics from the very
> >> beginning in which they were created, without ever stopping, it is
> >> obviously guaranteed to see every single consistent state of the
> >> database.
> >> If my consumer joined late (lets say after Tq got clobbered by Tr) it
> >> will get a mixed state, but if it will continue listening on those
> >> topics, always following the logs to their end, it is guaranteed to
> >> see a consistent state as soon a new transaction commits. Am I missing
> >> anything?
> >>
> >> Basically, I do not understand why you claim: "However, to recover all
> >> the tables at the same checkpoint, with each independently compacting,
> >> one may need to move to an even more recent checkpoint when a
> >> different table had the same read issue with the new checkpoint. Thus
> >> one could never be assured of this process terminating."
> >>
> >> I mean, it is true that you need to continuously read forward in order
> >> to get to a consistent state, but why can't you be assured of getting
> >> there?
> >>
> >> We are doing something very similar in KafkaConnect, where we need a
> >> consistent view of our configuration. We make sure that if the current
> >> state is inconsistent (i.e there is data that are not "committed"
> >> yet), we continue reading to the log end until we get to a consistent
> >> state.
> >>
> >> I am not convinced the new functionality is necessary, or even helpful.
> >>
> >> Gwen
> >>
> >> On Mon, May 16, 2016 at 4:07 PM, Eric Wasserman
> >> <eric.wasser...@gmail.com> wrote:
> >> > I would like to begin discussion on KIP-58
> >> >
> >> > The KIP is here:
> >> >
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-58+-+Make+Log+Compaction+Point+Configurable
> >> >
> >> > Jira: https://issues.apache.org/jira/browse/KAFKA-1981
> >> >
> >> > Pull Request: https://github.com/apache/kafka/pull/1168
> >> >
> >> > Thanks,
> >> >
> >> > Eric
> >>
>

Re: [DISCUSS] KIP-58 - Make Log Compaction Point Configurable

Reply via email to