I think it would be good to hammer out some of the practical use cases--I
definitely share your disdain for adding more configs. Here is my sort of
theoretical understanding of why you might want this.

As you say a consumer bootstrapping itself in the compacted part of the log
isn't actually traversing through valid states globally. i.e. if you have
written the following:
  offset, key, value
  0, k0, v0
  1, k1, v1
  2, k0, v2
it could be compacted to
  1, k1, v1
  2, k0, v2
Thus at offset 1 in the compacted log, you would have applied k1, but not
k0. So even though k0 and k1 both have valid values they get applied out of
order. This is totally normal, there is obviously no way to both compact
and retain every valid state.

For many things this is a non-issue since they treat items only on a
per-key basis without any global notion of consistency.

But let's say you want to guarantee you only traverse valid states in a
caught-up real-time consumer, how can you do this? It's actually a bit
tough. Generally speaking since we don't compact the active segment a
real-time consumer should have this property but this doesn't really give a
hard SLA. With a small segment size and a lagging consumer you could
imagine the compactor potentially getting ahead of the consumer.

So effectively what this config would establish is a guarantee that as long
as you consume all messages in log.cleaner.min.compaction.lag.ms you will
get every single produced record.

-Jay





On Mon, May 16, 2016 at 6:42 PM, Gwen Shapira <g...@confluent.io> wrote:

> Hi Eric,
>
> Thank you for submitting this improvement suggestion.
>
> Do you mind clarifying the use-case for me?
>
> Looking at your gist:
> https://gist.github.com/ewasserman/f8c892c2e7a9cf26ee46
>
> If my consumer started reading all the CDC topics from the very
> beginning in which they were created, without ever stopping, it is
> obviously guaranteed to see every single consistent state of the
> database.
> If my consumer joined late (lets say after Tq got clobbered by Tr) it
> will get a mixed state, but if it will continue listening on those
> topics, always following the logs to their end, it is guaranteed to
> see a consistent state as soon a new transaction commits. Am I missing
> anything?
>
> Basically, I do not understand why you claim: "However, to recover all
> the tables at the same checkpoint, with each independently compacting,
> one may need to move to an even more recent checkpoint when a
> different table had the same read issue with the new checkpoint. Thus
> one could never be assured of this process terminating."
>
> I mean, it is true that you need to continuously read forward in order
> to get to a consistent state, but why can't you be assured of getting
> there?
>
> We are doing something very similar in KafkaConnect, where we need a
> consistent view of our configuration. We make sure that if the current
> state is inconsistent (i.e there is data that are not "committed"
> yet), we continue reading to the log end until we get to a consistent
> state.
>
> I am not convinced the new functionality is necessary, or even helpful.
>
> Gwen
>
> On Mon, May 16, 2016 at 4:07 PM, Eric Wasserman
> <eric.wasser...@gmail.com> wrote:
> > I would like to begin discussion on KIP-58
> >
> > The KIP is here:
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-58+-+Make+Log+Compaction+Point+Configurable
> >
> > Jira: https://issues.apache.org/jira/browse/KAFKA-1981
> >
> > Pull Request: https://github.com/apache/kafka/pull/1168
> >
> > Thanks,
> >
> > Eric
>

Reply via email to