Yeah I think I gave a scenario but that is not the same as a concrete use case. I think the question you have is how common is it that people care about this and what concrete things would you build where you had this requirement? I think that would be good to figure out.
I think the issue with the current state is that it really gives no SLA at all, the last write to a segment is potentially compacted immediately so even a few seconds of lag (if your segment size is small) would cause this. -Jay On Mon, May 16, 2016 at 9:05 PM, Gwen Shapira <g...@confluent.io> wrote: > I agree that log.cleaner.min.compaction.lag.ms gives slightly more > flexibility for potentially-lagging consumers than tuning > segment.roll.ms for the exact same scenario. > > If more people think that the use-case of "consumer which must see > every single record, is running on a compacted topic, and is lagging > enough that tuning segment.roll.ms won't help" is important enough > that we need to address, I won't object to proceeding with the KIP > (i.e. I'm probably -0 on this). It is easy to come up with a scenario > in which a feature is helpful (heck, I do it all the time), I'm just > not sure there is a real problem that cannot be addressed using > Kafka's existing behavior. > > I do think that it will be an excellent idea to revisit the log > compaction configurations and see whether they make sense to users. > For example, if "log.cleaner.min.compaction.lag.X" can replace > "log.cleaner.min.cleanable.ratio" as an easier-to-tune alternative, > I'll be more excited about the replacement, even without a strong > use-case for a specific compaction lag. > > Gwen > > On Mon, May 16, 2016 at 7:46 PM, Jay Kreps <j...@confluent.io> wrote: > > I think it would be good to hammer out some of the practical use cases--I > > definitely share your disdain for adding more configs. Here is my sort of > > theoretical understanding of why you might want this. > > > > As you say a consumer bootstrapping itself in the compacted part of the > log > > isn't actually traversing through valid states globally. i.e. if you have > > written the following: > > offset, key, value > > 0, k0, v0 > > 1, k1, v1 > > 2, k0, v2 > > it could be compacted to > > 1, k1, v1 > > 2, k0, v2 > > Thus at offset 1 in the compacted log, you would have applied k1, but not > > k0. So even though k0 and k1 both have valid values they get applied out > of > > order. This is totally normal, there is obviously no way to both compact > > and retain every valid state. > > > > For many things this is a non-issue since they treat items only on a > > per-key basis without any global notion of consistency. > > > > But let's say you want to guarantee you only traverse valid states in a > > caught-up real-time consumer, how can you do this? It's actually a bit > > tough. Generally speaking since we don't compact the active segment a > > real-time consumer should have this property but this doesn't really > give a > > hard SLA. With a small segment size and a lagging consumer you could > > imagine the compactor potentially getting ahead of the consumer. > > > > So effectively what this config would establish is a guarantee that as > long > > as you consume all messages in log.cleaner.min.compaction.lag.ms you > will > > get every single produced record. > > > > -Jay > > > > > > > > > > > > On Mon, May 16, 2016 at 6:42 PM, Gwen Shapira <g...@confluent.io> wrote: > > > >> Hi Eric, > >> > >> Thank you for submitting this improvement suggestion. > >> > >> Do you mind clarifying the use-case for me? > >> > >> Looking at your gist: > >> https://gist.github.com/ewasserman/f8c892c2e7a9cf26ee46 > >> > >> If my consumer started reading all the CDC topics from the very > >> beginning in which they were created, without ever stopping, it is > >> obviously guaranteed to see every single consistent state of the > >> database. > >> If my consumer joined late (lets say after Tq got clobbered by Tr) it > >> will get a mixed state, but if it will continue listening on those > >> topics, always following the logs to their end, it is guaranteed to > >> see a consistent state as soon a new transaction commits. Am I missing > >> anything? > >> > >> Basically, I do not understand why you claim: "However, to recover all > >> the tables at the same checkpoint, with each independently compacting, > >> one may need to move to an even more recent checkpoint when a > >> different table had the same read issue with the new checkpoint. Thus > >> one could never be assured of this process terminating." > >> > >> I mean, it is true that you need to continuously read forward in order > >> to get to a consistent state, but why can't you be assured of getting > >> there? > >> > >> We are doing something very similar in KafkaConnect, where we need a > >> consistent view of our configuration. We make sure that if the current > >> state is inconsistent (i.e there is data that are not "committed" > >> yet), we continue reading to the log end until we get to a consistent > >> state. > >> > >> I am not convinced the new functionality is necessary, or even helpful. > >> > >> Gwen > >> > >> On Mon, May 16, 2016 at 4:07 PM, Eric Wasserman > >> <eric.wasser...@gmail.com> wrote: > >> > I would like to begin discussion on KIP-58 > >> > > >> > The KIP is here: > >> > > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-58+-+Make+Log+Compaction+Point+Configurable > >> > > >> > Jira: https://issues.apache.org/jira/browse/KAFKA-1981 > >> > > >> > Pull Request: https://github.com/apache/kafka/pull/1168 > >> > > >> > Thanks, > >> > > >> > Eric > >> >