Re: [DISCUSS] KIP-58 - Make Log Compaction Point Configurable

James Cheng Mon, 16 May 2016 23:31:17 -0700

> On May 16, 2016, at 9:21 PM, Eric Wasserman <eric.wasser...@gmail.com> wrote:
> 
> Gwen,
> 
> For simplicity, the example I gave in the gist is for a single table with a 
> single partition. The salient point is that even for a single topic with one 
> partition there is no guarantee without the feature that one will be able to 
> restore some particular checkpoint as the offset indicated by that checkpoint 
> may have been compacted away.
> 
> The practical reality is we are trying to restore the state of a database 
> with nearly 1000 tables each of which has 8 partitions. In this real case 
> there are 8000 offsets indicated in each checkpoint. If even a single one of 
> those 8000 is compacted the checkpointed state cannot be reconstructed.
>


Eric, I believe you're talking about something like the following: 
http://imgur.com/3K4looF

In the picture, that shows 3 topic-partitions. The blue lines show where 
transactions T1 T2 and T3 fall amongst the 3 topic-partitions. So if your 
consumers read the 3 topic-partitions up to any of the 3 blue lines, they will 
receive a consistent snapshot across all 3 topics.

Now, look at the next image. http://imgur.com/5KuGWNv

This one shows what happens after compaction. The blacked out boxes show items 
that were compacted out. The red lines show the compaction point.

T1 can obviously not be constructed, because its offsets have been compacted 
out. At first glance, it looks like you would be able to read from the 
beginning up to T2 in order to construct a consistent snapshot, but that 
actually is not the case. The black box in the 2nd row has been compacted out, 
by let's say the 5th box in row 2. Reading up to T2 for the 2nd row means that 
you would be missing that value.

And so the only consistent snapshot you can reconstruct would be T3.

The general rule there is, you must read to the first transaction that occurs 
after the compaction points of across all 3 topic-partitions, in order to 
obtain a consistent snapshot.


> Additionally, we don't really intend to have the consumers of the table 
> topics try to keep current. Rather they will occasionally (say at 1AM each 
> day) try to build the state of the database at a recent checkpoint (say from 
> midnight). Supposing this takes a bit of time (10's of minutes to hours) to 
> read all the partitions of all the table topics up each to its target offset 
> indicated in the midnight checkpoint. By the time all the consumers have 
> arrive at the designated offset perhaps one of them will have had its target 
> offset compacted away. We would then need to select a new target checkpoint 
> with its offsets for each topic and partition that is a bit later. How much 
> later? It might well be around the 10's of minutes to hours it took to read 
> through to the offsets of the original target checkpoint as the compaction 
> that foiled us may have occurred just before we reached the goal.
> 

I'm not sure why you say you need an *additional* 10's of minutes to hours in 
order to reach the next target checkpoint. You have already read until the 
first checkpoint (which is no longer good enough). Wouldn't you just have to 
read an additional couple messages past that checkpoint in order to reach the 
next checkpoint? In my pictures, if you reached T2 and decided it was not good 
enough, you would simply have to read a couple more messages to get from T2 to 
T3. You would not have to start over from the beginning in order to read to T3.

-James

> Really the issue is that while without the feature while we could eventually 
> restore _some_ consistent state we couldn't be assured of being able to 
> restore any
> particular (recent) one. My comment about never being assured of the process 
> terminating is just acknowledging the perhaps small but nonetheless finite 
> possibility of the process of chasing the checkpoints looking for which no 
> partition has yet had its target offset compacted away could continue 
> indefinitely. There is really no condition in which one could be absolutely 
> guaranteed this process would terminate.
> 
> The feature addresses this by providing a guarantee that _any_ checkpoint can 
> be reconstructed as long as it is within the compaction lag. I would love to 
> be convinced that I am in error but short of that I frankly would never turn 
> on compaction for a CDC use case without it.
> 
> As to reducing the number of parameters. I personally only see the 
> min.compaction.lag.ms as being truly essential. Even the existing ratio 
> setting is secondary in my mind.
> 
> Eric
> 
>> On May 16, 2016, at 6:42 PM, Gwen Shapira <g...@confluent.io> wrote:
>> 
>> Hi Eric,
>> 
>> Thank you for submitting this improvement suggestion.
>> 
>> Do you mind clarifying the use-case for me?
>> 
>> Looking at your gist: https://gist.github.com/ewasserman/f8c892c2e7a9cf26ee46
>> 
>> If my consumer started reading all the CDC topics from the very
>> beginning in which they were created, without ever stopping, it is
>> obviously guaranteed to see every single consistent state of the
>> database.
>> If my consumer joined late (lets say after Tq got clobbered by Tr) it
>> will get a mixed state, but if it will continue listening on those
>> topics, always following the logs to their end, it is guaranteed to
>> see a consistent state as soon a new transaction commits. Am I missing
>> anything?
>> 
>> Basically, I do not understand why you claim: "However, to recover all
>> the tables at the same checkpoint, with each independently compacting,
>> one may need to move to an even more recent checkpoint when a
>> different table had the same read issue with the new checkpoint. Thus
>> one could never be assured of this process terminating."
>> 
>> I mean, it is true that you need to continuously read forward in order
>> to get to a consistent state, but why can't you be assured of getting
>> there?
>> 
>> We are doing something very similar in KafkaConnect, where we need a
>> consistent view of our configuration. We make sure that if the current
>> state is inconsistent (i.e there is data that are not "committed"
>> yet), we continue reading to the log end until we get to a consistent
>> state.
>> 
>> I am not convinced the new functionality is necessary, or even helpful.
>> 
>> Gwen
>> 
>> On Mon, May 16, 2016 at 4:07 PM, Eric Wasserman
>> <eric.wasser...@gmail.com> wrote:
>>> I would like to begin discussion on KIP-58
>>> 
>>> The KIP is here:
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-58+-+Make+Log+Compaction+Point+Configurable
>>> 
>>> Jira: https://issues.apache.org/jira/browse/KAFKA-1981
>>> 
>>> Pull Request: https://github.com/apache/kafka/pull/1168
>>> 
>>> Thanks,
>>> 
>>> Eric
>

Re: [DISCUSS] KIP-58 - Make Log Compaction Point Configurable

Reply via email to