Hi, Luis, 1. The cleaning policy is configurable at both global and topic level. The global one has the name log.cleanup.policy and the topic level has the name cleanup.policy by just stripping the log prefix. We can probably do the same for the new configs.
2. Since this KIP may require an admin to configure a larger dedup buffer size, it would be useful to document this impact in the wiki and the release notes. 3. Yes, it's unlikely for the last message to be removed in the current implementation since we never clean the active segment. However, in theory, this can happen. So it would be useful to guard this explicitly. 4. Just thought about another issue. We probably want to be a bit careful with key deletion. Currently, one can delete a key by sending a message with a delete tombstone (a null payload). To prevent a reader from missing a deletion if it's removed too quickly, we depend on a configuration log.cleaner.delete.retention.ms (defaults to 1 day). The delete tombstone will only be physically removed from the log after that amount of time. The expectation is that a reader should finish reading to the end of the log after consuming a message within that configured time. With the new strategy, we have similar, but slightly different problems. The first problem is that the delete tombstone may be delivered earlier than an outdated record in offset order to a consumer. In order for the consumer not to take the outdated record, the consumer should cache the deletion tombstone for some configured amount of time. We ca probably piggyback this on log.cleaner.delete.retention.ms, but we need to document this. The second problem is that once the delete tombstone is physically removed from the log, how can we prevent outdated records to be added (otherwise, they will never be garbage collected)? Not sure what's the best way to do this. One possible way is to push this back to the application and require the user not to publish outdated records after log.cleaner.delete.retention.ms. Thanks, Jun On Wed, Jul 4, 2018 at 11:11 AM, Luís Cabral <luis_cab...@yahoo.com.invalid> wrote: > Hi Jun, > > -: 1. I guess both new configurations will be at the topic level? > > They will exist in the global configuration, at the very least. > I would like to have them on the topic level as well, but there is an > inconsistency between the cleanup/compaction properties that exist “only > globally” vs “globally + per topic”. > I haven’t gotten around to investigating why, and if that reason would > then also impact the properties I’m suggesting. At first glance they seem > to belong with the properties that are "only globally” configured, but > Guozhang has written earlier with a suggestion of a compaction property > that works for both (though I haven’t had the time to look into it yet, > unfortunately). > > -: 2. Since the log cleaner now needs to keep both the offset and another > long (say timestamp) in the de-dup map, it reduces the number of keys that > we can keep in the map and thus may require more rounds of cleaning. This > is probably not a big issue, but it will be useful to document this impact > in the KIP. > > As a reader, I tend to prefer brief documentation on new features (they > tend to be too many for me to find the willpower to read a 200-page essay > about each one), so that influences me to avoid writing about every > micro-impact that may exist, and simply leave it inferred (as you have just > done). > But I also don’t feel strongly enough about it to argue either way. So, > after reading my argument, if you still insist, I’ll happily add this there. > > -: 3. With the new cleaning strategy, we want to be a bit careful with > removing the last message in a partition (which is possible now). We need > to preserve the offset of the last message so that we don't reuse the > offset for a different message. One way to simply never remove the last > message. Another way (suggested by Jason) is to create an empty message > batch. > > That is a good point, but isn’t the last message always kept regardless? > In all of my tests with this approach, never have I seen it being removed. > This is not because I made it so while changing the code, it was simply > like this before, which made me smile! > Given these results, I just *assumed* (oops) that these scenarios > represented the reality, so the compaction would only affected the “tail”, > while the “head” remained untouched. Now that you say its possible that the > last message actually gets overwritten somehow, I guess a new bullet point > will have to be added to the KIP for this (after I’ve found the time to > review the portion of the code that enacts this behaviour). > > Kind Regards, > Luís Cabral > > From: Jun Rao > Sent: 03 July 2018 23:58 > To: dev > Subject: Re: [VOTE] KIP-280: Enhanced log compaction > > Hi, Luis, > > Thanks for the KIP. Overall, this seems a useful KIP. A few comments below. > > 1. I guess both new configurations will be at the topic level? > 2. Since the log cleaner now needs to keep both the offset and another long > (say timestamp) in the de-dup map, it reduces the number of keys that we > can keep in the map and thus may require more rounds of cleaning. This is > probably not a big issue, but it will be useful to document this impact in > the KIP. > 3. With the new cleaning strategy, we want to be a bit careful with > removing the last message in a partition (which is possible now). We need > to preserve the offset of the last message so that we don't reuse the > offset for a different message. One way to simply never remove the last > message. Another way (suggested by Jason) is to create an empty message > batch. > > Jun > > On Sat, Jun 9, 2018 at 12:39 AM, Luís Cabral <luis_cab...@yahoo.com.invalid > > > wrote: > > > Hi all, > > > > Any takers on having a look at this KIP and voting on it? > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP- > > 280%3A+Enhanced+log+compaction > > > > Cheers, > > Luis > > > >