[ 
https://issues.apache.org/jira/browse/KAFKA-13239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17406115#comment-17406115
 ] 

A. Sophie Blee-Goldman commented on KAFKA-13239:
------------------------------------------------

Thanks for looping back on this idea we've been tossing around for a long time 
– maybe this time we'll actually get to doing it :P A few things:
{quote}The {{IngestExternalFileOptions}} would be specifically configured to 
allow key range overlapping with mem-table
{quote}
What does this mean/how do memtables factor into this? Isn't the whole point of 
ingesting entire SST files that we are skipping the memtables and just dumping 
data into the db backend? Or does this have to do with (4), and normal 
processing (which writes to memtables) may actually begin in parallel? 

On that note, are we assuming that restoration will already be performed in a 
dedicated thread(s)? It sounds like that's the case, but I know when we last 
discussed this optimization it was as an alternative/precursor to the "restore 
in a separate thread" work. 

Either way, I think I'm not convinced that it will be so simple to avoid issues 
due to compaction/write stalls. The blocking RocksDB.compactRange() call was 
removed a few versions ago due to issues with dropping out of the group, and 
how is it better than a write stall? Either way we're blocking on compaction to 
be complete. Does RocksDB.compactRange() maybe provide some control over how to 
define "complete", ie could we use it to block for smaller intervals so we can 
poll in between or something like that?

> Use RocksDB.ingestExternalFile for restoration
> ----------------------------------------------
>
>                 Key: KAFKA-13239
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13239
>             Project: Kafka
>          Issue Type: Improvement
>          Components: streams
>            Reporter: Guozhang Wang
>            Priority: Major
>
> Now that we are in newer version of RocksDB, we can consider using the new
> {code}
> ingestExternalFile(final ColumnFamilyHandle columnFamilyHandle,
>       final List<String> filePathList,
>       final IngestExternalFileOptions ingestExternalFileOptions)
> {code}
> for restoring changelog into state stores. More specifically:
> 1) Use larger default batch size in restore consumer polling behavior so that 
> each poll would return more records as possible.
> 2) For a single batch of records returned from a restore consumer poll call, 
> first write them as a single SST File using the {{SstFileWriter}}. The 
> existing {{DBOptions}} could be used to construct the {{EnvOptions} and 
> {{Options}} for the writter.
> Do not yet ingest the written file to the db yet within each iteration
> 3) At the end of the restoration, call {{RocksDB.ingestExternalFile}} given 
> all the written files' path as the parameter. The 
> {{IngestExternalFileOptions}} would be specifically configured to allow key 
> range overlapping with mem-table.
> 4) A specific note is that after the call in 3), heavy compaction may be 
> executed by RocksDB in the background and before it cools down, starting 
> normal processing immediately which would try to {{put}} new records into the 
> store may see high stalls. To work around it we would consider using 
> {{RocksDB.compactRange()}} which would block until the compaction is 
> completed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to