[
https://issues.apache.org/jira/browse/KAFKA-4317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matthias J. Sax updated KAFKA-4317:
-----------------------------------
Description:
Right now, the checkpoint files for logged RocksDB stores are written during a
graceful shutdown, and removed upon restoration. Unfortunately this means that
in a scenario where the process is forcibly killed, the checkpoint files are
not there, so all RocksDB stores are rematerialized from scratch on the next
launch.
In a way, this is good, because it simulates bootstrapping a new node (for
example, its a good way to see how much I/O is used to rematerialize the
stores) however it leads to longer recovery times when a non-graceful shutdown
occurs and we want to get the job up and running again.
It seems that two possible things to consider:
- Simply do not remove checkpoint files on restoring. This way a kill -9 will
result in only repeating the restoration of all the data generated in the
source topics since the last graceful shutdown.
- Continually update the checkpoint files (perhaps on commit) -- this would
result in the least amount of overhead/latency in restarting, but the
additional complexity may not be worth it.
https://cwiki.apache.org/confluence/display/KAFKA/KIP-116%3A+Add+State+Store+Checkpoint+Interval+Configuration
was:
Right now, the checkpoint files for logged RocksDB stores are written during a
graceful shutdown, and removed upon restoration. Unfortunately this means that
in a scenario where the process is forcibly killed, the checkpoint files are
not there, so all RocksDB stores are rematerialized from scratch on the next
launch.
In a way, this is good, because it simulates bootstrapping a new node (for
example, its a good way to see how much I/O is used to rematerialize the
stores) however it leads to longer recovery times when a non-graceful shutdown
occurs and we want to get the job up and running again.
It seems that two possible things to consider:
- Simply do not remove checkpoint files on restoring. This way a kill -9 will
result in only repeating the restoration of all the data generated in the
source topics since the last graceful shutdown.
- Continually update the checkpoint files (perhaps on commit) -- this would
result in the least amount of overhead/latency in restarting, but the
additional complexity may not be worth it.
> RocksDB checkpoint files lost on kill -9
> ----------------------------------------
>
> Key: KAFKA-4317
> URL: https://issues.apache.org/jira/browse/KAFKA-4317
> Project: Kafka
> Issue Type: Improvement
> Components: streams
> Affects Versions: 0.10.0.1
> Reporter: Greg Fodor
> Assignee: Damian Guy
> Priority: Critical
> Labels: architecture, needs-kip, user-experience
>
> Right now, the checkpoint files for logged RocksDB stores are written during
> a graceful shutdown, and removed upon restoration. Unfortunately this means
> that in a scenario where the process is forcibly killed, the checkpoint files
> are not there, so all RocksDB stores are rematerialized from scratch on the
> next launch.
> In a way, this is good, because it simulates bootstrapping a new node (for
> example, its a good way to see how much I/O is used to rematerialize the
> stores) however it leads to longer recovery times when a non-graceful
> shutdown occurs and we want to get the job up and running again.
> It seems that two possible things to consider:
> - Simply do not remove checkpoint files on restoring. This way a kill -9 will
> result in only repeating the restoration of all the data generated in the
> source topics since the last graceful shutdown.
> - Continually update the checkpoint files (perhaps on commit) -- this would
> result in the least amount of overhead/latency in restarting, but the
> additional complexity may not be worth it.
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-116%3A+Add+State+Store+Checkpoint+Interval+Configuration
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)