[ 
https://issues.apache.org/jira/browse/IGNITE-24811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksandr Polovtsev updated IGNITE-24811:
-----------------------------------------
    Description: 
Inside the colocation track, the following problem exists: 

When a new table processor is added to the ZonePartitionRaftListener its 
storage gets initialized with some information, like the last applied index and 
Raft group configuration. However, a node can die or be restarted before this 
information gets flushed onto a persistent storage which means that upon the 
consecutive startup, this storage will return 0 as its last applied index. 
Since on startup we use the minimum last applied index across all storages 
during Raft recovery, this value will also be 0 and JRaft will think that it 
needs to replay the log from the beginning of time, while actually this came 
from a storage for an empty table, and its applied index shouldn't even be 
taken into account. An even bigger problem is that the log might have been 
truncated and cannot be restored from the 0 index, so the node won't even be 
able to start.

As the solution, the following algorithm is proposed:

# When a Raft snapshot is taken, save the current set of table IDs inside the 
TX state storage. This means that we have a set of table IDs that participated 
in the most recent snapshot of this partition;
# During recovery, for every table partition storage, check the following:
##  If this storage contains an applied index (i.e. is not empty), use the 
current recovery mechanism of choosing the minimum applied index across all 
storages;
## If this storage is empty and *is* present in the set of table IDs from the 
TX storage, then this storage must have participated in the snapshot, but lost 
all of its persistent data somehow. In this case, tell JRaft to start recovery 
from the very beginning of time, either succeeding if we have the Raft log 
present starting from the 0 index, or throwing in error in case the log has 
been truncated;
## If this storage is empty and *is not* present in the set of table IDs from 
the TX storage, then this storage is guaranteed to have no writes to it before 
the most recent snapshot, and we can start the recovery from the position, 
saved in that snapshot.


  was:Inside the colocation track, the following problem exists: 


> Handle the case when a table storage is empty on Zone Raft group recovery
> -------------------------------------------------------------------------
>
>                 Key: IGNITE-24811
>                 URL: https://issues.apache.org/jira/browse/IGNITE-24811
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Aleksandr Polovtsev
>            Assignee: Aleksandr Polovtsev
>            Priority: Major
>              Labels: ignite-3
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Inside the colocation track, the following problem exists: 
> When a new table processor is added to the ZonePartitionRaftListener its 
> storage gets initialized with some information, like the last applied index 
> and Raft group configuration. However, a node can die or be restarted before 
> this information gets flushed onto a persistent storage which means that upon 
> the consecutive startup, this storage will return 0 as its last applied 
> index. Since on startup we use the minimum last applied index across all 
> storages during Raft recovery, this value will also be 0 and JRaft will think 
> that it needs to replay the log from the beginning of time, while actually 
> this came from a storage for an empty table, and its applied index shouldn't 
> even be taken into account. An even bigger problem is that the log might have 
> been truncated and cannot be restored from the 0 index, so the node won't 
> even be able to start.
> As the solution, the following algorithm is proposed:
> # When a Raft snapshot is taken, save the current set of table IDs inside the 
> TX state storage. This means that we have a set of table IDs that 
> participated in the most recent snapshot of this partition;
> # During recovery, for every table partition storage, check the following:
> ##  If this storage contains an applied index (i.e. is not empty), use the 
> current recovery mechanism of choosing the minimum applied index across all 
> storages;
> ## If this storage is empty and *is* present in the set of table IDs from the 
> TX storage, then this storage must have participated in the snapshot, but 
> lost all of its persistent data somehow. In this case, tell JRaft to start 
> recovery from the very beginning of time, either succeeding if we have the 
> Raft log present starting from the 0 index, or throwing in error in case the 
> log has been truncated;
> ## If this storage is empty and *is not* present in the set of table IDs from 
> the TX storage, then this storage is guaranteed to have no writes to it 
> before the most recent snapshot, and we can start the recovery from the 
> position, saved in that snapshot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to