[ 
https://issues.apache.org/jira/browse/IGNITE-25079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-25079:
-----------------------------------
    Description: 
h3. Scenario
 # Begin a "long" explicit transaction, that would span several partitions.
 # Insert a few entries.
 # For some reason, nodes on the cluster should initiate flush (checkpoint) on 
a table. This might happen in a real environment.
 # Then we insert more entries in the same transaction.
 # Commit.
 # Wait for flush to complete, stop the cluster.
 # Start the cluster.
 # Wait for about a minute. In my test, I waited enough for two TX state vacuum 
cycles to complete.
 # Read the data.

h3. Expected result

You see the data of the entire transaction.
h3. Actual result

Data inserted before the checkpoint suddenly disappeared.
h3. Test

The following test should be inserted into {{{}ItInternalTableTest{}}}. Should 
be greatly improved before committing, because it's long (1m+) and ugly.
{code:java}
@Test
public void testIgnite25079() throws Exception {
    IgniteImpl node = node();

    KeyValueView<Tuple, Tuple> keyValueView = table.keyValueView();

    node.transactions().runInTransaction(tx -> {
        for (int i = 0; i < 15; i++) {
            putValue(keyValueView, i, tx);
        }

        CompletableFuture<Void> flushFuture = 
unwrapTableViewInternal(table).internalTable()
                .storage().getMvPartition(0).flush(true);
        assertThat(flushFuture, willCompleteSuccessfully());

        for (int i = 15; i < 30; i++) {
            putValue(keyValueView, i, tx);
        }
    });

    CLUSTER.stopNode(0);
    node = unwrapIgniteImpl(CLUSTER.startNode(0));
    table = node.tables().table(TABLE_NAME);

    Thread.sleep(61_000);

    InternalTable internalTable = 
unwrapTableViewInternal(table).internalTable();
    CompletableFuture<List<BinaryRow>> getAllFuture = internalTable.getAll(
            LongStream.range(0, 
30).mapToObj(ItInternalTableTest::createKeyRow).collect(Collectors.toList()),
            node.clock().now(),
            node.node()
    );

    assertThat(getAllFuture, willCompleteSuccessfully());
    List<BinaryRow> res = getAllFuture.get();
    assertEquals(30, res.size());
    assertEquals(30, res.stream().filter(Objects::nonNull).count());
} {code}
h3. Why this happens

{{StorageUpdateHandler#pendingRows}} is to blame. When we run a cleanup process 
on a transaction, we read a list of row IDs from this field, assuming it has 
all we need. In a provided test, we start local RAFT log reapplication from the 
middle of the transaction. During that reapplication, we meat 15 inserted 
records, put them into {{{}pendingRows{}}}, and then execute a cleanup request 
with that information on hand.

In other words, cleanup command only resolves 15 write intents out of 30.

If we wait for long enough, TX state storage will delete the state of our 
transaction. All cleanup commands have been replicated, there are no reasons 
not to do so. But, 15 write intents are not resolved, and at this point their 
state will be determined as ABORTED.

ABORTED write intents are rolled back upon encountering, this is why in test we 
will only be able to read 15 out of 30 records. This is clearly considered a 
data loss.

But it also might introduce {*}data inconsistency on different replicas{*}. The 
reason for that is that checkpoints happen at different moments in time on 
different nodes. So, {{pendingRows}} field will have different content on 
different nodes during their local recovery phase, and as a result some RO 
queries will yield different results depending on where they're being run.

  was:
h3. Scenario
 # Begin a "long" explicit transaction, that would span several partitions.
 # Insert a few entries.
 # For some reason, nodes on the cluster should initiate flush (checkpoint) on 
a table. This might happen in a real environment.
 # Then we insert more entries in the same transaction.
 # Commit.
 # Wait for flush to complete, stop the cluster.
 # Start the cluster.
 # Wait for about a minute. In my test, I waited enough for two TX state vacuum 
cycles to complete.
 # Read the data.

h3. Expected result

You see the data of the entire transaction.
h3. Actual result

Data inserted before the checkpoint suddenly disappeared.
h3. Test

The following test should be inserted into {{{}ItInternalTableTest{}}}. Should 
be greatly improved before committing, because it's long (1m+) and ugly.
{code:java}
@Test
public void testIgnite25079() throws Exception {
    IgniteImpl node = node();

    KeyValueView<Tuple, Tuple> keyValueView = table.keyValueView();

    node.transactions().runInTransaction(tx -> {
        for (int i = 0; i < 15; i++) {
            putValue(keyValueView, i, tx);
        }

        CompletableFuture<Void> flushFuture = 
unwrapTableViewInternal(table).internalTable()
                .storage().getMvPartition(0).flush(true);
        assertThat(flushFuture, willCompleteSuccessfully());

        for (int i = 15; i < 30; i++) {
            putValue(keyValueView, i, tx);
        }
    });

    CLUSTER.stopNode(0);
    node = unwrapIgniteImpl(CLUSTER.startNode(0));
    table = node.tables().table(TABLE_NAME);

    Thread.sleep(61_000);

    InternalTable internalTable = 
unwrapTableViewInternal(table).internalTable();
    CompletableFuture<List<BinaryRow>> getAllFuture = internalTable.getAll(
            LongStream.range(0, 
30).mapToObj(ItInternalTableTest::createKeyRow).collect(Collectors.toList()),
            node.clock().now(),
            node.node()
    );

    assertThat(getAllFuture, willCompleteSuccessfully());
    List<BinaryRow> res = getAllFuture.get();
    assertEquals(30, res.size());
    assertEquals(30, res.stream().filter(Objects::nonNull).count());
} {code}


> Partial data loss after cluster restart
> ---------------------------------------
>
>                 Key: IGNITE-25079
>                 URL: https://issues.apache.org/jira/browse/IGNITE-25079
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Ivan Bessonov
>            Priority: Blocker
>              Labels: ignite-3
>
> h3. Scenario
>  # Begin a "long" explicit transaction, that would span several partitions.
>  # Insert a few entries.
>  # For some reason, nodes on the cluster should initiate flush (checkpoint) 
> on a table. This might happen in a real environment.
>  # Then we insert more entries in the same transaction.
>  # Commit.
>  # Wait for flush to complete, stop the cluster.
>  # Start the cluster.
>  # Wait for about a minute. In my test, I waited enough for two TX state 
> vacuum cycles to complete.
>  # Read the data.
> h3. Expected result
> You see the data of the entire transaction.
> h3. Actual result
> Data inserted before the checkpoint suddenly disappeared.
> h3. Test
> The following test should be inserted into {{{}ItInternalTableTest{}}}. 
> Should be greatly improved before committing, because it's long (1m+) and 
> ugly.
> {code:java}
> @Test
> public void testIgnite25079() throws Exception {
>     IgniteImpl node = node();
>     KeyValueView<Tuple, Tuple> keyValueView = table.keyValueView();
>     node.transactions().runInTransaction(tx -> {
>         for (int i = 0; i < 15; i++) {
>             putValue(keyValueView, i, tx);
>         }
>         CompletableFuture<Void> flushFuture = 
> unwrapTableViewInternal(table).internalTable()
>                 .storage().getMvPartition(0).flush(true);
>         assertThat(flushFuture, willCompleteSuccessfully());
>         for (int i = 15; i < 30; i++) {
>             putValue(keyValueView, i, tx);
>         }
>     });
>     CLUSTER.stopNode(0);
>     node = unwrapIgniteImpl(CLUSTER.startNode(0));
>     table = node.tables().table(TABLE_NAME);
>     Thread.sleep(61_000);
>     InternalTable internalTable = 
> unwrapTableViewInternal(table).internalTable();
>     CompletableFuture<List<BinaryRow>> getAllFuture = internalTable.getAll(
>             LongStream.range(0, 
> 30).mapToObj(ItInternalTableTest::createKeyRow).collect(Collectors.toList()),
>             node.clock().now(),
>             node.node()
>     );
>     assertThat(getAllFuture, willCompleteSuccessfully());
>     List<BinaryRow> res = getAllFuture.get();
>     assertEquals(30, res.size());
>     assertEquals(30, res.stream().filter(Objects::nonNull).count());
> } {code}
> h3. Why this happens
> {{StorageUpdateHandler#pendingRows}} is to blame. When we run a cleanup 
> process on a transaction, we read a list of row IDs from this field, assuming 
> it has all we need. In a provided test, we start local RAFT log reapplication 
> from the middle of the transaction. During that reapplication, we meat 15 
> inserted records, put them into {{{}pendingRows{}}}, and then execute a 
> cleanup request with that information on hand.
> In other words, cleanup command only resolves 15 write intents out of 30.
> If we wait for long enough, TX state storage will delete the state of our 
> transaction. All cleanup commands have been replicated, there are no reasons 
> not to do so. But, 15 write intents are not resolved, and at this point their 
> state will be determined as ABORTED.
> ABORTED write intents are rolled back upon encountering, this is why in test 
> we will only be able to read 15 out of 30 records. This is clearly considered 
> a data loss.
> But it also might introduce {*}data inconsistency on different replicas{*}. 
> The reason for that is that checkpoints happen at different moments in time 
> on different nodes. So, {{pendingRows}} field will have different content on 
> different nodes during their local recovery phase, and as a result some RO 
> queries will yield different results depending on where they're being run.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to