[jira] [Created] (IGNITE-25418) Long write of 1000-rows batches into storage

Denis Chudov (Jira) Mon, 19 May 2025 07:48:04 -0700

Denis Chudov created IGNITE-25418:
-------------------------------------

             Summary: Long write of 1000-rows batches into storage
                 Key: IGNITE-25418
                 URL: https://issues.apache.org/jira/browse/IGNITE-25418
             Project: Ignite
          Issue Type: Bug
            Reporter: Denis Chudov



While investigating "Send with retry timeout" exceptions occurring under the 
load from data streamer, I logged the time of call in 
PartitionListener#handleUpdateAllCommand here:

 
{code:java}
storageUpdateHandler.handleUpdateAll(
        txId,
        cmd.rowsToUpdate(),
        cmd.commitPartitionId().asReplicationGroupId(),
        !cmd.full(),
        () -> storage.lastApplied(commandIndex, commandTerm),
        cmd.full() ? safeTimestamp : null,
        indexIdsAtRwTxBeginTs(catalogService, txId, storage.tableId())
); {code}
At some time, it took over hundreds of milliseconds:

 
{code:java}
[2025-05-19T11:33:22,849][INFO 
][%istod_sdt_3345%JRaft-FSMCaller-Disruptor_stripe_3-0][PartitionListener] qqq 
groupId=19_part_9, storageWriteMcs=341377, hash=-1115824889 {code}
In this line, the number is in microseconds.

Also, I see that the delta files' compaction time is constantly growing (look 
at the two pairs of lines, page number is the same but compaction time grew 
dramatically, as well as the checkpoint fsync time):
{code:java}
2025-03-26 14:46:23:999 +0000 [INFO][%node1%checkpoint-thread][Checkpointer] 
Checkpoint finished [checkpointId=67f35966-b0fd-4377-8e5b-7a3af7d73aa2, 
pages=140807, pagesWriteTime=1132ms, fsyncTime=2200ms, 
replicatorLogSyncTime=126ms, totalTime=3591ms, avgWriteSpeed=613MB/s]2025-03-26 
14:46:26:609 +0000 [INFO][%node1%compaction-thread][Compactor] Compaction round 
finished [compactionId=8fc11431-783f-4111-8e5b-017870110cf5, pages=140807, 
duration=2610ms, avgWriteSpeed=843MB/s]


2025-03-26 15:11:02:160 +0000 [INFO][%node1%checkpoint-thread][Checkpointer] 
Checkpoint finished [checkpointId=2b4d735f-1b09-471a-a47e-3d5c5c8c7f5d, 
pages=165589, pagesWriteTime=1476ms, fsyncTime=6027ms, 
replicatorLogSyncTime=230ms, totalTime=7798ms, avgWriteSpeed=332MB/s]2025-03-26 
15:11:07:276 +0000 [INFO][%node1%compaction-thread][Compactor] Compaction round 
finished [compactionId=6984cff7-e1e1-4ec7-ba5a-5e5d7037898a, pages=170116, 
duration=10213ms, avgWriteSpeed=260MB/s] {code}
I have also seen the plenty of lines like this:
{code:java}
[2025-05-19T11:32:41,268][INFO ][checkpoint-runner-io0][CheckpointPagesWriter] 
Checkpoint pages were not written yet due to unsuccessful page write lock 
acquisition and will be retried [pageCount=1] {code}
*Definition of done*

The reason of long write time and the degradation of compaction time should be 
investigated and fixed, if applicable.

*Notes*

Reproducer is attached.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (IGNITE-25418) Long write of 1000-rows batches into storage

Reply via email to