[ 
https://issues.apache.org/jira/browse/IGNITE-25418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-25418:
----------------------------------
    Attachment: ItStreamTonsOfData.java

> Long write of 1000-rows batches into storage
> --------------------------------------------
>
>                 Key: IGNITE-25418
>                 URL: https://issues.apache.org/jira/browse/IGNITE-25418
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Denis Chudov
>            Priority: Major
>              Labels: ignite-3
>         Attachments: ItStreamTonsOfData.java
>
>
> While investigating "Send with retry timeout" exceptions occurring under the 
> load from data streamer, I logged the time of call in 
> PartitionListener#handleUpdateAllCommand here:
>  
> {code:java}
> storageUpdateHandler.handleUpdateAll(
>         txId,
>         cmd.rowsToUpdate(),
>         cmd.commitPartitionId().asReplicationGroupId(),
>         !cmd.full(),
>         () -> storage.lastApplied(commandIndex, commandTerm),
>         cmd.full() ? safeTimestamp : null,
>         indexIdsAtRwTxBeginTs(catalogService, txId, storage.tableId())
> ); {code}
> At some time, it took over hundreds of milliseconds:
>  
> {code:java}
> [2025-05-19T11:33:22,849][INFO 
> ][%istod_sdt_3345%JRaft-FSMCaller-Disruptor_stripe_3-0][PartitionListener] 
> qqq groupId=19_part_9, storageWriteMcs=341377, hash=-1115824889 {code}
> In this line, the number is in microseconds.
> Also, I see that the delta files' compaction time is constantly growing (look 
> at the two pairs of lines, page number is the same but compaction time grew 
> dramatically, as well as the checkpoint fsync time):
> {code:java}
> 2025-03-26 14:46:23:999 +0000 [INFO][%node1%checkpoint-thread][Checkpointer] 
> Checkpoint finished [checkpointId=67f35966-b0fd-4377-8e5b-7a3af7d73aa2, 
> pages=140807, pagesWriteTime=1132ms, fsyncTime=2200ms, 
> replicatorLogSyncTime=126ms, totalTime=3591ms, 
> avgWriteSpeed=613MB/s]2025-03-26 14:46:26:609 +0000 
> [INFO][%node1%compaction-thread][Compactor] Compaction round finished 
> [compactionId=8fc11431-783f-4111-8e5b-017870110cf5, pages=140807, 
> duration=2610ms, avgWriteSpeed=843MB/s]
> 2025-03-26 15:11:02:160 +0000 [INFO][%node1%checkpoint-thread][Checkpointer] 
> Checkpoint finished [checkpointId=2b4d735f-1b09-471a-a47e-3d5c5c8c7f5d, 
> pages=165589, pagesWriteTime=1476ms, fsyncTime=6027ms, 
> replicatorLogSyncTime=230ms, totalTime=7798ms, 
> avgWriteSpeed=332MB/s]2025-03-26 15:11:07:276 +0000 
> [INFO][%node1%compaction-thread][Compactor] Compaction round finished 
> [compactionId=6984cff7-e1e1-4ec7-ba5a-5e5d7037898a, pages=170116, 
> duration=10213ms, avgWriteSpeed=260MB/s] {code}
> I have also seen the plenty of lines like this:
> {code:java}
> [2025-05-19T11:32:41,268][INFO 
> ][checkpoint-runner-io0][CheckpointPagesWriter] Checkpoint pages were not 
> written yet due to unsuccessful page write lock acquisition and will be 
> retried [pageCount=1] {code}
> *Definition of done*
> The reason of long write time and the degradation of compaction time should 
> be investigated and fixed, if applicable.
> *Notes*
> Reproducer is attached, put it into ignite-runner - integrationTests.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to