[jira] [Updated] (HDDS-12979) Ratis Write Pipeline Granularity Exploration

Ivan Andika (Jira) Tue, 06 May 2025 06:02:20 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-12979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ivan Andika updated HDDS-12979:
-------------------------------
    Description: 
Currently, RATIS/THREE write pipeline consists of a single Raft group of three 
datanodes.

We found in the previous write tests that the sequential nature of Raft 
consensus algorithm is a write bottleneck (instead of the I/O of the datanode 
volumes) even for writes that are not related to each other (e.g. unrelated 
writes for different blocks for the same container might interfere with each 
other). When we increased the number of pipelines per datanode 
(ozone.scm.datanode.pipeline.limit), we saw quite a significant increase of the 
overall write throughput.

We can think on increase the granularity of the write pipeline. For example, we 
can have such one write pipeline consists of multiple Raft groups of mutually 
exclusive datanode volumes. This way, writes can be parallelized across volumes 
and the overall throughput can be increased. Additionally, we can also ensure 
volume isolation, that is, once a volume is chosen in an active write pipeline, 
it will not be chosen for another write pipeline. Therefore, there won't be 
issues where one write pipeline write interfere with another.

We can go even more granular where each open container is a Raft group.

There are definitely overheads when increasing the number of Ratis group since 
each Raft group carries some associated overhead (e.g. direct buffer allocation 
for "raft.server.log.write.buffer.size"). We need to do some testing whether 
the performance improvements is worth the overhead and overall complexity.
 

  was:
Currently, RATIS/THREE write pipeline consists of a single Raft group of three 
datanodes.

We found in the previous write tests that the sequential nature of Raft 
consensus algorithm is a write bottleneck (instead of the I/O of the datanode 
volumes) even for writes that are not related to each other (e.g. unrelated 
writes for different blocks for the same container might interfere with each 
other). When we increased the number of pipelines per datanode 
(ozone.scm.datanode.pipeline.limit), we saw quite a significant increase of the 
overall write throughput.

We can think on increase the granularity of the write pipeline. For example, we 
can have such one write pipeline consists of multiple Raft groups of mutually 
exclusive datanode volumes. This way, writes can be parallelized across volumes 
and the overall throughput can be increased. Additionally, we can also ensure 
volume isolation, that is, once a volume is chosen in an active write pipeline, 
it will not be chosen for another write pipeline. Therefore, there won't be 
issues where one write pipeline write interfere with another. 

We can go even more granular where each open container is a Raft group.

There are overhead when increasing the number of Ratis group since each Raft 
group carries some overhead (e.g. direct buffer allocation for 
"raft.server.log.write.buffer.size"). We need to do some testing whether the 
performance improvements is worth the overhead.


> Ratis Write Pipeline Granularity Exploration
> --------------------------------------------
>
>                 Key: HDDS-12979
>                 URL: https://issues.apache.org/jira/browse/HDDS-12979
>             Project: Apache Ozone
>          Issue Type: Wish
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>
> Currently, RATIS/THREE write pipeline consists of a single Raft group of 
> three datanodes.
> We found in the previous write tests that the sequential nature of Raft 
> consensus algorithm is a write bottleneck (instead of the I/O of the datanode 
> volumes) even for writes that are not related to each other (e.g. unrelated 
> writes for different blocks for the same container might interfere with each 
> other). When we increased the number of pipelines per datanode 
> (ozone.scm.datanode.pipeline.limit), we saw quite a significant increase of 
> the overall write throughput.
> We can think on increase the granularity of the write pipeline. For example, 
> we can have such one write pipeline consists of multiple Raft groups of 
> mutually exclusive datanode volumes. This way, writes can be parallelized 
> across volumes and the overall throughput can be increased. Additionally, we 
> can also ensure volume isolation, that is, once a volume is chosen in an 
> active write pipeline, it will not be chosen for another write pipeline. 
> Therefore, there won't be issues where one write pipeline write interfere 
> with another.
> We can go even more granular where each open container is a Raft group.
> There are definitely overheads when increasing the number of Ratis group 
> since each Raft group carries some associated overhead (e.g. direct buffer 
> allocation for "raft.server.log.write.buffer.size"). We need to do some 
> testing whether the performance improvements is worth the overhead and 
> overall complexity.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org
For additional commands, e-mail: issues-h...@ozone.apache.org

[jira] [Updated] (HDDS-12979) Ratis Write Pipeline Granularity Exploration

Reply via email to