[ https://issues.apache.org/jira/browse/HDDS-12979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ivan Andika updated HDDS-12979: ------------------------------- Description: Currently, RATIS/THREE write pipeline consists of a single Raft group of three datanodes. We found in the previous write tests that the sequential nature of Raft consensus algorithm is a write bottleneck (instead of the I/O of the datanode volumes) even for writes that are not related to each other (e.g. unrelated writes for different containers in the same pipeline or different blocks for the same container might interfere with each other). When we increased the number of pipelines per datanode (ozone.scm.datanode.pipeline.limit), we saw quite a significant increase of the overall write throughput. We can think about increasing the granularity of the Ratis write pipeline. For example, we can have such one write pipeline consists of multiple Raft groups of mutually exclusive datanode volumes. This way, writes can be parallelized across volumes and the overall throughput can be increased. Additionally, we can also ensure volume isolation, that is, once a volume is chosen in an active write pipeline, it will not be chosen for another write pipeline. Therefore, there won't be issues where one write in a single pipeline interfere with another. We can go even more granular where each open container is a Raft group and closing a container means closing the Ratis group as well. To decide the correct level of granularity, we need to decide in what case when concurrent writes become acceptable (i.e. consistency guarantees of writes). From what I see since we have one file per block, technically the only thing that needs to be serialized is the order of WriteChunk and the final PutBlock. There are definitely overheads when increasing the number of Ratis group, such as the lifecycle of each Ratis group and the overheads associated with it (e.g. direct buffer allocation for "raft.server.log.write.buffer.size"). Also, we might need to increase the number of pipelines being tracked by SCM. We need to do some testing whether the performance improvements is worth the overhead and overall complexity. Related GitHub discussion: [https://github.com/apache/ozone/discussions/7505] was: Currently, RATIS/THREE write pipeline consists of a single Raft group of three datanodes. We found in the previous write tests that the sequential nature of Raft consensus algorithm is a write bottleneck (instead of the I/O of the datanode volumes) even for writes that are not related to each other (e.g. unrelated writes for different containers in the same pipeline or different blocks for the same container might interfere with each other). When we increased the number of pipelines per datanode (ozone.scm.datanode.pipeline.limit), we saw quite a significant increase of the overall write throughput. We can think about increasing the granularity of the Ratis write pipeline. For example, we can have such one write pipeline consists of multiple Raft groups of mutually exclusive datanode volumes. This way, writes can be parallelized across volumes and the overall throughput can be increased. Additionally, we can also ensure volume isolation, that is, once a volume is chosen in an active write pipeline, it will not be chosen for another write pipeline. Therefore, there won't be issues where one write in a single pipeline interfere with another. We can go even more granular where each open container is a Raft group and closing a container means closing the Ratis group as well. To decide the correct level of granularity, we need to decide in what case when concurrent writes become acceptable (i.e. consistency guarantees of writes). From what I see since we have one file per block, technically the only thing that needs to be serialized is the order of WriteChunk and the final PutBlock. There are definitely overheads when increasing the number of Ratis group since each Raft group carries some associated overhead (e.g. direct buffer allocation for "raft.server.log.write.buffer.size"). Also, we might need to increase the number of pipelines being tracked by SCM. We need to do some testing whether the performance improvements is worth the overhead and overall complexity. Related GitHub discussion: [https://github.com/apache/ozone/discussions/7505] > Increasing Ratis Write Pipeline Granularity > ------------------------------------------- > > Key: HDDS-12979 > URL: https://issues.apache.org/jira/browse/HDDS-12979 > Project: Apache Ozone > Issue Type: Wish > Reporter: Ivan Andika > Assignee: Ivan Andika > Priority: Major > > Currently, RATIS/THREE write pipeline consists of a single Raft group of > three datanodes. > We found in the previous write tests that the sequential nature of Raft > consensus algorithm is a write bottleneck (instead of the I/O of the datanode > volumes) even for writes that are not related to each other (e.g. unrelated > writes for different containers in the same pipeline or different blocks for > the same container might interfere with each other). When we increased the > number of pipelines per datanode (ozone.scm.datanode.pipeline.limit), we saw > quite a significant increase of the overall write throughput. > We can think about increasing the granularity of the Ratis write pipeline. > For example, we can have such one write pipeline consists of multiple Raft > groups of mutually exclusive datanode volumes. This way, writes can be > parallelized across volumes and the overall throughput can be increased. > Additionally, we can also ensure volume isolation, that is, once a volume is > chosen in an active write pipeline, it will not be chosen for another write > pipeline. Therefore, there won't be issues where one write in a single > pipeline interfere with another. > We can go even more granular where each open container is a Raft group and > closing a container means closing the Ratis group as well. > To decide the correct level of granularity, we need to decide in what case > when concurrent writes become acceptable (i.e. consistency guarantees of > writes). From what I see since we have one file per block, technically the > only thing that needs to be serialized is the order of WriteChunk and the > final PutBlock. > There are definitely overheads when increasing the number of Ratis group, > such as the lifecycle of each Ratis group and the overheads associated with > it (e.g. direct buffer allocation for "raft.server.log.write.buffer.size"). > Also, we might need to increase the number of pipelines being tracked by SCM. > We need to do some testing whether the performance improvements is worth the > overhead and overall complexity. > Related GitHub discussion: [https://github.com/apache/ozone/discussions/7505] -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org For additional commands, e-mail: issues-h...@ozone.apache.org