[ 
https://issues.apache.org/jira/browse/HDDS-12659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Andika updated HDDS-12659:
-------------------------------
    Description: 
In the future, we can support "one file per container" storage layout. 
Currently, we support FilePerBlock and FilePerChunk (deprecated).

The current FilePerBlock storage layout have the following benefits:
 * No write contentions for writing blocks belonging to the write container
 ** However, for Ratis pipeline, this is also guaranteed by sequential nature 
of Raft algorithm
 * Block file can be deleted as soon as the datanode receives the deletion 
command

However, the FilePerBlock layout is not good for handling a lot of small files 
since each block is stored a separate file. This increases the inode tree size 
of the datanodes and cause memory issues when we need to check all the block 
files (e.g. scanner / volume size using "du").

An alternative storage layout can be one file per container. This is 
implemented in some existing distributed object storage / file system like 
SeaweedFS's volume (similar to Facebook's Haystack)

This has the benefit of reducing the small files in the datanode. One container 
file can contains thousands or Ozone files. 

However, this also comes with some drawbacks:
 * Bookkeeping required
 ** We need to keep track of which blocks are in which offset of the files
 * Deletion is not direct
 ** Deletion of a block needs to mark the particular block as deleted
 ** A separate background task will run the compaction task where it will 
create a new container file with the deleted blocks removed
 *** This can momentarily increase the datanode space usage since a new file 
needs to be created

We might also store the small files directly in the RocksDB (e.g. using 
https://github.com/facebook/rocksdb/wiki/BlobDB).

This is a long-term wish to kickstart discussions the feasibility of this 
storage layout in Ozone in the future.

  was:
In the future, we can support "one file per container" storage layout. 
Currently, we support FilePerBlock and FilePerChunk (deprecated).

The current FilePerBlock storage layout have the following benefits:
 * No write contentions for writing blocks belonging to the write container
 ** However, for Ratis pipeline, this is also guaranteed by sequential nature 
of Raft algorithm
 * Block file can be deleted as soon as the datanode receives the deletion 
command

However, the FilePerBlock layout is not good for handling a lot of small files 
since each block is stored a separate file. This increases the inode tree size 
of the datanodes and cause memory issues when we need to check all the block 
files (e.g. scanner / volume size using "du").

An alternative storage layout can be one file per container. This is 
implemented in some existing distributed object storage / file system like 
SeaweedFS's volume (similar to Facebook's Haystack)

This has the benefit of reducing the small files in the datanode. One container 
file can contains thousands or Ozone files. 

However, this also comes with some drawbacks:
 * Bookkeeping required
 ** We need to keep track of which blocks are in which offset of the files
 * Deletion is not direct
 ** Deletion of a block needs to mark the particular block as deleted
 ** A separate background task will run the compaction task where it will 
create a new container file with the deleted blocks removed
 *** This can momentarily increase the datanode space usage since a new file 
needs to be created

This is a wish, we can discuss the feasibility of this storage layout in Ozone 
in the future.


> One File per Container Storage Layout
> -------------------------------------
>
>                 Key: HDDS-12659
>                 URL: https://issues.apache.org/jira/browse/HDDS-12659
>             Project: Apache Ozone
>          Issue Type: Wish
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>
> In the future, we can support "one file per container" storage layout. 
> Currently, we support FilePerBlock and FilePerChunk (deprecated).
> The current FilePerBlock storage layout have the following benefits:
>  * No write contentions for writing blocks belonging to the write container
>  ** However, for Ratis pipeline, this is also guaranteed by sequential nature 
> of Raft algorithm
>  * Block file can be deleted as soon as the datanode receives the deletion 
> command
> However, the FilePerBlock layout is not good for handling a lot of small 
> files since each block is stored a separate file. This increases the inode 
> tree size of the datanodes and cause memory issues when we need to check all 
> the block files (e.g. scanner / volume size using "du").
> An alternative storage layout can be one file per container. This is 
> implemented in some existing distributed object storage / file system like 
> SeaweedFS's volume (similar to Facebook's Haystack)
> This has the benefit of reducing the small files in the datanode. One 
> container file can contains thousands or Ozone files. 
> However, this also comes with some drawbacks:
>  * Bookkeeping required
>  ** We need to keep track of which blocks are in which offset of the files
>  * Deletion is not direct
>  ** Deletion of a block needs to mark the particular block as deleted
>  ** A separate background task will run the compaction task where it will 
> create a new container file with the deleted blocks removed
>  *** This can momentarily increase the datanode space usage since a new file 
> needs to be created
> We might also store the small files directly in the RocksDB (e.g. using 
> https://github.com/facebook/rocksdb/wiki/BlobDB).
> This is a long-term wish to kickstart discussions the feasibility of this 
> storage layout in Ozone in the future.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to