On Thu, May 31, 2018 at 1:06 PM, Gao Xiang <gaoxian...@huawei.com> wrote:
> Hi all,
>
> Read-only file systems are used in many cases, such as read-only storage 
> media.
> We are now focusing on the Android device which several read-only partitions 
> exist.
> Due to limited read-only solutions, a new read-only file system EROFS
> (Extendable Read-Only File System) is introduced.

In which sense is it extendable?

> As the other read-only file systems, several meta regions in generic file 
> systems
> such as free space bitmap are omitted. But the difference is that EROFS 
> focuses
> more on performance than purely on saving storage space as much as possible.
>
> Furthermore, we also add the compression support called z_erofs.
>
> Traditional file systems with the compression support use the fixed-sized 
> input
> compression, the output compressed units could be arbitrary lengths.
> However, data is accessed in the block unit for block devices, which means
> (A) if the accessed compressed data is not buffered, some data read from
> the physical block cannot be further utilized, which is illustrated as 
> follows:
>
>    ++-----------++-----------++         ++-----------++-----------++
> ...||           ||           ||   ...   ||           ||           ||  ... 
> original data
>    ++-----------++-----------++         ++-----------++-----------++
>     \                         /          \                         /
>        \                   /                \                    /
>           \             /                      \               /
>       ++---|-------++--|--------++       ++-----|----++--------|--++
>       ||xxx|       ||  |xxxxxxxx||  ...  ||xxxxx|    ||        |xx||  
> compressed data
>       ++---|-------++--|--------++       ++-----|----++--------|--++
>
> The shadow regions read from the block device but cannot be used for 
> decompression.
>
> (B) If the compressed data is also buffered, it will increase the memory 
> overhead.
> Because these are compressed data, it cannot be directly used, and we don't 
> know
> when the corresponding compressed blocks are accessed, which is not friendly 
> to
> the random read.
>
> In order to reduce the proportion of the data which cannot be directly 
> decompressed,
> larger compressed sizes are preferred to be selected, which is also not 
> friendly to
> the random read.
>
> Erofs implements the compression in a different approach, the details of 
> which will
> be discussed in the next section.
>
> In brief, the following points summarize our design at a high level:
>
> 1) Use page-sized blocks so that there are no buffer heads.
>
> 2) By introducing a more general inline data / xattr, metadata and small data 
> have
> the opportunity to be read with the inode metadata at the same time.
>
> 3) Introduce another shared xattr region in order to store the common xattrs 
> (eg.
> selinux labels) or xattrs too large to be suitable for meta inline.
>
> 4) Metadata and data could be mixed by design, so it could be more flexible 
> for mkfs
> to organize files and data.
>
> 5) instead of using the fixed-sized input compression, we put forward a new 
> fixed
> output compression to make the full use of IO (which means all data from IO 
> can be
> decompressed), reduce the read amplification, improve random read and keep the
> relatively lower compression ratios, illustrated as follows:
>
>
>         |---- varient-length extent ----|------ VLE ------|---  VLE ---|
>          /> clusterofs                  /> clusterofs     /> clusterofs /> 
> clusterofs
>    ++---|-------++-----------++---------|-++-----------++-|---------++-|
> ...||   |       ||           ||         | ||           || |         || | ... 
> original data
>    ++---|-------++-----------++---------|-++-----------++-|---------++-|
>    ++->cluster<-++->cluster<-++->cluster<-++->cluster<-++->cluster<-++
>         size         size         size         size         size
>          \                             /                 /            /
>           \                      /              /            /
>            \               /            /            /
>             ++-----------++-----------++-----------++
>         ... ||           ||           ||           || ... compressed clusters
>             ++-----------++-----------++-----------++
>             ++->cluster<-++->cluster<-++->cluster<-++
>                  size         size         size
>
>    A cluster could have more than one blocks by design, but currently we only 
> have the
> page-sized cluster implementation (page-sized fixed output compression can 
> also have
> better compression ratio than fixed input compression).
>
>    All compressed clusters have a fixed size but could be decompressed into 
> extents with
> arbitrary lengths.
>
>    In addition, if a buffered IO reads the following shadow region (x), we 
> could make a more
>    customized path (to replace generic_file_buffered_read) which only reads 
> one compressed
>    cluster and makes the partial page available.
>          /> clusterofs
>    ++---|-------++
> ...||   | xxxx  || ...
>    ||---|-------||
>
> Some numbers using fixed output compression (VLE, cluster size = block size = 
> 4k) on
> the server and Android phone (kirin970 platform):
>
> Server (magnetic disk):
>
> compression  EROFS seq read  EXT4 seq read        EROFS random read  EXT4 
> random read
> ratio           bw[MB/s]       bw[MB/s]             bw[MB/s] (20%)    
> bw[MB/s] (20%)
>
>   4              480.3          502.5                   69.8               
> 11.1
>  10              472.3          503.3                   56.4               
> 10.0
>  15              457.6          495.3                   47.0               
> 10.9
>  26              401.5          511.2                   34.7               
> 11.1
>  35              389.1          512.5                   28.0               
> 11.0
>  48              375.4          496.5                   23.2               
> 10.6
>  53              370.2          512.0                   21.8               
> 11.0
>  66              349.2          512.0                   19.0               
> 11.4
>  76              310.5          497.3                   17.3               
> 11.6
>  85              301.2          512.0                   16.0               
> 11.0
>  94              292.7          496.5                   14.6               
> 11.1
> 100              538.9          512.0                   11.4               
> 10.8
>
> Kirin970 (A73 Big-core 2361Mhz, A53 little-core 0Mhz, DDR 1866Mhz):

What storage was used? An eMMC?

> compression  EROFS seq read  EXT4 seq read        EROFS random read  EXT4 
> random read
> ratio           bw[MB/s]       bw[MB/s]             bw[MB/s] (20%)    
> bw[MB/s] (20%)
>
>   4              546.7          544.3                    157.7              
> 57.9
>  10              535.7          521.0                    152.7              
> 62.0
>  15              529.0          520.3                    125.0              
> 65.0
>  26              418.0          526.3                     97.6              
> 63.7
>  35              367.7          511.7                     89.0              
> 63.7
>  48              415.7          500.7                     78.2              
> 61.2
>  53              423.0          566.7                     72.8              
> 62.9
>  66              334.3          537.3                     69.8              
> 58.3
>  76              387.3          546.0                     65.2              
> 56.0
>  85              306.3          546.0                     63.8              
> 57.7
>  94              345.0          589.7                     59.2              
> 49.9
> 100              579.7          556.7                     62.1              
> 57.7

How does it compare to existing read only filesystems, such as squashfs?

-- 
Thanks,
//richard

Reply via email to