[ https://issues.apache.org/jira/browse/KUDU-3371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17837743#comment-17837743 ]
ASF subversion and git services commented on KUDU-3371: ------------------------------------------------------- Commit 61d437fcd3723b93c57b039d1fdbb7ed03fc8ae9 in kudu's branch refs/heads/master from Yingchun Lai [ https://gitbox.apache.org/repos/asf?p=kudu.git;h=61d437fcd ] KUDU-3371 Unify the 'rdb' directory name Change-Id: I398bfd4a6a91d95f825362fc55c3b6c7dfe5c724 Reviewed-on: http://gerrit.cloudera.org:8080/21298 Reviewed-by: Yifan Zhang <chinazhangyi...@163.com> Reviewed-by: KeDeng <kdeng...@gmail.com> Tested-by: Yingchun Lai <laiyingc...@apache.org> > Use RocksDB to store LBM metadata > --------------------------------- > > Key: KUDU-3371 > URL: https://issues.apache.org/jira/browse/KUDU-3371 > Project: Kudu > Issue Type: Improvement > Components: fs > Reporter: Yingchun Lai > Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > h1. Motivation > The current LBM container use separate .data and .metadata files. The .data > file store the real user data, we can use hole punching to reduce disk space. > While the metadata use write protobuf serialized string to a file, in append > only mode. Each protobuf object is a struct of BlockRecordPB: > > {code:java} > message BlockRecordPB { > required BlockIdPB block_id = 1; // int64 > required BlockRecordType op_type = 2; // CREATE or DELETE > required uint64 timestamp_us = 3; > optional int64 offset = 4; // Required for CREATE. > optional int64 length = 5; // Required for CREATE. > } {code} > That means each object is either type of CREATE or DELETE. To mark a 'block' > as deleted, there will be 2 objects in the metadata, one is CREATE type and > the other is DELETE type. > There are some weak points of current LBM metadata storage mechanism: > h2. 1. Disk space amplification > The metadata live blocks rate may be very low, the worst case is there is > only 1 alive block (suppose it hasn't reach the runtime compact threshold), > all the other thousands of blocks are dead (i.e. in pair of CREATE-DELETE). > So the disk space amplification is very serious. > h2. 2. Long time bootstrap > In Kudu server bootstrap stage, it have to replay all the metadata files, to > find out the alive blocks. In the worst case, we may replayed thousands of > blocks in metadata, but find only a very few blocks are alive. > It may waste much time in almost all cases, since the Kudu cluster in > production environment always run without bootstrap with several months, the > LBM may be very loose. > h2. 3. Metadada compaction > To resolve the issues above, there is a metadata compaction mechanism in LBM, > both at runtime and bootstrap stage. > The one at runtime will lock the container, and it's synchronous. > The one in bootstrap stage is synchronous too, and may make the bootstrap > time longer. > h1. Optimization by using RocksDB > h2. Storage design > * RocksDB instance: one RocksDB instance per data directory. > * Key: <container_id>.<block_id> > * Value: the same as before, i.e. the serialized protobuf string, and only > store for CREATE entries. > * Put/Delete: put value to rocksdb when create block, delete it from rocksdb > when delete block > * Scan: happened only in bootstrap stage to retrieve all blocks > * DeleteRange: happened only when invalidate a container > h2. Advantages > # Disk space amplification: There is still disk space amplification problem. > But we can tune RocksDB to reach a balanced point, I trust in most cases, > RocksDB is better than append only file. > # Bootstrap time: since there are only valid blocks left in rocksdb, so it > maybe much faster than before. > # metadata compaction: we can leave it to rocksdb to do this work, of course > tuning needed. > h2. test & benchmark > I'm trying to use RocksDB to store LBM container metadata recently, finished > most of work now, and did some benchmark. It show that the fs module block > read/write/delete performance is similar to or little worse than the old > implemention, the bootstrap time may reduce several times. > I not sure if it is worth to continue the work, or anybody know if there is > any discussion on this topic ever. -- This message was sent by Atlassian Jira (v8.20.10#820010)