[ 
https://issues.apache.org/jira/browse/KUDU-3371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17831132#comment-17831132
 ] 

ASF subversion and git services commented on KUDU-3371:
-------------------------------------------------------

Commit 4da8b20070a7c0070a1829dfd50fdc78cad88b6a in kudu's branch 
refs/heads/master from Yingchun Lai
[ https://gitbox.apache.org/repos/asf?p=kudu.git;h=4da8b2007 ]

KUDU-3371 [fs] Use RocksDB to store LBM metadata (2nd try)

The first try is: http://gerrit.cloudera.org:8080/18569
The second try mainly fix the linkage error by using snappy
instead of lz4 when link rocksdb based on the first try.
The lz4 build and link issues of itself can be fixed in
next patches.

Since the LogBlockContainerNativeMeta stores block records
sequentially in the metadata file, the live blocks maybe
in a very low ratio, so it may cause serious disk space
amplification and long bootstrap times.

This patch introduces a new class LogBlockContainerRdbMeta
which uses RocksDB to store LBM metadata, a new item will
be Put() into RocksDB when a new block is created in LBM,
and the item will be Delete() from RocksDB when the block
is removed from LBM. Data in RocksDB can be maintained by
RocksDB itself, i.e. deleted items will be GCed so it's not
needed to rewrite the metadata as how we do in
LogBlockContainerNativeMeta.

The implementation also reuses most logic of the base class
LogBlockContainer, the main difference with
LogBlockContainerNativeMeta is that LogBlockContainerRdbMeta
stores block records metadata in RocksDB rather than in a
native file. The main implementation of interfaces from
the base class including:
a. Create a container
   Data file is created similar to LogBlockContainerNativeMeta,
   but the metadata part is stored in RocksDB with keys
   constructed as "<container_id>.<block_id>", and values are
   the same to the records stored in metadata file of
   LogBlockContainerNativeMeta.
b. Open a container
   Similar to LogBlockContainerNativeMeta, and it's not needed
   to check the metadata part, because it has been checked when
   loading containers during the bootstrap phase.
c. Destroy a container
   If the container is dead (full and no live blocks), remove
   the data file, and clean up metadata part, by deleting all
   the keys prefixed by "<container_id>".
d. Load a container (by ProcessRecords())
   Iterate the RocksDB in the key range
   [<container_id>, <next_container_id>), because dead blocks
   have been deleted directly, thus only live block records
   will be populated, we can use them as LogBlockContainerNativeMeta.
e. Create blocks in a container
   Put() serialized BlockRecordPB records into RocksDB, keys
   are constructed the same to the above.
f. Remove blocks from a container
   Construct the keys same to the above, Delete() them from RocksDB
   in batch.

This patch contains the following changes:
- Adds a new block manager type named 'logr', it uses RocksDB
  to store LBM metadata. The new block manager is enabled by setting
  --block_manager=logr.
- Related tests add new parameterized value to test the case
  of "--block_manager=logr".

It's optional to use RocksDB, we can use the former LBM as
before, we will introduce more tools to convert data between
the two implementations in the future.

The optimization is obvious as shown in JIRA KUDU-3371, it shows that
the time spent to re-open tablet server's metadata when 99.99% of all
the records removed reduced about 9.5 times when using
LogBlockContainerRdbMeta instead of LogBlockContainerNativeMeta.

Change-Id: I23b7d2a16802af01a382a1d74cd9869baf364688
Reviewed-on: http://gerrit.cloudera.org:8080/21075
Tested-by: Yingchun Lai <laiyingc...@apache.org>
Reviewed-by: Alexey Serbin <ale...@apache.org>


> Use RocksDB to store LBM metadata
> ---------------------------------
>
>                 Key: KUDU-3371
>                 URL: https://issues.apache.org/jira/browse/KUDU-3371
>             Project: Kudu
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Yingchun Lai
>            Priority: Major
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> h1. Motivation
> The current LBM container use separate .data and .metadata files. The .data 
> file store the real user data, we can use hole punching to reduce disk space. 
> While the metadata use write protobuf serialized string to a file, in append 
> only mode. Each protobuf object is a struct of BlockRecordPB:
>  
> {code:java}
> message BlockRecordPB {
>   required BlockIdPB block_id = 1;  // int64
>   required BlockRecordType op_type = 2;  // CREATE or DELETE
>   required uint64 timestamp_us = 3;
>   optional int64 offset = 4; // Required for CREATE.
>   optional int64 length = 5; // Required for CREATE.
> } {code}
> That means each object is either type of CREATE or DELETE. To mark a 'block' 
> as deleted, there will be 2 objects in the metadata, one is CREATE type and 
> the other is DELETE type.
> There are some weak points of current LBM metadata storage mechanism:
> h2. 1. Disk space amplification
> The metadata live blocks rate may be very low, the worst case is there is 
> only 1 alive block (suppose it hasn't reach the runtime compact threshold), 
> all the other thousands of blocks are dead (i.e. in pair of CREATE-DELETE).
> So the disk space amplification is very serious.
> h2. 2. Long time bootstrap
> In Kudu server bootstrap stage, it have to replay all the metadata files, to 
> find out the alive blocks. In the worst case, we may replayed thousands of 
> blocks in metadata, but find only a very few blocks are alive.
> It may waste much time in almost all cases, since the Kudu cluster in 
> production environment always run without bootstrap with several months, the 
> LBM may be very loose.
> h2. 3. Metadada compaction
> To resolve the issues above, there is a metadata compaction mechanism in LBM, 
> both at runtime and bootstrap stage.
> The one at runtime will lock the container, and it's synchronous.
> The one in bootstrap stage is synchronous too, and may make the bootstrap 
> time longer.
> h1. Optimization by using RocksDB
> h2. Storage design
>  * RocksDB instance: one RocksDB instance per data directory.
>  * Key: <container_id>.<block_id>
>  * Value: the same as before, i.e. the serialized protobuf string, and only 
> store for CREATE entries.
>  * Put/Delete: put value to rocksdb when create block, delete it from rocksdb 
> when delete block
>  * Scan: happened only in bootstrap stage to retrieve all blocks
>  * DeleteRange: happened only when invalidate a container
> h2. Advantages
>  # Disk space amplification: There is still disk space amplification problem. 
> But we can tune RocksDB to reach a balanced point, I trust in most cases, 
> RocksDB is better than append only file.
>  # Bootstrap time: since there are only valid blocks left in rocksdb, so it 
> maybe much faster than before.
>  # metadata compaction: we can leave it to rocksdb to do this work, of course 
> tuning needed.
> h2. test & benchmark
> I'm trying to use RocksDB to store LBM container metadata recently, finished 
> most of work now, and did some benchmark. It show that the fs module block 
> read/write/delete performance is similar to or little worse than the old 
> implemention, the bootstrap time may reduce several times.
> I not sure if it is worth to continue the work, or anybody know if there is 
> any discussion on this topic ever.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to