Attender: Guo Hao, Bing Hong, George Huang, Xi Chen, Guangbao Zhao, Jackson Yao, Hualong Zhang, Minyu Liu, Yuanben Wang, Sammi Chen
Agenda: 1. Goerge raised the question about the 1.3.1 release. After the discussion, we reached the consensus that it's better to put all our resources on 1.4.0 to speed it up. Once 1.4.0 is released, most partners have the plan to upgrade to it. George will send a notice in the community dev mail list later about that we are going to cancel the 1.3.1 release. 2. Bing raised a question about block deletion. They have deleted a lot of files recently, and find that the block deletion is still going on after several days. Turing the block deletion related properties of OM, SCM and DN, doesn't help. Finally it's found that because there are too closeContainerCommand in the queue and the thread is busy handling these commands, the block deletion commands are waiting for a long time to get executed. Please refer to the https://docs.google.com/document/d/1g1h-63fvA-be-clvyVRAHLWehoadCnjNRWyX8Bp-UIU/edit#heading=h.358h6n518zuj for detail info. 3. Xi shared their block deletion turning experience,after some turning,1K blocks deletion per second can be achieved on one DN, and improvements proposed to community based on the real environment cases - HDDS-8888. Consider Datanode queue capacity when sending DeleteBlocks command #4939 <https://github.com/apache/ozone/pull/4939> - HDDS-8869. Make DN DeleteBlocksCommandHandler wait for the lock can timeout. #4913 <https://github.com/apache/ozone/pull/4913> - HDDS-8690. Ozone Support deletion related parameter dynamic configuration #4798 <https://github.com/apache/ozone/pull/4798> - HDDS-8882. Add status management of SCM's DeleteBlocksCommand to avoid sending duplicate delete transactions to the DN #4988 <https://github.com/apache/ozone/pull/4988> - Review help is needed for above patches - 4. Xi also mentioned that when the SCM rocksdb had 2 billion keys in the delete transaction table,it took nearly 60-70s for the rocksdb iterator to seek first place. - 5. Hao found their SCM failed to startup because there was a raft log gap. And it's very hard for SCM to recover from that state without any SCM metadata loss. He raised the ticket RATIS-1887.