July 19th Attenders: Hao, Mingyu, Hongbin, Jianghua, Wei-Chiu, Sammi Hongbin/Jianghua:
- The further investigation shows that "hundreds of RaftServer objects on Datanodes" is because there are so many pipeline directories under the Datanode ratis directory. Besides they will consume some Datanode resources to maintain the state, there is no impact to the integrity of data read/write. It would be nice if Datanode can have some logic to clearly close these RaftServer objects. - It's found that it took hours for SCM to reflect the new state of the specific container replica. Suggest to check the SCM metrics, and tuning the SCM container report handler configuration "ozone.scm.event.Incremental_Container_Report.thread.pool.size" or ozone.scm.event.Container_Report.thread.pool.size" if needed. - Client failed to read EC data during SCM restart, due to the EC pipeline lack of enough DN information during this period. And the pipeline is cached in OM, and will only be refreshed after 6h currently. Refer to last issue in https://docs.google.com/document/d/1g1h-63fvA-be-clvyVRAHLWehoadCnjNRWyX8Bp-UIU/edit . Hao: - RocksDB block cache hit rate on one Datanode is about 15%. Currently the block cache is fixed at 1GB in Ozone. Want to improve the block cache hit rate. - Discussed one common EC reconstruction failure reason, when there is failed and uncommitted block data in some EC container replica due to client side write failure and retry.