July 19th
Attenders: Hao, Mingyu, Hongbin, Jianghua, Wei-Chiu, Sammi

Hongbin/Jianghua:

   - The further investigation shows that "hundreds of RaftServer objects
   on Datanodes" is because there are so many pipeline directories under the
   Datanode ratis directory. Besides they will consume some Datanode resources
   to maintain the state, there is no impact to the integrity of data
   read/write. It would be nice if Datanode can have some logic to clearly
   close these RaftServer objects.
   - It's found that it took hours for SCM to reflect the new state of the
   specific container replica. Suggest to check the SCM metrics, and tuning
   the SCM container report handler configuration
   "ozone.scm.event.Incremental_Container_Report.thread.pool.size" or
   ozone.scm.event.Container_Report.thread.pool.size" if needed.
   - Client failed to read EC data during SCM restart, due to the EC
   pipeline lack of enough DN information during this period. And the pipeline
   is cached in OM, and will only be refreshed after 6h currently. Refer to
   last issue in
   
https://docs.google.com/document/d/1g1h-63fvA-be-clvyVRAHLWehoadCnjNRWyX8Bp-UIU/edit
   .

Hao:

   - RocksDB block cache hit rate on one Datanode is about 15%. Currently
   the block cache is fixed at 1GB in Ozone. Want to improve the block cache
   hit rate.
   - Discussed one common EC reconstruction failure reason, when there is
   failed and uncommitted block data in some EC container replica due to
   client side write failure and retry.

Reply via email to