Hi everyone, I've implemented a container lease POC [1], and the result looks good.
Here's what's changed in the POC: 1. SCM will keep a LeaseExipreAt for each OPEN container. If SCM receives container close command, it will change the container state to CLOSING, but it will not send close container command to DN until the lease expires. 2. OM will forward the container lease request from Client to SCM. 3. Client will acquire lease when a block is allocated (to be improved), and it will renew leases for open blocks before its expiration. Client will ignore any errors with leases, and keep writing chunks to DN even if lease expires. Because the wrost case is simply ContainerNotOpenException. Despite this POC is not perfect, the result in my tests looks good. Cluster: 48 datanodes on 4 machines Client: Ozone freon ockg Threads: 100 Key count: 1000 Key size: 1000 MB ReplicationConfig: EC/RS-10-4-1024K We should expect 14000x 100 MB blocks in ideal condition. I'm only showing the data from 1 of the 4 machines. Before the change (commit 1cf5678224bf00dee580ffdb14ab8b650cc1e2e0): (The number before each sizes is the count of blocks in that size) 15 1.0M 48 2.0M 40 3.0M 48 4.0M 37 5.0M 33 6.0M 48 7.0M 51 8.0M 30 9.0M 49 10M 40 11M 65 12M 33 13M 18 14M 43 15M 46 16M 38 17M 20 18M 46 19M 32 20M 5 21M 54 22M 58 23M 33 24M 25 25M 39 26M 44 27M 48 28M 25 29M 18 30M 34 31M 42 32M 22 33M 23 34M 27 35M 26 36M 33 37M 27 38M 30 39M 60 40M 25 41M 27 42M 26 43M 20 44M 13 45M 18 46M 40 47M 27 48M 25 49M 15 50M 40 51M 26 52M 41 53M 41 54M 9 55M 11 56M 11 57M 19 58M 30 59M 28 60M 44 61M 36 62M 21 63M 14 64M 19 65M 14 66M 23 67M 33 68M 40 69M 34 70M 17 71M 10 72M 35 73M 28 74M 24 75M 21 76M 34 77M 26 78M 35 79M 18 80M 27 81M 26 82M 14 83M 19 84M 23 85M 29 86M 4 87M 23 88M 37 89M 11 90M 23 91M 38 92M 16 93M 12 94M 18 95M 21 96M 27 97M 19 98M 35 99M 2099 100M Container size before the change: $ ./ozone admin container list -c 10000 | grep usedBytes | awk '{print $3}' | sort | xargs echo 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1001390080, 1002438656, 1003487232, 1003487232, 1004535808, 1004535808, 1004535808, 1004535808, 1006632960, 1007681536, 1010827264, 1011875840, 1011875840, 1011875840, 1013972992, 1016070144, 1016070144, 1016070144, 1019215872, 1024458752, 1028653056, 1028653056, 1031798784, 1032847360, 1032847360, 1032847360, 1033895936, 1035993088, 1044381696, 1046478848, 1050673152, 1062207488, 1092616192, 1096810496, 968884224, 968884224, 970981376, 970981376, 972029952, 972029952, 973078528, 973078528, 974127104, 974127104, 975175680, 976224256, 976224256, 976224256, 976224256, 976224256, 976224256, 976224256, 976224256, 979369984, 980418560, 980418560, 980418560, 981467136, 981467136, 983564288, 983564288, 983564288, 984612864, 984612864, 984612864, 985661440, 985661440, 985661440, 985661440, 986710016, 986710016, 987758592, 987758592, 988807168, 988807168, 989855744, 989855744, 989855744, 989855744, 990904320, 990904320, 990904320, 990904320, 990904320, 990904320, 991952896, 991952896, 993001472, 994050048, 996147200, 997195776, 998244352, 998244352, After the change (commit 52c903ccc644aba63bbd5354bae98bc8bbe13675): (Occasionally, there are a few blocks breaked into smaller ones) 3571 100M Container sizes after the change: **Note: "ozone.scm.container.size" was set to 1G** **Note: "hdds.datanode.storage.utilization.critical.threshold" was set to 0.99** $ ./ozone admin container list -c 10000 | grep usedBytes | awk '{print $3}' | sort | xargs echo 0, 1258291200, 1258291200, 1363148800, 1468006400, 1782579200, 1887436800, 1887436800, 1992294400, 2306867200, 2621440000, 2621440000, 2726297600, 2831155200, 2831155200, 2936012800, 2936012800, 3040870400, 3040870400, 3040870400, 3040870400, 3040870400, 3145728000, 3250585600, 3250585600, 3355443200, 3355443200, 3460300800, 3565158400, 3565158400, 3670016000, 3670016000, 3774873600, 3879731200, 3879731200, 4404019200, 4404019200, I've also done tests in RATIS/THREE, the results looks similiar. What I've implemented in POC is basically don't let DN close a container if it is recently written to. And it could be implemented solely in DN by a lastUpdated timestamp in containers. So we won't need extra RPCs to achieve this, what do you think? Please help verify and give feedbacks and suggestions. Thanks, Kaijie --- [1]: https://github.com/kaijchen/ozone/tree/container-lease --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org For additional commands, e-mail: dev-h...@ozone.apache.org