Hi Sumit, Thank you for sharing your ideas. The block expiry time list sounds appealing.
I have a few questions here: > CloseBlockNotification ICR handling at SCM: > 1) Check for containerID and get the matching entry from blockExpiryTimeList > 2) Remove the block entry from blockExpiryTimeList > 3) Add the occupied size to all other usedSpace (To avoid extra block) to > previous blocks still in-progress in list Question 1, In step 3 above, do we need change the blocks after the finished block as well? (+occupied size -full block size) If so, can we maintain only one usedSpace value? (the list of expiryTime and usedSpace are keeped) Question 2, In openKey request, OM may pre-allocate some blocks for the client, How do we set the expiration time for the pre-allocated blocks? Question 3, When it's safe to close the container? (I assume when the list is empty plus usedSpace exceeds some limit?) PS: Are you planning to implement a POC to see how it works in real cluster? Thanks, Kaijie ---- On Wed, 02 Nov 2022 22:08:31 +0800 Sumit Agrawal wrote --- > Hi Devs, > I have another approach without have much impact to system keeping some > restrictions as usages and minimize the impact. > I have attached the proposal, please have a look. > RegardsSumit > On Wed, Nov 2, 2022 at 3:49 PM Nandakumar Vadivelu nvadiv...@cloudera.com> > wrote: > + Sumit Agrawal > (He is also working on the design for Reserve Space for Allocated Blocks) > > > On 25-Oct-2022, at 9:18 AM, Kaijie Chen c...@apache.org> wrote: > > > > Looking into the AllocateBlock interface, it assumes all blocks allocated > > are in the same size. > > > > List allocateBlock(long size, int numBlocks, > > ReplicationConfig replicationConfig, String owner, > > ExcludeList excludeList) throws IOException; > > > > I'm wondering if we can change this API to allocate optimistically, > > and track the exact space allocated. Such as, > > > > List allocateBlock(long totalSize, > > ReplicationConfig replicationConfig, String owner, > > ExcludeList excludeList) throws IOException; > > > > Suppose we want to write a 300 MB key, we should expect > > 256 MB + 44 MB blocks instead of 256 MB + 256 MB blocks. > > > > Yes, exceptions could happen and the final block size may vary, > > but we should optimize for the most common case. > > > > Best, > > Kaijie > > > > ---- On Thu, 29 Sep 2022 09:54:40 +0800 anu engineer wrote --- > >> 15 GB sounds excessive; I would first investigate how that can happen and > >> if we have some sort of path this is not explored fully or perhaps a bug, > >> in the allocation or the client are moving too fast for us to respond. > >> > >> If you think the issue is with the clients being able to get leases too > >> fast, I think that you need a solution combination of tracking and leases. > >> > >> if we can limit, two things : > >> 1. The maximum times you can renew the lease - It limits the maximum time > >> a > >> client can force the container to remain open. > >> 2. The maximum number of outstanding leases - Have a policy, for example > >> if > >> you can say that we will have only 50% of unallocated space at any given > >> time as leases -- That is the proposal that we were discussing on the > >> other > >> thread. > >> > >> > >> Also be aware that this is a soft constraint -- if a large number of your > >> containers behave and tend to converge to your expected size, overall your > >> system is stable(r). > >> > >> > >> Thanks > >> Anu > >> > >> > >> > >> > >> On Wed, Sep 28, 2022 at 5:56 AM Kaijie Chen c...@apache.org> wrote: > >> > >>> Hi Anu, > >>> > >>> Thanks for your suggestions. These are indeed where we can > >>> improve the code. I have something more to share. > >>> > >>> I did more tests today, and I have observed containers over 15 GB, > >>> which is 15 times of the configured container size limit (1 GB). > >>> It might be related to the pipeline chosing policy and the container > >>> close threshold (99%). > >>> > >>> Because we have no control of how many block can be allocated > >>> simultaneously, it seems there is risk we can get abnormally > >>> large containers. What do you think? > >>> > >>> I have also tested the simple delay proposal. It sometimes works well. > >>> But sometimes still produces fragmented blocks. This is expected. > >>> > >>> Kaijie > >>> > >>> ---- On Wed, 28 Sep 2022 08:00:38 +0800 anu engineer wrote --- > >>>> Thank you for the POC, and the numbers from your POC. It looks very > >>> good. > >>>> I know this is a private POCproposal, yet I have two minor questions. > >>>> > >>>> 1. Should we maintain the client ID in "private final Map<ContainerID, > >>>> Long> containerLeases" map ? so instead of a long we maintain a Long + > >>>> Client ID is what I was thinking. Might be useful for debugging. > >>>> 2. Suppose a client keeps on renewing a container lease, do we want to > >>>> enforce a maximum limit ? It is not needed per se -- more like a > >>> question > >>>> that I am asking myself. > >>>> > >>>> Thanks > >>>> Anu > >>>> > >>>> > >>>> > >>>> > >>>> On Mon, Sep 26, 2022 at 2:42 AM Kaijie Chen c...@apache.org> wrote: > >>>> > >>>>> Hi everyone, > >>>>> > >>>>> I've implemented a container lease POC [1], and the result looks good. > >>>>> > >>>>> Here's what's changed in the POC: > >>>>> > >>>>> 1. SCM will keep a LeaseExipreAt for each OPEN container. If SCM > >>>>> receives container close command, it will change the container > >>>>> state to CLOSING, but it will not send close container command > >>>>> to DN until the lease expires. > >>>>> 2. OM will forward the container lease request from Client to SCM. > >>>>> 3. Client will acquire lease when a block is allocated (to be > >>> improved), > >>>>> and it will renew leases for open blocks before its expiration. > >>>>> Client will ignore any errors with leases, and keep writing chunks > >>>>> to DN even if lease expires. Because the wrost case is simply > >>>>> ContainerNotOpenException. > >>>>> > >>>>> Despite this POC is not perfect, the result in my tests looks good. > >>>>> > >>>>> Cluster: 48 datanodes on 4 machines > >>>>> Client: Ozone freon ockg > >>>>> Threads: 100 > >>>>> Key count: 1000 > >>>>> Key size: 1000 MB > >>>>> ReplicationConfig: EC/RS-10-4-1024K > >>>>> > >>>>> We should expect 14000x 100 MB blocks in ideal condition. > >>>>> I'm only showing the data from 1 of the 4 machines. > >>>>> > >>>>> > >>>>> Before the change (commit 1cf5678224bf00dee580ffdb14ab8b650cc1e2e0): > >>>>> (The number before each sizes is the count of blocks in that size) > >>>>> > >>>>> 15 1.0M 48 2.0M 40 3.0M 48 4.0M 37 5.0M 33 6.0M 48 7.0M 51 8.0M > >>>>> 30 9.0M 49 10M 40 11M 65 12M 33 13M 18 14M 43 15M 46 16M 38 17M > >>>>> 20 18M 46 19M 32 20M 5 21M 54 22M 58 23M 33 24M 25 25M 39 26M > >>>>> 44 27M 48 28M 25 29M 18 30M 34 31M 42 32M 22 33M 23 34M 27 35M > >>>>> 26 36M 33 37M 27 38M 30 39M 60 40M 25 41M 27 42M 26 43M 20 44M > >>>>> 13 45M 18 46M 40 47M 27 48M 25 49M 15 50M 40 51M 26 52M 41 53M > >>>>> 41 54M 9 55M 11 56M 11 57M 19 58M 30 59M 28 60M 44 61M 36 62M > >>>>> 21 63M 14 64M 19 65M 14 66M 23 67M 33 68M 40 69M 34 70M 17 71M > >>>>> 10 72M 35 73M 28 74M 24 75M 21 76M 34 77M 26 78M 35 79M 18 80M > >>>>> 27 81M 26 82M 14 83M 19 84M 23 85M 29 86M 4 87M 23 88M 37 89M > >>>>> 11 90M 23 91M 38 92M 16 93M 12 94M 18 95M 21 96M 27 97M 19 98M > >>>>> 35 99M 2099 100M > >>>>> > >>>>> Container size before the change: > >>>>> > >>>>> $ ./ozone admin container list -c 10000 | grep usedBytes | awk > >>> '{print > >>>>> $3}' | sort | xargs echo > >>>>> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > >>> 0, > >>>>> 0, 0, 0, 1001390080, > >>>>> 1002438656, 1003487232, 1003487232, 1004535808, 1004535808, > >>> 1004535808, > >>>>> 1004535808, 1006632960, 1007681536, 1010827264, 1011875840, > >>> 1011875840, > >>>>> 1011875840, 1013972992, 1016070144, 1016070144, 1016070144, > >>> 1019215872, > >>>>> 1024458752, 1028653056, 1028653056, 1031798784, 1032847360, > >>> 1032847360, > >>>>> 1032847360, 1033895936, 1035993088, 1044381696, 1046478848, > >>> 1050673152, > >>>>> 1062207488, 1092616192, 1096810496, 968884224, 968884224, > >>> 970981376, > >>>>> 970981376, 972029952, 972029952, 973078528, 973078528, 974127104, > >>>>> 974127104, 975175680, 976224256, 976224256, 976224256, 976224256, > >>>>> 976224256, 976224256, 976224256, 976224256, 979369984, 980418560, > >>>>> 980418560, 980418560, 981467136, 981467136, 983564288, 983564288, > >>>>> 983564288, 984612864, 984612864, 984612864, 985661440, 985661440, > >>>>> 985661440, 985661440, 986710016, 986710016, 987758592, 987758592, > >>>>> 988807168, 988807168, 989855744, 989855744, 989855744, 989855744, > >>>>> 990904320, 990904320, 990904320, 990904320, 990904320, 990904320, > >>>>> 991952896, 991952896, 993001472, 994050048, 996147200, 997195776, > >>>>> 998244352, 998244352, > >>>>> > >>>>> > >>>>> After the change (commit 52c903ccc644aba63bbd5354bae98bc8bbe13675): > >>>>> (Occasionally, there are a few blocks breaked into smaller ones) > >>>>> > >>>>> 3571 100M > >>>>> > >>>>> Container sizes after the change: > >>>>> > >>>>> **Note: "ozone.scm.container.size" was set to 1G** > >>>>> **Note: "hdds.datanode.storage.utilization.critical.threshold" > >>> was set > >>>>> to 0.99** > >>>>> > >>>>> $ ./ozone admin container list -c 10000 | grep usedBytes | awk > >>> '{print > >>>>> $3}' | sort | xargs echo > >>>>> 0, 1258291200, 1258291200, 1363148800, 1468006400, 1782579200, > >>>>> 1887436800, > >>>>> 1887436800, 1992294400, 2306867200, 2621440000, 2621440000, > >>> 2726297600, > >>>>> 2831155200, 2831155200, 2936012800, 2936012800, 3040870400, > >>> 3040870400, > >>>>> 3040870400, 3040870400, 3040870400, 3145728000, 3250585600, > >>> 3250585600, > >>>>> 3355443200, 3355443200, 3460300800, 3565158400, 3565158400, > >>> 3670016000, > >>>>> 3670016000, 3774873600, 3879731200, 3879731200, 4404019200, > >>> 4404019200, > >>>>> > >>>>> I've also done tests in RATIS/THREE, the results looks similiar. > >>>>> > >>>>> > >>>>> What I've implemented in POC is basically don't let DN close a > >>>>> container if it is recently written to. And it could be implemented > >>>>> solely in DN by a lastUpdated timestamp in containers. > >>>>> So we won't need extra RPCs to achieve this, what do you think? > >>>>> > >>>>> Please help verify and give feedbacks and suggestions. > >>>>> > >>>>> Thanks, > >>>>> Kaijie > >>>>> > >>>>> --- > >>>>> > >>>>> [1]: https://github.com/kaijchen/ozone/tree/container-lease > >>>>> > >>>>> --------------------------------------------------------------------- > >>>>> To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org > >>>>> For additional commands, e-mail: dev-h...@ozone.apache.org > >>>>> > >>>>> > >>>> > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org > >>> For additional commands, e-mail: dev-h...@ozone.apache.org > >>> > >>> > >> > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org > > For additional commands, e-mail: dev-h...@ozone.apache.org > > > > > > -- > Sumit Agrawal | Senior Staff Engineer > > cloudera.com > > > > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org > For additional commands, e-mail: dev-h...@ozone.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org For additional commands, e-mail: dev-h...@ozone.apache.org