Re: [RFC] Proposal: Reserve Space for Allocated Blocks

Sumit Agrawal Wed, 02 Nov 2022 07:09:01 -0700

Hi Devs,

I have another approach without have much impact to system keeping some
restrictions as usages and minimize the impact.


I have attached the proposal, please have a look.

Regards
Sumit

On Wed, Nov 2, 2022 at 3:49 PM Nandakumar Vadivelu <nvadiv...@cloudera.com>
wrote:

> + Sumit Agrawal
> (He is also working on the design for Reserve Space for Allocated Blocks)
>
> > On 25-Oct-2022, at 9:18 AM, Kaijie Chen <c...@apache.org> wrote:
> >
> > Looking into the AllocateBlock interface, it assumes all blocks allocated
> > are in the same size.
> >
> >    List<AllocatedBlock> allocateBlock(long size, int numBlocks,
> >        ReplicationConfig replicationConfig, String owner,
> >        ExcludeList excludeList) throws IOException;
> >
> > I'm wondering if we can change this API to allocate optimistically,
> > and track the exact space allocated. Such as,
> >
> >    List<AllocatedBlock> allocateBlock(long totalSize,
> >        ReplicationConfig replicationConfig, String owner,
> >        ExcludeList excludeList) throws IOException;
> >
> > Suppose we want to write a 300 MB key, we should expect
> > 256 MB + 44 MB blocks instead of 256 MB + 256 MB blocks.
> >
> > Yes, exceptions could happen and the final block size may vary,
> > but we should optimize for the most common case.
> >
> > Best,
> > Kaijie
> >
> > ---- On Thu, 29 Sep 2022 09:54:40 +0800  anu engineer  wrote ---
> >> 15 GB sounds excessive; I would first investigate how that can happen
> and
> >> if we have some sort of path this is not explored fully or perhaps a
> bug,
> >> in the allocation or the client are moving too fast for us to respond.
> >>
> >> If you think the issue is with the clients being able to get leases too
> >> fast, I think that you need a solution combination of tracking and
> leases.
> >>
> >> if we can limit, two things :
> >> 1. The maximum times you can renew the lease - It limits the maximum
> time a
> >> client can force the container to remain open.
> >> 2. The maximum number of outstanding leases - Have a policy, for
> example if
> >> you can say that we will have only 50% of unallocated space at any given
> >> time as leases -- That is the proposal that we were discussing on the
> other
> >> thread.
> >>
> >>
> >> Also be aware that this is a soft constraint -- if a large number of
> your
> >> containers behave and tend to converge to your expected size, overall
> your
> >> system is stable(r).
> >>
> >>
> >> Thanks
> >> Anu
> >>
> >>
> >>
> >>
> >> On Wed, Sep 28, 2022 at 5:56 AM Kaijie Chen c...@apache.org> wrote:
> >>
> >>> Hi Anu,
> >>>
> >>> Thanks for your suggestions. These are indeed where we can
> >>> improve the code. I have something more to share.
> >>>
> >>> I did more tests today, and I have observed containers over 15 GB,
> >>> which is 15 times of the configured container size limit (1 GB).
> >>> It might be related to the pipeline chosing policy and the container
> >>> close threshold (99%).
> >>>
> >>> Because we have no control of how many block can be allocated
> >>> simultaneously, it seems there is risk we can get abnormally
> >>> large containers. What do you think?
> >>>
> >>> I have also tested the simple delay proposal. It sometimes works well.
> >>> But sometimes still produces fragmented blocks. This is expected.
> >>>
> >>> Kaijie
> >>>
> >>> ---- On Wed, 28 Sep 2022 08:00:38 +0800  anu engineer  wrote ---
> >>>> Thank you for the POC, and the numbers from your POC. It looks very
> >>> good.
> >>>> I know this is a private POCproposal, yet I have two minor questions.
> >>>>
> >>>> 1.  Should we maintain the client ID in  "private final
> Map<ContainerID,
> >>>> Long> containerLeases" map ? so instead of a long we maintain a Long +
> >>>> Client ID is what I was thinking. Might be useful for debugging.
> >>>> 2. Suppose a client keeps on renewing a container lease, do we want to
> >>>> enforce a maximum limit ? It is not needed per se -- more like a
> >>> question
> >>>> that I am asking myself.
> >>>>
> >>>> Thanks
> >>>> Anu
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Mon, Sep 26, 2022 at 2:42 AM Kaijie Chen c...@apache.org> wrote:
> >>>>
> >>>>> Hi everyone,
> >>>>>
> >>>>> I've implemented a container lease POC [1], and the result looks
> good.
> >>>>>
> >>>>> Here's what's changed in the POC:
> >>>>>
> >>>>> 1. SCM will keep a LeaseExipreAt for each OPEN container. If SCM
> >>>>>    receives container close command, it will change the container
> >>>>>    state to CLOSING, but it will not send close container command
> >>>>>    to DN until the lease expires.
> >>>>> 2. OM will forward the container lease request from Client to SCM.
> >>>>> 3. Client will acquire lease when a block is allocated (to be
> >>> improved),
> >>>>>    and it will renew leases for open blocks before its expiration.
> >>>>>    Client will ignore any errors with leases, and keep writing chunks
> >>>>>    to DN even if lease expires. Because the wrost case is simply
> >>>>>    ContainerNotOpenException.
> >>>>>
> >>>>> Despite this POC is not perfect, the result in my tests looks good.
> >>>>>
> >>>>> Cluster: 48 datanodes on 4 machines
> >>>>> Client: Ozone freon ockg
> >>>>> Threads: 100
> >>>>> Key count: 1000
> >>>>> Key size: 1000 MB
> >>>>> ReplicationConfig: EC/RS-10-4-1024K
> >>>>>
> >>>>> We should expect 14000x 100 MB blocks in ideal condition.
> >>>>> I'm only showing the data from 1 of the 4 machines.
> >>>>>
> >>>>>
> >>>>> Before the change (commit 1cf5678224bf00dee580ffdb14ab8b650cc1e2e0):
> >>>>>    (The number before each sizes is the count of blocks in that size)
> >>>>>
> >>>>>    15 1.0M 48 2.0M 40 3.0M 48 4.0M 37 5.0M 33 6.0M 48 7.0M 51 8.0M
> >>>>>    30 9.0M 49 10M 40 11M 65 12M 33 13M 18 14M 43 15M 46 16M 38 17M
> >>>>>    20 18M 46 19M 32 20M 5 21M 54 22M 58 23M 33 24M 25 25M 39 26M
> >>>>>    44 27M 48 28M 25 29M 18 30M 34 31M 42 32M 22 33M 23 34M 27 35M
> >>>>>    26 36M 33 37M 27 38M 30 39M 60 40M 25 41M 27 42M 26 43M 20 44M
> >>>>>    13 45M 18 46M 40 47M 27 48M 25 49M 15 50M 40 51M 26 52M 41 53M
> >>>>>    41 54M 9 55M 11 56M 11 57M 19 58M 30 59M 28 60M 44 61M 36 62M
> >>>>>    21 63M 14 64M 19 65M 14 66M 23 67M 33 68M 40 69M 34 70M 17 71M
> >>>>>    10 72M 35 73M 28 74M 24 75M 21 76M 34 77M 26 78M 35 79M 18 80M
> >>>>>    27 81M 26 82M 14 83M 19 84M 23 85M 29 86M 4 87M 23 88M 37 89M
> >>>>>    11 90M 23 91M 38 92M 16 93M 12 94M 18 95M 21 96M 27 97M 19 98M
> >>>>>    35 99M 2099 100M
> >>>>>
> >>>>> Container size before the change:
> >>>>>
> >>>>>    $ ./ozone admin container list -c 10000 | grep usedBytes | awk
> >>> '{print
> >>>>> $3}' | sort | xargs echo
> >>>>>    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> >>> 0,
> >>>>> 0, 0, 0, 1001390080,
> >>>>>    1002438656, 1003487232, 1003487232, 1004535808, 1004535808,
> >>> 1004535808,
> >>>>>    1004535808, 1006632960, 1007681536, 1010827264, 1011875840,
> >>> 1011875840,
> >>>>>    1011875840, 1013972992, 1016070144, 1016070144, 1016070144,
> >>> 1019215872,
> >>>>>    1024458752, 1028653056, 1028653056, 1031798784, 1032847360,
> >>> 1032847360,
> >>>>>    1032847360, 1033895936, 1035993088, 1044381696, 1046478848,
> >>> 1050673152,
> >>>>>    1062207488, 1092616192, 1096810496, 968884224, 968884224,
> >>> 970981376,
> >>>>>    970981376, 972029952, 972029952, 973078528, 973078528, 974127104,
> >>>>>    974127104, 975175680, 976224256, 976224256, 976224256, 976224256,
> >>>>>    976224256, 976224256, 976224256, 976224256, 979369984, 980418560,
> >>>>>    980418560, 980418560, 981467136, 981467136, 983564288, 983564288,
> >>>>>    983564288, 984612864, 984612864, 984612864, 985661440, 985661440,
> >>>>>    985661440, 985661440, 986710016, 986710016, 987758592, 987758592,
> >>>>>    988807168, 988807168, 989855744, 989855744, 989855744, 989855744,
> >>>>>    990904320, 990904320, 990904320, 990904320, 990904320, 990904320,
> >>>>>    991952896, 991952896, 993001472, 994050048, 996147200, 997195776,
> >>>>>    998244352, 998244352,
> >>>>>
> >>>>>
> >>>>> After the change (commit 52c903ccc644aba63bbd5354bae98bc8bbe13675):
> >>>>>    (Occasionally, there are a few blocks breaked into smaller ones)
> >>>>>
> >>>>>    3571 100M
> >>>>>
> >>>>> Container sizes after the change:
> >>>>>
> >>>>>    **Note: "ozone.scm.container.size" was set to 1G**
> >>>>>    **Note: "hdds.datanode.storage.utilization.critical.threshold"
> >>> was set
> >>>>> to 0.99**
> >>>>>
> >>>>>    $ ./ozone admin container list -c 10000 | grep usedBytes | awk
> >>> '{print
> >>>>> $3}' | sort | xargs echo
> >>>>>    0, 1258291200, 1258291200, 1363148800, 1468006400, 1782579200,
> >>>>> 1887436800,
> >>>>>    1887436800, 1992294400, 2306867200, 2621440000, 2621440000,
> >>> 2726297600,
> >>>>>    2831155200, 2831155200, 2936012800, 2936012800, 3040870400,
> >>> 3040870400,
> >>>>>    3040870400, 3040870400, 3040870400, 3145728000, 3250585600,
> >>> 3250585600,
> >>>>>    3355443200, 3355443200, 3460300800, 3565158400, 3565158400,
> >>> 3670016000,
> >>>>>    3670016000, 3774873600, 3879731200, 3879731200, 4404019200,
> >>> 4404019200,
> >>>>>
> >>>>> I've also done tests in RATIS/THREE, the results looks similiar.
> >>>>>
> >>>>>
> >>>>> What I've implemented in POC is basically don't let DN close a
> >>>>> container if it is recently written to. And it could be implemented
> >>>>> solely in DN by a lastUpdated timestamp in containers.
> >>>>> So we won't need extra RPCs to achieve this, what do you think?
> >>>>>
> >>>>> Please help verify and give feedbacks and suggestions.
> >>>>>
> >>>>> Thanks,
> >>>>> Kaijie
> >>>>>
> >>>>> ---
> >>>>>
> >>>>> [1]: https://github.com/kaijchen/ozone/tree/container-lease
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org
> >>>>> For additional commands, e-mail: dev-h...@ozone.apache.org
> >>>>>
> >>>>>
> >>>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org
> >>> For additional commands, e-mail: dev-h...@ozone.apache.org
> >>>
> >>>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org
> > For additional commands, e-mail: dev-h...@ozone.apache.org
> >
>
>

-- 
*Sumit Agrawal* | Senior Staff Engineer
cloudera.com <https://www.cloudera.com>
[image: Cloudera] <https://www.cloudera.com/>
[image: Cloudera on Twitter] <https://twitter.com/cloudera> [image:
Cloudera on Facebook] <https://www.facebook.com/cloudera> [image: Cloudera
on LinkedIn] <https://www.linkedin.com/company/cloudera>
------------------------------

SCM Reserve space for allocated blocks of container.docx
Description: MS-Word 2007 document

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org
For additional commands, e-mail: dev-h...@ozone.apache.org

Re: [RFC] Proposal: Reserve Space for Allocated Blocks

Reply via email to