Re: [RFC] Proposal: Reserve Space for Allocated Blocks

Uma Maheswara Rao Gangumalla Thu, 03 Nov 2022 13:21:17 -0700

Thanks Sumit for the proposal. In general it looks good. However IIUC, most
of the proposals so far here are trying to address one important factor,
that is when to close the container safely. Even with this solution that is
turning out to be waiting until the last block expiry time. But this seems
like giving a hold to understand how many such blocks are there in the
expiry list. Once the list is empty and the container already beyond size,
you propose to close? Of course, slower writers can face container close
exceptions if they continue to write after the last block expiry time.


Regards,
Uma

On Wed, Nov 2, 2022 at 9:03 AM Kaijie Chen <c...@apache.org> wrote:

> Hi Sumit,
>
> Thank you for sharing your ideas. The block expiry time list sounds
> appealing.
>
> I have a few questions here:
>
> > CloseBlockNotification ICR handling at SCM:
> > 1) Check for containerID and get the matching entry from
> blockExpiryTimeList
> > 2) Remove the block entry from blockExpiryTimeList
> > 3) Add the occupied size to all other usedSpace (To avoid extra block)
> to previous blocks still in-progress in list
>
> Question 1,
> In step 3 above, do we need change the blocks after the finished block as
> well? (+occupied size -full block size)
> If so, can we maintain only one usedSpace value? (the list of expiryTime
> and usedSpace are keeped)
>
> Question 2,
> In openKey request, OM may pre-allocate some blocks for the client,
> How do we set the expiration time for the pre-allocated blocks?
>
> Question 3,
> When it's safe to close the container?
> (I assume when the list is empty plus usedSpace exceeds some limit?)
>
> PS: Are you planning to implement a POC to see how it works in real
> cluster?
>
> Thanks,
> Kaijie
>
>  ---- On Wed, 02 Nov 2022 22:08:31 +0800  Sumit Agrawal  wrote ---
>  > Hi Devs,
>  > I have another approach without have much impact to system keeping some
> restrictions as usages and minimize the impact.
>  > I have attached the proposal, please have a look.
>  > RegardsSumit
>  > On Wed, Nov 2, 2022 at 3:49 PM Nandakumar Vadivelu
> nvadiv...@cloudera.com> wrote:
>  > + Sumit Agrawal
>  > (He is also working on the design for Reserve Space for Allocated
> Blocks)
>  >
>  > > On 25-Oct-2022, at 9:18 AM, Kaijie Chen c...@apache.org> wrote:
>  > >
>  > > Looking into the AllocateBlock interface, it assumes all blocks
> allocated
>  > > are in the same size.
>  > >
>  > >    List allocateBlock(long size, int numBlocks,
>  > >        ReplicationConfig replicationConfig, String owner,
>  > >        ExcludeList excludeList) throws IOException;
>  > >
>  > > I'm wondering if we can change this API to allocate optimistically,
>  > > and track the exact space allocated. Such as,
>  > >
>  > >    List allocateBlock(long totalSize,
>  > >        ReplicationConfig replicationConfig, String owner,
>  > >        ExcludeList excludeList) throws IOException;
>  > >
>  > > Suppose we want to write a 300 MB key, we should expect
>  > > 256 MB + 44 MB blocks instead of 256 MB + 256 MB blocks.
>  > >
>  > > Yes, exceptions could happen and the final block size may vary,
>  > > but we should optimize for the most common case.
>  > >
>  > > Best,
>  > > Kaijie
>  > >
>  > > ---- On Thu, 29 Sep 2022 09:54:40 +0800  anu engineer  wrote ---
>  > >> 15 GB sounds excessive; I would first investigate how that can
> happen and
>  > >> if we have some sort of path this is not explored fully or perhaps a
> bug,
>  > >> in the allocation or the client are moving too fast for us to
> respond.
>  > >>
>  > >> If you think the issue is with the clients being able to get leases
> too
>  > >> fast, I think that you need a solution combination of tracking and
> leases.
>  > >>
>  > >> if we can limit, two things :
>  > >> 1. The maximum times you can renew the lease - It limits the maximum
> time a
>  > >> client can force the container to remain open.
>  > >> 2. The maximum number of outstanding leases - Have a policy, for
> example if
>  > >> you can say that we will have only 50% of unallocated space at any
> given
>  > >> time as leases -- That is the proposal that we were discussing on
> the other
>  > >> thread.
>  > >>
>  > >>
>  > >> Also be aware that this is a soft constraint -- if a large number of
> your
>  > >> containers behave and tend to converge to your expected size,
> overall your
>  > >> system is stable(r).
>  > >>
>  > >>
>  > >> Thanks
>  > >> Anu
>  > >>
>  > >>
>  > >>
>  > >>
>  > >> On Wed, Sep 28, 2022 at 5:56 AM Kaijie Chen c...@apache.org> wrote:
>  > >>
>  > >>> Hi Anu,
>  > >>>
>  > >>> Thanks for your suggestions. These are indeed where we can
>  > >>> improve the code. I have something more to share.
>  > >>>
>  > >>> I did more tests today, and I have observed containers over 15 GB,
>  > >>> which is 15 times of the configured container size limit (1 GB).
>  > >>> It might be related to the pipeline chosing policy and the container
>  > >>> close threshold (99%).
>  > >>>
>  > >>> Because we have no control of how many block can be allocated
>  > >>> simultaneously, it seems there is risk we can get abnormally
>  > >>> large containers. What do you think?
>  > >>>
>  > >>> I have also tested the simple delay proposal. It sometimes works
> well.
>  > >>> But sometimes still produces fragmented blocks. This is expected.
>  > >>>
>  > >>> Kaijie
>  > >>>
>  > >>> ---- On Wed, 28 Sep 2022 08:00:38 +0800  anu engineer  wrote ---
>  > >>>> Thank you for the POC, and the numbers from your POC. It looks very
>  > >>> good.
>  > >>>> I know this is a private POCproposal, yet I have two minor
> questions.
>  > >>>>
>  > >>>> 1.  Should we maintain the client ID in  "private final
> Map<ContainerID,
>  > >>>> Long> containerLeases" map ? so instead of a long we maintain a
> Long +
>  > >>>> Client ID is what I was thinking. Might be useful for debugging.
>  > >>>> 2. Suppose a client keeps on renewing a container lease, do we
> want to
>  > >>>> enforce a maximum limit ? It is not needed per se -- more like a
>  > >>> question
>  > >>>> that I am asking myself.
>  > >>>>
>  > >>>> Thanks
>  > >>>> Anu
>  > >>>>
>  > >>>>
>  > >>>>
>  > >>>>
>  > >>>> On Mon, Sep 26, 2022 at 2:42 AM Kaijie Chen c...@apache.org> wrote:
>  > >>>>
>  > >>>>> Hi everyone,
>  > >>>>>
>  > >>>>> I've implemented a container lease POC [1], and the result looks
> good.
>  > >>>>>
>  > >>>>> Here's what's changed in the POC:
>  > >>>>>
>  > >>>>> 1. SCM will keep a LeaseExipreAt for each OPEN container. If SCM
>  > >>>>>    receives container close command, it will change the container
>  > >>>>>    state to CLOSING, but it will not send close container command
>  > >>>>>    to DN until the lease expires.
>  > >>>>> 2. OM will forward the container lease request from Client to SCM.
>  > >>>>> 3. Client will acquire lease when a block is allocated (to be
>  > >>> improved),
>  > >>>>>    and it will renew leases for open blocks before its expiration.
>  > >>>>>    Client will ignore any errors with leases, and keep writing
> chunks
>  > >>>>>    to DN even if lease expires. Because the wrost case is simply
>  > >>>>>    ContainerNotOpenException.
>  > >>>>>
>  > >>>>> Despite this POC is not perfect, the result in my tests looks
> good.
>  > >>>>>
>  > >>>>> Cluster: 48 datanodes on 4 machines
>  > >>>>> Client: Ozone freon ockg
>  > >>>>> Threads: 100
>  > >>>>> Key count: 1000
>  > >>>>> Key size: 1000 MB
>  > >>>>> ReplicationConfig: EC/RS-10-4-1024K
>  > >>>>>
>  > >>>>> We should expect 14000x 100 MB blocks in ideal condition.
>  > >>>>> I'm only showing the data from 1 of the 4 machines.
>  > >>>>>
>  > >>>>>
>  > >>>>> Before the change (commit
> 1cf5678224bf00dee580ffdb14ab8b650cc1e2e0):
>  > >>>>>    (The number before each sizes is the count of blocks in that
> size)
>  > >>>>>
>  > >>>>>    15 1.0M 48 2.0M 40 3.0M 48 4.0M 37 5.0M 33 6.0M 48 7.0M 51 8.0M
>  > >>>>>    30 9.0M 49 10M 40 11M 65 12M 33 13M 18 14M 43 15M 46 16M 38 17M
>  > >>>>>    20 18M 46 19M 32 20M 5 21M 54 22M 58 23M 33 24M 25 25M 39 26M
>  > >>>>>    44 27M 48 28M 25 29M 18 30M 34 31M 42 32M 22 33M 23 34M 27 35M
>  > >>>>>    26 36M 33 37M 27 38M 30 39M 60 40M 25 41M 27 42M 26 43M 20 44M
>  > >>>>>    13 45M 18 46M 40 47M 27 48M 25 49M 15 50M 40 51M 26 52M 41 53M
>  > >>>>>    41 54M 9 55M 11 56M 11 57M 19 58M 30 59M 28 60M 44 61M 36 62M
>  > >>>>>    21 63M 14 64M 19 65M 14 66M 23 67M 33 68M 40 69M 34 70M 17 71M
>  > >>>>>    10 72M 35 73M 28 74M 24 75M 21 76M 34 77M 26 78M 35 79M 18 80M
>  > >>>>>    27 81M 26 82M 14 83M 19 84M 23 85M 29 86M 4 87M 23 88M 37 89M
>  > >>>>>    11 90M 23 91M 38 92M 16 93M 12 94M 18 95M 21 96M 27 97M 19 98M
>  > >>>>>    35 99M 2099 100M
>  > >>>>>
>  > >>>>> Container size before the change:
>  > >>>>>
>  > >>>>>    $ ./ozone admin container list -c 10000 | grep usedBytes | awk
>  > >>> '{print
>  > >>>>> $3}' | sort | xargs echo
>  > >>>>>    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0,
>  > >>> 0,
>  > >>>>> 0, 0, 0, 1001390080,
>  > >>>>>    1002438656, 1003487232, 1003487232, 1004535808, 1004535808,
>  > >>> 1004535808,
>  > >>>>>    1004535808, 1006632960, 1007681536, 1010827264, 1011875840,
>  > >>> 1011875840,
>  > >>>>>    1011875840, 1013972992, 1016070144, 1016070144, 1016070144,
>  > >>> 1019215872,
>  > >>>>>    1024458752, 1028653056, 1028653056, 1031798784, 1032847360,
>  > >>> 1032847360,
>  > >>>>>    1032847360, 1033895936, 1035993088, 1044381696, 1046478848,
>  > >>> 1050673152,
>  > >>>>>    1062207488, 1092616192, 1096810496, 968884224, 968884224,
>  > >>> 970981376,
>  > >>>>>    970981376, 972029952, 972029952, 973078528, 973078528,
> 974127104,
>  > >>>>>    974127104, 975175680, 976224256, 976224256, 976224256,
> 976224256,
>  > >>>>>    976224256, 976224256, 976224256, 976224256, 979369984,
> 980418560,
>  > >>>>>    980418560, 980418560, 981467136, 981467136, 983564288,
> 983564288,
>  > >>>>>    983564288, 984612864, 984612864, 984612864, 985661440,
> 985661440,
>  > >>>>>    985661440, 985661440, 986710016, 986710016, 987758592,
> 987758592,
>  > >>>>>    988807168, 988807168, 989855744, 989855744, 989855744,
> 989855744,
>  > >>>>>    990904320, 990904320, 990904320, 990904320, 990904320,
> 990904320,
>  > >>>>>    991952896, 991952896, 993001472, 994050048, 996147200,
> 997195776,
>  > >>>>>    998244352, 998244352,
>  > >>>>>
>  > >>>>>
>  > >>>>> After the change (commit
> 52c903ccc644aba63bbd5354bae98bc8bbe13675):
>  > >>>>>    (Occasionally, there are a few blocks breaked into smaller
> ones)
>  > >>>>>
>  > >>>>>    3571 100M
>  > >>>>>
>  > >>>>> Container sizes after the change:
>  > >>>>>
>  > >>>>>    **Note: "ozone.scm.container.size" was set to 1G**
>  > >>>>>    **Note: "hdds.datanode.storage.utilization.critical.threshold"
>  > >>> was set
>  > >>>>> to 0.99**
>  > >>>>>
>  > >>>>>    $ ./ozone admin container list -c 10000 | grep usedBytes | awk
>  > >>> '{print
>  > >>>>> $3}' | sort | xargs echo
>  > >>>>>    0, 1258291200, 1258291200, 1363148800, 1468006400, 1782579200,
>  > >>>>> 1887436800,
>  > >>>>>    1887436800, 1992294400, 2306867200, 2621440000, 2621440000,
>  > >>> 2726297600,
>  > >>>>>    2831155200, 2831155200, 2936012800, 2936012800, 3040870400,
>  > >>> 3040870400,
>  > >>>>>    3040870400, 3040870400, 3040870400, 3145728000, 3250585600,
>  > >>> 3250585600,
>  > >>>>>    3355443200, 3355443200, 3460300800, 3565158400, 3565158400,
>  > >>> 3670016000,
>  > >>>>>    3670016000, 3774873600, 3879731200, 3879731200, 4404019200,
>  > >>> 4404019200,
>  > >>>>>
>  > >>>>> I've also done tests in RATIS/THREE, the results looks similiar.
>  > >>>>>
>  > >>>>>
>  > >>>>> What I've implemented in POC is basically don't let DN close a
>  > >>>>> container if it is recently written to. And it could be
> implemented
>  > >>>>> solely in DN by a lastUpdated timestamp in containers.
>  > >>>>> So we won't need extra RPCs to achieve this, what do you think?
>  > >>>>>
>  > >>>>> Please help verify and give feedbacks and suggestions.
>  > >>>>>
>  > >>>>> Thanks,
>  > >>>>> Kaijie
>  > >>>>>
>  > >>>>> ---
>  > >>>>>
>  > >>>>> [1]: https://github.com/kaijchen/ozone/tree/container-lease
>  > >>>>>
>  > >>>>>
> ---------------------------------------------------------------------
>  > >>>>> To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org
>  > >>>>> For additional commands, e-mail: dev-h...@ozone.apache.org
>  > >>>>>
>  > >>>>>
>  > >>>>
>  > >>>
>  > >>>
> ---------------------------------------------------------------------
>  > >>> To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org
>  > >>> For additional commands, e-mail: dev-h...@ozone.apache.org
>  > >>>
>  > >>>
>  > >>
>  > >
>  > > ---------------------------------------------------------------------
>  > > To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org
>  > > For additional commands, e-mail: dev-h...@ozone.apache.org
>  > >
>  >
>  >
>  >
>  > --
>  > Sumit Agrawal | Senior Staff Engineer
>  >
>  > cloudera.com
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  > ---------------------------------------------------------------------
>  > To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org
>  > For additional commands, e-mail: dev-h...@ozone.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org
> For additional commands, e-mail: dev-h...@ozone.apache.org
>
>

Re: [RFC] Proposal: Reserve Space for Allocated Blocks

Reply via email to