Re: [RFC] Proposal: Reserve Space for Allocated Blocks

Sumit Agrawal Fri, 04 Nov 2022 05:46:57 -0700

Thanks Kaijie and Uma,

Question 1,
In step 3 above, do we need change the blocks after the finished block as
well? (+occupied size -full block size)
If so, can we maintain only one usedSpace value? (the list of expiryTime
and usedSpace are keeped)
>>
When block is expired, then actual used for that block will not be known.
Trying somehow, but seems complex and inefficient.


Will change the logic as below:

*container capacity:*
totalWrittenBlockSize: <block written as confirmed by ICR + container
usages on first allocate>
containerUsedCapacity: Max ( nrOfBlockReserve*BlockSize +
totalWrittenBlockSize, currentUsagesOfContainer)

extra space as consumed by slow writer will reduce actual capacity
required by other client, but that same can be retried with new block.

Updated doc with this.

Question 2,
In openKey request, OM may pre-allocate some blocks for the client,
How do we set the expiration time for the pre-allocated blocks?

>> multiple blocks is pre-allocated using SCM, so we can use information:
    block1: expiry time: current Time + 1*BlockExpiryTime
    block2: expiry time: current Time + 2*BlockExpiryTime
    ...


Question 3,
When it's safe to close the container?
(I assume when the list is empty plus usedSpace exceeds some limit?)

>>
Currently DN initiate closure of container if it reaches threshold (90% of
container usages), And SCM trigger close event.

We can have below criteria, and impact:
1. Since there is a check at SCM for block allocation and try avoid
over-subscription,
    -- DN can notify SCM for container full, as actual write handled by DN.
Container Full condition can be 100% of container usages.
        Here, *DN can make hard failure similar to container closed for
further write, client retry allocate new block and continue*
    -- Slow writer client need retry with new block as currently present
       If there is a need support slow client more frequently, then need
increase the expiry time of block.
2. If list is empty and space available in container is less than block
size, this is not usable and SCM can trigger close in this scenario
    Slow writter client will have impact and they can create a new block
and start continue write with new blocks.
    But this impact will not be there compare to earlier case because DN
was initiating close at 90% of size as default (i.e  twice of block size).



PS: Are you planning to implement a POC to see how it works in real cluster?

I will try to simulate the above scenarios and test using freon and java
client. Number of DNs will try 3-5 nodes.

Regards
Sumit

On Fri, Nov 4, 2022 at 1:51 AM Uma Maheswara Rao Gangumalla <
umaganguma...@gmail.com> wrote:

> Thanks Sumit for the proposal. In general it looks good. However IIUC, most
> of the proposals so far here are trying to address one important factor,
> that is when to close the container safely. Even with this solution that is
> turning out to be waiting until the last block expiry time. But this seems
> like giving a hold to understand how many such blocks are there in the
> expiry list. Once the list is empty and the container already beyond size,
> you propose to close? Of course, slower writers can face container close
> exceptions if they continue to write after the last block expiry time.
>
> Regards,
> Uma
>
> On Wed, Nov 2, 2022 at 9:03 AM Kaijie Chen <c...@apache.org> wrote:
>
> > Hi Sumit,
> >
> > Thank you for sharing your ideas. The block expiry time list sounds
> > appealing.
> >
> > I have a few questions here:
> >
> > > CloseBlockNotification ICR handling at SCM:
> > > 1) Check for containerID and get the matching entry from
> > blockExpiryTimeList
> > > 2) Remove the block entry from blockExpiryTimeList
> > > 3) Add the occupied size to all other usedSpace (To avoid extra block)
> > to previous blocks still in-progress in list
> >
> > Question 1,
> > In step 3 above, do we need change the blocks after the finished block as
> > well? (+occupied size -full block size)
> > If so, can we maintain only one usedSpace value? (the list of expiryTime
> > and usedSpace are keeped)
> >
> > Question 2,
> > In openKey request, OM may pre-allocate some blocks for the client,
> > How do we set the expiration time for the pre-allocated blocks?
> >
> > Question 3,
> > When it's safe to close the container?
> > (I assume when the list is empty plus usedSpace exceeds some limit?)
> >
> > PS: Are you planning to implement a POC to see how it works in real
> > cluster?
> >
> > Thanks,
> > Kaijie
> >
> >  ---- On Wed, 02 Nov 2022 22:08:31 +0800  Sumit Agrawal  wrote ---
> >  > Hi Devs,
> >  > I have another approach without have much impact to system keeping
> some
> > restrictions as usages and minimize the impact.
> >  > I have attached the proposal, please have a look.
> >  > RegardsSumit
> >  > On Wed, Nov 2, 2022 at 3:49 PM Nandakumar Vadivelu
> > nvadiv...@cloudera.com> wrote:
> >  > + Sumit Agrawal
> >  > (He is also working on the design for Reserve Space for Allocated
> > Blocks)
> >  >
> >  > > On 25-Oct-2022, at 9:18 AM, Kaijie Chen c...@apache.org> wrote:
> >  > >
> >  > > Looking into the AllocateBlock interface, it assumes all blocks
> > allocated
> >  > > are in the same size.
> >  > >
> >  > >    List allocateBlock(long size, int numBlocks,
> >  > >        ReplicationConfig replicationConfig, String owner,
> >  > >        ExcludeList excludeList) throws IOException;
> >  > >
> >  > > I'm wondering if we can change this API to allocate optimistically,
> >  > > and track the exact space allocated. Such as,
> >  > >
> >  > >    List allocateBlock(long totalSize,
> >  > >        ReplicationConfig replicationConfig, String owner,
> >  > >        ExcludeList excludeList) throws IOException;
> >  > >
> >  > > Suppose we want to write a 300 MB key, we should expect
> >  > > 256 MB + 44 MB blocks instead of 256 MB + 256 MB blocks.
> >  > >
> >  > > Yes, exceptions could happen and the final block size may vary,
> >  > > but we should optimize for the most common case.
> >  > >
> >  > > Best,
> >  > > Kaijie
> >  > >
> >  > > ---- On Thu, 29 Sep 2022 09:54:40 +0800  anu engineer  wrote ---
> >  > >> 15 GB sounds excessive; I would first investigate how that can
> > happen and
> >  > >> if we have some sort of path this is not explored fully or perhaps
> a
> > bug,
> >  > >> in the allocation or the client are moving too fast for us to
> > respond.
> >  > >>
> >  > >> If you think the issue is with the clients being able to get leases
> > too
> >  > >> fast, I think that you need a solution combination of tracking and
> > leases.
> >  > >>
> >  > >> if we can limit, two things :
> >  > >> 1. The maximum times you can renew the lease - It limits the
> maximum
> > time a
> >  > >> client can force the container to remain open.
> >  > >> 2. The maximum number of outstanding leases - Have a policy, for
> > example if
> >  > >> you can say that we will have only 50% of unallocated space at any
> > given
> >  > >> time as leases -- That is the proposal that we were discussing on
> > the other
> >  > >> thread.
> >  > >>
> >  > >>
> >  > >> Also be aware that this is a soft constraint -- if a large number
> of
> > your
> >  > >> containers behave and tend to converge to your expected size,
> > overall your
> >  > >> system is stable(r).
> >  > >>
> >  > >>
> >  > >> Thanks
> >  > >> Anu
> >  > >>
> >  > >>
> >  > >>
> >  > >>
> >  > >> On Wed, Sep 28, 2022 at 5:56 AM Kaijie Chen c...@apache.org> wrote:
> >  > >>
> >  > >>> Hi Anu,
> >  > >>>
> >  > >>> Thanks for your suggestions. These are indeed where we can
> >  > >>> improve the code. I have something more to share.
> >  > >>>
> >  > >>> I did more tests today, and I have observed containers over 15 GB,
> >  > >>> which is 15 times of the configured container size limit (1 GB).
> >  > >>> It might be related to the pipeline chosing policy and the
> container
> >  > >>> close threshold (99%).
> >  > >>>
> >  > >>> Because we have no control of how many block can be allocated
> >  > >>> simultaneously, it seems there is risk we can get abnormally
> >  > >>> large containers. What do you think?
> >  > >>>
> >  > >>> I have also tested the simple delay proposal. It sometimes works
> > well.
> >  > >>> But sometimes still produces fragmented blocks. This is expected.
> >  > >>>
> >  > >>> Kaijie
> >  > >>>
> >  > >>> ---- On Wed, 28 Sep 2022 08:00:38 +0800  anu engineer  wrote ---
> >  > >>>> Thank you for the POC, and the numbers from your POC. It looks
> very
> >  > >>> good.
> >  > >>>> I know this is a private POCproposal, yet I have two minor
> > questions.
> >  > >>>>
> >  > >>>> 1.  Should we maintain the client ID in  "private final
> > Map<ContainerID,
> >  > >>>> Long> containerLeases" map ? so instead of a long we maintain a
> > Long +
> >  > >>>> Client ID is what I was thinking. Might be useful for debugging.
> >  > >>>> 2. Suppose a client keeps on renewing a container lease, do we
> > want to
> >  > >>>> enforce a maximum limit ? It is not needed per se -- more like a
> >  > >>> question
> >  > >>>> that I am asking myself.
> >  > >>>>
> >  > >>>> Thanks
> >  > >>>> Anu
> >  > >>>>
> >  > >>>>
> >  > >>>>
> >  > >>>>
> >  > >>>> On Mon, Sep 26, 2022 at 2:42 AM Kaijie Chen c...@apache.org>
> wrote:
> >  > >>>>
> >  > >>>>> Hi everyone,
> >  > >>>>>
> >  > >>>>> I've implemented a container lease POC [1], and the result looks
> > good.
> >  > >>>>>
> >  > >>>>> Here's what's changed in the POC:
> >  > >>>>>
> >  > >>>>> 1. SCM will keep a LeaseExipreAt for each OPEN container. If SCM
> >  > >>>>>    receives container close command, it will change the
> container
> >  > >>>>>    state to CLOSING, but it will not send close container
> command
> >  > >>>>>    to DN until the lease expires.
> >  > >>>>> 2. OM will forward the container lease request from Client to
> SCM.
> >  > >>>>> 3. Client will acquire lease when a block is allocated (to be
> >  > >>> improved),
> >  > >>>>>    and it will renew leases for open blocks before its
> expiration.
> >  > >>>>>    Client will ignore any errors with leases, and keep writing
> > chunks
> >  > >>>>>    to DN even if lease expires. Because the wrost case is simply
> >  > >>>>>    ContainerNotOpenException.
> >  > >>>>>
> >  > >>>>> Despite this POC is not perfect, the result in my tests looks
> > good.
> >  > >>>>>
> >  > >>>>> Cluster: 48 datanodes on 4 machines
> >  > >>>>> Client: Ozone freon ockg
> >  > >>>>> Threads: 100
> >  > >>>>> Key count: 1000
> >  > >>>>> Key size: 1000 MB
> >  > >>>>> ReplicationConfig: EC/RS-10-4-1024K
> >  > >>>>>
> >  > >>>>> We should expect 14000x 100 MB blocks in ideal condition.
> >  > >>>>> I'm only showing the data from 1 of the 4 machines.
> >  > >>>>>
> >  > >>>>>
> >  > >>>>> Before the change (commit
> > 1cf5678224bf00dee580ffdb14ab8b650cc1e2e0):
> >  > >>>>>    (The number before each sizes is the count of blocks in that
> > size)
> >  > >>>>>
> >  > >>>>>    15 1.0M 48 2.0M 40 3.0M 48 4.0M 37 5.0M 33 6.0M 48 7.0M 51
> 8.0M
> >  > >>>>>    30 9.0M 49 10M 40 11M 65 12M 33 13M 18 14M 43 15M 46 16M 38
> 17M
> >  > >>>>>    20 18M 46 19M 32 20M 5 21M 54 22M 58 23M 33 24M 25 25M 39 26M
> >  > >>>>>    44 27M 48 28M 25 29M 18 30M 34 31M 42 32M 22 33M 23 34M 27
> 35M
> >  > >>>>>    26 36M 33 37M 27 38M 30 39M 60 40M 25 41M 27 42M 26 43M 20
> 44M
> >  > >>>>>    13 45M 18 46M 40 47M 27 48M 25 49M 15 50M 40 51M 26 52M 41
> 53M
> >  > >>>>>    41 54M 9 55M 11 56M 11 57M 19 58M 30 59M 28 60M 44 61M 36 62M
> >  > >>>>>    21 63M 14 64M 19 65M 14 66M 23 67M 33 68M 40 69M 34 70M 17
> 71M
> >  > >>>>>    10 72M 35 73M 28 74M 24 75M 21 76M 34 77M 26 78M 35 79M 18
> 80M
> >  > >>>>>    27 81M 26 82M 14 83M 19 84M 23 85M 29 86M 4 87M 23 88M 37 89M
> >  > >>>>>    11 90M 23 91M 38 92M 16 93M 12 94M 18 95M 21 96M 27 97M 19
> 98M
> >  > >>>>>    35 99M 2099 100M
> >  > >>>>>
> >  > >>>>> Container size before the change:
> >  > >>>>>
> >  > >>>>>    $ ./ozone admin container list -c 10000 | grep usedBytes |
> awk
> >  > >>> '{print
> >  > >>>>> $3}' | sort | xargs echo
> >  > >>>>>    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0,
> > 0,
> >  > >>> 0,
> >  > >>>>> 0, 0, 0, 1001390080,
> >  > >>>>>    1002438656, 1003487232, 1003487232, 1004535808, 1004535808,
> >  > >>> 1004535808,
> >  > >>>>>    1004535808, 1006632960, 1007681536, 1010827264, 1011875840,
> >  > >>> 1011875840,
> >  > >>>>>    1011875840, 1013972992, 1016070144, 1016070144, 1016070144,
> >  > >>> 1019215872,
> >  > >>>>>    1024458752, 1028653056, 1028653056, 1031798784, 1032847360,
> >  > >>> 1032847360,
> >  > >>>>>    1032847360, 1033895936, 1035993088, 1044381696, 1046478848,
> >  > >>> 1050673152,
> >  > >>>>>    1062207488, 1092616192, 1096810496, 968884224, 968884224,
> >  > >>> 970981376,
> >  > >>>>>    970981376, 972029952, 972029952, 973078528, 973078528,
> > 974127104,
> >  > >>>>>    974127104, 975175680, 976224256, 976224256, 976224256,
> > 976224256,
> >  > >>>>>    976224256, 976224256, 976224256, 976224256, 979369984,
> > 980418560,
> >  > >>>>>    980418560, 980418560, 981467136, 981467136, 983564288,
> > 983564288,
> >  > >>>>>    983564288, 984612864, 984612864, 984612864, 985661440,
> > 985661440,
> >  > >>>>>    985661440, 985661440, 986710016, 986710016, 987758592,
> > 987758592,
> >  > >>>>>    988807168, 988807168, 989855744, 989855744, 989855744,
> > 989855744,
> >  > >>>>>    990904320, 990904320, 990904320, 990904320, 990904320,
> > 990904320,
> >  > >>>>>    991952896, 991952896, 993001472, 994050048, 996147200,
> > 997195776,
> >  > >>>>>    998244352, 998244352,
> >  > >>>>>
> >  > >>>>>
> >  > >>>>> After the change (commit
> > 52c903ccc644aba63bbd5354bae98bc8bbe13675):
> >  > >>>>>    (Occasionally, there are a few blocks breaked into smaller
> > ones)
> >  > >>>>>
> >  > >>>>>    3571 100M
> >  > >>>>>
> >  > >>>>> Container sizes after the change:
> >  > >>>>>
> >  > >>>>>    **Note: "ozone.scm.container.size" was set to 1G**
> >  > >>>>>    **Note:
> "hdds.datanode.storage.utilization.critical.threshold"
> >  > >>> was set
> >  > >>>>> to 0.99**
> >  > >>>>>
> >  > >>>>>    $ ./ozone admin container list -c 10000 | grep usedBytes |
> awk
> >  > >>> '{print
> >  > >>>>> $3}' | sort | xargs echo
> >  > >>>>>    0, 1258291200, 1258291200, 1363148800, 1468006400,
> 1782579200,
> >  > >>>>> 1887436800,
> >  > >>>>>    1887436800, 1992294400, 2306867200, 2621440000, 2621440000,
> >  > >>> 2726297600,
> >  > >>>>>    2831155200, 2831155200, 2936012800, 2936012800, 3040870400,
> >  > >>> 3040870400,
> >  > >>>>>    3040870400, 3040870400, 3040870400, 3145728000, 3250585600,
> >  > >>> 3250585600,
> >  > >>>>>    3355443200, 3355443200, 3460300800, 3565158400, 3565158400,
> >  > >>> 3670016000,
> >  > >>>>>    3670016000, 3774873600, 3879731200, 3879731200, 4404019200,
> >  > >>> 4404019200,
> >  > >>>>>
> >  > >>>>> I've also done tests in RATIS/THREE, the results looks similiar.
> >  > >>>>>
> >  > >>>>>
> >  > >>>>> What I've implemented in POC is basically don't let DN close a
> >  > >>>>> container if it is recently written to. And it could be
> > implemented
> >  > >>>>> solely in DN by a lastUpdated timestamp in containers.
> >  > >>>>> So we won't need extra RPCs to achieve this, what do you think?
> >  > >>>>>
> >  > >>>>> Please help verify and give feedbacks and suggestions.
> >  > >>>>>
> >  > >>>>> Thanks,
> >  > >>>>> Kaijie
> >  > >>>>>
> >  > >>>>> ---
> >  > >>>>>
> >  > >>>>> [1]: https://github.com/kaijchen/ozone/tree/container-lease
> >  > >>>>>
> >  > >>>>>
> > ---------------------------------------------------------------------
> >  > >>>>> To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org
> >  > >>>>> For additional commands, e-mail: dev-h...@ozone.apache.org
> >  > >>>>>
> >  > >>>>>
> >  > >>>>
> >  > >>>
> >  > >>>
> > ---------------------------------------------------------------------
> >  > >>> To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org
> >  > >>> For additional commands, e-mail: dev-h...@ozone.apache.org
> >  > >>>
> >  > >>>
> >  > >>
> >  > >
> >  > >
> ---------------------------------------------------------------------
> >  > > To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org
> >  > > For additional commands, e-mail: dev-h...@ozone.apache.org
> >  > >
> >  >
> >  >
> >  >
> >  > --
> >  > Sumit Agrawal | Senior Staff Engineer
> >  >
> >  > cloudera.com
> >  >
> >  >
> >  >
> >  >
> >  >
> >  >
> >  >
> >  >
> >  >
> >  >
> >  >
> >  > ---------------------------------------------------------------------
> >  > To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org
> >  > For additional commands, e-mail: dev-h...@ozone.apache.org
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org
> > For additional commands, e-mail: dev-h...@ozone.apache.org
> >
> >
>


-- 
*Sumit Agrawal* | Senior Staff Engineer
cloudera.com <https://www.cloudera.com>
[image: Cloudera] <https://www.cloudera.com/>
[image: Cloudera on Twitter] <https://twitter.com/cloudera> [image:
Cloudera on Facebook] <https://www.facebook.com/cloudera> [image: Cloudera
on LinkedIn] <https://www.linkedin.com/company/cloudera>
------------------------------

SCM Reserve space for allocated blocks of container_V1.docx
Description: MS-Word 2007 document

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org
For additional commands, e-mail: dev-h...@ozone.apache.org

Re: [RFC] Proposal: Reserve Space for Allocated Blocks

Reply via email to