Re: [RFC] Proposal: Reserve Space for Allocated Blocks

István Fajth Tue, 13 Sep 2022 02:55:36 -0700

I am not sure I understand every bits of the discussion, but let me add
some thought as well here.


Waiting on clients to finish writes, and see if the container is full
after, sounds like a horrible idea, as with that we will decrease the
amount of writes that can be done parallelly and make it to a function of
the available space in the open containers, with that potentially decrease
write performance.

The best solution would be the simplest, and someone might already
suggested something like that... but here is my idea to a very simple logic:
Some information to establish the base where I am coming from:
- SCM as I understand knows how much space is not already used in a
container, and it can track the block allocations for the remaining space,
as it is the service actually allocates a block
- The container is a directory on a DataNode, and hence the 5GB limit is an
arbitrary number we came up with as container size, but in theory if it
goes up to 10GB then that is just that and there are no further
implications by a higher sized container (except for re-replication and
validating health of the container)
- our block size is also an arbitrary number 256MB which does not come
based on any science, or any data from clusters about average file size we
just made it up, and doubled the one HDFS used, so that is just the unit of
allocation and an upper limit of one unit of uniquely identifiable data.

Now in SCM if we hold a counter per container that holds the number of open
allocations, and allow over allocation by 50% of the remaining space in the
container, we will end up somewhat larger containers than 5GB in the
average case but we might get a simple system that ensures that allocated
blocks can be finished writing to by the clients. Here is how:
- at allocation SCM checks if the available space and the number of
allocations allow any new block to be allocated in the container, if it
does not , then it checks another open container, if no open containers can
allocate more blocks it creates a new container. (A block can be allocated
if the in theory size of the container can not exceed the 5GB soft limit by
more than 50% of the remaining space in the container considering the open
allocations)
- at final putBlock the DN in the ICR or in the heartBeat reports that the
block write is finished, in container x, so that SCM can decrement the
counter.
- if a block is abandoned, (afaik) we have a routine that deletes the half
written block data, we can also ask SCM to decrement the counter from this
routine while freeing up the space.
- if a block is abandoned for over a week (the token lifetime), we can
decrement the counter (but I believe this should happen with abandoned
block detection).
- if the container allocations are going down that much that the container
size can not exceed the 5GB soft-limit anymore, we can open up the
container for allocations again in SCM
- if all the open allocations are closed (the counter is 0), and the
container exceeds the 5GB soft-limit, we close the container from SCM

Seems like this logic is simple enough, it does not require a lease, as we
do not have one today either and we somehow (as I know) handle the
abandoned blocks. Also it does not require any big changes in the
protocols, the only thing needed is to store the counter within SCM's
container data structure, and report from the DN to SCM that the block had
the final putBlock call.

In case of a DDos style block allocation, we will end up with enough
containers to hold the data, and we can have 7.5GB containers in the worst
case, while in the general case we most likely will have around 5GB
containers.
Of course if needed we can introduce a hard limit for the number of open
containers (pipelines), and if we exceed that then the client would need to
wait to allocate more... (some form of QoS mentioned in the doc), but I am
not really a fan of this idea, as it introduces possibly large delays not
just for the rogue client, do we know what is the consequence of having
like 1000 open containers on a 10 node cluster for example? (As that would
mean ~5TB of open allocation...)

Please let me know if this idea does not account for some problems, I just
quickly drafted it to come up with something fairly simple that I believe
helps to solve the problem with the least amount of disruption, but I
possibly did not thought through all the possible failure scenarios, and
haven't checked what is actually in the code at the moment, just wanted to
share my idea quickly to validate with folks who know more about the actual
code out there in the codebase, to see if it is a feasible alternative.

Pifta


anu engineer <anu.engin...@gmail.com> ezt írta (időpont: 2022. szept. 9.,
P, 19:10):

> Extending the same thought from Steven. If you are going to do a small
> delay, it is better to do it via a Lease.
>
> So SCM could offer a lease for 60 seconds, with a provision to reacquire
> the lease one more time.
> This does mean that a single container inside the data node technically
> could become larger than 5GB (but that is possible even today).
>
> I do think a lease or a timeout based approach (as suggested by Steven)
> might be easier than pre-allocating blocks.
>
> Thanks
> Anu
>
>
> On Fri, Sep 9, 2022 at 12:47 AM Stephen O'Donnell
> <sodonn...@cloudera.com.invalid> wrote:
>
> > > 4. Datanode wait until no write commands to this container, then close
> > it.
> >
> > This could be done on SCM, with a simple delay. Ie hold back the close
> > commands for a "normal" close for some configurable amount of time. Eg if
> > we hold for 60 seconds, it is likely almost all blocks will get written.
> If
> > a very small number fail, it is likely OK.
> >
> > On Fri, Sep 9, 2022 at 5:01 AM Kaijie Chen <c...@apache.org> wrote:
> >
> > > Thanks Ethan. Yes this could be a simpler solution.
> > > The main idea is allowing container size limit to be exceeded,
> > > to ensure all allocated blocks can be finished.
> > >
> > > We can change it to something like this:
> > > 1. Datanode notices the container is near full.
> > > 2. Datanode sends close container action to SCM immediately.
> > > 3. SCM closes the container and stops allocating new blocks in it.
> > > 4. Datanode wait until no write commands to this container, then close
> > it.
> > >
> > > It's still okay to wait for the next heartbeat in step 2.
> > > Step 4 is a little bit tricky, we need a lease or timeout to determine
> > the
> > > time.
> > >
> > > Kaijie
> > >
> > >  ---- On Fri, 09 Sep 2022 08:54:34 +0800  Ethan Rose  wrote ---
> > >  > I believe the flow is:
> > >  > 1. Datanode notices the container is near full.
> > >  > 2. Datanode sends close container action to SCM on its next
> heartbeat.
> > >  > 3. SCM closes the container and sends a close container command on
> the
> > >  > heartbeat response.
> > >  > 4. Datanodes get the response and close the container. If it is a
> > Ratis
> > >  > container, the leader will send the close via Ratis.
> > >  >
> > >  > There is a "grace period" of sorts between steps 1 and 2, but this
> > does
> > > not
> > >  > help the situation because SCM does not stop issuing blocks to this
> > >  > container until after step 3. Perhaps some amount of pause between
> > > steps 3
> > >  > and 4 would help, either on the SCM or datanode side. This would
> > > provide a
> > >  > "grace period" between when SCM stops allocating blocks for the
> > > container
> > >  > and when the container is actually closed. I'm not sure exactly how
> > this
> > >  > would be implemented in the code given the current setup, but it
> seems
> > > like
> > >  > a simple option we should try before other more complicated
> solutions.
> > >  >
> > >  > Ethan
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org
> > > For additional commands, e-mail: dev-h...@ozone.apache.org
> > >
> > >
> >
>


-- 
Pifta

Re: [RFC] Proposal: Reserve Space for Allocated Blocks

Reply via email to