Re: [RFC] Proposal: Reserve Space for Allocated Blocks

István Fajth Sat, 17 Sep 2022 13:14:34 -0700

Hi Kaijie, Anu,

Yes, my suggestions are mainly to introduce some rules that help us to keep
the containers around the expected size, and I think this is one if not the
most important aspect of the suggestion.


As Anu stated earlier, there were some calculations and modelling done that
lead us to the 5GB container size, so as I understand we should not just
let containers grow indefinitely.

Plan B is problematic in this sense, but I might not understand all aspects
of it either, and I am not worried about the under-allocation, but the
limitless over-allocation.
If we just issue a lease, then let's say we have a container in which in a
rapid allocation phase we allocate 100 blocks, that means 25GB data tops,
and on a larger cluster I can imagine a job that does something like this.
Now let's say the node's network bandwidth is just a mere 40G full duplex
network, so we have enough bandwidth to push the data onto the node in like
5 seconds if I count properly, but writing it to a disk is a good
question... Assume spinning disk, and random writes + reads but let's say
the disk can still write around 200MB/s, this means ~125 seconds is needed
for writing such amount of data.
This can lead to two problems... One is that there might still be writes
that have to be re-done as they can not finish within the lease period,
this is the smaller problem... On the other hand we will still have an
arbitrarily large container. On an SSD this arbitrarily large container
would mean even 100GB in a more extreme case... Please do not hesitate to
correct my math if I miscalculated... :)

So I think without A we can not ensure that the containers remain at or at
least around the 5GB limit, that is why I suggested adding additional rules
and allocation accounting, as otherwise I don't see how we can preserve the
property of having roughly around 5GB containers when a container is closed.

BUT,
there is a significant aspect of my suggestion that I seemingly did not
express well or did not stress enough...
As part of the suggestion I also tried to suggest, that the containers
should not be closed right away after the last allocated blocks have been
written... If we account for the allocations, and prevent allocation when
the container would be already over 5GB (in that 5.5-7.5GB range if we
apply my rules), then we do not have to close the container right after
these blocks are committed... At commit time, we can check if the container
size has reached the expected 5GB size (or the 95% threshold), if yes, we
close, if no, we can leave it open, and the next allocation will be able to
allocate a block or even more based on the same rules, as the remaining
space will be enough for a block or more.
So closing the container is necessary only if after finishing the writes
the container is above the limit, and not when all the pre-allocated blocks
were closed.
I hope, Anu, this helps you to see how the under-allocation problem should
not be a problem with my suggestion.

So I still think accounting of the allocations gives us a better solution
that helps to keep the container size around the limit much better, and it
does not seem to be more complicated than introducing a lease into the
protocol, though it is more complex than an arbitrary delay which, as Anu
stated correctly, is not easy to reason about, hard to test, and leads to
more uncertainty and harder to understand later on.

Pifta


anu engineer <anu.engin...@gmail.com> ezt írta (időpont: 2022. szept. 16.,
P, 3:54):

> As I said in my previous email, I prefer working on these kinds of issues
> by creating detailed models so I can reason about them. It is also quite
> possible that I am not understanding the current issue in its full depth.
>
> For what it is worth, I think your solution is correct. But I think that
> you are only solving the over allocation case. That is the case where
> everyone is trying to write more than the container size.
>
> But apply the same logic that you proposed, that is count the total number
> of blocks you can write in a container to under allocation case.
>
> Here is an example, Let us simply start with the assumption that clients
> are well behaved and it will chunk the block at the size that we are
> proposing.
>
> Assumptions:
>
> 1. Block size is 64MB
> 2. Container Size is 5GB
> 3, Therefore, the number of blocks you can write is 78 blocks per
> container.
> 4. Imagine a scenario, where a client writes single bytes to a file. That
> is each I/O is a single byte into different files.
> 5. Further imagine a scenario, where all those small files are landing in
> the same container -- what do you have now -- a container with 78 bytes
> (ignoring the metadata for now),
> 6. if you only keep the allocated block count, I am afraid that you might
> stop allocation of the blocks as soon as you hit 78 blocks, in this case,
> that would be a measly 78 bytes.
>
> Is that an ideal scenario? No, that is the reason why Ozone does not just
> go by counting allocated blocks, instead tries to get the containers to an
> expected size.
> In short, the average number of blocks and the size of containers is
> dependent on your workload, and the BIG promise we made to ourselves that
> we will try to get away from the small file problem of HDFS.
>
> Now, take this case to extreme, if all files in a cluster are one byte
> files,will the container space explode ? If we follow the "let us count"
> the blocks algo, yes it will, because for each 78 files, we will allocate a
> new container.
> I know it is a very delicate balance that we are dealing with and no
> algorithm will truly make us happy, it is what trade-offs made us least
> sad/
> If we really want to pack as much data into containers, and reduce the
> number of containers, IMHO we need to go beyond the block counts.
>
> While this case I am illustrating is an extreme case, I wanted to show you
> that we have to solve for both the under allocation and over allocation in
> the containers.
> Again, this is a theoretical case, which might never happen, and if you are
> confident that counting the blocks algo will work, please go ahead.
>
>
> just my 2 cents.
> Thanks
> Anu
>
>
>
>
> On Wed, Sep 14, 2022 at 10:19 AM István Fajth <fapi...@gmail.com> wrote:
>
> > Thank you for the comments Anu,
> >
> > I think if we choose wisely how we enable over-allocation, we will not go
> > above 5.5GB in almost none of the cases, let me reason about this later
> on.
> >
> > I would like to get some more clarity on the lease, as that also seems to
> > be easy, but I feel that it is an orthogonal solution to our current
> > problem and does not solve the current problem, more than a "simple
> delay"
> > would.
> > Let me share why I think so, and please tell me why I am wrong if I do :)
> > The current problem as I understand is that there are some clients that
> are
> > allocating blocks around the same time from the same container, and they
> > write a full block, or at least good amounts of data compared to block
> > size, but because there isn't a check whether the container has enough
> > space to write all the allocated blocks, so some clients fail to finish
> the
> > write, as in the middle of writing data to the container the container
> gets
> > closed.
> > An arbitrary delay might help somewhat, a lease that can be renewed for 2
> > minutes might help somewhat, but if there is a slow client that will
> still
> > fail the write, and we end up with dead data in the containers, which I
> > believe (but I am not sure) will be cleaned eventually by some background
> > process that checks after unreferenced blocks and removes them from
> > containers, while the client has to rewrite the whole data.
> >
> > In this situation, if we do not calculate how many blocks we may allocate
> > in an open container, there will anyway be cases where client can not
> > finish the write in time, while we can end up with arbitrarily large
> > containers, as based on the lease there will be a lot of writes that may
> > finish in time without a constraint on size. (I imagine a scenario where
> a
> > distributed client job allocates 10k blocks and then starts to write into
> > all of them on a cluster where we have 500 open containers.)
> >
> > Now if we allow over allocation by 50% of free space or by 2 blocks
> > (whichever is more), then we can end up having 5.5GB containers most of
> the
> > cases, but we also have an upper bound, as blocks in an empty container
> can
> > be allocated up to 7.5GB worth of total data if all blocks are fully
> > written.
> > The average case still would be 5.5GB, as we do not see this problem
> often
> > so far, so in most of the cases data is written fine and the container is
> > closed after, at least based on the problem description in the document
> > shared. Why am I saying this? If things go smooth and normal, there can
> be
> > two cases, 50% of the remaining space is bigger than 2 blocks, in this
> case
> > we will go beyond 5.5GB with allocation, if all blocks allocated last are
> > fully written, if the remaining space is less than 2 blocks, then we will
> > over allocate by 2 blocks, which if fully written will make the container
> > closed with ~5.5GB data. Now if you consider the number of full blocks
> > compared to the number of partial blocks on a real system, and the fact
> > that the largest size of a container is still limited to 7.5GB, and the
> > average should not go much bigger than the original 5GB, I think this is
> > enough to ensure that writes can be finished, while containers does not
> > grow beyond the limit too much.
> >
> > On the other side with leases we can have way larger containers in the
> > extreme case, but we do not need to open containers right away when there
> > is a flood of block allocation. (Though sooner or later we need to open
> new
> > ones).
> >
> > So the options seem to be:
> > - having container that might be heavily over allocated and writes that
> are
> > limited by the lease timeout, from which some may fail the write still
> the
> > same way
> > - having containers with a size close to our current limit, with the
> burden
> > of opening containers during a flood of block allocation, and add
> metadata
> > and bookkeeping around the number of open allocations in SCM to limit the
> > number of allocations that can go to one container.
> >
> > The first one seems to be simple, but does not solve the problem just
> hides
> > it a bit more due to delays in container closure, the second one seems to
> > solve the problem as it closes the container when all writes have been
> > finished, or abandoned, while limiting the number of writes that can go
> > into a container, at the price of having more open containers (for which
> I
> > still don't know what effects it might have if we have too many
> containers
> > open).
> >
> > Pifta
> >
> >
> >
> > anu engineer <anu.engin...@gmail.com> ezt írta (időpont: 2022. szept.
> 13.,
> > K, 19:24):
> >
> > > What I am arguing is that if you are going to use time, use it in a way
> > > that plays well with a distributed system.
> > >
> > > That is where leases come from. What you are suggesting is very similar
> > to
> > > the original design, that says keep track of allocated
> > > blocks (the original design doc that was shared) and your argument is
> > that
> > > keeping track can be done in terms of metadata if we are willing to
> > > make the size of containers variable. That is absolutely true. I think
> we
> > > will need to really think if we can make containers arbitrarily large
> and
> > > what is the impact on the rest of the system.
> > >
> > > My point was simply -- There is no notion of a "simple delay" in a
> > > distributed system -- since we will not be able to reason about it.
> > > If you want to create such delays, do it via a lease mechanism -- See
> my
> > > other email in response to Steven. There are subtle correctness issues
> > with
> > > arbitrary delays.
> > >
> > > Personally, I am fine with making containers larger, but I think really
> > it
> > > should be done after:
> > >
> > > 1. Measuring the average container sizes in a long running ozone
> cluster.
> > > 2. The average time it takes to replicate a container in a cluster.
> > > 3. After modelling the recovery time of Ozone node loss. The issue is
> > that
> > > if you have clusters with a small number of nodes, Ozone does not do as
> > > well a job as HDFS on scattering the blocks.
> > >
> > > So it really needs some data plotted on a curve with cluster size,
> > scatter,
> > > and that will give you a ball-park with recovery times. Long, long time
> > ago
> > > -- we did do an extensivel model of this, and kind of convinced
> ourselves
> > > that the 5 GB size is viable. You should do a similar exercise to
> > convince
> > > yourself that it is possible to do much larger container sizes.
> > >
> > >  Another option you have is to support splitting closed containers --
> let
> > > it grow to any size, and then once it is closed, you do a special
> > operation
> > > via SCM to split the containers into two or so.
> > >
> > > Best of luck with either approach, the proposal you are making is
> > certainly
> > > doable, but it just needs some data analysis.
> > >
> > > Thanks
> > > Anu
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Tue, Sep 13, 2022 at 2:55 AM István Fajth <fapi...@gmail.com>
> wrote:
> > >
> > > > I am not sure I understand every bits of the discussion, but let me
> add
> > > > some thought as well here.
> > > >
> > > > Waiting on clients to finish writes, and see if the container is full
> > > > after, sounds like a horrible idea, as with that we will decrease the
> > > > amount of writes that can be done parallelly and make it to a
> function
> > of
> > > > the available space in the open containers, with that potentially
> > > decrease
> > > > write performance.
> > > >
> > > > The best solution would be the simplest, and someone might already
> > > > suggested something like that... but here is my idea to a very simple
> > > > logic:
> > > > Some information to establish the base where I am coming from:
> > > > - SCM as I understand knows how much space is not already used in a
> > > > container, and it can track the block allocations for the remaining
> > > space,
> > > > as it is the service actually allocates a block
> > > > - The container is a directory on a DataNode, and hence the 5GB limit
> > is
> > > an
> > > > arbitrary number we came up with as container size, but in theory if
> it
> > > > goes up to 10GB then that is just that and there are no further
> > > > implications by a higher sized container (except for re-replication
> and
> > > > validating health of the container)
> > > > - our block size is also an arbitrary number 256MB which does not
> come
> > > > based on any science, or any data from clusters about average file
> size
> > > we
> > > > just made it up, and doubled the one HDFS used, so that is just the
> > unit
> > > of
> > > > allocation and an upper limit of one unit of uniquely identifiable
> > data.
> > > >
> > > > Now in SCM if we hold a counter per container that holds the number
> of
> > > open
> > > > allocations, and allow over allocation by 50% of the remaining space
> in
> > > the
> > > > container, we will end up somewhat larger containers than 5GB in the
> > > > average case but we might get a simple system that ensures that
> > allocated
> > > > blocks can be finished writing to by the clients. Here is how:
> > > > - at allocation SCM checks if the available space and the number of
> > > > allocations allow any new block to be allocated in the container, if
> it
> > > > does not , then it checks another open container, if no open
> containers
> > > can
> > > > allocate more blocks it creates a new container. (A block can be
> > > allocated
> > > > if the in theory size of the container can not exceed the 5GB soft
> > limit
> > > by
> > > > more than 50% of the remaining space in the container considering the
> > > open
> > > > allocations)
> > > > - at final putBlock the DN in the ICR or in the heartBeat reports
> that
> > > the
> > > > block write is finished, in container x, so that SCM can decrement
> the
> > > > counter.
> > > > - if a block is abandoned, (afaik) we have a routine that deletes the
> > > half
> > > > written block data, we can also ask SCM to decrement the counter from
> > > this
> > > > routine while freeing up the space.
> > > > - if a block is abandoned for over a week (the token lifetime), we
> can
> > > > decrement the counter (but I believe this should happen with
> abandoned
> > > > block detection).
> > > > - if the container allocations are going down that much that the
> > > container
> > > > size can not exceed the 5GB soft-limit anymore, we can open up the
> > > > container for allocations again in SCM
> > > > - if all the open allocations are closed (the counter is 0), and the
> > > > container exceeds the 5GB soft-limit, we close the container from SCM
> > > >
> > > > Seems like this logic is simple enough, it does not require a lease,
> as
> > > we
> > > > do not have one today either and we somehow (as I know) handle the
> > > > abandoned blocks. Also it does not require any big changes in the
> > > > protocols, the only thing needed is to store the counter within SCM's
> > > > container data structure, and report from the DN to SCM that the
> block
> > > had
> > > > the final putBlock call.
> > > >
> > > > In case of a DDos style block allocation, we will end up with enough
> > > > containers to hold the data, and we can have 7.5GB containers in the
> > > worst
> > > > case, while in the general case we most likely will have around 5GB
> > > > containers.
> > > > Of course if needed we can introduce a hard limit for the number of
> > open
> > > > containers (pipelines), and if we exceed that then the client would
> > need
> > > to
> > > > wait to allocate more... (some form of QoS mentioned in the doc),
> but I
> > > am
> > > > not really a fan of this idea, as it introduces possibly large delays
> > not
> > > > just for the rogue client, do we know what is the consequence of
> having
> > > > like 1000 open containers on a 10 node cluster for example? (As that
> > > would
> > > > mean ~5TB of open allocation...)
> > > >
> > > > Please let me know if this idea does not account for some problems, I
> > > just
> > > > quickly drafted it to come up with something fairly simple that I
> > believe
> > > > helps to solve the problem with the least amount of disruption, but I
> > > > possibly did not thought through all the possible failure scenarios,
> > and
> > > > haven't checked what is actually in the code at the moment, just
> wanted
> > > to
> > > > share my idea quickly to validate with folks who know more about the
> > > actual
> > > > code out there in the codebase, to see if it is a feasible
> alternative.
> > > >
> > > > Pifta
> > > >
> > > >
> > > > anu engineer <anu.engin...@gmail.com> ezt írta (időpont: 2022.
> szept.
> > > 9.,
> > > > P, 19:10):
> > > >
> > > > > Extending the same thought from Steven. If you are going to do a
> > small
> > > > > delay, it is better to do it via a Lease.
> > > > >
> > > > > So SCM could offer a lease for 60 seconds, with a provision to
> > > reacquire
> > > > > the lease one more time.
> > > > > This does mean that a single container inside the data node
> > technically
> > > > > could become larger than 5GB (but that is possible even today).
> > > > >
> > > > > I do think a lease or a timeout based approach (as suggested by
> > Steven)
> > > > > might be easier than pre-allocating blocks.
> > > > >
> > > > > Thanks
> > > > > Anu
> > > > >
> > > > >
> > > > > On Fri, Sep 9, 2022 at 12:47 AM Stephen O'Donnell
> > > > > <sodonn...@cloudera.com.invalid> wrote:
> > > > >
> > > > > > > 4. Datanode wait until no write commands to this container,
> then
> > > > close
> > > > > > it.
> > > > > >
> > > > > > This could be done on SCM, with a simple delay. Ie hold back the
> > > close
> > > > > > commands for a "normal" close for some configurable amount of
> time.
> > > Eg
> > > > if
> > > > > > we hold for 60 seconds, it is likely almost all blocks will get
> > > > written.
> > > > > If
> > > > > > a very small number fail, it is likely OK.
> > > > > >
> > > > > > On Fri, Sep 9, 2022 at 5:01 AM Kaijie Chen <c...@apache.org>
> wrote:
> > > > > >
> > > > > > > Thanks Ethan. Yes this could be a simpler solution.
> > > > > > > The main idea is allowing container size limit to be exceeded,
> > > > > > > to ensure all allocated blocks can be finished.
> > > > > > >
> > > > > > > We can change it to something like this:
> > > > > > > 1. Datanode notices the container is near full.
> > > > > > > 2. Datanode sends close container action to SCM immediately.
> > > > > > > 3. SCM closes the container and stops allocating new blocks in
> > it.
> > > > > > > 4. Datanode wait until no write commands to this container,
> then
> > > > close
> > > > > > it.
> > > > > > >
> > > > > > > It's still okay to wait for the next heartbeat in step 2.
> > > > > > > Step 4 is a little bit tricky, we need a lease or timeout to
> > > > determine
> > > > > > the
> > > > > > > time.
> > > > > > >
> > > > > > > Kaijie
> > > > > > >
> > > > > > >  ---- On Fri, 09 Sep 2022 08:54:34 +0800  Ethan Rose  wrote ---
> > > > > > >  > I believe the flow is:
> > > > > > >  > 1. Datanode notices the container is near full.
> > > > > > >  > 2. Datanode sends close container action to SCM on its next
> > > > > heartbeat.
> > > > > > >  > 3. SCM closes the container and sends a close container
> > command
> > > on
> > > > > the
> > > > > > >  > heartbeat response.
> > > > > > >  > 4. Datanodes get the response and close the container. If it
> > is
> > > a
> > > > > > Ratis
> > > > > > >  > container, the leader will send the close via Ratis.
> > > > > > >  >
> > > > > > >  > There is a "grace period" of sorts between steps 1 and 2,
> but
> > > this
> > > > > > does
> > > > > > > not
> > > > > > >  > help the situation because SCM does not stop issuing blocks
> to
> > > > this
> > > > > > >  > container until after step 3. Perhaps some amount of pause
> > > between
> > > > > > > steps 3
> > > > > > >  > and 4 would help, either on the SCM or datanode side. This
> > would
> > > > > > > provide a
> > > > > > >  > "grace period" between when SCM stops allocating blocks for
> > the
> > > > > > > container
> > > > > > >  > and when the container is actually closed. I'm not sure
> > exactly
> > > > how
> > > > > > this
> > > > > > >  > would be implemented in the code given the current setup,
> but
> > it
> > > > > seems
> > > > > > > like
> > > > > > >  > a simple option we should try before other more complicated
> > > > > solutions.
> > > > > > >  >
> > > > > > >  > Ethan
> > > > > > >
> > > > > > >
> > > > > > >
> > > ---------------------------------------------------------------------
> > > > > > > To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org
> > > > > > > For additional commands, e-mail: dev-h...@ozone.apache.org
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Pifta
> > > >
> > >
> >
> >
> > --
> > Pifta
> >
>


-- 
Pifta

Re: [RFC] Proposal: Reserve Space for Allocated Blocks

Reply via email to