Thanks everyone for the comments.
If I understand correctly, Pifta's suggestion is similar to the OP doc plus
some additional rules. Allowing slight overallocating is like increasing
container size limit but closing containers a little bit earlier to avoid the
final "bounce" stage.
Here's a brief summary of current plans. Plan C was in the "P.S." section
of my original post, and I'll explain it more details later.
A. "Reserve space or limit blocks": may cause slow writes.
B. "Lease or delay before closing": may cause too large containers.
C. "Lazy allocation until commit": may be complicated to implement.
Details of plan C:
1. Create a special container "Eden" in each pipeline, with no space limit.
2. All new blocks will be created in "Eden".
3. When a key is being committed, its blocks will be moved into a regular
container.
The move operation is performed by link and unlink syscall in local FS
(no data copy).
4. If any error happens, close the old "Eden" and create a new one.
The old "Eden" is retired and blocks inside it will be collected later.
5. Only "Eden" and retired "Eden"s are scanned for uncommitted/orphan blocks.
This is much simpler in block allocation, because we will know the size of the
blocks
before moving them into a regular container. But this is a huge change and might
not be easy to implement without breaking compatibility.
I'm working on a POC of plan B right now because it's easier to implement.
I'll share the results when it's ready.
Thanks
Kaijie
---- On Thu, 15 Sep 2022 01:18:33 +0800 István Fajth wrote ---
> Thank you for the comments Anu,
>
> I think if we choose wisely how we enable over-allocation, we will not go
> above 5.5GB in almost none of the cases, let me reason about this later on.
>
> I would like to get some more clarity on the lease, as that also seems to
> be easy, but I feel that it is an orthogonal solution to our current
> problem and does not solve the current problem, more than a "simple delay"
> would.
> Let me share why I think so, and please tell me why I am wrong if I do :)
> The current problem as I understand is that there are some clients that are
> allocating blocks around the same time from the same container, and they
> write a full block, or at least good amounts of data compared to block
> size, but because there isn't a check whether the container has enough
> space to write all the allocated blocks, so some clients fail to finish the
> write, as in the middle of writing data to the container the container gets
> closed.
> An arbitrary delay might help somewhat, a lease that can be renewed for 2
> minutes might help somewhat, but if there is a slow client that will still
> fail the write, and we end up with dead data in the containers, which I
> believe (but I am not sure) will be cleaned eventually by some background
> process that checks after unreferenced blocks and removes them from
> containers, while the client has to rewrite the whole data.
>
> In this situation, if we do not calculate how many blocks we may allocate
> in an open container, there will anyway be cases where client can not
> finish the write in time, while we can end up with arbitrarily large
> containers, as based on the lease there will be a lot of writes that may
> finish in time without a constraint on size. (I imagine a scenario where a
> distributed client job allocates 10k blocks and then starts to write into
> all of them on a cluster where we have 500 open containers.)
>
> Now if we allow over allocation by 50% of free space or by 2 blocks
> (whichever is more), then we can end up having 5.5GB containers most of the
> cases, but we also have an upper bound, as blocks in an empty container can
> be allocated up to 7.5GB worth of total data if all blocks are fully
> written.
> The average case still would be 5.5GB, as we do not see this problem often
> so far, so in most of the cases data is written fine and the container is
> closed after, at least based on the problem description in the document
> shared. Why am I saying this? If things go smooth and normal, there can be
> two cases, 50% of the remaining space is bigger than 2 blocks, in this case
> we will go beyond 5.5GB with allocation, if all blocks allocated last are
> fully written, if the remaining space is less than 2 blocks, then we will
> over allocate by 2 blocks, which if fully written will make the container
> closed with ~5.5GB data. Now if you consider the number of full blocks
> compared to the number of partial blocks on a real system, and the fact
> that the largest size of a container is still limited to 7.5GB, and the
> average should not go much bigger than the original 5GB, I think this is
> enough to ensure that writes can be finished, while containers does not
> grow beyond the limit too much.
>
> On the other side with leases we can have way larger containers in the
> extreme case, but we do not need to open containers right away when there
> is a flood of block allocation. (Though sooner or later we need to open new
> ones).
>
> So the options seem to be:
> - having container that might be heavily over allocated and writes that are
> limited by the lease timeout, from which some may fail the write still the
> same way
> - having containers with a size close to our current limit, with the burden
> of opening containers during a flood of block allocation, and add metadata
> and bookkeeping around the number of open allocations in SCM to limit the
> number of allocations that can go to one container.
>
> The first one seems to be simple, but does not solve the problem just hides
> it a bit more due to delays in container closure, the second one seems to
> solve the problem as it closes the container when all writes have been
> finished, or abandoned, while limiting the number of writes that can go
> into a container, at the price of having more open containers (for which I
> still don't know what effects it might have if we have too many containers
> open).
>
> Pifta
>
>
>
> anu engineer [email protected]> ezt írta (időpont: 2022. szept. 13.,
> K, 19:24):
>
> > What I am arguing is that if you are going to use time, use it in a way
> > that plays well with a distributed system.
> >
> > That is where leases come from. What you are suggesting is very similar to
> > the original design, that says keep track of allocated
> > blocks (the original design doc that was shared) and your argument is that
> > keeping track can be done in terms of metadata if we are willing to
> > make the size of containers variable. That is absolutely true. I think we
> > will need to really think if we can make containers arbitrarily large and
> > what is the impact on the rest of the system.
> >
> > My point was simply -- There is no notion of a "simple delay" in a
> > distributed system -- since we will not be able to reason about it.
> > If you want to create such delays, do it via a lease mechanism -- See my
> > other email in response to Steven. There are subtle correctness issues with
> > arbitrary delays.
> >
> > Personally, I am fine with making containers larger, but I think really it
> > should be done after:
> >
> > 1. Measuring the average container sizes in a long running ozone cluster.
> > 2. The average time it takes to replicate a container in a cluster.
> > 3. After modelling the recovery time of Ozone node loss. The issue is that
> > if you have clusters with a small number of nodes, Ozone does not do as
> > well a job as HDFS on scattering the blocks.
> >
> > So it really needs some data plotted on a curve with cluster size, scatter,
> > and that will give you a ball-park with recovery times. Long, long time ago
> > -- we did do an extensivel model of this, and kind of convinced ourselves
> > that the 5 GB size is viable. You should do a similar exercise to convince
> > yourself that it is possible to do much larger container sizes.
> >
> > Another option you have is to support splitting closed containers -- let
> > it grow to any size, and then once it is closed, you do a special operation
> > via SCM to split the containers into two or so.
> >
> > Best of luck with either approach, the proposal you are making is certainly
> > doable, but it just needs some data analysis.
> >
> > Thanks
> > Anu
> >
> >
> >
> >
> >
> >
> >
> >
> > On Tue, Sep 13, 2022 at 2:55 AM István Fajth [email protected]> wrote:
> >
> > > I am not sure I understand every bits of the discussion, but let me add
> > > some thought as well here.
> > >
> > > Waiting on clients to finish writes, and see if the container is full
> > > after, sounds like a horrible idea, as with that we will decrease the
> > > amount of writes that can be done parallelly and make it to a function of
> > > the available space in the open containers, with that potentially
> > decrease
> > > write performance.
> > >
> > > The best solution would be the simplest, and someone might already
> > > suggested something like that... but here is my idea to a very simple
> > > logic:
> > > Some information to establish the base where I am coming from:
> > > - SCM as I understand knows how much space is not already used in a
> > > container, and it can track the block allocations for the remaining
> > space,
> > > as it is the service actually allocates a block
> > > - The container is a directory on a DataNode, and hence the 5GB limit is
> > an
> > > arbitrary number we came up with as container size, but in theory if it
> > > goes up to 10GB then that is just that and there are no further
> > > implications by a higher sized container (except for re-replication and
> > > validating health of the container)
> > > - our block size is also an arbitrary number 256MB which does not come
> > > based on any science, or any data from clusters about average file size
> > we
> > > just made it up, and doubled the one HDFS used, so that is just the unit
> > of
> > > allocation and an upper limit of one unit of uniquely identifiable data.
> > >
> > > Now in SCM if we hold a counter per container that holds the number of
> > open
> > > allocations, and allow over allocation by 50% of the remaining space in
> > the
> > > container, we will end up somewhat larger containers than 5GB in the
> > > average case but we might get a simple system that ensures that allocated
> > > blocks can be finished writing to by the clients. Here is how:
> > > - at allocation SCM checks if the available space and the number of
> > > allocations allow any new block to be allocated in the container, if it
> > > does not , then it checks another open container, if no open containers
> > can
> > > allocate more blocks it creates a new container. (A block can be
> > allocated
> > > if the in theory size of the container can not exceed the 5GB soft limit
> > by
> > > more than 50% of the remaining space in the container considering the
> > open
> > > allocations)
> > > - at final putBlock the DN in the ICR or in the heartBeat reports that
> > the
> > > block write is finished, in container x, so that SCM can decrement the
> > > counter.
> > > - if a block is abandoned, (afaik) we have a routine that deletes the
> > half
> > > written block data, we can also ask SCM to decrement the counter from
> > this
> > > routine while freeing up the space.
> > > - if a block is abandoned for over a week (the token lifetime), we can
> > > decrement the counter (but I believe this should happen with abandoned
> > > block detection).
> > > - if the container allocations are going down that much that the
> > container
> > > size can not exceed the 5GB soft-limit anymore, we can open up the
> > > container for allocations again in SCM
> > > - if all the open allocations are closed (the counter is 0), and the
> > > container exceeds the 5GB soft-limit, we close the container from SCM
> > >
> > > Seems like this logic is simple enough, it does not require a lease, as
> > we
> > > do not have one today either and we somehow (as I know) handle the
> > > abandoned blocks. Also it does not require any big changes in the
> > > protocols, the only thing needed is to store the counter within SCM's
> > > container data structure, and report from the DN to SCM that the block
> > had
> > > the final putBlock call.
> > >
> > > In case of a DDos style block allocation, we will end up with enough
> > > containers to hold the data, and we can have 7.5GB containers in the
> > worst
> > > case, while in the general case we most likely will have around 5GB
> > > containers.
> > > Of course if needed we can introduce a hard limit for the number of open
> > > containers (pipelines), and if we exceed that then the client would need
> > to
> > > wait to allocate more... (some form of QoS mentioned in the doc), but I
> > am
> > > not really a fan of this idea, as it introduces possibly large delays not
> > > just for the rogue client, do we know what is the consequence of having
> > > like 1000 open containers on a 10 node cluster for example? (As that
> > would
> > > mean ~5TB of open allocation...)
> > >
> > > Please let me know if this idea does not account for some problems, I
> > just
> > > quickly drafted it to come up with something fairly simple that I believe
> > > helps to solve the problem with the least amount of disruption, but I
> > > possibly did not thought through all the possible failure scenarios, and
> > > haven't checked what is actually in the code at the moment, just wanted
> > to
> > > share my idea quickly to validate with folks who know more about the
> > actual
> > > code out there in the codebase, to see if it is a feasible alternative.
> > >
> > > Pifta
> > >
> > >
> > > anu engineer [email protected]> ezt írta (időpont: 2022. szept.
> > 9.,
> > > P, 19:10):
> > >
> > > > Extending the same thought from Steven. If you are going to do a small
> > > > delay, it is better to do it via a Lease.
> > > >
> > > > So SCM could offer a lease for 60 seconds, with a provision to
> > reacquire
> > > > the lease one more time.
> > > > This does mean that a single container inside the data node technically
> > > > could become larger than 5GB (but that is possible even today).
> > > >
> > > > I do think a lease or a timeout based approach (as suggested by Steven)
> > > > might be easier than pre-allocating blocks.
> > > >
> > > > Thanks
> > > > Anu
> > > >
> > > >
> > > > On Fri, Sep 9, 2022 at 12:47 AM Stephen O'Donnell
> > > > [email protected]> wrote:
> > > >
> > > > > > 4. Datanode wait until no write commands to this container, then
> > > close
> > > > > it.
> > > > >
> > > > > This could be done on SCM, with a simple delay. Ie hold back the
> > close
> > > > > commands for a "normal" close for some configurable amount of time.
> > Eg
> > > if
> > > > > we hold for 60 seconds, it is likely almost all blocks will get
> > > written.
> > > > If
> > > > > a very small number fail, it is likely OK.
> > > > >
> > > > > On Fri, Sep 9, 2022 at 5:01 AM Kaijie Chen [email protected]> wrote:
> > > > >
> > > > > > Thanks Ethan. Yes this could be a simpler solution.
> > > > > > The main idea is allowing container size limit to be exceeded,
> > > > > > to ensure all allocated blocks can be finished.
> > > > > >
> > > > > > We can change it to something like this:
> > > > > > 1. Datanode notices the container is near full.
> > > > > > 2. Datanode sends close container action to SCM immediately.
> > > > > > 3. SCM closes the container and stops allocating new blocks in it.
> > > > > > 4. Datanode wait until no write commands to this container, then
> > > close
> > > > > it.
> > > > > >
> > > > > > It's still okay to wait for the next heartbeat in step 2.
> > > > > > Step 4 is a little bit tricky, we need a lease or timeout to
> > > determine
> > > > > the
> > > > > > time.
> > > > > >
> > > > > > Kaijie
> > > > > >
> > > > > > ---- On Fri, 09 Sep 2022 08:54:34 +0800 Ethan Rose wrote ---
> > > > > > > I believe the flow is:
> > > > > > > 1. Datanode notices the container is near full.
> > > > > > > 2. Datanode sends close container action to SCM on its next
> > > > heartbeat.
> > > > > > > 3. SCM closes the container and sends a close container command
> > on
> > > > the
> > > > > > > heartbeat response.
> > > > > > > 4. Datanodes get the response and close the container. If it is
> > a
> > > > > Ratis
> > > > > > > container, the leader will send the close via Ratis.
> > > > > > >
> > > > > > > There is a "grace period" of sorts between steps 1 and 2, but
> > this
> > > > > does
> > > > > > not
> > > > > > > help the situation because SCM does not stop issuing blocks to
> > > this
> > > > > > > container until after step 3. Perhaps some amount of pause
> > between
> > > > > > steps 3
> > > > > > > and 4 would help, either on the SCM or datanode side. This would
> > > > > > provide a
> > > > > > > "grace period" between when SCM stops allocating blocks for the
> > > > > > container
> > > > > > > and when the container is actually closed. I'm not sure exactly
> > > how
> > > > > this
> > > > > > > would be implemented in the code given the current setup, but it
> > > > seems
> > > > > > like
> > > > > > > a simple option we should try before other more complicated
> > > > solutions.
> > > > > > >
> > > > > > > Ethan
> > > > > >
> > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: [email protected]
> > > > > > For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]