Thanks everyone for the comments.

If I understand correctly, Pifta's suggestion is similar to the OP doc plus
some additional rules. Allowing slight overallocating is like increasing
container size limit but closing containers a little bit earlier to avoid the
final "bounce" stage.

Here's a brief summary of current plans. Plan C was in the "P.S." section
of my original post, and I'll explain it more details later.

  A. "Reserve space or limit blocks": may cause slow writes.
  B. "Lease or delay before closing": may cause too large containers.
  C. "Lazy allocation until commit": may be complicated to implement.

Details of plan C:

  1. Create a special container "Eden" in each pipeline, with no space limit.
  2. All new blocks will be created in "Eden".
  3. When a key is being committed, its blocks will be moved into a regular 
container.
      The move operation is performed by link and unlink syscall in local FS 
(no data copy).
  4. If any error happens, close the old "Eden" and create a new one.
      The old "Eden" is retired and blocks inside it will be collected later.
  5. Only "Eden" and retired "Eden"s are scanned for uncommitted/orphan blocks.

This is much simpler in block allocation, because we will know the size of the 
blocks
before moving them into a regular container. But this is a huge change and might
not be easy to implement without breaking compatibility.

I'm working on a POC of plan B right now because it's easier to implement.
I'll share the results when it's ready.

Thanks
Kaijie

 ---- On Thu, 15 Sep 2022 01:18:33 +0800  István Fajth  wrote --- 
 > Thank you for the comments Anu,
 > 
 > I think if we choose wisely how we enable over-allocation, we will not go
 > above 5.5GB in almost none of the cases, let me reason about this later on.
 > 
 > I would like to get some more clarity on the lease, as that also seems to
 > be easy, but I feel that it is an orthogonal solution to our current
 > problem and does not solve the current problem, more than a "simple delay"
 > would.
 > Let me share why I think so, and please tell me why I am wrong if I do :)
 > The current problem as I understand is that there are some clients that are
 > allocating blocks around the same time from the same container, and they
 > write a full block, or at least good amounts of data compared to block
 > size, but because there isn't a check whether the container has enough
 > space to write all the allocated blocks, so some clients fail to finish the
 > write, as in the middle of writing data to the container the container gets
 > closed.
 > An arbitrary delay might help somewhat, a lease that can be renewed for 2
 > minutes might help somewhat, but if there is a slow client that will still
 > fail the write, and we end up with dead data in the containers, which I
 > believe (but I am not sure) will be cleaned eventually by some background
 > process that checks after unreferenced blocks and removes them from
 > containers, while the client has to rewrite the whole data.
 > 
 > In this situation, if we do not calculate how many blocks we may allocate
 > in an open container, there will anyway be cases where client can not
 > finish the write in time, while we can end up with arbitrarily large
 > containers, as based on the lease there will be a lot of writes that may
 > finish in time without a constraint on size. (I imagine a scenario where a
 > distributed client job allocates 10k blocks and then starts to write into
 > all of them on a cluster where we have 500 open containers.)
 > 
 > Now if we allow over allocation by 50% of free space or by 2 blocks
 > (whichever is more), then we can end up having 5.5GB containers most of the
 > cases, but we also have an upper bound, as blocks in an empty container can
 > be allocated up to 7.5GB worth of total data if all blocks are fully
 > written.
 > The average case still would be 5.5GB, as we do not see this problem often
 > so far, so in most of the cases data is written fine and the container is
 > closed after, at least based on the problem description in the document
 > shared. Why am I saying this? If things go smooth and normal, there can be
 > two cases, 50% of the remaining space is bigger than 2 blocks, in this case
 > we will go beyond 5.5GB with allocation, if all blocks allocated last are
 > fully written, if the remaining space is less than 2 blocks, then we will
 > over allocate by 2 blocks, which if fully written will make the container
 > closed with ~5.5GB data. Now if you consider the number of full blocks
 > compared to the number of partial blocks on a real system, and the fact
 > that the largest size of a container is still limited to 7.5GB, and the
 > average should not go much bigger than the original 5GB, I think this is
 > enough to ensure that writes can be finished, while containers does not
 > grow beyond the limit too much.
 > 
 > On the other side with leases we can have way larger containers in the
 > extreme case, but we do not need to open containers right away when there
 > is a flood of block allocation. (Though sooner or later we need to open new
 > ones).
 > 
 > So the options seem to be:
 > - having container that might be heavily over allocated and writes that are
 > limited by the lease timeout, from which some may fail the write still the
 > same way
 > - having containers with a size close to our current limit, with the burden
 > of opening containers during a flood of block allocation, and add metadata
 > and bookkeeping around the number of open allocations in SCM to limit the
 > number of allocations that can go to one container.
 > 
 > The first one seems to be simple, but does not solve the problem just hides
 > it a bit more due to delays in container closure, the second one seems to
 > solve the problem as it closes the container when all writes have been
 > finished, or abandoned, while limiting the number of writes that can go
 > into a container, at the price of having more open containers (for which I
 > still don't know what effects it might have if we have too many containers
 > open).
 > 
 > Pifta
 > 
 > 
 > 
 > anu engineer anu.engin...@gmail.com> ezt írta (időpont: 2022. szept. 13.,
 > K, 19:24):
 > 
 > > What I am arguing is that if you are going to use time, use it in a way
 > > that plays well with a distributed system.
 > >
 > > That is where leases come from. What you are suggesting is very similar to
 > > the original design, that says keep track of allocated
 > > blocks (the original design doc that was shared) and your argument is that
 > > keeping track can be done in terms of metadata if we are willing to
 > > make the size of containers variable. That is absolutely true. I think we
 > > will need to really think if we can make containers arbitrarily large and
 > > what is the impact on the rest of the system.
 > >
 > > My point was simply -- There is no notion of a "simple delay" in a
 > > distributed system -- since we will not be able to reason about it.
 > > If you want to create such delays, do it via a lease mechanism -- See my
 > > other email in response to Steven. There are subtle correctness issues with
 > > arbitrary delays.
 > >
 > > Personally, I am fine with making containers larger, but I think really it
 > > should be done after:
 > >
 > > 1. Measuring the average container sizes in a long running ozone cluster.
 > > 2. The average time it takes to replicate a container in a cluster.
 > > 3. After modelling the recovery time of Ozone node loss. The issue is that
 > > if you have clusters with a small number of nodes, Ozone does not do as
 > > well a job as HDFS on scattering the blocks.
 > >
 > > So it really needs some data plotted on a curve with cluster size, scatter,
 > > and that will give you a ball-park with recovery times. Long, long time ago
 > > -- we did do an extensivel model of this, and kind of convinced ourselves
 > > that the 5 GB size is viable. You should do a similar exercise to convince
 > > yourself that it is possible to do much larger container sizes.
 > >
 > >  Another option you have is to support splitting closed containers -- let
 > > it grow to any size, and then once it is closed, you do a special operation
 > > via SCM to split the containers into two or so.
 > >
 > > Best of luck with either approach, the proposal you are making is certainly
 > > doable, but it just needs some data analysis.
 > >
 > > Thanks
 > > Anu
 > >
 > >
 > >
 > >
 > >
 > >
 > >
 > >
 > > On Tue, Sep 13, 2022 at 2:55 AM István Fajth fapi...@gmail.com> wrote:
 > >
 > > > I am not sure I understand every bits of the discussion, but let me add
 > > > some thought as well here.
 > > >
 > > > Waiting on clients to finish writes, and see if the container is full
 > > > after, sounds like a horrible idea, as with that we will decrease the
 > > > amount of writes that can be done parallelly and make it to a function of
 > > > the available space in the open containers, with that potentially
 > > decrease
 > > > write performance.
 > > >
 > > > The best solution would be the simplest, and someone might already
 > > > suggested something like that... but here is my idea to a very simple
 > > > logic:
 > > > Some information to establish the base where I am coming from:
 > > > - SCM as I understand knows how much space is not already used in a
 > > > container, and it can track the block allocations for the remaining
 > > space,
 > > > as it is the service actually allocates a block
 > > > - The container is a directory on a DataNode, and hence the 5GB limit is
 > > an
 > > > arbitrary number we came up with as container size, but in theory if it
 > > > goes up to 10GB then that is just that and there are no further
 > > > implications by a higher sized container (except for re-replication and
 > > > validating health of the container)
 > > > - our block size is also an arbitrary number 256MB which does not come
 > > > based on any science, or any data from clusters about average file size
 > > we
 > > > just made it up, and doubled the one HDFS used, so that is just the unit
 > > of
 > > > allocation and an upper limit of one unit of uniquely identifiable data.
 > > >
 > > > Now in SCM if we hold a counter per container that holds the number of
 > > open
 > > > allocations, and allow over allocation by 50% of the remaining space in
 > > the
 > > > container, we will end up somewhat larger containers than 5GB in the
 > > > average case but we might get a simple system that ensures that allocated
 > > > blocks can be finished writing to by the clients. Here is how:
 > > > - at allocation SCM checks if the available space and the number of
 > > > allocations allow any new block to be allocated in the container, if it
 > > > does not , then it checks another open container, if no open containers
 > > can
 > > > allocate more blocks it creates a new container. (A block can be
 > > allocated
 > > > if the in theory size of the container can not exceed the 5GB soft limit
 > > by
 > > > more than 50% of the remaining space in the container considering the
 > > open
 > > > allocations)
 > > > - at final putBlock the DN in the ICR or in the heartBeat reports that
 > > the
 > > > block write is finished, in container x, so that SCM can decrement the
 > > > counter.
 > > > - if a block is abandoned, (afaik) we have a routine that deletes the
 > > half
 > > > written block data, we can also ask SCM to decrement the counter from
 > > this
 > > > routine while freeing up the space.
 > > > - if a block is abandoned for over a week (the token lifetime), we can
 > > > decrement the counter (but I believe this should happen with abandoned
 > > > block detection).
 > > > - if the container allocations are going down that much that the
 > > container
 > > > size can not exceed the 5GB soft-limit anymore, we can open up the
 > > > container for allocations again in SCM
 > > > - if all the open allocations are closed (the counter is 0), and the
 > > > container exceeds the 5GB soft-limit, we close the container from SCM
 > > >
 > > > Seems like this logic is simple enough, it does not require a lease, as
 > > we
 > > > do not have one today either and we somehow (as I know) handle the
 > > > abandoned blocks. Also it does not require any big changes in the
 > > > protocols, the only thing needed is to store the counter within SCM's
 > > > container data structure, and report from the DN to SCM that the block
 > > had
 > > > the final putBlock call.
 > > >
 > > > In case of a DDos style block allocation, we will end up with enough
 > > > containers to hold the data, and we can have 7.5GB containers in the
 > > worst
 > > > case, while in the general case we most likely will have around 5GB
 > > > containers.
 > > > Of course if needed we can introduce a hard limit for the number of open
 > > > containers (pipelines), and if we exceed that then the client would need
 > > to
 > > > wait to allocate more... (some form of QoS mentioned in the doc), but I
 > > am
 > > > not really a fan of this idea, as it introduces possibly large delays not
 > > > just for the rogue client, do we know what is the consequence of having
 > > > like 1000 open containers on a 10 node cluster for example? (As that
 > > would
 > > > mean ~5TB of open allocation...)
 > > >
 > > > Please let me know if this idea does not account for some problems, I
 > > just
 > > > quickly drafted it to come up with something fairly simple that I believe
 > > > helps to solve the problem with the least amount of disruption, but I
 > > > possibly did not thought through all the possible failure scenarios, and
 > > > haven't checked what is actually in the code at the moment, just wanted
 > > to
 > > > share my idea quickly to validate with folks who know more about the
 > > actual
 > > > code out there in the codebase, to see if it is a feasible alternative.
 > > >
 > > > Pifta
 > > >
 > > >
 > > > anu engineer anu.engin...@gmail.com> ezt írta (időpont: 2022. szept.
 > > 9.,
 > > > P, 19:10):
 > > >
 > > > > Extending the same thought from Steven. If you are going to do a small
 > > > > delay, it is better to do it via a Lease.
 > > > >
 > > > > So SCM could offer a lease for 60 seconds, with a provision to
 > > reacquire
 > > > > the lease one more time.
 > > > > This does mean that a single container inside the data node technically
 > > > > could become larger than 5GB (but that is possible even today).
 > > > >
 > > > > I do think a lease or a timeout based approach (as suggested by Steven)
 > > > > might be easier than pre-allocating blocks.
 > > > >
 > > > > Thanks
 > > > > Anu
 > > > >
 > > > >
 > > > > On Fri, Sep 9, 2022 at 12:47 AM Stephen O'Donnell
 > > > > sodonn...@cloudera.com.invalid> wrote:
 > > > >
 > > > > > > 4. Datanode wait until no write commands to this container, then
 > > > close
 > > > > > it.
 > > > > >
 > > > > > This could be done on SCM, with a simple delay. Ie hold back the
 > > close
 > > > > > commands for a "normal" close for some configurable amount of time.
 > > Eg
 > > > if
 > > > > > we hold for 60 seconds, it is likely almost all blocks will get
 > > > written.
 > > > > If
 > > > > > a very small number fail, it is likely OK.
 > > > > >
 > > > > > On Fri, Sep 9, 2022 at 5:01 AM Kaijie Chen c...@apache.org> wrote:
 > > > > >
 > > > > > > Thanks Ethan. Yes this could be a simpler solution.
 > > > > > > The main idea is allowing container size limit to be exceeded,
 > > > > > > to ensure all allocated blocks can be finished.
 > > > > > >
 > > > > > > We can change it to something like this:
 > > > > > > 1. Datanode notices the container is near full.
 > > > > > > 2. Datanode sends close container action to SCM immediately.
 > > > > > > 3. SCM closes the container and stops allocating new blocks in it.
 > > > > > > 4. Datanode wait until no write commands to this container, then
 > > > close
 > > > > > it.
 > > > > > >
 > > > > > > It's still okay to wait for the next heartbeat in step 2.
 > > > > > > Step 4 is a little bit tricky, we need a lease or timeout to
 > > > determine
 > > > > > the
 > > > > > > time.
 > > > > > >
 > > > > > > Kaijie
 > > > > > >
 > > > > > >  ---- On Fri, 09 Sep 2022 08:54:34 +0800  Ethan Rose  wrote ---
 > > > > > >  > I believe the flow is:
 > > > > > >  > 1. Datanode notices the container is near full.
 > > > > > >  > 2. Datanode sends close container action to SCM on its next
 > > > > heartbeat.
 > > > > > >  > 3. SCM closes the container and sends a close container command
 > > on
 > > > > the
 > > > > > >  > heartbeat response.
 > > > > > >  > 4. Datanodes get the response and close the container. If it is
 > > a
 > > > > > Ratis
 > > > > > >  > container, the leader will send the close via Ratis.
 > > > > > >  >
 > > > > > >  > There is a "grace period" of sorts between steps 1 and 2, but
 > > this
 > > > > > does
 > > > > > > not
 > > > > > >  > help the situation because SCM does not stop issuing blocks to
 > > > this
 > > > > > >  > container until after step 3. Perhaps some amount of pause
 > > between
 > > > > > > steps 3
 > > > > > >  > and 4 would help, either on the SCM or datanode side. This would
 > > > > > > provide a
 > > > > > >  > "grace period" between when SCM stops allocating blocks for the
 > > > > > > container
 > > > > > >  > and when the container is actually closed. I'm not sure exactly
 > > > how
 > > > > > this
 > > > > > >  > would be implemented in the code given the current setup, but it
 > > > > seems
 > > > > > > like
 > > > > > >  > a simple option we should try before other more complicated
 > > > > solutions.
 > > > > > >  >
 > > > > > >  > Ethan
 > > > > > >
 > > > > > >
 > > > > > >
 > > ---------------------------------------------------------------------
 > > > > > > To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org
 > > > > > > For additional commands, e-mail: dev-h...@ozone.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org
For additional commands, e-mail: dev-h...@ozone.apache.org

Reply via email to