Thanks everyone for the comments. If I understand correctly, Pifta's suggestion is similar to the OP doc plus some additional rules. Allowing slight overallocating is like increasing container size limit but closing containers a little bit earlier to avoid the final "bounce" stage.
Here's a brief summary of current plans. Plan C was in the "P.S." section of my original post, and I'll explain it more details later. A. "Reserve space or limit blocks": may cause slow writes. B. "Lease or delay before closing": may cause too large containers. C. "Lazy allocation until commit": may be complicated to implement. Details of plan C: 1. Create a special container "Eden" in each pipeline, with no space limit. 2. All new blocks will be created in "Eden". 3. When a key is being committed, its blocks will be moved into a regular container. The move operation is performed by link and unlink syscall in local FS (no data copy). 4. If any error happens, close the old "Eden" and create a new one. The old "Eden" is retired and blocks inside it will be collected later. 5. Only "Eden" and retired "Eden"s are scanned for uncommitted/orphan blocks. This is much simpler in block allocation, because we will know the size of the blocks before moving them into a regular container. But this is a huge change and might not be easy to implement without breaking compatibility. I'm working on a POC of plan B right now because it's easier to implement. I'll share the results when it's ready. Thanks Kaijie ---- On Thu, 15 Sep 2022 01:18:33 +0800 István Fajth wrote --- > Thank you for the comments Anu, > > I think if we choose wisely how we enable over-allocation, we will not go > above 5.5GB in almost none of the cases, let me reason about this later on. > > I would like to get some more clarity on the lease, as that also seems to > be easy, but I feel that it is an orthogonal solution to our current > problem and does not solve the current problem, more than a "simple delay" > would. > Let me share why I think so, and please tell me why I am wrong if I do :) > The current problem as I understand is that there are some clients that are > allocating blocks around the same time from the same container, and they > write a full block, or at least good amounts of data compared to block > size, but because there isn't a check whether the container has enough > space to write all the allocated blocks, so some clients fail to finish the > write, as in the middle of writing data to the container the container gets > closed. > An arbitrary delay might help somewhat, a lease that can be renewed for 2 > minutes might help somewhat, but if there is a slow client that will still > fail the write, and we end up with dead data in the containers, which I > believe (but I am not sure) will be cleaned eventually by some background > process that checks after unreferenced blocks and removes them from > containers, while the client has to rewrite the whole data. > > In this situation, if we do not calculate how many blocks we may allocate > in an open container, there will anyway be cases where client can not > finish the write in time, while we can end up with arbitrarily large > containers, as based on the lease there will be a lot of writes that may > finish in time without a constraint on size. (I imagine a scenario where a > distributed client job allocates 10k blocks and then starts to write into > all of them on a cluster where we have 500 open containers.) > > Now if we allow over allocation by 50% of free space or by 2 blocks > (whichever is more), then we can end up having 5.5GB containers most of the > cases, but we also have an upper bound, as blocks in an empty container can > be allocated up to 7.5GB worth of total data if all blocks are fully > written. > The average case still would be 5.5GB, as we do not see this problem often > so far, so in most of the cases data is written fine and the container is > closed after, at least based on the problem description in the document > shared. Why am I saying this? If things go smooth and normal, there can be > two cases, 50% of the remaining space is bigger than 2 blocks, in this case > we will go beyond 5.5GB with allocation, if all blocks allocated last are > fully written, if the remaining space is less than 2 blocks, then we will > over allocate by 2 blocks, which if fully written will make the container > closed with ~5.5GB data. Now if you consider the number of full blocks > compared to the number of partial blocks on a real system, and the fact > that the largest size of a container is still limited to 7.5GB, and the > average should not go much bigger than the original 5GB, I think this is > enough to ensure that writes can be finished, while containers does not > grow beyond the limit too much. > > On the other side with leases we can have way larger containers in the > extreme case, but we do not need to open containers right away when there > is a flood of block allocation. (Though sooner or later we need to open new > ones). > > So the options seem to be: > - having container that might be heavily over allocated and writes that are > limited by the lease timeout, from which some may fail the write still the > same way > - having containers with a size close to our current limit, with the burden > of opening containers during a flood of block allocation, and add metadata > and bookkeeping around the number of open allocations in SCM to limit the > number of allocations that can go to one container. > > The first one seems to be simple, but does not solve the problem just hides > it a bit more due to delays in container closure, the second one seems to > solve the problem as it closes the container when all writes have been > finished, or abandoned, while limiting the number of writes that can go > into a container, at the price of having more open containers (for which I > still don't know what effects it might have if we have too many containers > open). > > Pifta > > > > anu engineer anu.engin...@gmail.com> ezt írta (időpont: 2022. szept. 13., > K, 19:24): > > > What I am arguing is that if you are going to use time, use it in a way > > that plays well with a distributed system. > > > > That is where leases come from. What you are suggesting is very similar to > > the original design, that says keep track of allocated > > blocks (the original design doc that was shared) and your argument is that > > keeping track can be done in terms of metadata if we are willing to > > make the size of containers variable. That is absolutely true. I think we > > will need to really think if we can make containers arbitrarily large and > > what is the impact on the rest of the system. > > > > My point was simply -- There is no notion of a "simple delay" in a > > distributed system -- since we will not be able to reason about it. > > If you want to create such delays, do it via a lease mechanism -- See my > > other email in response to Steven. There are subtle correctness issues with > > arbitrary delays. > > > > Personally, I am fine with making containers larger, but I think really it > > should be done after: > > > > 1. Measuring the average container sizes in a long running ozone cluster. > > 2. The average time it takes to replicate a container in a cluster. > > 3. After modelling the recovery time of Ozone node loss. The issue is that > > if you have clusters with a small number of nodes, Ozone does not do as > > well a job as HDFS on scattering the blocks. > > > > So it really needs some data plotted on a curve with cluster size, scatter, > > and that will give you a ball-park with recovery times. Long, long time ago > > -- we did do an extensivel model of this, and kind of convinced ourselves > > that the 5 GB size is viable. You should do a similar exercise to convince > > yourself that it is possible to do much larger container sizes. > > > > Another option you have is to support splitting closed containers -- let > > it grow to any size, and then once it is closed, you do a special operation > > via SCM to split the containers into two or so. > > > > Best of luck with either approach, the proposal you are making is certainly > > doable, but it just needs some data analysis. > > > > Thanks > > Anu > > > > > > > > > > > > > > > > > > On Tue, Sep 13, 2022 at 2:55 AM István Fajth fapi...@gmail.com> wrote: > > > > > I am not sure I understand every bits of the discussion, but let me add > > > some thought as well here. > > > > > > Waiting on clients to finish writes, and see if the container is full > > > after, sounds like a horrible idea, as with that we will decrease the > > > amount of writes that can be done parallelly and make it to a function of > > > the available space in the open containers, with that potentially > > decrease > > > write performance. > > > > > > The best solution would be the simplest, and someone might already > > > suggested something like that... but here is my idea to a very simple > > > logic: > > > Some information to establish the base where I am coming from: > > > - SCM as I understand knows how much space is not already used in a > > > container, and it can track the block allocations for the remaining > > space, > > > as it is the service actually allocates a block > > > - The container is a directory on a DataNode, and hence the 5GB limit is > > an > > > arbitrary number we came up with as container size, but in theory if it > > > goes up to 10GB then that is just that and there are no further > > > implications by a higher sized container (except for re-replication and > > > validating health of the container) > > > - our block size is also an arbitrary number 256MB which does not come > > > based on any science, or any data from clusters about average file size > > we > > > just made it up, and doubled the one HDFS used, so that is just the unit > > of > > > allocation and an upper limit of one unit of uniquely identifiable data. > > > > > > Now in SCM if we hold a counter per container that holds the number of > > open > > > allocations, and allow over allocation by 50% of the remaining space in > > the > > > container, we will end up somewhat larger containers than 5GB in the > > > average case but we might get a simple system that ensures that allocated > > > blocks can be finished writing to by the clients. Here is how: > > > - at allocation SCM checks if the available space and the number of > > > allocations allow any new block to be allocated in the container, if it > > > does not , then it checks another open container, if no open containers > > can > > > allocate more blocks it creates a new container. (A block can be > > allocated > > > if the in theory size of the container can not exceed the 5GB soft limit > > by > > > more than 50% of the remaining space in the container considering the > > open > > > allocations) > > > - at final putBlock the DN in the ICR or in the heartBeat reports that > > the > > > block write is finished, in container x, so that SCM can decrement the > > > counter. > > > - if a block is abandoned, (afaik) we have a routine that deletes the > > half > > > written block data, we can also ask SCM to decrement the counter from > > this > > > routine while freeing up the space. > > > - if a block is abandoned for over a week (the token lifetime), we can > > > decrement the counter (but I believe this should happen with abandoned > > > block detection). > > > - if the container allocations are going down that much that the > > container > > > size can not exceed the 5GB soft-limit anymore, we can open up the > > > container for allocations again in SCM > > > - if all the open allocations are closed (the counter is 0), and the > > > container exceeds the 5GB soft-limit, we close the container from SCM > > > > > > Seems like this logic is simple enough, it does not require a lease, as > > we > > > do not have one today either and we somehow (as I know) handle the > > > abandoned blocks. Also it does not require any big changes in the > > > protocols, the only thing needed is to store the counter within SCM's > > > container data structure, and report from the DN to SCM that the block > > had > > > the final putBlock call. > > > > > > In case of a DDos style block allocation, we will end up with enough > > > containers to hold the data, and we can have 7.5GB containers in the > > worst > > > case, while in the general case we most likely will have around 5GB > > > containers. > > > Of course if needed we can introduce a hard limit for the number of open > > > containers (pipelines), and if we exceed that then the client would need > > to > > > wait to allocate more... (some form of QoS mentioned in the doc), but I > > am > > > not really a fan of this idea, as it introduces possibly large delays not > > > just for the rogue client, do we know what is the consequence of having > > > like 1000 open containers on a 10 node cluster for example? (As that > > would > > > mean ~5TB of open allocation...) > > > > > > Please let me know if this idea does not account for some problems, I > > just > > > quickly drafted it to come up with something fairly simple that I believe > > > helps to solve the problem with the least amount of disruption, but I > > > possibly did not thought through all the possible failure scenarios, and > > > haven't checked what is actually in the code at the moment, just wanted > > to > > > share my idea quickly to validate with folks who know more about the > > actual > > > code out there in the codebase, to see if it is a feasible alternative. > > > > > > Pifta > > > > > > > > > anu engineer anu.engin...@gmail.com> ezt írta (időpont: 2022. szept. > > 9., > > > P, 19:10): > > > > > > > Extending the same thought from Steven. If you are going to do a small > > > > delay, it is better to do it via a Lease. > > > > > > > > So SCM could offer a lease for 60 seconds, with a provision to > > reacquire > > > > the lease one more time. > > > > This does mean that a single container inside the data node technically > > > > could become larger than 5GB (but that is possible even today). > > > > > > > > I do think a lease or a timeout based approach (as suggested by Steven) > > > > might be easier than pre-allocating blocks. > > > > > > > > Thanks > > > > Anu > > > > > > > > > > > > On Fri, Sep 9, 2022 at 12:47 AM Stephen O'Donnell > > > > sodonn...@cloudera.com.invalid> wrote: > > > > > > > > > > 4. Datanode wait until no write commands to this container, then > > > close > > > > > it. > > > > > > > > > > This could be done on SCM, with a simple delay. Ie hold back the > > close > > > > > commands for a "normal" close for some configurable amount of time. > > Eg > > > if > > > > > we hold for 60 seconds, it is likely almost all blocks will get > > > written. > > > > If > > > > > a very small number fail, it is likely OK. > > > > > > > > > > On Fri, Sep 9, 2022 at 5:01 AM Kaijie Chen c...@apache.org> wrote: > > > > > > > > > > > Thanks Ethan. Yes this could be a simpler solution. > > > > > > The main idea is allowing container size limit to be exceeded, > > > > > > to ensure all allocated blocks can be finished. > > > > > > > > > > > > We can change it to something like this: > > > > > > 1. Datanode notices the container is near full. > > > > > > 2. Datanode sends close container action to SCM immediately. > > > > > > 3. SCM closes the container and stops allocating new blocks in it. > > > > > > 4. Datanode wait until no write commands to this container, then > > > close > > > > > it. > > > > > > > > > > > > It's still okay to wait for the next heartbeat in step 2. > > > > > > Step 4 is a little bit tricky, we need a lease or timeout to > > > determine > > > > > the > > > > > > time. > > > > > > > > > > > > Kaijie > > > > > > > > > > > > ---- On Fri, 09 Sep 2022 08:54:34 +0800 Ethan Rose wrote --- > > > > > > > I believe the flow is: > > > > > > > 1. Datanode notices the container is near full. > > > > > > > 2. Datanode sends close container action to SCM on its next > > > > heartbeat. > > > > > > > 3. SCM closes the container and sends a close container command > > on > > > > the > > > > > > > heartbeat response. > > > > > > > 4. Datanodes get the response and close the container. If it is > > a > > > > > Ratis > > > > > > > container, the leader will send the close via Ratis. > > > > > > > > > > > > > > There is a "grace period" of sorts between steps 1 and 2, but > > this > > > > > does > > > > > > not > > > > > > > help the situation because SCM does not stop issuing blocks to > > > this > > > > > > > container until after step 3. Perhaps some amount of pause > > between > > > > > > steps 3 > > > > > > > and 4 would help, either on the SCM or datanode side. This would > > > > > > provide a > > > > > > > "grace period" between when SCM stops allocating blocks for the > > > > > > container > > > > > > > and when the container is actually closed. I'm not sure exactly > > > how > > > > > this > > > > > > > would be implemented in the code given the current setup, but it > > > > seems > > > > > > like > > > > > > > a simple option we should try before other more complicated > > > > solutions. > > > > > > > > > > > > > > Ethan > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > > To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org > > > > > > For additional commands, e-mail: dev-h...@ozone.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org For additional commands, e-mail: dev-h...@ozone.apache.org