Hi Steven,
Thank you for the comments, my earlier comment was very brief, and I did not explain things in sufficient detail. Your comments made me realize that, and I have taken the opportunity to clarify what I was thinking. Hopefully, it is much clearer now. 1. The Lease is a simple lock with a time-out. The Data node will not close the Container if there is a "Please keep it open lock" or a lease. 2. A client cannot extend the lease infinitely. The lease will have a maximum lease extension maxed out at two tries, and I am proposing something like 60 seconds for each attempt. Now let me discuss each of the concerns. >> Adding a lease will be quite complex, and may not be needed if the simple delay works. That is correct if the "simple delay" works. The problem is that "simple delay" is never simple. You will not be able to reason about the correctness of the system. With a simple delay, We would not be able to understand if the client did a write, if someone is waiting, or worse if the delay is always introduced for no purpose because all we see is a simple delay—no information about the actual writes or clients. So it would help if we had some way to introduce a "deterministic delay" instead of a simple delay. This delay allows us to reason about the state from the SCM point of view, client point of view or from the datanode point of view. So overall, much better for debuggability of the system. If there is a write failure, it is not dependent on some arbitrary notion of delay, but the client comes and says I want to write, SCM has given me an allocated block, and SCM gave me(client)60 seconds to write my data. Let us say there is a failure, or a bug, with a lease based system we will be able to reason about the mistakes we are making in introducing the notion of a "Simple Delay." To be completely honest, I am just proposing a formal method that is easy to reason about and debug, which will bring the idea that you are offering - "the simple delay." If the client runs out of that time, the client can request a lease extension one more time. However, it is still deterministic since SCM can decide if the lease can be issued, and we are arguing that the client should extend the lease only once(this is not a strictly necessary condition). I am arguing that "Delay" is a good notion, but it should not be arbitrary and not something that we cannot see and measure. If you are going to use time, use it via leveraging a well-known pattern, leases. Now I can log the lease info on a client if needed, and if I run into bugs, I can find some way of reasoning about it. >> Then on SCM you will need to track all open blocks, lease expiry due to crashed clients etc I don't understand this assertion. Perhaps there may be things in Ozone now that I have no clue about. It has been a while. But let us play this from first principles. 1. You issue a lease; this is a signed lease that says - "Client, You have 60 seconds from 11:00:00." 2. The client is attempting to write and sends this lease. 3. The client crashes -- Why do we care? The lease is gone at the time out anyway. We use leases in the first place( as opposed to a Distributed Lock Manager) because we don't need to care about client crashes. 4. Why does SCM need to track anything? The Datanode has to honor the request if the client has a block write lease and the write arrives within time. 5. if we know that any lease is issued for 60 seconds for a block write, all the data node needs to do before closing a container is tell the SCM to stop giving Block write leases for a specific container and wait for 121 seconds (1 second for extra safety). This guarantees that any outstanding client with a lease has had sufficient time to write. Please let me know if you imagine this differently. I am struggling to understand where the state is coming from; the whole deal with leases is that it is easy to forget state, and we can still get to a consistent state if we wait. Now let us talk about real disasters, say SCM is crashing completely. (assume there is no HA, for the time being, to make it interesting as a technical exercise, or to make the point that leases need not be replicated.) Now, if any request comes from any data node saying I am going to close the Container, the standard response works == "Please wait 121 seconds and then close." >From that moment onwards, SCM remembers that it can not issue any leases for this new Container which it learned about now. Now, what of a Container notice "I want to close container X" that just came a few seconds *before the crash *-- SCM after recovery, will not issue any LEASES for 120 seconds. *This is the most important part.* That way, the whole write pipeline is stalled because we are in a safe place. This theoretical discussion illustrates how leases would work and create no additional state requirements. >> You would also need to decide what to do with a block that is kept open for many hours or days by a persistent client slowly writing to one or more files - we cannot keep the Container open indefinitely. Not possible at all. A lease is not something that is infinite. It is NOT a lock. By definition, a lease expires after "X" amount of time. In this case, I am picking a number like 60 seconds. So, if we create a model that a client can extend a lease two times, then the maximum time limit would be 120 seconds for a block write. We can pick whatever number looks reasonable. Please let me know if this makes sense. Thanks Anu On Tue, Sep 13, 2022 at 2:33 AM Stephen O'Donnell <sodonn...@cloudera.com.invalid> wrote: > I would recommend a proof of concept on your cluster, doing the simplest > thing first, which is holding the close commands in SCM for a configurable > delay, and see if you can alleviate the problem that way. > > Adding a lease will be quite complex, and may not be needed if the simple > delay works. You will need a heartbeat thread from client to OM, then relay > that from OM to SCM, placing more front end load onto SCM. Then on SCM you > will need to track all open blocks, lease expiry due to crashed clients > etc. You would also need to decide what to do with a block that is kept > open for many hours or days by a persistent client slowly writing to one or > more files - we cannot keep the container open indefinitely. > > The front end load on SCM here is probably minimal, but there is work under > way to cache the container locations in OM partly to speed up KeyInfo > calls, but also to prevent increasing client load increasing load on SCM. > > On Tue, Sep 13, 2022 at 9:19 AM Kaijie Chen <c...@apache.org> wrote: > > > Thanks Anu and Steven for the suggestion. Granting a lease to client > > sounds like a more controllable way. > > > > However, if I understand correctly, clients don't talk to SCM directly. > > Does it mean OM has to relay the renew lease request to SCM? > > Is there a better way to implement it? > > > > Regards > > Kaijie > > > > ---- On Sat, 10 Sep 2022 01:10:28 +0800 anu engineer wrote --- > > > Extending the same thought from Steven. If you are going to do a small > > > delay, it is better to do it via a Lease. > > > > > > So SCM could offer a lease for 60 seconds, with a provision to > reacquire > > > the lease one more time. > > > This does mean that a single container inside the data node > technically > > > could become larger than 5GB (but that is possible even today). > > > > > > I do think a lease or a timeout based approach (as suggested by > Steven) > > > might be easier than pre-allocating blocks. > > > > > > Thanks > > > Anu > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org > > For additional commands, e-mail: dev-h...@ozone.apache.org > > > > >