Ivan, I would prefer forward-only implementation even knowing it allows false positive check results.
Why I think so: 1) From my experience, when we have any future is waiting for reply, we have to take a failover into consideration. Usually failover implementations are way more complex than an initial algorithm itself. Forward-only version doesn't need any failover implementation as it expects failed probes (the probes didn't face a deadlock ). 2) Simple lightweight feature implementation is a really good point to start - it may be extended if needed but really often such extending doesn't need at all. 3) Any implementation allow false positive result - we are working with distributed system, there are a bunch of processes happens at the same time and, for example, concurrently happened rollback on timeout may solve a deadlock but probe is finished at this time, so- there is a rollback on deadlock also as a result. 4) The described case (when false positive result is probable) isn't usual but, potentially, extremely rare, I don't think we should solve it since it doesn't break the grid. Regards, Igor вт, 18 дек. 2018 г. в 17:57, Павлухин Иван <vololo...@gmail.com>: > Hi folks, > > During implementation of edge-chasing deadlock detection algorithm in > scope of [1] it has been realized that basically we have 2 options for > "chasing" strategy. I will use terms Near when GridNearTxLocal is > assumed and Dht when GridDhtTxLocal (tx which updates primary > partition). So, here is 2 mentioned strategies: > 1. Send initial probe when Dht tx blocks waiting for another tx to > release a key. Initial probe is sent from waiting Dht tx to Near tx > holding a lock. If receiving Near tx is blocked as well then it relays > the probe to Dht tx it awaits response from. So, the probe is > traveling like Dht -> Near -> Dht -> Near -> ... Let's call the > approach "forward-only". > 2. Send probes only between Near transactions. This approach requires > additional request-response call which Near tx issues to Dht node to > check if tx sending a request is waiting for another tx. The call > returns identifier of a Near tx blocking tx sending a request (if > sending tx is in fact blocked). Then waiting Near tx relays a probe to > blocked Near tx. Let's call that approach "lock-checking". > > I would like to define several points to consider while comparing > approaches (please point out more if I miss something): > 1. Correctness. Especially regarding reporting false deadlocks. > 2. Extensibility. Rollback to savepoint and generalization for classic > transactions should be considered. > 3. Messaging overhead. > 4. Simplicity of implementation. > > You can find an implementation of "lock-checking" approach in PR > attached to the ticket [1]. Currently it works correctly only for > transactions obeying strict two-phase locking (2PL). It is fairly easy > to adopt PR to implement "forward-only" approach. Which will work > correct in conditions of 2PL. "lock-checking" algorithm uses 1.5 times > more messages than "forward-only". Implementation of "forward-only" > algorithm is simpler in a sense that it does not require introducing > lock-wait-check messages and future. But as for me it does not look as > big deal. I think that implementation efforts for both approaches are > almost equal so far. > > But corner stone here is an extensibility. Let's imagine that we are > going to implement rollback to savepoint. One suggested approach to > extend "forward-only" approach for that case is invalidating sent > probe. Basically, a tx initiated deadlock check assigns unique id to a > probe. And this probe id is invalidated when the tx wakes up from > wait. If that tx receives back a probe with invalidated id it simply > discards it. If id is not invalidated it means a deadlock. But false > deadlock still can be detected. I will provide an example when it does > not work correctly. > > Please note, that our transactions can request multiple locks > simultaneously (so-called AND model). Also rollback to savepoint > releases locks obtained after creating a savepoint. Let's use > following notation. "*" means savepoint. "^" means rollback to > previous savepoint. "!" means acquiring a lock. "?" means waiting for > a lock. "&" means requesting multiple locks simultaneously. See the > following transaction execution scenario for 3 transactions and 3 > keys: > TA | TB | TC > | | * > | 1! | 2! > 1? & 3! | | > | 2? | > | | ^ > | | 3! > > We can notice that suggested scenario does not introduce deadlock > because locks are requested in consistent order among all > transactions. But in different moments there exist waiting relations > TA w TB, TB w TC, TC w TA. So, we suspect a false deadlock could be > detected here. Also one can notice that a relation TC w TA is > established after TB w TC is destroyed. Messages on a timeline diagram > demonstrates how "forward-only with probe invalidation" approach can > detect a false deadlock here [2]. On the picture L means "lock", U > "unlock", W "wait". Red cross means reporting a false deadlock. > Intuition here is that while probe is in flight one wait-for edge can > be destroyed (TB w TC) and another one can be created (TC w TA). So, > the approach can report false deadlocks. Also, note that a false > deadlock can be found even when locking order is consistent! > > After a while I will describe how "lock-checking" can be adopted to > support rollback to savepoint. Disclaimer here is that it will require > even more messages. > ---- > > I think that we can already start a discussion. > Actually, while false deadlocks can be surprising we perhaps can > tolerate it because our transaction implementation assumes that > transactions can be rolled back by system. "forward-only" approach > requires lesser messages but in big-O terms both approaches have the > same complexity. Both correctness and complexity impact unfortunately > have hardly predictable impact for real workflows. In ideal world > thorough testing could give the answers. > > Please, share your vision. > > [1] https://issues.apache.org/jira/browse/IGNITE-9322 > [2] > https://gist.githubusercontent.com/pavlukhin/c8c7c6266eeab56048c31f5cdfb31d20/raw/7d1aef9abcd014ec9fdf69840168768ced638b6c/msg_diagram.jpg > пн, 26 нояб. 2018 г. в 15:27, Павлухин Иван <vololo...@gmail.com>: > > > > Vladimir, > > > > I think it might work. So, if nobody minds I can start prototyping > > edge-chasing approach. > > пн, 26 нояб. 2018 г. в 14:32, Vladimir Ozerov <voze...@gridgain.com>: > > > > > > Ivan, > > > > > > The problem is that in our system a transaction may wait for N locks > > > simultaneously. This may form complex graphs which spread between many > > > nodes. Now consider that I have a deadlock between 4 nodes: A -> B -> > *C* > > > -> D -> A. I've sent a message from a and never reached D because C > failed. > > > Well, may be that is good, because some transactions will be rolled > back. > > > But when they are rolled back, another cycles from pending multi-way > > > deadlocks may form again. E.g. A -> B -> *E* -> D -> A. So the > question is > > > whether we will be able to detect them reliably. I think that we may > use > > > the following assumption: > > > 1) If data node fails, relevant transactions will be rolled-back > > > 2) It means that some other transactions will make at least partial > > > progress with locks acquisition > > > > > > So, we can introduce a kind of counter which will be advanced on every > > > acquired lock on a given node. And define a rule that transaction > deadlock > > > request for the given pair (NODE_ID, COUNTER) should be requested at > most > > > once. This way, even if specific deadlock check request is lost due to > node > > > failure, and another deadlock has formed, then some other node will > > > re-trigger deadlock change request sooner or later, as it's counter > > > advanced. > > > > > > Makes sense? > > > > > > On Sat, Nov 24, 2018 at 5:40 PM Павлухин Иван <vololo...@gmail.com> > wrote: > > > > > > > Hi Vladimir, > > > > > > > > Regarding fault tolerance. It seems that it is not a problem for > > > > edge-chasing approaches. A found deadlock is identified by a message > > > > returned to a detection initiator with initiator's identifier. If > > > > there is no deadlock then such message will not come. If some node > > > > containing a deadlocked transaction fails then it will break the > > > > deadlock. Am I missing something? > > > > > > > > About messaging overhead. Indeed, it looks like edge-chasing > > > > approaches can bring redundant messages. Perhaps, we can borrow some > > > > ideas about optimization from [1]. And I also think about a timeout > > > > before starting a deadlock detection. > > > > > > > > Thoughts about adaptive timeouts lead me to some thoughts about > > > > monitoring. I assume that long waiting for locks signals about not > > > > optimal performance of the system. I think it would be great to have > > > > means of monitoring transactions contention. It could help users to > > > > improve theirs workloads. It also could help us to build some > > > > reasoning how contention correlates with deadlock detection overhead. > > > > > > > > [1] http://mentallandscape.com/Papers_podc84.pdf > > > > вс, 18 нояб. 2018 г. в 10:52, Vladimir Ozerov <voze...@gridgain.com > >: > > > > > > > > > > Hi Ivan, > > > > > > > > > > Great analysis. Agree that edge-chasing looks like better > candidate. > > > > First, > > > > > it will be applicable to both normal and MVCC transactions. > Second, in > > > > MVCC > > > > > we probably will also need to release some locks when doing > rollbacks. > > > > What > > > > > we should think about is failover - what if a node which was in the > > > > middle > > > > > of a graph fails? We need to craft fault-tolerance carefully. > > > > > > > > > > Another critically important point is how to trigger deadlock > detection. > > > > My > > > > > concerns about edge-chasing was not about latency, but about a > number of > > > > > messages which travels between nodes. Good algorithm must produce > little > > > > to > > > > > no messages in case of normal contenion while still providing > protection > > > > in > > > > > case of real deadlocks. So how would we trigger fraph traversal > for the > > > > > given transaction? May be we can start with hard timeout and then > employ > > > > a > > > > > kind of adaptive increment in case high rate of false-positives are > > > > > observed. May be something else. > > > > > > > > > > On Sun, Nov 18, 2018 at 10:21 AM Павлухин Иван < > vololo...@gmail.com> > > > > wrote: > > > > > > > > > > > Vladimir, > > > > > > > > > > > > Thanks for the articles! I studied them and a couple of others. > And I > > > > > > would like to share a knowledge I found. > > > > > > > > > > > > BACKGROUND > > > > > > First of all our algorithm implemented in > > > > > > > > > > > > > > > > > org.apache.ignite.internal.processors.cache.transactions.TxDeadlockDetection > > > > > > is not an edge-chasing algorithm. In essence a lock-waiting > > > > > > transaction site polls nodes responsible for keys of interest > one by > > > > > > one and reconstructs global wait-for graph (WFG) locally. > > > > > > A centralized algorithm discussed in this thread looks similar to > > > > > > algorithms proposed by Ho [1]. The simplest of them (two-phased) > > > > > > reports false deadlocks when unlock before transaction finish is > > > > > > permitted. So, it seems that it only works when strict two-phase > > > > > > locking (2PL) is obeyed. Another one (called one-phased) requires > > > > > > tracking all locked keys by each transaction which is not > desirable > > > > > > for MVCC transactions. > > > > > > Aforementioned edge-chasing algorithm by Chandy, Misra and Haas > [2] is > > > > > > proven to work even when 2PL is not obeyed. > > > > > > Also performance target is not clear for me. It looks like > centralized > > > > > > approaches can use fewer messages and provide lower latency that > > > > > > distributed ones. But would like to understand what are the > orders of > > > > > > latency and messaging overhead. Many of algorithms are described > in > > > > > > [3] and some performance details are mentioned. It is said "As > per > > > > > > [GRAY81], most deadlocks involve two to four transactions." I > see one > > > > > > more argument making deadlock detection algorithm latency not so > > > > > > important. We can consider deadlock as unavoidable situation, > but even > > > > > > lightning fast detector will not save a database from performance > > > > > > degradation in case of tons of deadlocks. > > > > > > What is also interesting, Google Cloud Spanner employs simple > > > > > > "wound-wait" deadlock prevention [4] instead of any detection > > > > > > algorithm. > > > > > > > > > > > > DISCUSSION > > > > > > So, I see the following options: > > > > > > 1. Edge-chasing algorithm like Chandy, Misra and Haas [2]. > > > > > > 2. Centralized two-phased algorithm described by Ho [1] with > > > > > > restriction that transactions must obey 2PL. > > > > > > Personally, I tend to edge-chasing approach because it looks > simple > > > > > > and universal. Restricting ourselves with 2PL will restrict us in > > > > > > features which could be implemented in future (e.g. transaction > > > > > > savepoints), will not it? > > > > > > > > > > > > WDYT? > > > > > > > > > > > > Ivan > > > > > > > > > > > > [1] > https://turing.cs.hbg.psu.edu/comp512.papers/HoRamamoorthy-82.pdf > > > > > > [2] > > > > > > > > > > > https://www.cs.utexas.edu/users/misra/scannedPdf.dir/DistrDeadlockDetection.pdf > > > > > > [3] https://www.ics.uci.edu/~cs237/reading/p37-elmagarmid.pdf > > > > > > [4] > > > > > > > > > > > https://cloud.google.com/spanner/docs/transactions#rw_transaction_performance > > > > > > > > > > > > ср, 14 нояб. 2018 г. в 22:13, Vladimir Ozerov < > voze...@gridgain.com>: > > > > > > > > > > > > > > Ivan, > > > > > > > > > > > > > > This is interesting question. I think we should spend some > time for > > > > > > formal > > > > > > > verification whether this algorithm works or not. Several > articles > > > > you > > > > > > may > > > > > > > use as a startitng point: [1], [2]. From what I understand, > Ignite > > > > fall > > > > > > > into "AND" model, and currently implemented algorithm is a > variation > > > > of > > > > > > > "edge-chasing" approach as per Chandy, Misra and Haas [3], > which is > > > > > > *proven > > > > > > > to be correct* in that it both detects deadlocks when they are > > > > present, > > > > > > and > > > > > > > do not produce false positives. But is might be too heavy for > the > > > > system > > > > > > > under contention. > > > > > > > > > > > > > > We need to search for any formal proof of correctness of > proposed > > > > > > > algorithm. This area is already researched throughly enough, > so we > > > > should > > > > > > > be able to find an answer quickly. > > > > > > > > > > > > > > Vladimir. > > > > > > > > > > > > > > [1] http://www.cse.scu.edu/~jholliday/dd_9_16.htm > > > > > > > [2] https://www.cs.uic.edu/~ajayk/Chapter10.pdf > > > > > > > [3] > > > > > > > > > > > > > > > > > > https://www.cs.utexas.edu/users/misra/scannedPdf.dir/DistrDeadlockDetection.pdf > > > > > > > > > > > > > > On Wed, Nov 14, 2018 at 6:55 PM Павлухин Иван < > vololo...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > Next part as promised. A working item for me is a deadlock > detector > > > > > > > > for MVCC transactions [1]. The message is structured in 2 > parts. > > > > First > > > > > > > > is an analysis of the current state of affairs and possible > > > > options to > > > > > > > > go. Second is a proposed option. First part is going to be > not so > > > > > > > > short so some might prefer to skip it. > > > > > > > > > > > > > > > > ANALYSIS > > > > > > > > The immediate question is "why we cannot use an existing > deadlock > > > > > > > > detector?". The differences between classic and MVCC > transactions > > > > > > > > implementation is the answer. Currently a collection of > > > > IgniteTxEntry > > > > > > > > is used for detection. But such collection is not maintained > for > > > > MVCC > > > > > > > > transactions. So, it will not work out of box. > > > > > > > > Also it looks like that current distributed iterative > approach > > > > cannot > > > > > > > > be low latency it the worst case because of doing possibly > many > > > > > > > > network requests sequentially. > > > > > > > > So, what options do we have? Generally we should choose > between > > > > > > > > centralized and distributed approaches. By centralized > approach I > > > > mean > > > > > > > > existence of a dedicated deadlock detector located on a > single > > > > node. > > > > > > > > In the centralized approach we can face difficulties related > to > > > > > > > > failover as a node running deadlock detector can fail. In the > > > > > > > > distributed approach extra network messaging overhead can > strike > > > > > > > > because different nodes participating in a deadlock can start > > > > > > > > detection independently and send redundant messages. I see > some > > > > > > > > aspects which make sense for choosing implementation. Here > they are > > > > > > > > with an approach that is better (roughly speaking) in > parentheses: > > > > > > > > * Detection latency (centralized). > > > > > > > > * Messaging overhead (centralized). > > > > > > > > * Failover (distributed). > > > > > > > > And also having a park of deadlock detectors sounds not very > good. > > > > I > > > > > > > > hope that it is possible to develop a common solution > suitable for > > > > > > > > both kinds of transactions. I suggest to pilot new solution > with > > > > MVCC > > > > > > > > and then adopt it for classic transactions. > > > > > > > > > > > > > > > > PROPOSAL > > > > > > > > Actually I propose to start with an centralized algorithm > > > > described by > > > > > > > > Vladimir in the beginning of the thread. I will try to > outline main > > > > > > > > points of it. > > > > > > > > 1. Single deadlock detector exists in the cluster which > maintains > > > > > > > > transaction wait-for graph (WFG). > > > > > > > > 2. Each cluster node sends and invalidates wait-for edges to > the > > > > > > detector. > > > > > > > > 3. The detector periodically searches cycles in WFG and > chooses and > > > > > > > > aborts a victim transaction if cycle is found. > > > > > > > > > > > > > > > > Currently I have one fundamental question. Is there a > possibility > > > > of > > > > > > > > false detected deadlocks because of concurrent WFG updates? > > > > > > > > Of course there are many points of improvements and > optimizations. > > > > But > > > > > > > > I would like to start from discussing key points. > > > > > > > > > > > > > > > > Please share your thoughts! > > > > > > > > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9322 > > > > > > > > ср, 14 нояб. 2018 г. в 15:47, ipavlukhin < > vololo...@gmail.com>: > > > > > > > > > > > > > > > > > > Hi Igniters, > > > > > > > > > > > > > > > > > > I would like to resume the discussion about a deadlock > detector. > > > > I > > > > > > start > > > > > > > > > with a motivation for a further work on a subject. As I see > > > > current > > > > > > > > > implementation (entry point IgniteTxManager.detectDeadlock) > > > > starts a > > > > > > > > > detection only after a transaction was timed out. In my > mind it > > > > is > > > > > > not > > > > > > > > > very good from a product usability standpoint. As you > know, in a > > > > > > > > > situation of deadlock some keys become non-usable for an > infinite > > > > > > amount > > > > > > > > > of time. Currently the only way to work around it is > configuring > > > > a > > > > > > > > > timeout, but it could be rather tricky in practice to > choose a > > > > > > > > > proper/universal value for it. So, I see the main point as: > > > > > > > > > > > > > > > > > > Ability to break deadlocks without a need to configure > timeouts > > > > > > > > explicitly. > > > > > > > > > > > > > > > > > > I will return soon with some thoughts about implementation. > > > > > > Meanwhile, > > > > > > > > > does anybody have in mind any other usability points which > I am > > > > > > missing? > > > > > > > > > Or is there any alternative approaches? > > > > > > > > > > > > > > > > > > On 2017/11/21 08:32:02, Dmitriy Setrakyan <d...@apache.org > > > > > > wrote: > > > > > > > > > > On Mon, Nov 20, 2017 at 10:15 PM, Vladimir Ozerov < > > > > > > vo...@gridgain.com > > > > > > > > >> > > > > > > > > > > wrote:> > > > > > > > > > > > > > > > > > > > > > It doesn’t need all txes. Instead, other nodes will > send > > > > info > > > > > > about> > > > > > > > > > > > suspicious txes to it from time to time.> > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > I see your point, I think it might work.> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Best regards, > > > > > > > > Ivan Pavlukhin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Best regards, > > > > > > Ivan Pavlukhin > > > > > > > > > > > > > > > > > > > > > > -- > > > > Best regards, > > > > Ivan Pavlukhin > > > > > > > > > > > > -- > > Best regards, > > Ivan Pavlukhin > > > > -- > Best regards, > Ivan Pavlukhin >