Denis, Sorry, may be I was not clear enough - "tuple-approach" and "persistent approach" are the same. By "tuple" I mean a row stored inside a data block. Currently we store lock information in Java heap and proposal is to move it to data blocks. The main driver is memory - if there are a rows to be locked we will either run out of memory, or produce serious memory pressure. For example, currently update of 1M entries will consume ~500Mb of heap. With proposed approach it will consume almost nothing. The drawback is increased number of dirty data pages, but it should not be a problem because in final implementation we will update data rows before prepare phase anyway, so I do not expect any write amplification in usual case.
This approach is only applicable for Ignite persistence. On Thu, Dec 14, 2017 at 1:53 AM, Denis Magda <dma...@apache.org> wrote: > Vladimir, > > Thanks for a throughout overview and proposal. > > > Also we could try employing tiered approach > > 1) Try to keep everything in-memory to minimize writes to blocks > > 2) Fallback to persistent lock data if certain threshold is reached. > > What are the benefits of the backed-by-persistence approach in compare to > the one based on tuples? Specifically: > - will the persistence approach work for both 3rd party and Ignite > persistence? > - any performance impacts depending on a chosen method? > - what’s faster to implement? > > — > Denis > > > On Dec 13, 2017, at 2:10 AM, Vladimir Ozerov <voze...@gridgain.com> > wrote: > > > > Igniters, > > > > As you probably we know we work actively on MVCC [1] and transactional > SQL > > [2] features which could be treated as a single huge improvement. We > face a > > number of challenges and one of them is locking. > > > > At the moment information about all locks is kept in memory on per-entry > > basis (see GridCacheMvccManager). For every locked key we maintain > current > > lock owner (XID) and the list of would-be-owner transactions. When > > transaction is about to lock an entry two scenarios are possible: > > 1) If entry is not locked we obtain the lock immediately > > 2) if entry is locked we add current transaction to the wait list and > jumps > > to the next entry to be locked. Once the first entry is released by > > conflicting transaction, current transaction becomes an owner of the > first > > entry and tries to promote itself for subsequent entries. > > > > Once all required locks are obtained, response is sent to the caller. > > > > This approach doesn't work well for transactional SQL - if we update > > millions of rows in a single transaction we will simply run out of > memory. > > To mitigate the problem other database vendors keep information about > locks > > inside the tuples. I propose to apply the similar design as follows: > > > > 1) No per-entry lock information is stored in memory anymore. > > 2) The list of active transactions are maintained in memory still > > 3) When TX locks an entry, it sets special marker to the tuple [3] > > 4) When TX meets already locked entry, it enlists itself to wait queue of > > conflicting transaction and suspends > > 5) When first transaction releases conflicting lock, it notifies and > wakes > > up suspended transactions, so they resume locking > > 6) Entry lock data is cleared on transaction commit > > 7) Entry lock data is not cleared on rollback or node restart; Instead, > we > > will could use active transactions list to identify invalid locks and > > overwrite them as needed. > > > > Also we could try employing tiered approach > > 1) Try to keep everything in-memory to minimize writes to blocks > > 2) Fallback to persistent lock data if certain threshold is reached. > > > > Thoughts? > > > > [1] https://issues.apache.org/jira/browse/IGNITE-3478 > > [2] https://issues.apache.org/jira/browse/IGNITE-4191 > > [3] Depends on final MVCC design - it could be per-tuple XID, undo > vectors, > > per-block transaction lists, etc.. > > > > Vladimir. > >