That would be great, actually, We have a lot of data structures which size heavily depends on the topology size and caches number, and all those data structures are placed on-heap. I am looking forward moving that stuff to a limited memory region with disk offload.
The DataRegion and disk offload are fairly easy to implement - we have almost all mechanics in the current page memory implementation. We only need to get rid of copy-on-write and checkpointing logic and alter page eviction mechanics a little bit. Another side note - we can look into range locks mechanics. It will be a greater effort to implement them, but I think it worth it. 2017-12-15 16:59 GMT+03:00 Vladimir Ozerov <voze...@gridgain.com>: > Well, there is completely different approach to handle the problem - lock > escalation. E.g. SQL Server typically require 96 bytes per lock. Much less > than Ignite, but still. And as soon as some query require more than 5000 > row locks, it is escalated to exclusive table lock. Not only this > eliminates memory consumption problem, it increases performance of massive > updates dramatically. > On the other hand this approach is more prone to deadlocks, since > transactions updating disjoint data sets now have shared resources to be > locked. > > There is no silver bullet apparently. > > On Fri, Dec 15, 2017 at 4:42 PM, Vladimir Ozerov <voze...@gridgain.com> > wrote: > > > Alex, > > > > That might be very good idea. In fact, what you describe closely > resembles > > TempDB in MS SQL Server [1]. It is also offloaded to disk, minimally > logged > > and purged on restart. Ideally we could try designing this component in > > generic way, so that it could store a lot different temporal stuff: > > 1) Locks > > 2) UNDO data > > 3) Sort/join data (for SELECT and CREATE INDEX, statistics, whatsoever) > > 4) If needed - visibility info (e.g. for covering indexes and > purge/vacuum) > > > > WDYT? > > > > Vladimir. > > > > [1] https://docs.microsoft.com/en-us/sql/relational- > > databases/databases/tempdb-database > > > > On Fri, Dec 15, 2017 at 4:26 PM, Alexey Goncharuk < > > alexey.goncha...@gmail.com> wrote: > > > >> Vladimir, > >> > >> What about moving the entire locking mechanism to a separate off-heap > >> memory region which will be volatile wrt restarts, but will still > support > >> off-load to disk. In the current architecture, it means that we will > need > >> to allocate a separate DataRegion with no WAL and no crash recovery - > >> locks > >> are meaningless after a restart, and we will automatically drop them. I > >> would be interesting to prototype this because I think we may be on-par > >> with on-heap lock placement, as we already proved for in-memory caches. > >> > >> 2017-12-14 21:53 GMT+03:00 Denis Magda <dma...@apache.org>: > >> > >> > Vladimir, > >> > > >> > No it’s crystal clear, thanks. > >> > > >> > If this approach works only for Ignite persistence based deployment, > how > >> > will we handle locking for pure in-memory and caching of 3rd party > >> > databases scenarios? As I understand the tuples still will be stored > in > >> the > >> > page memory while there won’t be any opportunity to fallback to disk > if > >> the > >> > memory usage increases some threshold. > >> > > >> > — > >> > Denis > >> > > >> > > On Dec 13, 2017, at 11:21 PM, Vladimir Ozerov <voze...@gridgain.com > > > >> > wrote: > >> > > > >> > > Denis, > >> > > > >> > > Sorry, may be I was not clear enough - "tuple-approach" and > >> "persistent > >> > > approach" are the same. By "tuple" I mean a row stored inside a data > >> > block. > >> > > Currently we store lock information in Java heap and proposal is to > >> move > >> > it > >> > > to data blocks. The main driver is memory - if there are a rows to > be > >> > > locked we will either run out of memory, or produce serious memory > >> > > pressure. For example, currently update of 1M entries will consume > >> ~500Mb > >> > > of heap. With proposed approach it will consume almost nothing. The > >> > > drawback is increased number of dirty data pages, but it should not > >> be a > >> > > problem because in final implementation we will update data rows > >> before > >> > > prepare phase anyway, so I do not expect any write amplification in > >> usual > >> > > case. > >> > > > >> > > This approach is only applicable for Ignite persistence. > >> > > > >> > > On Thu, Dec 14, 2017 at 1:53 AM, Denis Magda <dma...@apache.org> > >> wrote: > >> > > > >> > >> Vladimir, > >> > >> > >> > >> Thanks for a throughout overview and proposal. > >> > >> > >> > >>> Also we could try employing tiered approach > >> > >>> 1) Try to keep everything in-memory to minimize writes to blocks > >> > >>> 2) Fallback to persistent lock data if certain threshold is > reached. > >> > >> > >> > >> What are the benefits of the backed-by-persistence approach in > >> compare > >> > to > >> > >> the one based on tuples? Specifically: > >> > >> - will the persistence approach work for both 3rd party and Ignite > >> > >> persistence? > >> > >> - any performance impacts depending on a chosen method? > >> > >> - what’s faster to implement? > >> > >> > >> > >> — > >> > >> Denis > >> > >> > >> > >>> On Dec 13, 2017, at 2:10 AM, Vladimir Ozerov < > voze...@gridgain.com> > >> > >> wrote: > >> > >>> > >> > >>> Igniters, > >> > >>> > >> > >>> As you probably we know we work actively on MVCC [1] and > >> transactional > >> > >> SQL > >> > >>> [2] features which could be treated as a single huge improvement. > We > >> > >> face a > >> > >>> number of challenges and one of them is locking. > >> > >>> > >> > >>> At the moment information about all locks is kept in memory on > >> > per-entry > >> > >>> basis (see GridCacheMvccManager). For every locked key we maintain > >> > >> current > >> > >>> lock owner (XID) and the list of would-be-owner transactions. When > >> > >>> transaction is about to lock an entry two scenarios are possible: > >> > >>> 1) If entry is not locked we obtain the lock immediately > >> > >>> 2) if entry is locked we add current transaction to the wait list > >> and > >> > >> jumps > >> > >>> to the next entry to be locked. Once the first entry is released > by > >> > >>> conflicting transaction, current transaction becomes an owner of > the > >> > >> first > >> > >>> entry and tries to promote itself for subsequent entries. > >> > >>> > >> > >>> Once all required locks are obtained, response is sent to the > >> caller. > >> > >>> > >> > >>> This approach doesn't work well for transactional SQL - if we > update > >> > >>> millions of rows in a single transaction we will simply run out of > >> > >> memory. > >> > >>> To mitigate the problem other database vendors keep information > >> about > >> > >> locks > >> > >>> inside the tuples. I propose to apply the similar design as > follows: > >> > >>> > >> > >>> 1) No per-entry lock information is stored in memory anymore. > >> > >>> 2) The list of active transactions are maintained in memory still > >> > >>> 3) When TX locks an entry, it sets special marker to the tuple [3] > >> > >>> 4) When TX meets already locked entry, it enlists itself to wait > >> queue > >> > of > >> > >>> conflicting transaction and suspends > >> > >>> 5) When first transaction releases conflicting lock, it notifies > and > >> > >> wakes > >> > >>> up suspended transactions, so they resume locking > >> > >>> 6) Entry lock data is cleared on transaction commit > >> > >>> 7) Entry lock data is not cleared on rollback or node restart; > >> Instead, > >> > >> we > >> > >>> will could use active transactions list to identify invalid locks > >> and > >> > >>> overwrite them as needed. > >> > >>> > >> > >>> Also we could try employing tiered approach > >> > >>> 1) Try to keep everything in-memory to minimize writes to blocks > >> > >>> 2) Fallback to persistent lock data if certain threshold is > reached. > >> > >>> > >> > >>> Thoughts? > >> > >>> > >> > >>> [1] https://issues.apache.org/jira/browse/IGNITE-3478 > >> > >>> [2] https://issues.apache.org/jira/browse/IGNITE-4191 > >> > >>> [3] Depends on final MVCC design - it could be per-tuple XID, undo > >> > >> vectors, > >> > >>> per-block transaction lists, etc.. > >> > >>> > >> > >>> Vladimir. > >> > >> > >> > >> > >> > > >> > > >> > > > > >