Alexander, thanks for the detailed explanation. I would start simple, without JOINs. However, we should throw proper exceptions if unsupported SQL is used.
D. On Tue, Feb 20, 2018 at 9:45 AM, Alexander Paschenko < alexander.a.pasche...@gmail.com> wrote: > Hello again, and here I go with some more thoughts on SELECT FOR UPDATE. > After having talking to Vlad, we returned again to question about how > SELECT FOR UPDATE should work with JOINs. > > It seems that distributed joins currently are better to be left aside > as long as rework of distributed queries and joins is a separate big > issue. > However, probably we still should think about what we could make work > after all with what we have in our hands now. And it seems that some > scenarios are pretty doable - say, when there's an ordinary JOIN of > few tables, we can simply append keys of those tables to list of > selected columns, and that would work in case of collocated data. > > * Alas, major vendors (Postgres, MySQL, Oracle) all also honor > subqueries in JOINs - say, like this: > > select p.id , c.id from person p inner join (select * from company) c > on p.company_id = c.id where p.id > 3 and p.id < 10 for update; > > Obviously, this is a bit corner case as such JOIN is not much > different from an ordinary multi-table one, but it gives overall idea > of what I'm talking about - major databases lock what's been selected > in subquery as well. > > That said, aforementioned case probably could also be processed as > suggested above - that is, we take innermost query, append key to list > of its selected columns, so that we can refer to that new column from > outer query, etc, etc. In principle we could process even > multiple-level nesting of queries with such approach. > Also no major vendors, except probably MySQL, support GROUP BY with > SELECT FOR UPDATE, and this simplifies our task a bit. :) > > If someone can think of an example of a query that cannot be processed > in the way described above, please write about it. > > * Also it looks like no major vendors out of three mentioned above > lock record from tables mentioned in subqueries in SELECT columns or > in WHERE - say, this query > > select p.id, (select count(*) from company) from person p > > will lock only records in Person table. > > Thoughts? Should we attempt to implement JOINs support right away? Or > would it be better to leave it be for (near) future and start with > simple implementation that would ban JOINs altogether? > > Regards, > Alex > > > 2018-01-25 10:58 GMT+03:00 Vladimir Ozerov <voze...@gridgain.com>: > > Alex, > > > > Thank you for detailed analysis. My 50 cents: > > > > 1) I would not allow GROUP BYs in the first place. These are good > > candidates for further iterations IMO > > 2) Same as p.1 > > 3) In final TX SQL solution we woll lock rows, not keys > > (GridCacheMapEntry). Can we simply lock every row returned from the > query? > > 4) Same as p.1 > > 5) Yes, it should work the same way we would implement it for normal > queries > > > > Also I am not quite understand why should we keep results on map node > until > > all keys are locked. This increases memory pressure on the server. > Instead, > > I would start sending batches to reducer immediately, but do not return > the > > very first result to the user until all results are collected. This way > > pressure is moved to the client what increases cluster stability. Will > that > > work? > > > > On Wed, Jan 24, 2018 at 6:23 PM, Alexander Paschenko < > > alexander.a.pasche...@gmail.com> wrote: > > > >> Hello Igniters, > >> > >> I'd like to bring up the discussion about implementation details of > >> https://issues.apache.org/jira/browse/IGNITE-6937 about > >> support of SELECT FOR UPDATE statements as a part of overall activity > >> on transactional SQL. > >> > >> That kind of statements allows the user proactively obtain row level > >> locks for the records that they may (or may not) want to > >> update in further statements of the same transaction. > >> > >> Suggested general approach to implementation is as follows: > >> > >> - Perform two-step SELECT as usual (e.g. split, map, reduce stages), BUT > >> - Map nodes start giving away their first result pages for such query > >> only after they have traversed whole their result set > >> and have obtained locks for all keys corresponding to selected rows > >> (e.g. rows matching WHERE clause). > >> > >> I've made some research on how some RDBMSs behave (MySQL, Oracle, Pg) > >> and found that they all also consider SELECT FOR UPDATE > >> finished and give away its result set only after all locks have been > >> acquired, so in general suggested strategy is similar to > >> what other databases do. Alas, everyone concerned is welcome to share > >> their views on how this feature could work in Ignite - as well as on > >> some stuff stated below. These are some caveats that I'd like to > >> discuss in addition to overall approach. > >> > >> First: should we allow GROUP BY clause in SELECT FOR UPDATE? Oracle > >> and Pg do not support it in any way, while MySQL does. > >> > >> Second: should we allow JOIN clause in SELECT FOR UPDATE? All > >> mentioned major vendors support it, but with Ignite implementation > >> likely > >> will not be straightforward - this has to do with pt. 3. > >> > >> Third: in general case, how do we detect *keys* that we should lock? > >> Consider following statement: SELECT name FROM person FOR UPDATE. > >> This query does not mention table key at all. Moreover, if we have > >> JOIN clause in the query, it is concerned with more than one table > >> (=cache), thus making it even harder to detect which keys we should > >> lock. Suggested solution is as follows: on split stage, > >> append to MAP query key columns for all concerned tables (i.e. SELECT > >> a.name, b.surname from a join b will become > >> SELECT a.name, b.surname, a._key, b._key) - with that done, we'll > >> definitely be able to lock everything we need on MAP nodes. > >> > >> Fourth: should we allow LIMIT clause in SELECT FOR UPDATE? In general, > >> RDBMS vendors do not recommend combining LIMIT with > >> SELECT FOR UPDATE. First, this is because LIMIT results are subject to > >> ordering, and in this case it's harder to predict what records > >> will be locked after all. Second, some vendors state that they lock > >> not just what LIMIT yields, but also *everything before it* - say, > >> if your SELECT returns 100 records and you apply LIMIT 50,60, then all > >> first 60 records will be locked even if you don't need first 50. > >> Moreover, in case of Ignite, any LIMIT will make query need merge > >> table, all data will have to be copied to reduce node first, > >> yadda yadda yadda, and we won't be able to actually lock anything > >> until all data is on reduce node. See pt. 5 for more on this. > >> > >> Fifth: in fact, above question stands for all cases when we need merge > >> table: how things should work when we have to pull all data to > >> reduce node prior to actually filtering anything? Should we aim to > >> support such cases in SELECT FOR UPDATE? > >> > >> Regards, > >> Alex > >> >