On Fri, Dec 9, 2016 at 12:45 AM, Vladimir Ozerov <voze...@gridgain.com> wrote:
> I already expressed my concern - this is counterintuitive approach. Because > without happens-before pure streaming model can be applied only on > independent chunks of data. It mean that mentioned ETL use case is not > feasible - ETL always depend on implicit or explicit links between tables, > and hence streaming is not applicable here. My question stands still - what > products except of possibly Ignite do this kind of JDBC streaming? > Vova, we have 2 mechanisms in the product: IgniteCache.putAll() or DataStreamer.addData(). JDBC batching and putAll() are absolutely identical. If you see it as counter-intuitive, I would ask for a concrete example. As far as links between data, Ignite does not have foreign-key constraints, so DataStreamer can insert data in any order (but again, not as part of JDBC batch). > > Another problem is that connection-wide property doesn't fit well in JDBC > pooling model. Users will have use different connections for streaming and > non-streaming approaches. > Using DataStreamer is not possible within JDBC batching paradigm, period. I wish we could drop the high-level-feels-good discussions altogether, because it seems like we are spinning wheels here. There is no way to use the streamer in JDBC context, unless we add a connection flag. Again, if you disagree, I would prefer to see a concrete example explaining why. > Please see how Oracle did that, this is precisely what I am talking about: > https://docs.oracle.com/cd/B28359_01/java.111/b31224/oraperf.htm#i1056232 > Two batching modes - one with explicit flush, another one with implicit > flush, when Oracle decides on it's own when it is better to communicate the > server. Batching mode can be declared globally or on per-statement level. > Simple and flexible. > > > On Fri, Dec 9, 2016 at 4:40 AM, Dmitriy Setrakyan <dsetrak...@apache.org> > wrote: > > > Gents, > > > > As Sergi suggested, batching and streaming are very different > semantically. > > > > To use standard JDBC batching, all we need to do is convert it to a > > cache.putAll() method, as semantically a putAll(...) call is identical > to a > > JDBC batch. Of course, if we see and UPDATE with a WHERE clause in > between, > > then we may have to break a batch into several chunks and execute the > > update in between. The DataStreamer should not be used here. > > > > I believe that for streaming we need to add a special JDBC/ODBC > connection > > flag. Whenever this flag is set to true, then we only should allow INSERT > > or single-UPDATE operations and use DataStreamer API internally. All > > operations other than INSERT or single-UPDATE should be prohibited. > > > > I think this design is semantically clear. Any objections? > > > > D. > > > > On Thu, Dec 8, 2016 at 5:02 AM, Sergi Vladykin <sergi.vlady...@gmail.com > > > > wrote: > > > > > If we use Streamer, then we always have `happens-before` broken. This > is > > > ok, because Streamer is for data loading, not for usual operating. > > > > > > We are not inventing any bicycles, just separating concerns: Batching > and > > > Streaming. > > > > > > My point here is that they should not depend on each other at all: > > Batching > > > can work with or without Streaming, as well as Streaming can work with > or > > > without Batching. > > > > > > Your proposal is a set of non-obvious rules for them to work. I see no > > > reasons for these complications. > > > > > > Sergi > > > > > > > > > 2016-12-08 15:49 GMT+03:00 Vladimir Ozerov <voze...@gridgain.com>: > > > > > > > Sergi, > > > > > > > > If user call single *execute() *operation, than most likely it is not > > > > batching. We should not rely on strange case where user perform > > batching > > > > without using standard and well-adopted batching JDBC API. The main > > > problem > > > > with streamer is that it is async and hence break happens-before > > > guarantees > > > > in a single thread: SELECT after INSERT might not return inserted > > value. > > > > > > > > Honestly, I do not really understand why we are trying to re-invent a > > > > bicycle here. There is standard API - let's just use it and make > > flexible > > > > enough to take advantage of IgniteDataStreamer if needed. > > > > > > > > Is there any use case which is not covered with this solution? Or let > > me > > > > ask from the opposite side - are there any well-known JDBC drivers > > which > > > > perform batching/streaming from non-batched update statements? > > > > > > > > Vladimir. > > > > > > > > On Thu, Dec 8, 2016 at 3:38 PM, Sergi Vladykin < > > sergi.vlady...@gmail.com > > > > > > > > wrote: > > > > > > > > > Vladimir, > > > > > > > > > > I see no reason to forbid Streamer usage from non-batched statement > > > > > execution. > > > > > It is common that users already have their ETL tools and you can't > be > > > > sure > > > > > if they use batching or not. > > > > > > > > > > Alex, > > > > > > > > > > I guess we have to decide on Streaming first and then we will > discuss > > > > > Batching separately, ok? Because this decision may become important > > for > > > > > batching implementation. > > > > > > > > > > Sergi > > > > > > > > > > 2016-12-08 15:31 GMT+03:00 Andrey Gura <ag...@apache.org>: > > > > > > > > > > > Alex, > > > > > > > > > > > > In most cases JdbcQueryTask should be executed locally on client > > node > > > > > > started by JDBC driver. > > > > > > > > > > > > JdbcQueryTask.QueryResult res = > > > > > > loc ? qryTask.call() : > > > > > > ignite.compute(ignite.cluster().forNodeId(nodeId)).call( > qryTask); > > > > > > > > > > > > Is it valid behavior after introducing DML functionality? > > > > > > > > > > > > In cases when user wants to execute query on specific node he > > should > > > > > > fully understand what he wants and what can go in wrong way. > > > > > > > > > > > > > > > > > > On Thu, Dec 8, 2016 at 3:20 PM, Alexander Paschenko > > > > > > <alexander.a.pasche...@gmail.com> wrote: > > > > > > > Sergi, > > > > > > > > > > > > > > JDBC batching might work quite differently from driver to > driver. > > > > Say, > > > > > > > MySQL happily rewrites queries as I had suggested in the > > beginning > > > of > > > > > > > this thread (it's not the only strategy, but one of the > possible > > > > > > > options) - and, BTW, would like to hear at least an opinion > about > > > it. > > > > > > > > > > > > > > On your first approach, section before streamer: you suggest > that > > > we > > > > > > > send single statement and multiple param sets as a single query > > > task, > > > > > > > am I right? (Just to make sure that I got you properly.) If so, > > do > > > > you > > > > > > > also mean that API (namely JdbcQueryTask) between server and > > client > > > > > > > should also change? Or should new API means be added to > > facilitate > > > > > > > batching tasks? > > > > > > > > > > > > > > - Alex > > > > > > > > > > > > > > 2016-12-08 15:05 GMT+03:00 Sergi Vladykin < > > > sergi.vlady...@gmail.com > > > > >: > > > > > > >> Guys, > > > > > > >> > > > > > > >> I discussed this feature with Dmitriy and we came to > conclusion > > > that > > > > > > >> batching in JDBC and Data Streaming in Ignite have different > > > > semantics > > > > > > and > > > > > > >> performance characteristics. Thus they are independent > features > > > > (they > > > > > > may > > > > > > >> work together, may separately, but this is another story). > > > > > > >> > > > > > > >> Let me explain. > > > > > > >> > > > > > > >> This is how JDBC batching works: > > > > > > >> - Add N sets of parameters to a prepared statement. > > > > > > >> - Manually execute prepared statement. > > > > > > >> - Repeat until all the data is loaded. > > > > > > >> > > > > > > >> > > > > > > >> This is how data streamer works: > > > > > > >> - Keep adding data. > > > > > > >> - Streamer will buffer and load buffered per-node batches when > > > they > > > > > are > > > > > > big > > > > > > >> enough. > > > > > > >> - Close streamer to make sure that everything is loaded. > > > > > > >> > > > > > > >> As you can see we have a difference in semantics of when we > send > > > > data: > > > > > > if > > > > > > >> in our JDBC we will allow sending batches to nodes without > > calling > > > > > > >> `execute` (and probably we will need to make `execute` to > no-op > > > > here), > > > > > > then > > > > > > >> we are violating semantics of JDBC, if we will disallow this > > > > behavior, > > > > > > then > > > > > > >> this batching will underperform. > > > > > > >> > > > > > > >> Thus I suggest keeping these features (JDBC Batching and JDBC > > > > > > Streaming) as > > > > > > >> separate features. > > > > > > >> > > > > > > >> As I already said they can work together: Batching will batch > > > > > parameters > > > > > > >> and on `execute` they will go to the Streamer in one shot and > > > > Streamer > > > > > > will > > > > > > >> deal with the rest. > > > > > > >> > > > > > > >> Sergi > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> 2016-12-08 14:16 GMT+03:00 Vladimir Ozerov < > > voze...@gridgain.com > > > >: > > > > > > >> > > > > > > >>> Hi Alex, > > > > > > >>> > > > > > > >>> To my understanding there are two possible approaches to > > batching > > > > in > > > > > > JDBC > > > > > > >>> layer: > > > > > > >>> > > > > > > >>> 1) Rely on default batching API. Specifically > > > > > > >>> *PreparedStatement.addBatch()* [1] > > > > > > >>> and others. This is nice and clear API, users are used to it, > > and > > > > > it's > > > > > > >>> adoption will minimize user code changes when migrating from > > > other > > > > > JDBC > > > > > > >>> sources. We simply copy updates locally and then execute them > > all > > > > at > > > > > > once > > > > > > >>> with only a single network hop to servers. > *IgniteDataStreamer* > > > can > > > > > be > > > > > > used > > > > > > >>> underneath. > > > > > > >>> > > > > > > >>> 2) Or we can have separate connection flag which will move > all > > > > > > >>> INSERT/UPDATE/DELETE statements through streamer. > > > > > > >>> > > > > > > >>> I prefer the first approach > > > > > > >>> > > > > > > >>> Also we need to keep in mind that data streamer has poor > > > > performance > > > > > > when > > > > > > >>> adding single key-value pairs due to high overhead on > > concurrency > > > > and > > > > > > other > > > > > > >>> bookkeeping. Instead, it is better to pre-batch key-value > pairs > > > > > before > > > > > > >>> giving them to streamer. > > > > > > >>> > > > > > > >>> Vladimir. > > > > > > >>> > > > > > > >>> [1] > > > > > > >>> https://docs.oracle.com/javase/8/docs/api/java/sql/ > > > > > > PreparedStatement.html# > > > > > > >>> addBatch-- > > > > > > >>> > > > > > > >>> On Thu, Dec 8, 2016 at 1:21 PM, Alexander Paschenko < > > > > > > >>> alexander.a.pasche...@gmail.com> wrote: > > > > > > >>> > > > > > > >>> > Hello Igniters, > > > > > > >>> > > > > > > > >>> > One of the major improvements to DML has to be support of > > batch > > > > > > >>> > statements. I'd like to discuss its implementation. The > > > suggested > > > > > > >>> > approach is to rewrite given query turning it from few > > INSERTs > > > > into > > > > > > >>> > single statement and processing arguments accordingly. I > > > suggest > > > > > this > > > > > > >>> > as long as the whole point of batching is to make as little > > > > > > >>> > interactions with cluster as possible and to make > operations > > as > > > > > > >>> > condensed as possible, and in case of Ignite it means that > we > > > > > should > > > > > > >>> > send as little JdbcQueryTasks as possible. And, as long as > a > > > > query > > > > > > >>> > task holds single query and its arguments, this approach > will > > > not > > > > > > >>> > require any changes to be done to current design and won't > > > break > > > > > any > > > > > > >>> > backward compatibility - all dirty work on rewriting will > be > > > done > > > > > by > > > > > > >>> > JDBC driver. > > > > > > >>> > Without rewriting, we could introduce some new query task > for > > > > batch > > > > > > >>> > operations, but that would make impossible sending such > > > requests > > > > > from > > > > > > >>> > newer clients to older servers (say, servers of version > > 1.8.0, > > > > > which > > > > > > >>> > does not know about batching, let alone older versions). > > > > > > >>> > I'd like to hear comments and suggestions from the > community. > > > > > Thanks! > > > > > > >>> > > > > > > > >>> > - Alex > > > > > > >>> > > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > > > > -- > Vladimir Ozerov > Senior Software Architect > GridGain Systems > www.gridgain.com > *+7 (960) 283 98 40* >