Alexander, A couple of comments in regards to the streaming mode.
I would rename rename the existed property to “ignite.jdbc.streaming” and add additional ones that will help to manage and tune the streaming behavior: ignite.jdbc.streaming.perNodeBufferSize ignite.jdbc.streaming.perNodeParallelOperations ignite.jdbc.streaming.autoFlushFrequency Any other thoughts? — Denis > On Dec 19, 2016, at 8:02 AM, Alexander Paschenko > <alexander.a.pasche...@gmail.com> wrote: > > OK folks, both data streamer support and batching support have been > implemented. > > Resulting design fully conforms to what Dima suggested initially - > these two concepts are separated. > > Streamed statements are turned on by connection flag, stream auto > flush timeout can be tuned in the same way; these statements support > INSERT and MERGE w/o subquery as well as fast key bounded DELETE and > UPDATE; each prepared statement in streamed mode has its own streamer > object and their lifecycles are the same - on close, the statement > closes its streamer. Streaming mode is available only in "local" mode > of connection between JDBC driver and Ignite client (default mode when > JDBC driver creates Ignite client node by itself) - there would be no > sense in streaming if query args would have to travel over network. > > Batched statements sre used via conventional JDBC API (setArgs... > addBatch... executeBatch...), they also support INSERT and MERGE w/o > subquery as well as fast key (and, optionally, value) bounded DELETE > and UPDATE. These work in the similar manner to non batched statements > and likewise rely on traditional putAll/invokeAll routines. > Essentially, batching is just the way to pass a bigger map to > cache.putAll without writing single very long query. This works in > local as well as "remote" Ignite JDBC connectivity mode. > > More info (details are in the comments): > > Batching - https://issues.apache.org/jira/browse/IGNITE-4269 > Streaming - https://issues.apache.org/jira/browse/IGNITE-4169 > > Regards, > Alex > > 2016-12-10 23:39 GMT+03:00 Dmitriy Setrakyan <dsetrak...@apache.org>: >> Alex, >> >> It seams to me that replace semantic can be implemented with >> StreamReceiver, no? >> >> D. >> >> On Sat, Dec 10, 2016 at 2:54 AM, Alexander Paschenko < >> alexander.a.pasche...@gmail.com> wrote: >> >>> Sorry, "no relation w/JDBC" in my previous message should read "no relation >>> w/JDBC batching". >>> >>> — Alex >>> 10 дек. 2016 г. 1:52 PM пользователь "Alexander Paschenko" < >>> alexander.a.pasche...@gmail.com> написал: >>> >>>> Dima, >>>> >>>> I would like to point out that data streamer support had already been >>>> implemented in the course of work on DML in 1.8 exactly as you are >>>> suggesting now (turned on via connection flag; allowed only MERGE — data >>>> streamer can't do putIfAbsent stuff, right?; absolutely no relation >>>> w/JDBC), *but* that patch had been reverted — by advice from Vlad which I >>>> believe had been agreed with you, so it didn't make it to 1.8 after all. >>>> Also, while it's possible to maintain INSERT vs MERGE semantic using >>>> streamer's allowOverwrite flag, I can't see how we could mimic UPDATE >>> here >>>> as long as the streamer does not put data to cache only in case when key >>> is >>>> present AND allowOverwrite is false, while UPDATE should not put anything >>>> when the key is *missing* — i.e., there's no way to emulate cache's >>>> *replace* operation semantic with streamer (update value only if key is >>>> present, otherwise do nothing). >>>> >>>> — Alex >>>> 9 дек. 2016 г. 10:00 PM пользователь "Dmitriy Setrakyan" < >>>> dsetrak...@apache.org> написал: >>>> >>>>> On Fri, Dec 9, 2016 at 12:45 AM, Vladimir Ozerov <voze...@gridgain.com> >>>>> wrote: >>>>> >>>>>> I already expressed my concern - this is counterintuitive approach. >>>>> Because >>>>>> without happens-before pure streaming model can be applied only on >>>>>> independent chunks of data. It mean that mentioned ETL use case is not >>>>>> feasible - ETL always depend on implicit or explicit links between >>>>> tables, >>>>>> and hence streaming is not applicable here. My question stands still - >>>>> what >>>>>> products except of possibly Ignite do this kind of JDBC streaming? >>>>>> >>>>> >>>>> Vova, we have 2 mechanisms in the product: IgniteCache.putAll() or >>>>> DataStreamer.addData(). >>>>> >>>>> JDBC batching and putAll() are absolutely identical. If you see it as >>>>> counter-intuitive, I would ask for a concrete example. >>>>> >>>>> As far as links between data, Ignite does not have foreign-key >>>>> constraints, >>>>> so DataStreamer can insert data in any order (but again, not as part of >>>>> JDBC batch). >>>>> >>>>> >>>>>> >>>>>> Another problem is that connection-wide property doesn't fit well in >>>>> JDBC >>>>>> pooling model. Users will have use different connections for streaming >>>>> and >>>>>> non-streaming approaches. >>>>>> >>>>> >>>>> Using DataStreamer is not possible within JDBC batching paradigm, >>> period. >>>>> I >>>>> wish we could drop the high-level-feels-good discussions altogether, >>>>> because it seems like we are spinning wheels here. >>>>> >>>>> There is no way to use the streamer in JDBC context, unless we add a >>>>> connection flag. Again, if you disagree, I would prefer to see a >>> concrete >>>>> example explaining why. >>>>> >>>>> >>>>>> Please see how Oracle did that, this is precisely what I am talking >>>>> about: >>>>>> https://docs.oracle.com/cd/B28359_01/java.111/b31224/oraperf >>>>> .htm#i1056232 >>>>>> Two batching modes - one with explicit flush, another one with >>> implicit >>>>>> flush, when Oracle decides on it's own when it is better to >>> communicate >>>>> the >>>>>> server. Batching mode can be declared globally or on per-statement >>>>> level. >>>>>> Simple and flexible. >>>>>> >>>>>> >>>>>> On Fri, Dec 9, 2016 at 4:40 AM, Dmitriy Setrakyan < >>>>> dsetrak...@apache.org> >>>>>> wrote: >>>>>> >>>>>>> Gents, >>>>>>> >>>>>>> As Sergi suggested, batching and streaming are very different >>>>>> semantically. >>>>>>> >>>>>>> To use standard JDBC batching, all we need to do is convert it to a >>>>>>> cache.putAll() method, as semantically a putAll(...) call is >>> identical >>>>>> to a >>>>>>> JDBC batch. Of course, if we see and UPDATE with a WHERE clause in >>>>>> between, >>>>>>> then we may have to break a batch into several chunks and execute >>> the >>>>>>> update in between. The DataStreamer should not be used here. >>>>>>> >>>>>>> I believe that for streaming we need to add a special JDBC/ODBC >>>>>> connection >>>>>>> flag. Whenever this flag is set to true, then we only should allow >>>>> INSERT >>>>>>> or single-UPDATE operations and use DataStreamer API internally. All >>>>>>> operations other than INSERT or single-UPDATE should be prohibited. >>>>>>> >>>>>>> I think this design is semantically clear. Any objections? >>>>>>> >>>>>>> D. >>>>>>> >>>>>>> On Thu, Dec 8, 2016 at 5:02 AM, Sergi Vladykin < >>>>> sergi.vlady...@gmail.com >>>>>>> >>>>>>> wrote: >>>>>>> >>>>>>>> If we use Streamer, then we always have `happens-before` broken. >>>>> This >>>>>> is >>>>>>>> ok, because Streamer is for data loading, not for usual operating. >>>>>>>> >>>>>>>> We are not inventing any bicycles, just separating concerns: >>>>> Batching >>>>>> and >>>>>>>> Streaming. >>>>>>>> >>>>>>>> My point here is that they should not depend on each other at all: >>>>>>> Batching >>>>>>>> can work with or without Streaming, as well as Streaming can work >>>>> with >>>>>> or >>>>>>>> without Batching. >>>>>>>> >>>>>>>> Your proposal is a set of non-obvious rules for them to work. I >>> see >>>>> no >>>>>>>> reasons for these complications. >>>>>>>> >>>>>>>> Sergi >>>>>>>> >>>>>>>> >>>>>>>> 2016-12-08 15:49 GMT+03:00 Vladimir Ozerov <voze...@gridgain.com >>>> : >>>>>>>> >>>>>>>>> Sergi, >>>>>>>>> >>>>>>>>> If user call single *execute() *operation, than most likely it >>> is >>>>> not >>>>>>>>> batching. We should not rely on strange case where user perform >>>>>>> batching >>>>>>>>> without using standard and well-adopted batching JDBC API. The >>>>> main >>>>>>>> problem >>>>>>>>> with streamer is that it is async and hence break happens-before >>>>>>>> guarantees >>>>>>>>> in a single thread: SELECT after INSERT might not return >>> inserted >>>>>>> value. >>>>>>>>> >>>>>>>>> Honestly, I do not really understand why we are trying to >>>>> re-invent a >>>>>>>>> bicycle here. There is standard API - let's just use it and make >>>>>>> flexible >>>>>>>>> enough to take advantage of IgniteDataStreamer if needed. >>>>>>>>> >>>>>>>>> Is there any use case which is not covered with this solution? >>> Or >>>>> let >>>>>>> me >>>>>>>>> ask from the opposite side - are there any well-known JDBC >>> drivers >>>>>>> which >>>>>>>>> perform batching/streaming from non-batched update statements? >>>>>>>>> >>>>>>>>> Vladimir. >>>>>>>>> >>>>>>>>> On Thu, Dec 8, 2016 at 3:38 PM, Sergi Vladykin < >>>>>>> sergi.vlady...@gmail.com >>>>>>>>> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Vladimir, >>>>>>>>>> >>>>>>>>>> I see no reason to forbid Streamer usage from non-batched >>>>> statement >>>>>>>>>> execution. >>>>>>>>>> It is common that users already have their ETL tools and you >>>>> can't >>>>>> be >>>>>>>>> sure >>>>>>>>>> if they use batching or not. >>>>>>>>>> >>>>>>>>>> Alex, >>>>>>>>>> >>>>>>>>>> I guess we have to decide on Streaming first and then we will >>>>>> discuss >>>>>>>>>> Batching separately, ok? Because this decision may become >>>>> important >>>>>>> for >>>>>>>>>> batching implementation. >>>>>>>>>> >>>>>>>>>> Sergi >>>>>>>>>> >>>>>>>>>> 2016-12-08 15:31 GMT+03:00 Andrey Gura <ag...@apache.org>: >>>>>>>>>> >>>>>>>>>>> Alex, >>>>>>>>>>> >>>>>>>>>>> In most cases JdbcQueryTask should be executed locally on >>>>> client >>>>>>> node >>>>>>>>>>> started by JDBC driver. >>>>>>>>>>> >>>>>>>>>>> JdbcQueryTask.QueryResult res = >>>>>>>>>>> loc ? qryTask.call() : >>>>>>>>>>> ignite.compute(ignite.cluster().forNodeId(nodeId)).call( >>>>>> qryTask); >>>>>>>>>>> >>>>>>>>>>> Is it valid behavior after introducing DML functionality? >>>>>>>>>>> >>>>>>>>>>> In cases when user wants to execute query on specific node >>> he >>>>>>> should >>>>>>>>>>> fully understand what he wants and what can go in wrong way. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, Dec 8, 2016 at 3:20 PM, Alexander Paschenko >>>>>>>>>>> <alexander.a.pasche...@gmail.com> wrote: >>>>>>>>>>>> Sergi, >>>>>>>>>>>> >>>>>>>>>>>> JDBC batching might work quite differently from driver to >>>>>> driver. >>>>>>>>> Say, >>>>>>>>>>>> MySQL happily rewrites queries as I had suggested in the >>>>>>> beginning >>>>>>>> of >>>>>>>>>>>> this thread (it's not the only strategy, but one of the >>>>>> possible >>>>>>>>>>>> options) - and, BTW, would like to hear at least an >>> opinion >>>>>> about >>>>>>>> it. >>>>>>>>>>>> >>>>>>>>>>>> On your first approach, section before streamer: you >>> suggest >>>>>> that >>>>>>>> we >>>>>>>>>>>> send single statement and multiple param sets as a single >>>>> query >>>>>>>> task, >>>>>>>>>>>> am I right? (Just to make sure that I got you properly.) >>> If >>>>> so, >>>>>>> do >>>>>>>>> you >>>>>>>>>>>> also mean that API (namely JdbcQueryTask) between server >>> and >>>>>>> client >>>>>>>>>>>> should also change? Or should new API means be added to >>>>>>> facilitate >>>>>>>>>>>> batching tasks? >>>>>>>>>>>> >>>>>>>>>>>> - Alex >>>>>>>>>>>> >>>>>>>>>>>> 2016-12-08 15:05 GMT+03:00 Sergi Vladykin < >>>>>>>> sergi.vlady...@gmail.com >>>>>>>>>> : >>>>>>>>>>>>> Guys, >>>>>>>>>>>>> >>>>>>>>>>>>> I discussed this feature with Dmitriy and we came to >>>>>> conclusion >>>>>>>> that >>>>>>>>>>>>> batching in JDBC and Data Streaming in Ignite have >>>>> different >>>>>>>>> semantics >>>>>>>>>>> and >>>>>>>>>>>>> performance characteristics. Thus they are independent >>>>>> features >>>>>>>>> (they >>>>>>>>>>> may >>>>>>>>>>>>> work together, may separately, but this is another >>> story). >>>>>>>>>>>>> >>>>>>>>>>>>> Let me explain. >>>>>>>>>>>>> >>>>>>>>>>>>> This is how JDBC batching works: >>>>>>>>>>>>> - Add N sets of parameters to a prepared statement. >>>>>>>>>>>>> - Manually execute prepared statement. >>>>>>>>>>>>> - Repeat until all the data is loaded. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> This is how data streamer works: >>>>>>>>>>>>> - Keep adding data. >>>>>>>>>>>>> - Streamer will buffer and load buffered per-node batches >>>>> when >>>>>>>> they >>>>>>>>>> are >>>>>>>>>>> big >>>>>>>>>>>>> enough. >>>>>>>>>>>>> - Close streamer to make sure that everything is loaded. >>>>>>>>>>>>> >>>>>>>>>>>>> As you can see we have a difference in semantics of when >>> we >>>>>> send >>>>>>>>> data: >>>>>>>>>>> if >>>>>>>>>>>>> in our JDBC we will allow sending batches to nodes >>> without >>>>>>> calling >>>>>>>>>>>>> `execute` (and probably we will need to make `execute` to >>>>>> no-op >>>>>>>>> here), >>>>>>>>>>> then >>>>>>>>>>>>> we are violating semantics of JDBC, if we will disallow >>>>> this >>>>>>>>> behavior, >>>>>>>>>>> then >>>>>>>>>>>>> this batching will underperform. >>>>>>>>>>>>> >>>>>>>>>>>>> Thus I suggest keeping these features (JDBC Batching and >>>>> JDBC >>>>>>>>>>> Streaming) as >>>>>>>>>>>>> separate features. >>>>>>>>>>>>> >>>>>>>>>>>>> As I already said they can work together: Batching will >>>>> batch >>>>>>>>>> parameters >>>>>>>>>>>>> and on `execute` they will go to the Streamer in one shot >>>>> and >>>>>>>>> Streamer >>>>>>>>>>> will >>>>>>>>>>>>> deal with the rest. >>>>>>>>>>>>> >>>>>>>>>>>>> Sergi >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> 2016-12-08 14:16 GMT+03:00 Vladimir Ozerov < >>>>>>> voze...@gridgain.com >>>>>>>>> : >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Alex, >>>>>>>>>>>>>> >>>>>>>>>>>>>> To my understanding there are two possible approaches to >>>>>>> batching >>>>>>>>> in >>>>>>>>>>> JDBC >>>>>>>>>>>>>> layer: >>>>>>>>>>>>>> >>>>>>>>>>>>>> 1) Rely on default batching API. Specifically >>>>>>>>>>>>>> *PreparedStatement.addBatch()* [1] >>>>>>>>>>>>>> and others. This is nice and clear API, users are used >>> to >>>>> it, >>>>>>> and >>>>>>>>>> it's >>>>>>>>>>>>>> adoption will minimize user code changes when migrating >>>>> from >>>>>>>> other >>>>>>>>>> JDBC >>>>>>>>>>>>>> sources. We simply copy updates locally and then execute >>>>> them >>>>>>> all >>>>>>>>> at >>>>>>>>>>> once >>>>>>>>>>>>>> with only a single network hop to servers. >>>>>> *IgniteDataStreamer* >>>>>>>> can >>>>>>>>>> be >>>>>>>>>>> used >>>>>>>>>>>>>> underneath. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2) Or we can have separate connection flag which will >>> move >>>>>> all >>>>>>>>>>>>>> INSERT/UPDATE/DELETE statements through streamer. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I prefer the first approach >>>>>>>>>>>>>> >>>>>>>>>>>>>> Also we need to keep in mind that data streamer has poor >>>>>>>>> performance >>>>>>>>>>> when >>>>>>>>>>>>>> adding single key-value pairs due to high overhead on >>>>>>> concurrency >>>>>>>>> and >>>>>>>>>>> other >>>>>>>>>>>>>> bookkeeping. Instead, it is better to pre-batch >>> key-value >>>>>> pairs >>>>>>>>>> before >>>>>>>>>>>>>> giving them to streamer. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Vladimir. >>>>>>>>>>>>>> >>>>>>>>>>>>>> [1] >>>>>>>>>>>>>> https://docs.oracle.com/javase/8/docs/api/java/sql/ >>>>>>>>>>> PreparedStatement.html# >>>>>>>>>>>>>> addBatch-- >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Dec 8, 2016 at 1:21 PM, Alexander Paschenko < >>>>>>>>>>>>>> alexander.a.pasche...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hello Igniters, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> One of the major improvements to DML has to be support >>>>> of >>>>>>> batch >>>>>>>>>>>>>>> statements. I'd like to discuss its implementation. >>> The >>>>>>>> suggested >>>>>>>>>>>>>>> approach is to rewrite given query turning it from few >>>>>>> INSERTs >>>>>>>>> into >>>>>>>>>>>>>>> single statement and processing arguments >>> accordingly. I >>>>>>>> suggest >>>>>>>>>> this >>>>>>>>>>>>>>> as long as the whole point of batching is to make as >>>>> little >>>>>>>>>>>>>>> interactions with cluster as possible and to make >>>>>> operations >>>>>>> as >>>>>>>>>>>>>>> condensed as possible, and in case of Ignite it means >>>>> that >>>>>> we >>>>>>>>>> should >>>>>>>>>>>>>>> send as little JdbcQueryTasks as possible. And, as >>> long >>>>> as >>>>>> a >>>>>>>>> query >>>>>>>>>>>>>>> task holds single query and its arguments, this >>> approach >>>>>> will >>>>>>>> not >>>>>>>>>>>>>>> require any changes to be done to current design and >>>>> won't >>>>>>>> break >>>>>>>>>> any >>>>>>>>>>>>>>> backward compatibility - all dirty work on rewriting >>>>> will >>>>>> be >>>>>>>> done >>>>>>>>>> by >>>>>>>>>>>>>>> JDBC driver. >>>>>>>>>>>>>>> Without rewriting, we could introduce some new query >>>>> task >>>>>> for >>>>>>>>> batch >>>>>>>>>>>>>>> operations, but that would make impossible sending >>> such >>>>>>>> requests >>>>>>>>>> from >>>>>>>>>>>>>>> newer clients to older servers (say, servers of >>> version >>>>>>> 1.8.0, >>>>>>>>>> which >>>>>>>>>>>>>>> does not know about batching, let alone older >>> versions). >>>>>>>>>>>>>>> I'd like to hear comments and suggestions from the >>>>>> community. >>>>>>>>>> Thanks! >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Alex >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Vladimir Ozerov >>>>>> Senior Software Architect >>>>>> GridGain Systems >>>>>> www.gridgain.com >>>>>> *+7 (960) 283 98 40* >>>>>> >>>>> >>>> >>>