Re: Batch DML queries design discussion

Denis Magda Mon, 19 Dec 2016 11:29:27 -0800

Alexander,

A couple of comments in regards to the streaming mode.


I would rename rename the existed property to “ignite.jdbc.streaming” and add 
additional ones that will help to manage and tune the streaming behavior:
ignite.jdbc.streaming.perNodeBufferSize
ignite.jdbc.streaming.perNodeParallelOperations
ignite.jdbc.streaming.autoFlushFrequency


Any other thoughts?

—
Denis

> On Dec 19, 2016, at 8:02 AM, Alexander Paschenko 
> <alexander.a.pasche...@gmail.com> wrote:
> 
> OK folks, both data streamer support and batching support have been 
> implemented.
> 
> Resulting design fully conforms to what Dima suggested initially -
> these two concepts are separated.
> 
> Streamed statements are turned on by connection flag, stream auto
> flush timeout can be tuned in the same way; these statements support
> INSERT and MERGE w/o subquery as well as fast key bounded DELETE and
> UPDATE; each prepared statement in streamed mode has its own streamer
> object and their lifecycles are the same - on close, the statement
> closes its streamer. Streaming mode is available only in "local" mode
> of connection between JDBC driver and Ignite client (default mode when
> JDBC driver creates Ignite client node by itself) - there would be no
> sense in streaming if query args would have to travel over network.
> 
> Batched statements sre used via conventional JDBC API (setArgs...
> addBatch... executeBatch...), they also support INSERT and MERGE w/o
> subquery as well as fast key (and, optionally, value) bounded DELETE
> and UPDATE. These work in the similar manner to non batched statements
> and likewise rely on traditional putAll/invokeAll routines.
> Essentially, batching is just the way to pass a bigger map to
> cache.putAll without writing single very long query. This works in
> local as well as "remote" Ignite JDBC connectivity mode.
> 
> More info (details are in the comments):
> 
> Batching - https://issues.apache.org/jira/browse/IGNITE-4269
> Streaming - https://issues.apache.org/jira/browse/IGNITE-4169
> 
> Regards,
> Alex
> 
> 2016-12-10 23:39 GMT+03:00 Dmitriy Setrakyan <dsetrak...@apache.org>:
>> Alex,
>> 
>> It seams to me that replace semantic can be implemented with
>> StreamReceiver, no?
>> 
>> D.
>> 
>> On Sat, Dec 10, 2016 at 2:54 AM, Alexander Paschenko <
>> alexander.a.pasche...@gmail.com> wrote:
>> 
>>> Sorry, "no relation w/JDBC" in my previous message should read "no relation
>>> w/JDBC batching".
>>> 
>>> — Alex
>>> 10 дек. 2016 г. 1:52 PM пользователь "Alexander Paschenko" <
>>> alexander.a.pasche...@gmail.com> написал:
>>> 
>>>> Dima,
>>>> 
>>>> I would like to point out that data streamer support had already been
>>>> implemented in the course of work on DML in 1.8 exactly as you are
>>>> suggesting now (turned on via connection flag; allowed only MERGE — data
>>>> streamer can't do putIfAbsent stuff, right?; absolutely no relation
>>>> w/JDBC), *but* that patch had been reverted — by advice from Vlad which I
>>>> believe had been agreed with you, so it didn't make it to 1.8 after all.
>>>> Also, while it's possible to maintain INSERT vs MERGE semantic using
>>>> streamer's allowOverwrite flag, I can't see how we could mimic UPDATE
>>> here
>>>> as long as the streamer does not put data to cache only in case when key
>>> is
>>>> present AND allowOverwrite is false, while UPDATE should not put anything
>>>> when the key is *missing* — i.e., there's no way to emulate cache's
>>>> *replace* operation semantic with streamer (update value only if key is
>>>> present, otherwise do nothing).
>>>> 
>>>> — Alex
>>>> 9 дек. 2016 г. 10:00 PM пользователь "Dmitriy Setrakyan" <
>>>> dsetrak...@apache.org> написал:
>>>> 
>>>>> On Fri, Dec 9, 2016 at 12:45 AM, Vladimir Ozerov <voze...@gridgain.com>
>>>>> wrote:
>>>>> 
>>>>>> I already expressed my concern - this is counterintuitive approach.
>>>>> Because
>>>>>> without happens-before pure streaming model can be applied only on
>>>>>> independent chunks of data. It mean that mentioned ETL use case is not
>>>>>> feasible - ETL always depend on implicit or explicit links between
>>>>> tables,
>>>>>> and hence streaming is not applicable here. My question stands still -
>>>>> what
>>>>>> products except of possibly Ignite do this kind of JDBC streaming?
>>>>>> 
>>>>> 
>>>>> Vova, we have 2 mechanisms in the product: IgniteCache.putAll() or
>>>>> DataStreamer.addData().
>>>>> 
>>>>> JDBC batching and putAll() are absolutely identical. If you see it as
>>>>> counter-intuitive, I would ask for a concrete example.
>>>>> 
>>>>> As far as links between data, Ignite does not have foreign-key
>>>>> constraints,
>>>>> so DataStreamer can insert data in any order (but again, not as part  of
>>>>> JDBC batch).
>>>>> 
>>>>> 
>>>>>> 
>>>>>> Another problem is that connection-wide property doesn't fit well in
>>>>> JDBC
>>>>>> pooling model. Users will have use different connections for streaming
>>>>> and
>>>>>> non-streaming approaches.
>>>>>> 
>>>>> 
>>>>> Using DataStreamer is not possible within JDBC batching paradigm,
>>> period.
>>>>> I
>>>>> wish we could drop the high-level-feels-good discussions altogether,
>>>>> because it seems like we are spinning wheels here.
>>>>> 
>>>>> There is no way to use the streamer in JDBC context, unless we add a
>>>>> connection flag. Again, if you disagree, I would prefer to see a
>>> concrete
>>>>> example explaining why.
>>>>> 
>>>>> 
>>>>>> Please see how Oracle did that, this is precisely what I am talking
>>>>> about:
>>>>>> https://docs.oracle.com/cd/B28359_01/java.111/b31224/oraperf
>>>>> .htm#i1056232
>>>>>> Two batching modes - one with explicit flush, another one with
>>> implicit
>>>>>> flush, when Oracle decides on it's own when it is better to
>>> communicate
>>>>> the
>>>>>> server. Batching mode can be declared globally or on per-statement
>>>>> level.
>>>>>> Simple and flexible.
>>>>>> 
>>>>>> 
>>>>>> On Fri, Dec 9, 2016 at 4:40 AM, Dmitriy Setrakyan <
>>>>> dsetrak...@apache.org>
>>>>>> wrote:
>>>>>> 
>>>>>>> Gents,
>>>>>>> 
>>>>>>> As Sergi suggested, batching and streaming are very different
>>>>>> semantically.
>>>>>>> 
>>>>>>> To use standard JDBC batching, all we need to do is convert it to a
>>>>>>> cache.putAll() method, as semantically a putAll(...) call is
>>> identical
>>>>>> to a
>>>>>>> JDBC batch. Of course, if we see and UPDATE with a WHERE clause in
>>>>>> between,
>>>>>>> then we may have to break a batch into several chunks and execute
>>> the
>>>>>>> update in between. The DataStreamer should not be used here.
>>>>>>> 
>>>>>>> I believe that for streaming we need to add a special JDBC/ODBC
>>>>>> connection
>>>>>>> flag. Whenever this flag is set to true, then we only should allow
>>>>> INSERT
>>>>>>> or single-UPDATE operations and use DataStreamer API internally. All
>>>>>>> operations other than INSERT or single-UPDATE should be prohibited.
>>>>>>> 
>>>>>>> I think this design is semantically clear. Any objections?
>>>>>>> 
>>>>>>> D.
>>>>>>> 
>>>>>>> On Thu, Dec 8, 2016 at 5:02 AM, Sergi Vladykin <
>>>>> sergi.vlady...@gmail.com
>>>>>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> If we use Streamer, then we always have `happens-before` broken.
>>>>> This
>>>>>> is
>>>>>>>> ok, because Streamer is for data loading, not for usual operating.
>>>>>>>> 
>>>>>>>> We are not inventing any bicycles, just separating concerns:
>>>>> Batching
>>>>>> and
>>>>>>>> Streaming.
>>>>>>>> 
>>>>>>>> My point here is that they should not depend on each other at all:
>>>>>>> Batching
>>>>>>>> can work with or without Streaming, as well as Streaming can work
>>>>> with
>>>>>> or
>>>>>>>> without Batching.
>>>>>>>> 
>>>>>>>> Your proposal is a set of non-obvious rules for them to work. I
>>> see
>>>>> no
>>>>>>>> reasons for these complications.
>>>>>>>> 
>>>>>>>> Sergi
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 2016-12-08 15:49 GMT+03:00 Vladimir Ozerov <voze...@gridgain.com
>>>> :
>>>>>>>> 
>>>>>>>>> Sergi,
>>>>>>>>> 
>>>>>>>>> If user call single *execute() *operation, than most likely it
>>> is
>>>>> not
>>>>>>>>> batching. We should not rely on strange case where user perform
>>>>>>> batching
>>>>>>>>> without using standard and well-adopted batching JDBC API. The
>>>>> main
>>>>>>>> problem
>>>>>>>>> with streamer is that it is async and hence break happens-before
>>>>>>>> guarantees
>>>>>>>>> in a single thread: SELECT after INSERT might not return
>>> inserted
>>>>>>> value.
>>>>>>>>> 
>>>>>>>>> Honestly, I do not really understand why we are trying to
>>>>> re-invent a
>>>>>>>>> bicycle here. There is standard API - let's just use it and make
>>>>>>> flexible
>>>>>>>>> enough to take advantage of IgniteDataStreamer if needed.
>>>>>>>>> 
>>>>>>>>> Is there any use case which is not covered with this solution?
>>> Or
>>>>> let
>>>>>>> me
>>>>>>>>> ask from the opposite side - are there any well-known JDBC
>>> drivers
>>>>>>> which
>>>>>>>>> perform batching/streaming from non-batched update statements?
>>>>>>>>> 
>>>>>>>>> Vladimir.
>>>>>>>>> 
>>>>>>>>> On Thu, Dec 8, 2016 at 3:38 PM, Sergi Vladykin <
>>>>>>> sergi.vlady...@gmail.com
>>>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Vladimir,
>>>>>>>>>> 
>>>>>>>>>> I see no reason to forbid Streamer usage from non-batched
>>>>> statement
>>>>>>>>>> execution.
>>>>>>>>>> It is common that users already have their ETL tools and you
>>>>> can't
>>>>>> be
>>>>>>>>> sure
>>>>>>>>>> if they use batching or not.
>>>>>>>>>> 
>>>>>>>>>> Alex,
>>>>>>>>>> 
>>>>>>>>>> I guess we have to decide on Streaming first and then we will
>>>>>> discuss
>>>>>>>>>> Batching separately, ok? Because this decision may become
>>>>> important
>>>>>>> for
>>>>>>>>>> batching implementation.
>>>>>>>>>> 
>>>>>>>>>> Sergi
>>>>>>>>>> 
>>>>>>>>>> 2016-12-08 15:31 GMT+03:00 Andrey Gura <ag...@apache.org>:
>>>>>>>>>> 
>>>>>>>>>>> Alex,
>>>>>>>>>>> 
>>>>>>>>>>> In most cases JdbcQueryTask should be executed locally on
>>>>> client
>>>>>>> node
>>>>>>>>>>> started by JDBC driver.
>>>>>>>>>>> 
>>>>>>>>>>> JdbcQueryTask.QueryResult res =
>>>>>>>>>>>    loc ? qryTask.call() :
>>>>>>>>>>> ignite.compute(ignite.cluster().forNodeId(nodeId)).call(
>>>>>> qryTask);
>>>>>>>>>>> 
>>>>>>>>>>> Is it valid behavior after introducing DML functionality?
>>>>>>>>>>> 
>>>>>>>>>>> In cases when user wants to execute query on specific node
>>> he
>>>>>>> should
>>>>>>>>>>> fully understand what he wants and what can go in wrong way.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Thu, Dec 8, 2016 at 3:20 PM, Alexander Paschenko
>>>>>>>>>>> <alexander.a.pasche...@gmail.com> wrote:
>>>>>>>>>>>> Sergi,
>>>>>>>>>>>> 
>>>>>>>>>>>> JDBC batching might work quite differently from driver to
>>>>>> driver.
>>>>>>>>> Say,
>>>>>>>>>>>> MySQL happily rewrites queries as I had suggested in the
>>>>>>> beginning
>>>>>>>> of
>>>>>>>>>>>> this thread (it's not the only strategy, but one of the
>>>>>> possible
>>>>>>>>>>>> options) - and, BTW, would like to hear at least an
>>> opinion
>>>>>> about
>>>>>>>> it.
>>>>>>>>>>>> 
>>>>>>>>>>>> On your first approach, section before streamer: you
>>> suggest
>>>>>> that
>>>>>>>> we
>>>>>>>>>>>> send single statement and multiple param sets as a single
>>>>> query
>>>>>>>> task,
>>>>>>>>>>>> am I right? (Just to make sure that I got you properly.)
>>> If
>>>>> so,
>>>>>>> do
>>>>>>>>> you
>>>>>>>>>>>> also mean that API (namely JdbcQueryTask) between server
>>> and
>>>>>>> client
>>>>>>>>>>>> should also change? Or should new API means be added to
>>>>>>> facilitate
>>>>>>>>>>>> batching tasks?
>>>>>>>>>>>> 
>>>>>>>>>>>> - Alex
>>>>>>>>>>>> 
>>>>>>>>>>>> 2016-12-08 15:05 GMT+03:00 Sergi Vladykin <
>>>>>>>> sergi.vlady...@gmail.com
>>>>>>>>>> :
>>>>>>>>>>>>> Guys,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I discussed this feature with Dmitriy and we came to
>>>>>> conclusion
>>>>>>>> that
>>>>>>>>>>>>> batching in JDBC and Data Streaming in Ignite have
>>>>> different
>>>>>>>>> semantics
>>>>>>>>>>> and
>>>>>>>>>>>>> performance characteristics. Thus they are independent
>>>>>> features
>>>>>>>>> (they
>>>>>>>>>>> may
>>>>>>>>>>>>> work together, may separately, but this is another
>>> story).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Let me explain.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> This is how JDBC batching works:
>>>>>>>>>>>>> - Add N sets of parameters to a prepared statement.
>>>>>>>>>>>>> - Manually execute prepared statement.
>>>>>>>>>>>>> - Repeat until all the data is loaded.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> This is how data streamer works:
>>>>>>>>>>>>> - Keep adding data.
>>>>>>>>>>>>> - Streamer will buffer and load buffered per-node batches
>>>>> when
>>>>>>>> they
>>>>>>>>>> are
>>>>>>>>>>> big
>>>>>>>>>>>>> enough.
>>>>>>>>>>>>> - Close streamer to make sure that everything is loaded.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> As you can see we have a difference in semantics of when
>>> we
>>>>>> send
>>>>>>>>> data:
>>>>>>>>>>> if
>>>>>>>>>>>>> in our JDBC we will allow sending batches to nodes
>>> without
>>>>>>> calling
>>>>>>>>>>>>> `execute` (and probably we will need to make `execute` to
>>>>>> no-op
>>>>>>>>> here),
>>>>>>>>>>> then
>>>>>>>>>>>>> we are violating semantics of JDBC, if we will disallow
>>>>> this
>>>>>>>>> behavior,
>>>>>>>>>>> then
>>>>>>>>>>>>> this batching will underperform.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thus I suggest keeping these features (JDBC Batching and
>>>>> JDBC
>>>>>>>>>>> Streaming) as
>>>>>>>>>>>>> separate features.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> As I already said they can work together: Batching will
>>>>> batch
>>>>>>>>>> parameters
>>>>>>>>>>>>> and on `execute` they will go to the Streamer in one shot
>>>>> and
>>>>>>>>> Streamer
>>>>>>>>>>> will
>>>>>>>>>>>>> deal with the rest.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Sergi
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 2016-12-08 14:16 GMT+03:00 Vladimir Ozerov <
>>>>>>> voze...@gridgain.com
>>>>>>>>> :
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi Alex,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> To my understanding there are two possible approaches to
>>>>>>> batching
>>>>>>>>> in
>>>>>>>>>>> JDBC
>>>>>>>>>>>>>> layer:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 1) Rely on default batching API. Specifically
>>>>>>>>>>>>>> *PreparedStatement.addBatch()* [1]
>>>>>>>>>>>>>> and others. This is nice and clear API, users are used
>>> to
>>>>> it,
>>>>>>> and
>>>>>>>>>> it's
>>>>>>>>>>>>>> adoption will minimize user code changes when migrating
>>>>> from
>>>>>>>> other
>>>>>>>>>> JDBC
>>>>>>>>>>>>>> sources. We simply copy updates locally and then execute
>>>>> them
>>>>>>> all
>>>>>>>>> at
>>>>>>>>>>> once
>>>>>>>>>>>>>> with only a single network hop to servers.
>>>>>> *IgniteDataStreamer*
>>>>>>>> can
>>>>>>>>>> be
>>>>>>>>>>> used
>>>>>>>>>>>>>> underneath.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 2) Or we can have separate connection flag which will
>>> move
>>>>>> all
>>>>>>>>>>>>>> INSERT/UPDATE/DELETE statements through streamer.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I prefer the first approach
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Also we need to keep in mind that data streamer has poor
>>>>>>>>> performance
>>>>>>>>>>> when
>>>>>>>>>>>>>> adding single key-value pairs due to high overhead on
>>>>>>> concurrency
>>>>>>>>> and
>>>>>>>>>>> other
>>>>>>>>>>>>>> bookkeeping. Instead, it is better to pre-batch
>>> key-value
>>>>>> pairs
>>>>>>>>>> before
>>>>>>>>>>>>>> giving them to streamer.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Vladimir.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>> https://docs.oracle.com/javase/8/docs/api/java/sql/
>>>>>>>>>>> PreparedStatement.html#
>>>>>>>>>>>>>> addBatch--
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Thu, Dec 8, 2016 at 1:21 PM, Alexander Paschenko <
>>>>>>>>>>>>>> alexander.a.pasche...@gmail.com> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hello Igniters,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> One of the major improvements to DML has to be support
>>>>> of
>>>>>>> batch
>>>>>>>>>>>>>>> statements. I'd like to discuss its implementation.
>>> The
>>>>>>>> suggested
>>>>>>>>>>>>>>> approach is to rewrite given query turning it from few
>>>>>>> INSERTs
>>>>>>>>> into
>>>>>>>>>>>>>>> single statement and processing arguments
>>> accordingly. I
>>>>>>>> suggest
>>>>>>>>>> this
>>>>>>>>>>>>>>> as long as the whole point of batching is to make as
>>>>> little
>>>>>>>>>>>>>>> interactions with cluster as possible and to make
>>>>>> operations
>>>>>>> as
>>>>>>>>>>>>>>> condensed as possible, and in case of Ignite it means
>>>>> that
>>>>>> we
>>>>>>>>>> should
>>>>>>>>>>>>>>> send as little JdbcQueryTasks as possible. And, as
>>> long
>>>>> as
>>>>>> a
>>>>>>>>> query
>>>>>>>>>>>>>>> task holds single query and its arguments, this
>>> approach
>>>>>> will
>>>>>>>> not
>>>>>>>>>>>>>>> require any changes to be done to current design and
>>>>> won't
>>>>>>>> break
>>>>>>>>>> any
>>>>>>>>>>>>>>> backward compatibility - all dirty work on rewriting
>>>>> will
>>>>>> be
>>>>>>>> done
>>>>>>>>>> by
>>>>>>>>>>>>>>> JDBC driver.
>>>>>>>>>>>>>>> Without rewriting, we could introduce some new query
>>>>> task
>>>>>> for
>>>>>>>>> batch
>>>>>>>>>>>>>>> operations, but that would make impossible sending
>>> such
>>>>>>>> requests
>>>>>>>>>> from
>>>>>>>>>>>>>>> newer clients to older servers (say, servers of
>>> version
>>>>>>> 1.8.0,
>>>>>>>>>> which
>>>>>>>>>>>>>>> does not know about batching, let alone older
>>> versions).
>>>>>>>>>>>>>>> I'd like to hear comments and suggestions from the
>>>>>> community.
>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> - Alex
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Vladimir Ozerov
>>>>>> Senior Software Architect
>>>>>> GridGain Systems
>>>>>> www.gridgain.com
>>>>>> *+7 (960) 283 98 40*
>>>>>> 
>>>>> 
>>>> 
>>>

Re: Batch DML queries design discussion

Reply via email to