Re: Batch DML queries design discussion

Alexander Paschenko Tue, 20 Dec 2016 01:25:16 -0800

Auto flush freq is already there, I just forgot to mention it in the
comments. Will add the rest today.


— Alex
19 дек. 2016 г. 10:29 PM пользователь "Denis Magda" <dma...@apache.org>
написал:

> Alexander,
>
> A couple of comments in regards to the streaming mode.
>
> I would rename rename the existed property to “ignite.jdbc.streaming” and
> add additional ones that will help to manage and tune the streaming
> behavior:
> ignite.jdbc.streaming.perNodeBufferSize
> ignite.jdbc.streaming.perNodeParallelOperations
> ignite.jdbc.streaming.autoFlushFrequency
>
>
> Any other thoughts?
>
> —
> Denis
>
> > On Dec 19, 2016, at 8:02 AM, Alexander Paschenko <
> alexander.a.pasche...@gmail.com> wrote:
> >
> > OK folks, both data streamer support and batching support have been
> implemented.
> >
> > Resulting design fully conforms to what Dima suggested initially -
> > these two concepts are separated.
> >
> > Streamed statements are turned on by connection flag, stream auto
> > flush timeout can be tuned in the same way; these statements support
> > INSERT and MERGE w/o subquery as well as fast key bounded DELETE and
> > UPDATE; each prepared statement in streamed mode has its own streamer
> > object and their lifecycles are the same - on close, the statement
> > closes its streamer. Streaming mode is available only in "local" mode
> > of connection between JDBC driver and Ignite client (default mode when
> > JDBC driver creates Ignite client node by itself) - there would be no
> > sense in streaming if query args would have to travel over network.
> >
> > Batched statements sre used via conventional JDBC API (setArgs...
> > addBatch... executeBatch...), they also support INSERT and MERGE w/o
> > subquery as well as fast key (and, optionally, value) bounded DELETE
> > and UPDATE. These work in the similar manner to non batched statements
> > and likewise rely on traditional putAll/invokeAll routines.
> > Essentially, batching is just the way to pass a bigger map to
> > cache.putAll without writing single very long query. This works in
> > local as well as "remote" Ignite JDBC connectivity mode.
> >
> > More info (details are in the comments):
> >
> > Batching - https://issues.apache.org/jira/browse/IGNITE-4269
> > Streaming - https://issues.apache.org/jira/browse/IGNITE-4169
> >
> > Regards,
> > Alex
> >
> > 2016-12-10 23:39 GMT+03:00 Dmitriy Setrakyan <dsetrak...@apache.org>:
> >> Alex,
> >>
> >> It seams to me that replace semantic can be implemented with
> >> StreamReceiver, no?
> >>
> >> D.
> >>
> >> On Sat, Dec 10, 2016 at 2:54 AM, Alexander Paschenko <
> >> alexander.a.pasche...@gmail.com> wrote:
> >>
> >>> Sorry, "no relation w/JDBC" in my previous message should read "no
> relation
> >>> w/JDBC batching".
> >>>
> >>> — Alex
> >>> 10 дек. 2016 г. 1:52 PM пользователь "Alexander Paschenko" <
> >>> alexander.a.pasche...@gmail.com> написал:
> >>>
> >>>> Dima,
> >>>>
> >>>> I would like to point out that data streamer support had already been
> >>>> implemented in the course of work on DML in 1.8 exactly as you are
> >>>> suggesting now (turned on via connection flag; allowed only MERGE —
> data
> >>>> streamer can't do putIfAbsent stuff, right?; absolutely no relation
> >>>> w/JDBC), *but* that patch had been reverted — by advice from Vlad
> which I
> >>>> believe had been agreed with you, so it didn't make it to 1.8 after
> all.
> >>>> Also, while it's possible to maintain INSERT vs MERGE semantic using
> >>>> streamer's allowOverwrite flag, I can't see how we could mimic UPDATE
> >>> here
> >>>> as long as the streamer does not put data to cache only in case when
> key
> >>> is
> >>>> present AND allowOverwrite is false, while UPDATE should not put
> anything
> >>>> when the key is *missing* — i.e., there's no way to emulate cache's
> >>>> *replace* operation semantic with streamer (update value only if key
> is
> >>>> present, otherwise do nothing).
> >>>>
> >>>> — Alex
> >>>> 9 дек. 2016 г. 10:00 PM пользователь "Dmitriy Setrakyan" <
> >>>> dsetrak...@apache.org> написал:
> >>>>
> >>>>> On Fri, Dec 9, 2016 at 12:45 AM, Vladimir Ozerov <
> voze...@gridgain.com>
> >>>>> wrote:
> >>>>>
> >>>>>> I already expressed my concern - this is counterintuitive approach.
> >>>>> Because
> >>>>>> without happens-before pure streaming model can be applied only on
> >>>>>> independent chunks of data. It mean that mentioned ETL use case is
> not
> >>>>>> feasible - ETL always depend on implicit or explicit links between
> >>>>> tables,
> >>>>>> and hence streaming is not applicable here. My question stands
> still -
> >>>>> what
> >>>>>> products except of possibly Ignite do this kind of JDBC streaming?
> >>>>>>
> >>>>>
> >>>>> Vova, we have 2 mechanisms in the product: IgniteCache.putAll() or
> >>>>> DataStreamer.addData().
> >>>>>
> >>>>> JDBC batching and putAll() are absolutely identical. If you see it as
> >>>>> counter-intuitive, I would ask for a concrete example.
> >>>>>
> >>>>> As far as links between data, Ignite does not have foreign-key
> >>>>> constraints,
> >>>>> so DataStreamer can insert data in any order (but again, not as
> part  of
> >>>>> JDBC batch).
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> Another problem is that connection-wide property doesn't fit well in
> >>>>> JDBC
> >>>>>> pooling model. Users will have use different connections for
> streaming
> >>>>> and
> >>>>>> non-streaming approaches.
> >>>>>>
> >>>>>
> >>>>> Using DataStreamer is not possible within JDBC batching paradigm,
> >>> period.
> >>>>> I
> >>>>> wish we could drop the high-level-feels-good discussions altogether,
> >>>>> because it seems like we are spinning wheels here.
> >>>>>
> >>>>> There is no way to use the streamer in JDBC context, unless we add a
> >>>>> connection flag. Again, if you disagree, I would prefer to see a
> >>> concrete
> >>>>> example explaining why.
> >>>>>
> >>>>>
> >>>>>> Please see how Oracle did that, this is precisely what I am talking
> >>>>> about:
> >>>>>> https://docs.oracle.com/cd/B28359_01/java.111/b31224/oraperf
> >>>>> .htm#i1056232
> >>>>>> Two batching modes - one with explicit flush, another one with
> >>> implicit
> >>>>>> flush, when Oracle decides on it's own when it is better to
> >>> communicate
> >>>>> the
> >>>>>> server. Batching mode can be declared globally or on per-statement
> >>>>> level.
> >>>>>> Simple and flexible.
> >>>>>>
> >>>>>>
> >>>>>> On Fri, Dec 9, 2016 at 4:40 AM, Dmitriy Setrakyan <
> >>>>> dsetrak...@apache.org>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Gents,
> >>>>>>>
> >>>>>>> As Sergi suggested, batching and streaming are very different
> >>>>>> semantically.
> >>>>>>>
> >>>>>>> To use standard JDBC batching, all we need to do is convert it to a
> >>>>>>> cache.putAll() method, as semantically a putAll(...) call is
> >>> identical
> >>>>>> to a
> >>>>>>> JDBC batch. Of course, if we see and UPDATE with a WHERE clause in
> >>>>>> between,
> >>>>>>> then we may have to break a batch into several chunks and execute
> >>> the
> >>>>>>> update in between. The DataStreamer should not be used here.
> >>>>>>>
> >>>>>>> I believe that for streaming we need to add a special JDBC/ODBC
> >>>>>> connection
> >>>>>>> flag. Whenever this flag is set to true, then we only should allow
> >>>>> INSERT
> >>>>>>> or single-UPDATE operations and use DataStreamer API internally.
> All
> >>>>>>> operations other than INSERT or single-UPDATE should be prohibited.
> >>>>>>>
> >>>>>>> I think this design is semantically clear. Any objections?
> >>>>>>>
> >>>>>>> D.
> >>>>>>>
> >>>>>>> On Thu, Dec 8, 2016 at 5:02 AM, Sergi Vladykin <
> >>>>> sergi.vlady...@gmail.com
> >>>>>>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> If we use Streamer, then we always have `happens-before` broken.
> >>>>> This
> >>>>>> is
> >>>>>>>> ok, because Streamer is for data loading, not for usual operating.
> >>>>>>>>
> >>>>>>>> We are not inventing any bicycles, just separating concerns:
> >>>>> Batching
> >>>>>> and
> >>>>>>>> Streaming.
> >>>>>>>>
> >>>>>>>> My point here is that they should not depend on each other at all:
> >>>>>>> Batching
> >>>>>>>> can work with or without Streaming, as well as Streaming can work
> >>>>> with
> >>>>>> or
> >>>>>>>> without Batching.
> >>>>>>>>
> >>>>>>>> Your proposal is a set of non-obvious rules for them to work. I
> >>> see
> >>>>> no
> >>>>>>>> reasons for these complications.
> >>>>>>>>
> >>>>>>>> Sergi
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 2016-12-08 15:49 GMT+03:00 Vladimir Ozerov <voze...@gridgain.com
> >>>> :
> >>>>>>>>
> >>>>>>>>> Sergi,
> >>>>>>>>>
> >>>>>>>>> If user call single *execute() *operation, than most likely it
> >>> is
> >>>>> not
> >>>>>>>>> batching. We should not rely on strange case where user perform
> >>>>>>> batching
> >>>>>>>>> without using standard and well-adopted batching JDBC API. The
> >>>>> main
> >>>>>>>> problem
> >>>>>>>>> with streamer is that it is async and hence break happens-before
> >>>>>>>> guarantees
> >>>>>>>>> in a single thread: SELECT after INSERT might not return
> >>> inserted
> >>>>>>> value.
> >>>>>>>>>
> >>>>>>>>> Honestly, I do not really understand why we are trying to
> >>>>> re-invent a
> >>>>>>>>> bicycle here. There is standard API - let's just use it and make
> >>>>>>> flexible
> >>>>>>>>> enough to take advantage of IgniteDataStreamer if needed.
> >>>>>>>>>
> >>>>>>>>> Is there any use case which is not covered with this solution?
> >>> Or
> >>>>> let
> >>>>>>> me
> >>>>>>>>> ask from the opposite side - are there any well-known JDBC
> >>> drivers
> >>>>>>> which
> >>>>>>>>> perform batching/streaming from non-batched update statements?
> >>>>>>>>>
> >>>>>>>>> Vladimir.
> >>>>>>>>>
> >>>>>>>>> On Thu, Dec 8, 2016 at 3:38 PM, Sergi Vladykin <
> >>>>>>> sergi.vlady...@gmail.com
> >>>>>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Vladimir,
> >>>>>>>>>>
> >>>>>>>>>> I see no reason to forbid Streamer usage from non-batched
> >>>>> statement
> >>>>>>>>>> execution.
> >>>>>>>>>> It is common that users already have their ETL tools and you
> >>>>> can't
> >>>>>> be
> >>>>>>>>> sure
> >>>>>>>>>> if they use batching or not.
> >>>>>>>>>>
> >>>>>>>>>> Alex,
> >>>>>>>>>>
> >>>>>>>>>> I guess we have to decide on Streaming first and then we will
> >>>>>> discuss
> >>>>>>>>>> Batching separately, ok? Because this decision may become
> >>>>> important
> >>>>>>> for
> >>>>>>>>>> batching implementation.
> >>>>>>>>>>
> >>>>>>>>>> Sergi
> >>>>>>>>>>
> >>>>>>>>>> 2016-12-08 15:31 GMT+03:00 Andrey Gura <ag...@apache.org>:
> >>>>>>>>>>
> >>>>>>>>>>> Alex,
> >>>>>>>>>>>
> >>>>>>>>>>> In most cases JdbcQueryTask should be executed locally on
> >>>>> client
> >>>>>>> node
> >>>>>>>>>>> started by JDBC driver.
> >>>>>>>>>>>
> >>>>>>>>>>> JdbcQueryTask.QueryResult res =
> >>>>>>>>>>>    loc ? qryTask.call() :
> >>>>>>>>>>> ignite.compute(ignite.cluster().forNodeId(nodeId)).call(
> >>>>>> qryTask);
> >>>>>>>>>>>
> >>>>>>>>>>> Is it valid behavior after introducing DML functionality?
> >>>>>>>>>>>
> >>>>>>>>>>> In cases when user wants to execute query on specific node
> >>> he
> >>>>>>> should
> >>>>>>>>>>> fully understand what he wants and what can go in wrong way.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Dec 8, 2016 at 3:20 PM, Alexander Paschenko
> >>>>>>>>>>> <alexander.a.pasche...@gmail.com> wrote:
> >>>>>>>>>>>> Sergi,
> >>>>>>>>>>>>
> >>>>>>>>>>>> JDBC batching might work quite differently from driver to
> >>>>>> driver.
> >>>>>>>>> Say,
> >>>>>>>>>>>> MySQL happily rewrites queries as I had suggested in the
> >>>>>>> beginning
> >>>>>>>> of
> >>>>>>>>>>>> this thread (it's not the only strategy, but one of the
> >>>>>> possible
> >>>>>>>>>>>> options) - and, BTW, would like to hear at least an
> >>> opinion
> >>>>>> about
> >>>>>>>> it.
> >>>>>>>>>>>>
> >>>>>>>>>>>> On your first approach, section before streamer: you
> >>> suggest
> >>>>>> that
> >>>>>>>> we
> >>>>>>>>>>>> send single statement and multiple param sets as a single
> >>>>> query
> >>>>>>>> task,
> >>>>>>>>>>>> am I right? (Just to make sure that I got you properly.)
> >>> If
> >>>>> so,
> >>>>>>> do
> >>>>>>>>> you
> >>>>>>>>>>>> also mean that API (namely JdbcQueryTask) between server
> >>> and
> >>>>>>> client
> >>>>>>>>>>>> should also change? Or should new API means be added to
> >>>>>>> facilitate
> >>>>>>>>>>>> batching tasks?
> >>>>>>>>>>>>
> >>>>>>>>>>>> - Alex
> >>>>>>>>>>>>
> >>>>>>>>>>>> 2016-12-08 15:05 GMT+03:00 Sergi Vladykin <
> >>>>>>>> sergi.vlady...@gmail.com
> >>>>>>>>>> :
> >>>>>>>>>>>>> Guys,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I discussed this feature with Dmitriy and we came to
> >>>>>> conclusion
> >>>>>>>> that
> >>>>>>>>>>>>> batching in JDBC and Data Streaming in Ignite have
> >>>>> different
> >>>>>>>>> semantics
> >>>>>>>>>>> and
> >>>>>>>>>>>>> performance characteristics. Thus they are independent
> >>>>>> features
> >>>>>>>>> (they
> >>>>>>>>>>> may
> >>>>>>>>>>>>> work together, may separately, but this is another
> >>> story).
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Let me explain.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> This is how JDBC batching works:
> >>>>>>>>>>>>> - Add N sets of parameters to a prepared statement.
> >>>>>>>>>>>>> - Manually execute prepared statement.
> >>>>>>>>>>>>> - Repeat until all the data is loaded.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> This is how data streamer works:
> >>>>>>>>>>>>> - Keep adding data.
> >>>>>>>>>>>>> - Streamer will buffer and load buffered per-node batches
> >>>>> when
> >>>>>>>> they
> >>>>>>>>>> are
> >>>>>>>>>>> big
> >>>>>>>>>>>>> enough.
> >>>>>>>>>>>>> - Close streamer to make sure that everything is loaded.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> As you can see we have a difference in semantics of when
> >>> we
> >>>>>> send
> >>>>>>>>> data:
> >>>>>>>>>>> if
> >>>>>>>>>>>>> in our JDBC we will allow sending batches to nodes
> >>> without
> >>>>>>> calling
> >>>>>>>>>>>>> `execute` (and probably we will need to make `execute` to
> >>>>>> no-op
> >>>>>>>>> here),
> >>>>>>>>>>> then
> >>>>>>>>>>>>> we are violating semantics of JDBC, if we will disallow
> >>>>> this
> >>>>>>>>> behavior,
> >>>>>>>>>>> then
> >>>>>>>>>>>>> this batching will underperform.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thus I suggest keeping these features (JDBC Batching and
> >>>>> JDBC
> >>>>>>>>>>> Streaming) as
> >>>>>>>>>>>>> separate features.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> As I already said they can work together: Batching will
> >>>>> batch
> >>>>>>>>>> parameters
> >>>>>>>>>>>>> and on `execute` they will go to the Streamer in one shot
> >>>>> and
> >>>>>>>>> Streamer
> >>>>>>>>>>> will
> >>>>>>>>>>>>> deal with the rest.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Sergi
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 2016-12-08 14:16 GMT+03:00 Vladimir Ozerov <
> >>>>>>> voze...@gridgain.com
> >>>>>>>>> :
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Alex,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> To my understanding there are two possible approaches to
> >>>>>>> batching
> >>>>>>>>> in
> >>>>>>>>>>> JDBC
> >>>>>>>>>>>>>> layer:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 1) Rely on default batching API. Specifically
> >>>>>>>>>>>>>> *PreparedStatement.addBatch()* [1]
> >>>>>>>>>>>>>> and others. This is nice and clear API, users are used
> >>> to
> >>>>> it,
> >>>>>>> and
> >>>>>>>>>> it's
> >>>>>>>>>>>>>> adoption will minimize user code changes when migrating
> >>>>> from
> >>>>>>>> other
> >>>>>>>>>> JDBC
> >>>>>>>>>>>>>> sources. We simply copy updates locally and then execute
> >>>>> them
> >>>>>>> all
> >>>>>>>>> at
> >>>>>>>>>>> once
> >>>>>>>>>>>>>> with only a single network hop to servers.
> >>>>>> *IgniteDataStreamer*
> >>>>>>>> can
> >>>>>>>>>> be
> >>>>>>>>>>> used
> >>>>>>>>>>>>>> underneath.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 2) Or we can have separate connection flag which will
> >>> move
> >>>>>> all
> >>>>>>>>>>>>>> INSERT/UPDATE/DELETE statements through streamer.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I prefer the first approach
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Also we need to keep in mind that data streamer has poor
> >>>>>>>>> performance
> >>>>>>>>>>> when
> >>>>>>>>>>>>>> adding single key-value pairs due to high overhead on
> >>>>>>> concurrency
> >>>>>>>>> and
> >>>>>>>>>>> other
> >>>>>>>>>>>>>> bookkeeping. Instead, it is better to pre-batch
> >>> key-value
> >>>>>> pairs
> >>>>>>>>>> before
> >>>>>>>>>>>>>> giving them to streamer.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Vladimir.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>> https://docs.oracle.com/javase/8/docs/api/java/sql/
> >>>>>>>>>>> PreparedStatement.html#
> >>>>>>>>>>>>>> addBatch--
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Thu, Dec 8, 2016 at 1:21 PM, Alexander Paschenko <
> >>>>>>>>>>>>>> alexander.a.pasche...@gmail.com> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hello Igniters,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> One of the major improvements to DML has to be support
> >>>>> of
> >>>>>>> batch
> >>>>>>>>>>>>>>> statements. I'd like to discuss its implementation.
> >>> The
> >>>>>>>> suggested
> >>>>>>>>>>>>>>> approach is to rewrite given query turning it from few
> >>>>>>> INSERTs
> >>>>>>>>> into
> >>>>>>>>>>>>>>> single statement and processing arguments
> >>> accordingly. I
> >>>>>>>> suggest
> >>>>>>>>>> this
> >>>>>>>>>>>>>>> as long as the whole point of batching is to make as
> >>>>> little
> >>>>>>>>>>>>>>> interactions with cluster as possible and to make
> >>>>>> operations
> >>>>>>> as
> >>>>>>>>>>>>>>> condensed as possible, and in case of Ignite it means
> >>>>> that
> >>>>>> we
> >>>>>>>>>> should
> >>>>>>>>>>>>>>> send as little JdbcQueryTasks as possible. And, as
> >>> long
> >>>>> as
> >>>>>> a
> >>>>>>>>> query
> >>>>>>>>>>>>>>> task holds single query and its arguments, this
> >>> approach
> >>>>>> will
> >>>>>>>> not
> >>>>>>>>>>>>>>> require any changes to be done to current design and
> >>>>> won't
> >>>>>>>> break
> >>>>>>>>>> any
> >>>>>>>>>>>>>>> backward compatibility - all dirty work on rewriting
> >>>>> will
> >>>>>> be
> >>>>>>>> done
> >>>>>>>>>> by
> >>>>>>>>>>>>>>> JDBC driver.
> >>>>>>>>>>>>>>> Without rewriting, we could introduce some new query
> >>>>> task
> >>>>>> for
> >>>>>>>>> batch
> >>>>>>>>>>>>>>> operations, but that would make impossible sending
> >>> such
> >>>>>>>> requests
> >>>>>>>>>> from
> >>>>>>>>>>>>>>> newer clients to older servers (say, servers of
> >>> version
> >>>>>>> 1.8.0,
> >>>>>>>>>> which
> >>>>>>>>>>>>>>> does not know about batching, let alone older
> >>> versions).
> >>>>>>>>>>>>>>> I'd like to hear comments and suggestions from the
> >>>>>> community.
> >>>>>>>>>> Thanks!
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> - Alex
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Vladimir Ozerov
> >>>>>> Senior Software Architect
> >>>>>> GridGain Systems
> >>>>>> www.gridgain.com
> >>>>>> *+7 (960) 283 98 40*
> >>>>>>
> >>>>>
> >>>>
> >>>
>
>

Re: Batch DML queries design discussion

Reply via email to