Re: IgniteCache.loadCache improvement proposal

Vladimir Ozerov Tue, 15 Nov 2016 00:27:24 -0800

Hi Alex,

>>> Let's give the user the reusable code which is convenient, reliable and
fast.
Convenience - this is why I asked for example on how API can look like and
how users are going to use it.


Vladimir.

On Tue, Nov 15, 2016 at 11:18 AM, Alexandr Kuramshin <ein.nsk...@gmail.com>
wrote:

> Hi all,
>
> I think the discussion goes a wrong direction. Certainly it's not a big
> deal to implement some custom user logic to load the data into caches. But
> Ignite framework gives the user some reusable code build on top of the
> basic system.
>
> So the main question is: Why developers let the user to use convenient way
> to load caches with totally non-optimal solution?
>
> We could talk too much about different persistence storage types, but
> whenever we initiate the loading with IgniteCache.loadCache the current
> implementation imposes much overhead on the network.
>
> Partition-aware data loading may be used in some scenarios to avoid this
> network overhead, but the users are compelled to do additional steps to
> achieve this optimization: adding the column to tables, adding compound
> indices including the added column, write a peace of repeatable code to
> load the data in different caches in fault-tolerant fashion, etc.
>
> Let's give the user the reusable code which is convenient, reliable and
> fast.
>
> 2016-11-14 20:56 GMT+03:00 Valentin Kulichenko <
> valentin.kuliche...@gmail.com>:
>
> > Hi Aleksandr,
> >
> > Data streamer is already outlined as one of the possible approaches for
> > loading the data [1]. Basically, you start a designated client node or
> > chose a leader among server nodes [1] and then use IgniteDataStreamer API
> > to load the data. With this approach there is no need to have the
> > CacheStore implementation at all. Can you please elaborate what
> additional
> > value are you trying to add here?
> >
> > [1] https://apacheignite.readme.io/docs/data-loading#ignitedatastreamer
> > [2] https://apacheignite.readme.io/docs/leader-election
> >
> > -Val
> >
> > On Mon, Nov 14, 2016 at 8:23 AM, Dmitriy Setrakyan <
> dsetrak...@apache.org>
> > wrote:
> >
> > > Hi,
> > >
> > > I just want to clarify a couple of API details from the original email
> to
> > > make sure that we are making the right assumptions here.
> > >
> > > *"Because of none keys are passed to the CacheStore.loadCache methods,
> > the
> > > > underlying implementation is forced to read all the data from the
> > > > persistence storage"*
> > >
> > >
> > > According to the javadoc, loadCache(...) method receives an optional
> > > argument from the user. You can pass anything you like, including a
> list
> > of
> > > keys, or an SQL where clause, etc.
> > >
> > > *"The partition-aware data loading approach is not a choice. It
> requires
> > > > persistence of the volatile data depended on affinity function
> > > > implementation and settings."*
> > >
> > >
> > > This is only partially true. While Ignite allows to plugin custom
> > affinity
> > > functions, the affinity function is not something that changes
> > dynamically
> > > and should always return the same partition for the same key.So, the
> > > partition assignments are not volatile at all. If, in some very rare
> > case,
> > > the partition assignment logic needs to change, then you could update
> the
> > > partition assignments that you may have persisted elsewhere as well,
> e.g.
> > > database.
> > >
> > > D.
> > >
> > > On Mon, Nov 14, 2016 at 10:23 AM, Vladimir Ozerov <
> voze...@gridgain.com>
> > > wrote:
> > >
> > > > Alexandr, Alexey,
> > > >
> > > > While I agree with you that current cache loading logic is far from
> > > ideal,
> > > > it would be cool to see API drafts based on your suggestions to get
> > > better
> > > > understanding of your ideas. How exactly users are going to use your
> > > > suggestions?
> > > >
> > > > My main concern is that initial load is not very trivial task in
> > general
> > > > case. Some users have centralized RDBMS systems, some have NoSQL,
> > others
> > > > work with distributed persistent stores (e.g. HDFS). Sometimes we
> have
> > > > Ignite nodes "near" persistent data, sometimes we don't. Sharding,
> > > > affinity, co-location, etc.. If we try to support all (or many) cases
> > out
> > > > of the box, we may end up in very messy and difficult API. So we
> should
> > > > carefully balance between simplicity, usability and feature-rich
> > > > characteristics here.
> > > >
> > > > Personally, I think that if user is not satisfied with "loadCache()"
> > API,
> > > > he just writes simple closure with blackjack streamer and queries and
> > > send
> > > > it to whatever node he finds convenient. Not a big deal. Only very
> > common
> > > > cases should be added to Ignite API.
> > > >
> > > > Vladimir.
> > > >
> > > >
> > > > On Mon, Nov 14, 2016 at 12:43 PM, Alexey Kuznetsov <
> > > > akuznet...@gridgain.com>
> > > > wrote:
> > > >
> > > > > Looks good for me.
> > > > >
> > > > > But I will suggest to consider one more use-case:
> > > > >
> > > > > If user knows its data he could manually split loading.
> > > > > For example: table Persons contains 10M rows.
> > > > > User could provide something like:
> > > > > cache.loadCache(null, "Person", "select * from Person where id <
> > > > > 1_000_000",
> > > > > "Person", "select * from Person where id >=  1_000_000 and id <
> > > > 2_000_000",
> > > > > ....
> > > > > "Person", "select * from Person where id >= 9_000_000 and id <
> > > > 10_000_000",
> > > > > );
> > > > >
> > > > > or may be it could be some descriptor object like
> > > > >
> > > > >  {
> > > > >    sql: select * from Person where id >=  ? and id < ?"
> > > > >    range: 0...10_000_000
> > > > > }
> > > > >
> > > > > In this case provided queries will be send to mach nodes as number
> of
> > > > > queries.
> > > > > And data will be loaded in parallel and for keys that a not local -
> > > data
> > > > > streamer
> > > > > should be used (as described Alexandr description).
> > > > >
> > > > > I think it is a good issue for Ignite 2.0
> > > > >
> > > > > Vova, Val - what do you think?
> > > > >
> > > > >
> > > > > On Mon, Nov 14, 2016 at 4:01 PM, Alexandr Kuramshin <
> > > > ein.nsk...@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> All right,
> > > > >>
> > > > >> Let's assume a simple scenario. When the IgniteCache.loadCache is
> > > > invoked,
> > > > >> we check whether the cache is not local, and if so, then we'll
> > > initiate
> > > > >> the
> > > > >> new loading logic.
> > > > >>
> > > > >> First, we take a "streamer" node, it could be done by
> > > > >> utilizing LoadBalancingSpi, or it may be configured statically,
> for
> > > the
> > > > >> reason that the streamer node is running on the same host as the
> > > > >> persistence storage provider.
> > > > >>
> > > > >> After that we start the loading task on the streamer node which
> > > > >> creates IgniteDataStreamer and loads the cache with
> > > > CacheStore.loadCache.
> > > > >> Every call to IgniteBiInClosure.apply simply
> > > > >> invokes IgniteDataStreamer.addData.
> > > > >>
> > > > >> This implementation will completely relieve overhead on the
> > > persistence
> > > > >> storage provider. Network overhead is also decreased in the case
> of
> > > > >> partitioned caches. For two nodes we get 1-1/2 amount of data
> > > > transferred
> > > > >> by the network (1 part well be transferred from the persistence
> > > storage
> > > > to
> > > > >> the streamer, and then 1/2 from the streamer node to the another
> > > node).
> > > > >> For
> > > > >> three nodes it will be 1-2/3 and so on, up to the two times amount
> > of
> > > > data
> > > > >> on the big clusters.
> > > > >>
> > > > >> I'd like to propose some additional optimization at this place. If
> > we
> > > > have
> > > > >> the streamer node on the same machine as the persistence storage
> > > > provider,
> > > > >> then we completely relieve the network overhead as well. It could
> > be a
> > > > >> some
> > > > >> special daemon node for the cache loading assigned in the cache
> > > > >> configuration, or an ordinary sever node as well.
> > > > >>
> > > > >> Certainly this calculations have been done in assumption that we
> > have
> > > > even
> > > > >> partitioned cache with only primary nodes (without backups). In
> the
> > > case
> > > > >> of
> > > > >> one backup (the most frequent case I think), we get 2 amount of
> data
> > > > >> transferred by the network on two nodes, 2-1/3 on three, 2-1/2 on
> > > four,
> > > > >> and
> > > > >> so on up to the three times amount of data on the big clusters.
> > Hence
> > > > it's
> > > > >> still better than the current implementation. In the worst case
> > with a
> > > > >> fully replicated cache we take N+1 amount of data transferred by
> the
> > > > >> network (where N is the number of nodes in the cluster). But it's
> > not
> > > a
> > > > >> problem in small clusters, and a little overhead in big clusters.
> > And
> > > we
> > > > >> still gain the persistence storage provider optimization.
> > > > >>
> > > > >> Now let's take more complex scenario. To achieve some level of
> > > > >> parallelism,
> > > > >> we could split our cluster on several groups. It could be a
> > parameter
> > > of
> > > > >> the IgniteCache.loadCache method or a cache configuration option.
> > The
> > > > >> number of groups could be a fixed value, or it could be calculated
> > > > >> dynamically by the maximum number of nodes in the group.
> > > > >>
> > > > >> After splitting the whole cluster on groups we will take the
> > streamer
> > > > node
> > > > >> in the each group and submit the task for loading the cache
> similar
> > to
> > > > the
> > > > >> single streamer scenario, except as the only keys will be passed
> to
> > > > >> the IgniteDataStreamer.addData method those correspond to the
> > cluster
> > > > >> group
> > > > >> where is the streamer node running.
> > > > >>
> > > > >> In this case we get equal level of overhead as the parallelism,
> but
> > > not
> > > > so
> > > > >> surplus as how many nodes in whole the cluster.
> > > > >>
> > > > >> 2016-11-11 15:37 GMT+03:00 Alexey Kuznetsov <
> akuznet...@apache.org
> > >:
> > > > >>
> > > > >> > Alexandr,
> > > > >> >
> > > > >> > Could you describe your proposal in more details?
> > > > >> > Especially in case with several nodes.
> > > > >> >
> > > > >> > On Fri, Nov 11, 2016 at 6:34 PM, Alexandr Kuramshin <
> > > > >> ein.nsk...@gmail.com>
> > > > >> > wrote:
> > > > >> >
> > > > >> > > Hi,
> > > > >> > >
> > > > >> > > You know CacheStore API that is commonly used for
> > > read/write-through
> > > > >> > > relationship of the in-memory data with the persistence
> storage.
> > > > >> > >
> > > > >> > > There is also IgniteCache.loadCache method for hot-loading the
> > > cache
> > > > >> on
> > > > >> > > startup. Invocation of this method causes execution of
> > > > >> > CacheStore.loadCache
> > > > >> > > on the all nodes storing the cache partitions. Because of none
> > > keys
> > > > >> are
> > > > >> > > passed to the CacheStore.loadCache methods, the underlying
> > > > >> implementation
> > > > >> > > is forced to read all the data from the persistence storage,
> but
> > > > only
> > > > >> > part
> > > > >> > > of the data will be stored on each node.
> > > > >> > >
> > > > >> > > So, the current implementation have two general drawbacks:
> > > > >> > >
> > > > >> > > 1. Persistence storage is forced to perform as many identical
> > > > queries
> > > > >> as
> > > > >> > > many nodes on the cluster. Each query may involve much
> > additional
> > > > >> > > computation on the persistence storage server.
> > > > >> > >
> > > > >> > > 2. Network is forced to transfer much more data, so obviously
> > the
> > > > big
> > > > >> > > disadvantage on large systems.
> > > > >> > >
> > > > >> > > The partition-aware data loading approach, described in
> > > > >> > > https://apacheignite.readme.io/docs/data-loading#section-
> > > > >> > > partition-aware-data-loading
> > > > >> > > , is not a choice. It requires persistence of the volatile
> data
> > > > >> depended
> > > > >> > on
> > > > >> > > affinity function implementation and settings.
> > > > >> > >
> > > > >> > > I propose using something like IgniteDataStreamer inside
> > > > >> > > IgniteCache.loadCache implementation.
> > > > >> > >
> > > > >> > >
> > > > >> > > --
> > > > >> > > Thanks,
> > > > >> > > Alexandr Kuramshin
> > > > >> > >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > --
> > > > >> > Alexey Kuznetsov
> > > > >> >
> > > > >>
> > > > >>
> > > > >>
> > > > >> --
> > > > >> Thanks,
> > > > >> Alexandr Kuramshin
> > > > >>
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Alexey Kuznetsov
> > > > > GridGain Systems
> > > > > www.gridgain.com
> > > > >
> > > >
> > >
> >
>
>
>
> --
> Thanks,
> Alexandr Kuramshin
>

Re: IgniteCache.loadCache improvement proposal

Reply via email to