Re: [DISCUSSION] Spark Data Frame through Thin Client

Valentin Kulichenko Sat, 20 Oct 2018 18:34:03 -0700

Guys,

>From my experience, Ignite and Spark clusters typically run in the same
environment, which makes client node a more preferable option. Mainly,
because of performance. BTW, I doubt partition-awareness on thin client
will help either, because in dataframes we only run SQL queries and I
believe thin client will execute them through a proxy anyway. But correct
me if I’m wrong.


Either way, it sounds like we just have usability issues with Ignite/Spark
integration. Why don’t we concentrate on fixing them then? For example, #3
can be fixed by loading XML content on master and then distributing it to
workers, instead of loading on every worker independently. Then there are
certain procedures like deploying JARs, etc. First of all, they will exist
with thin client either. Second of all, I’m sure there are ways to simplify
this procedures and make integration easier. My opinion is that working on
such improvements is going to add more value than another implementation
based on thin client.

-Val

On Sat, Oct 20, 2018 at 4:03 PM Denis Magda <dma...@apache.org> wrote:

> Hello Nikolay,
>
> Your proposal sounds reasonable. However, I would suggest us to wait while
> partition-awareness is supported for Java thin client first. With that
> feature, the client can connect to any node directly while presently all
> the communication goes through a proxy (a node the client is connected to).
> All of that is bad for performance.
>
>
> Vladimir, how hard would it be to support the partition-awareness for Java
> client? Probably, Nikolay can take over.
>
> --
> Denis
>
>
> On Sat, Oct 20, 2018 at 2:09 PM Nikolay Izhikov <nizhi...@apache.org>
> wrote:
>
> > Hello, Igniters.
> >
> > Currently, Spark Data Frame integration implemented via client node
> > connection.
> > Whenever we need to retrieve some data into Spark worker(or master) from
> > Ignite we start a client node.
> >
> > It has several major disadvantages:
> >
> >         1. We should copy whole Ignite distribution on to each Spark
> > worker [1]
> >         2. We should copy whole Ignite distribution on to Spark master to
> > get catalogue works.
> >         3. We should have the same absolute path to Ignite configuration
> > file on every worker and provide it during data frame construction [2]
> >         4. We should additionally configure Spark workerks classpath to
> > include Ignite libraries.
> >
> > For now, almost all operation we need to do in Spark Data Frame
> > integration is supported by Java Thin Client.
> >         * obtain the list of caches.
> >         * get cache configuration.
> >         * execute SQL query.
> >         * stream data to the table - don't support by the thin client for
> > now, but can be implemented using simple SQL INSERT statements.
> >
> > Advantages of usage Java Thin Client in Spark integration(they all known
> > from Java Thin Client advantages):
> >         1. Easy to configure: only IP addresses of server nodes are
> > required.
> >         2. Easy to deploy: only 1 additional jar required. No server
> > side(Ignite worker) configuration required.
> >
> > I propose to implement Spark Data Frame integration through Java Thin
> > Client.
> >
> > Thoughts?
> >
> > [1] https://apacheignite-fs.readme.io/docs/installation-deployment
> > [2]
> >
> https://apacheignite-fs.readme.io/docs/ignite-data-frame#section-ignite-dataframe-options
> >
>

Re: [DISCUSSION] Spark Data Frame through Thin Client

Reply via email to