Re: Spark DataFrames With Cache Key and Value Objects

Valentin Kulichenko Tue, 31 Jul 2018 20:24:01 -0700

I don't think there are exact plans to remove _key and _value fields as
it's pretty hard considering the fact that many users use them and that
they are deeply integrated into the product. However, we already had
multiple usability and other issues due to their existence, and while
fixing them we gradually get rid of _key/_val on public API. Hard to tell
when we will be able to completely deprecate/remove these fields, but we
definitely should avoid building new features based on them.


On top of that, I also don't like this approach because it doesn't seem to
be Spark-friendly to me. That's not how they typically create typed
datasets (I already provided a documentation link [1] with examples
earlier).

>From API standpoint, I think we should do the following:
1. Add 'IgniteSparkSession#createDataset(IgniteCache[K, V] cache):
Dataset[(K, V)]' method that would create a dataset based on a cache.
2. (Scala only) Introduce 'IgniteCache.toDS()' that would do the same, but
via implicit conversions instead of SparkSession extension.

On implementation level, we can use SqlQuery API (not SqlFieldQuery) that
is specifically designed to return key-value pairs instead of specific
fields, while still providing all SQL capabilities.

*Nikolay*, does this makes sense to you? Is it feasible and how hard would
it be to implement? How much of the existing code can we reuse (I believe
it should it be majority of it)?

[1]
https://spark.apache.org/docs/2.3.1/sql-programming-guide.html#creating-datasets

-Val

On Tue, Jul 31, 2018 at 2:03 PM Denis Magda <dma...@apache.org> wrote:

> Hello folks,
>
> The documentation goes with a small reference about _key and _val usage,
> and only for Ignite SQL APIs (Java, Net, C++). I tried to clean up all the
> documentation code snippets.
>
> As for the GitHub examples, they require a major overhaul. Instead of _key
> and _val usage, we need to use SQL fields. Hopefully, someone will groom
> the examples.
>
> Considering this, I wouldn't suggest us exposing _key and _val in other
> places like Spark. Are there any alternatives to this approach?
>
> --
> Denis
>
>
>
> On Tue, Jul 31, 2018 at 2:49 AM Nikolay Izhikov <nizhi...@apache.org>
> wrote:
>
> > Hello, Igniters.
> >
> > Valentin,
> >
> > > We never recommend to use these fields
> >
> > Actually, we did:
> >
> >         * Documentation [1]. Please, see "Predefined Fields" section.
> >         * Java Example [2]
> >         * DotNet Example [3]
> >         * Scala Example [4]
> >
> > > ...hopefully will be removed altogether one day
> >
> > This is new for me.
> >
> > Do we have specific plans for it?
> >
> > [1] https://apacheignite-sql.readme.io/docs/schema-and-indexes
> > [2]
> >
> https://github.com/apache/ignite/blob/master/examples/src/main/java/org/apache/ignite/examples/sql/SqlDmlExample.java#L88
> > [3]
> >
> https://github.com/apache/ignite/blob/master/modules/platforms/dotnet/examples/Apache.Ignite.Examples/Sql/SqlDmlExample.cs#L91
> > [4]
> >
> https://github.com/apache/ignite/blob/master/examples/src/main/scala/org/apache/ignite/scalar/examples/ScalarCachePopularNumbersExample.scala#L124
> >
> > В Пт, 27/07/2018 в 15:22 -0700, Valentin Kulichenko пишет:
> > > Stuart,
> > >
> > > _key and _val fields is quite a dirty hack that was added years ago and
> > is
> > > virtually never used now. We never recommend to use these fields and I
> > > would definitely avoid building new features based on them.
> > >
> > > Having said that, I'm not arguing the use case, but we need better
> > > implementation approach here. I suggest we think it over and come back
> to
> > > this next week :) I'm sure Nikolay will also chime in and share his
> > > thoughts.
> > >
> > > -Val
> > >
> > > On Fri, Jul 27, 2018 at 12:39 PM Stuart Macdonald <stu...@stuwee.org>
> > wrote:
> > >
> > > > If your predicates and joins are expressed in Spark SQL, you cannot
> > > > currently optimise those and also gain access to the key/val objects.
> > If
> > > > you went without the Ignite Spark SQL optimisations and expressed
> your
> > > > query in Ignite SQL, you still need to use the _key/_val columns. The
> > > > Ignite documentation has this specific example using the _val column
> > (right
> > > > at the end):
> > > > https://apacheignite-fs.readme.io/docs/ignitecontext-igniterdd
> > > >
> > > > Stuart.
> > > >
> > > > On 27 Jul 2018, at 20:05, Valentin Kulichenko <
> > > > valentin.kuliche...@gmail.com>
> > > > wrote:
> > > >
> > > > Well, the second approach would use the optimizations, no?
> > > >
> > > > -Val
> > > >
> > > >
> > > > On Fri, Jul 27, 2018 at 11:49 AM Stuart Macdonald <stu...@stuwee.org
> >
> > > > wrote:
> > > >
> > > > Val,
> > > >
> > > >
> > > > Yes you can already get access to the cache objects as an RDD or
> > > >
> > > > Dataset but you can’t use the Ignite-optimised DataFrames with these
> > > >
> > > > mechanisms. Optimised DataFrames have to be passed through Spark
> SQL’s
> > > >
> > > > Catalyst engine to allow for predicate pushdown to Ignite. So the
> > > >
> > > > usecase we’re talking about here is when we want to be able to push
> > > >
> > > > Spark filters/joins to Ignite to optimise, but still have access to
> > > >
> > > > the underlying cache objects, which is not possible currently.
> > > >
> > > >
> > > > Can you elaborate on the reason _key and _val columns in Ignite SQL
> > > >
> > > > will be removed?
> > > >
> > > >
> > > > Stuart.
> > > >
> > > >
> > > > On 27 Jul 2018, at 19:39, Valentin Kulichenko <
> > > >
> > > > valentin.kuliche...@gmail.com> wrote:
> > > >
> > > >
> > > > Stuart, Nikolay,
> > > >
> > > >
> > > > I really don't like the idea of exposing '_key' and '_val' fields.
> This
> > > >
> > > > is
> > > >
> > > > legacy stuff that hopefully will be removed altogether one day. Let's
> > not
> > > >
> > > > use it in new features.
> > > >
> > > >
> > > > Actually, I don't even think it's even needed. Spark docs [1] suggest
> > two
> > > >
> > > > ways of creating a typed dataset:
> > > >
> > > > 1. Based on RDD. This should be supported using IgniteRDD I believe.
> > > >
> > > > 2. Based on DataFrame providing a class. This would just work out of
> > the
> > > >
> > > > box I guess.
> > > >
> > > >
> > > > Of course, this needs to be tested and verified, and there might be
> > > >
> > > > certain
> > > >
> > > > pieces missing to fully support the use case. But generally I like
> > these
> > > >
> > > > approaches much more.
> > > >
> > > >
> > > >
> > > >
> > > >
> >
> https://spark.apache.org/docs/2.3.1/sql-programming-guide.html#creating-datasets
> > > >
> > > >
> > > > -Val
> > > >
> > > >
> > > > On Fri, Jul 27, 2018 at 6:31 AM Stuart Macdonald <stu...@stuwee.org>
> > > >
> > > > wrote:
> > > >
> > > >
> > > > Here’s the ticket:
> > > >
> > > >
> > > > https://issues.apache.org/jira/browse/IGNITE-9108
> > > >
> > > >
> > > > Stuart.
> > > >
> > > >
> > > >
> > > > On Friday, 27 July 2018 at 14:19, Nikolay Izhikov wrote:
> > > >
> > > >
> > > > Sure.
> > > >
> > > >
> > > > Please, send ticket number in this thread.
> > > >
> > > >
> > > > пт, 27 июля 2018 г., 16:16 Stuart Macdonald <stu...@stuwee.org
> > > >
> > > > (mailto:
> > > >
> > > > stu...@stuwee.org)>:
> > > >
> > > >
> > > > Thanks Nikolay. For both options if the cache object isn’t a simple
> > > >
> > > > type,
> > > >
> > > > we’d probably do something like this in our Ignite SQL statement:
> > > >
> > > >
> > > > select cast(_key as binary), cast(_val as binary), ...
> > > >
> > > >
> > > > Which would give us the BinaryObject’s byte[], then for option 1 we
> > > >
> > > > keep
> > > >
> > > > the Ignite format and introduce a new Spark Encoder for Ignite binary
> > > >
> > > > types
> > > >
> > > > (
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> >
> https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Encoder.html
> > > >
> > > > ),
> > > >
> > > > so that the end user interface would be something like:
> > > >
> > > >
> > > > IgniteSparkSession session = ...
> > > >
> > > > Dataset<Row> dataFrame = ...
> > > >
> > > > Dataset<MyValClass> valDataSet =
> > > >
> > > >
> > > >
> > > >
> >
> dataFrame.select(“_val_).as(session.binaryObjectEncoder(MyValClass.class))
> > > >
> > > >
> > > > Or for option 2 we have a behind-the-scenes Ignite-to-Kryo UDF so
> that
> > > >
> > > > the
> > > >
> > > > user interface would be standard Spark:
> > > >
> > > >
> > > > Dataset<Row> dataFrame = ...
> > > >
> > > > DataSet<MyValClass> dataSet =
> > > >
> > > > dataFrame.select(“_val_).as(Encoders.kryo(MyValClass.class))
> > > >
> > > >
> > > > I’ll create a ticket and maybe put together a test case for further
> > > >
> > > > discussion?
> > > >
> > > >
> > > > Stuart.
> > > >
> > > >
> > > > On 27 Jul 2018, at 09:50, Nikolay Izhikov <nizhi...@apache.org
> > > >
> > > > (mailto:nizhi...@apache.org <nizhi...@apache.org>)> wrote:
> > > >
> > > >
> > > > Hello, Stuart.
> > > >
> > > >
> > > > I like your idea.
> > > >
> > > >
> > > > 1. Ignite BinaryObjects, in which case we’d need to supply a Spark
> > > >
> > > > Encoder
> > > >
> > > > implementation for BinaryObjects
> > > >
> > > >
> > > > 2. Kryo-serialised versions of the objects.
> > > >
> > > >
> > > >
> > > > Seems like first option is simple adapter. Am I right?
> > > >
> > > > If yes, I think it's a more efficient way comparing with
> > > >
> > > > transformation of
> > > >
> > > > each object to some other(Kryo) format.
> > > >
> > > >
> > > > Can you provide some additional links for both options?
> > > >
> > > > Where I can find API or(and) examples?
> > > >
> > > >
> > > > As a second step, we can apply same approach to the regular key,
> value
> > > >
> > > > caches.
> > > >
> > > >
> > > > Feel free to create a ticket.
> > > >
> > > >
> > > > В Пт, 27/07/2018 в 09:37 +0100, Stuart Macdonald пишет:
> > > >
> > > >
> > > > Ignite Dev Community,
> > > >
> > > >
> > > >
> > > > Within Ignite-supplied Spark DataFrames, I’d like to propose adding
> > > >
> > > > support
> > > >
> > > >
> > > > for _key and _val columns which represent the cache key and value
> > > >
> > > > objects
> > > >
> > > >
> > > > similar to the current _key/_val column semantics in Ignite SQL.
> > > >
> > > >
> > > >
> > > > If the cache key or value objects are standard SQL types (eg. String,
> > > >
> > > > Int,
> > > >
> > > >
> > > > etc) they will be represented as such in the DataFrame schema,
> > > >
> > > > otherwise
> > > >
> > > >
> > > > they are represented as Binary types encoded as either: 1. Ignite
> > > >
> > > >
> > > > BinaryObjects, in which case we’d need to supply a Spark Encoder
> > > >
> > > >
> > > > implementation for BinaryObjects, or 2. Kryo-serialised versions of
> > > >
> > > > the
> > > >
> > > >
> > > > objects. Option 1 would probably be more efficient but option 2 would
> > > >
> > > > be
> > > >
> > > >
> > > > more idiomatic Spark.
> > > >
> > > >
> > > >
> > > > This feature would be controlled with an optional parameter in the
> > > >
> > > > Ignite
> > > >
> > > >
> > > > data source, defaulting to the current implementation which doesn’t
> > > >
> > > > supply
> > > >
> > > >
> > > > _key or _val columns. The rationale behind this is the same as the
> > > >
> > > > Ignite
> > > >
> > > >
> > > > SQL _key and _val columns: to allow access to the full cache objects
> > > >
> > > > from a
> > > >
> > > >
> > > > SQL context.
> > > >
> > > >
> > > >
> > > > Can I ask for feedback on this proposal please?
> > > >
> > > >
> > > >
> > > > I’d be happy to contribute this feature if we agree on the concept.
> > > >
> > > >
> > > >
> > > > Stuart.
> > > >
>

Re: Spark DataFrames With Cache Key and Value Objects

Reply via email to