Re: Spark DataFrames With Cache Key and Value Objects

Valentin Kulichenko Wed, 01 Aug 2018 10:18:02 -0700

Stuart,

I don't see a reason why it would work with DataFrames, but not with
Datasets - they are pretty much the same thing. If you have any particular
thoughts on this, please let us know.


In any case, I would like to hear from Nikolay as he is an implementor of
this functionality. Nikolay, please share your thoughts on my suggestion
above.

-Val

On Wed, Aug 1, 2018 at 12:05 AM Stuart Macdonald <stu...@stuwee.org> wrote:

> I believe suggested approach will not work with the Spark SQL
> relational optimisations which perform predicate pushdown from Spark
> to Ignite. For that to work we need both the key/val and the
> relational fields in a dataframe schema.
>
> Stuart.
>
> > On 1 Aug 2018, at 04:23, Valentin Kulichenko <
> valentin.kuliche...@gmail.com> wrote:
> >
> > I don't think there are exact plans to remove _key and _value fields as
> > it's pretty hard considering the fact that many users use them and that
> > they are deeply integrated into the product. However, we already had
> > multiple usability and other issues due to their existence, and while
> > fixing them we gradually get rid of _key/_val on public API. Hard to tell
> > when we will be able to completely deprecate/remove these fields, but we
> > definitely should avoid building new features based on them.
> >
> > On top of that, I also don't like this approach because it doesn't seem
> to
> > be Spark-friendly to me. That's not how they typically create typed
> > datasets (I already provided a documentation link [1] with examples
> > earlier).
> >
> > From API standpoint, I think we should do the following:
> > 1. Add 'IgniteSparkSession#createDataset(IgniteCache[K, V] cache):
> > Dataset[(K, V)]' method that would create a dataset based on a cache.
> > 2. (Scala only) Introduce 'IgniteCache.toDS()' that would do the same,
> but
> > via implicit conversions instead of SparkSession extension.
> >
> > On implementation level, we can use SqlQuery API (not SqlFieldQuery) that
> > is specifically designed to return key-value pairs instead of specific
> > fields, while still providing all SQL capabilities.
> >
> > *Nikolay*, does this makes sense to you? Is it feasible and how hard
> would
> > it be to implement? How much of the existing code can we reuse (I believe
> > it should it be majority of it)?
> >
> > [1]
> >
> https://spark.apache.org/docs/2.3.1/sql-programming-guide.html#creating-datasets
> >
> > -Val
> >
> >> On Tue, Jul 31, 2018 at 2:03 PM Denis Magda <dma...@apache.org> wrote:
> >>
> >> Hello folks,
> >>
> >> The documentation goes with a small reference about _key and _val usage,
> >> and only for Ignite SQL APIs (Java, Net, C++). I tried to clean up all
> the
> >> documentation code snippets.
> >>
> >> As for the GitHub examples, they require a major overhaul. Instead of
> _key
> >> and _val usage, we need to use SQL fields. Hopefully, someone will groom
> >> the examples.
> >>
> >> Considering this, I wouldn't suggest us exposing _key and _val in other
> >> places like Spark. Are there any alternatives to this approach?
> >>
> >> --
> >> Denis
> >>
> >>
> >>
> >> On Tue, Jul 31, 2018 at 2:49 AM Nikolay Izhikov <nizhi...@apache.org>
> >> wrote:
> >>
> >>> Hello, Igniters.
> >>>
> >>> Valentin,
> >>>
> >>>> We never recommend to use these fields
> >>>
> >>> Actually, we did:
> >>>
> >>>        * Documentation [1]. Please, see "Predefined Fields" section.
> >>>        * Java Example [2]
> >>>        * DotNet Example [3]
> >>>        * Scala Example [4]
> >>>
> >>>> ...hopefully will be removed altogether one day
> >>>
> >>> This is new for me.
> >>>
> >>> Do we have specific plans for it?
> >>>
> >>> [1] https://apacheignite-sql.readme.io/docs/schema-and-indexes
> >>> [2]
> >>>
> >>
> https://github.com/apache/ignite/blob/master/examples/src/main/java/org/apache/ignite/examples/sql/SqlDmlExample.java#L88
> >>> [3]
> >>>
> >>
> https://github.com/apache/ignite/blob/master/modules/platforms/dotnet/examples/Apache.Ignite.Examples/Sql/SqlDmlExample.cs#L91
> >>> [4]
> >>>
> >>
> https://github.com/apache/ignite/blob/master/examples/src/main/scala/org/apache/ignite/scalar/examples/ScalarCachePopularNumbersExample.scala#L124
> >>>
> >>> В Пт, 27/07/2018 в 15:22 -0700, Valentin Kulichenko пишет:
> >>>> Stuart,
> >>>>
> >>>> _key and _val fields is quite a dirty hack that was added years ago
> and
> >>> is
> >>>> virtually never used now. We never recommend to use these fields and I
> >>>> would definitely avoid building new features based on them.
> >>>>
> >>>> Having said that, I'm not arguing the use case, but we need better
> >>>> implementation approach here. I suggest we think it over and come back
> >> to
> >>>> this next week :) I'm sure Nikolay will also chime in and share his
> >>>> thoughts.
> >>>>
> >>>> -Val
> >>>>
> >>>> On Fri, Jul 27, 2018 at 12:39 PM Stuart Macdonald <stu...@stuwee.org>
> >>> wrote:
> >>>>
> >>>>> If your predicates and joins are expressed in Spark SQL, you cannot
> >>>>> currently optimise those and also gain access to the key/val objects.
> >>> If
> >>>>> you went without the Ignite Spark SQL optimisations and expressed
> >> your
> >>>>> query in Ignite SQL, you still need to use the _key/_val columns. The
> >>>>> Ignite documentation has this specific example using the _val column
> >>> (right
> >>>>> at the end):
> >>>>> https://apacheignite-fs.readme.io/docs/ignitecontext-igniterdd
> >>>>>
> >>>>> Stuart.
> >>>>>
> >>>>> On 27 Jul 2018, at 20:05, Valentin Kulichenko <
> >>>>> valentin.kuliche...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>> Well, the second approach would use the optimizations, no?
> >>>>>
> >>>>> -Val
> >>>>>
> >>>>>
> >>>>> On Fri, Jul 27, 2018 at 11:49 AM Stuart Macdonald <stu...@stuwee.org
> >>>
> >>>>> wrote:
> >>>>>
> >>>>> Val,
> >>>>>
> >>>>>
> >>>>> Yes you can already get access to the cache objects as an RDD or
> >>>>>
> >>>>> Dataset but you can’t use the Ignite-optimised DataFrames with these
> >>>>>
> >>>>> mechanisms. Optimised DataFrames have to be passed through Spark
> >> SQL’s
> >>>>>
> >>>>> Catalyst engine to allow for predicate pushdown to Ignite. So the
> >>>>>
> >>>>> usecase we’re talking about here is when we want to be able to push
> >>>>>
> >>>>> Spark filters/joins to Ignite to optimise, but still have access to
> >>>>>
> >>>>> the underlying cache objects, which is not possible currently.
> >>>>>
> >>>>>
> >>>>> Can you elaborate on the reason _key and _val columns in Ignite SQL
> >>>>>
> >>>>> will be removed?
> >>>>>
> >>>>>
> >>>>> Stuart.
> >>>>>
> >>>>>
> >>>>> On 27 Jul 2018, at 19:39, Valentin Kulichenko <
> >>>>>
> >>>>> valentin.kuliche...@gmail.com> wrote:
> >>>>>
> >>>>>
> >>>>> Stuart, Nikolay,
> >>>>>
> >>>>>
> >>>>> I really don't like the idea of exposing '_key' and '_val' fields.
> >> This
> >>>>>
> >>>>> is
> >>>>>
> >>>>> legacy stuff that hopefully will be removed altogether one day. Let's
> >>> not
> >>>>>
> >>>>> use it in new features.
> >>>>>
> >>>>>
> >>>>> Actually, I don't even think it's even needed. Spark docs [1] suggest
> >>> two
> >>>>>
> >>>>> ways of creating a typed dataset:
> >>>>>
> >>>>> 1. Based on RDD. This should be supported using IgniteRDD I believe.
> >>>>>
> >>>>> 2. Based on DataFrame providing a class. This would just work out of
> >>> the
> >>>>>
> >>>>> box I guess.
> >>>>>
> >>>>>
> >>>>> Of course, this needs to be tested and verified, and there might be
> >>>>>
> >>>>> certain
> >>>>>
> >>>>> pieces missing to fully support the use case. But generally I like
> >>> these
> >>>>>
> >>>>> approaches much more.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>
> >>
> https://spark.apache.org/docs/2.3.1/sql-programming-guide.html#creating-datasets
> >>>>>
> >>>>>
> >>>>> -Val
> >>>>>
> >>>>>
> >>>>> On Fri, Jul 27, 2018 at 6:31 AM Stuart Macdonald <stu...@stuwee.org>
> >>>>>
> >>>>> wrote:
> >>>>>
> >>>>>
> >>>>> Here’s the ticket:
> >>>>>
> >>>>>
> >>>>> https://issues.apache.org/jira/browse/IGNITE-9108
> >>>>>
> >>>>>
> >>>>> Stuart.
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Friday, 27 July 2018 at 14:19, Nikolay Izhikov wrote:
> >>>>>
> >>>>>
> >>>>> Sure.
> >>>>>
> >>>>>
> >>>>> Please, send ticket number in this thread.
> >>>>>
> >>>>>
> >>>>> пт, 27 июля 2018 г., 16:16 Stuart Macdonald <stu...@stuwee.org
> >>>>>
> >>>>> (mailto:
> >>>>>
> >>>>> stu...@stuwee.org)>:
> >>>>>
> >>>>>
> >>>>> Thanks Nikolay. For both options if the cache object isn’t a simple
> >>>>>
> >>>>> type,
> >>>>>
> >>>>> we’d probably do something like this in our Ignite SQL statement:
> >>>>>
> >>>>>
> >>>>> select cast(_key as binary), cast(_val as binary), ...
> >>>>>
> >>>>>
> >>>>> Which would give us the BinaryObject’s byte[], then for option 1 we
> >>>>>
> >>>>> keep
> >>>>>
> >>>>> the Ignite format and introduce a new Spark Encoder for Ignite binary
> >>>>>
> >>>>> types
> >>>>>
> >>>>> (
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>
> >>
> https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Encoder.html
> >>>>>
> >>>>> ),
> >>>>>
> >>>>> so that the end user interface would be something like:
> >>>>>
> >>>>>
> >>>>> IgniteSparkSession session = ...
> >>>>>
> >>>>> Dataset<Row> dataFrame = ...
> >>>>>
> >>>>> Dataset<MyValClass> valDataSet =
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>
> >>
> dataFrame.select(“_val_).as(session.binaryObjectEncoder(MyValClass.class))
> >>>>>
> >>>>>
> >>>>> Or for option 2 we have a behind-the-scenes Ignite-to-Kryo UDF so
> >> that
> >>>>>
> >>>>> the
> >>>>>
> >>>>> user interface would be standard Spark:
> >>>>>
> >>>>>
> >>>>> Dataset<Row> dataFrame = ...
> >>>>>
> >>>>> DataSet<MyValClass> dataSet =
> >>>>>
> >>>>> dataFrame.select(“_val_).as(Encoders.kryo(MyValClass.class))
> >>>>>
> >>>>>
> >>>>> I’ll create a ticket and maybe put together a test case for further
> >>>>>
> >>>>> discussion?
> >>>>>
> >>>>>
> >>>>> Stuart.
> >>>>>
> >>>>>
> >>>>> On 27 Jul 2018, at 09:50, Nikolay Izhikov <nizhi...@apache.org
> >>>>>
> >>>>> (mailto:nizhi...@apache.org <nizhi...@apache.org>)> wrote:
> >>>>>
> >>>>>
> >>>>> Hello, Stuart.
> >>>>>
> >>>>>
> >>>>> I like your idea.
> >>>>>
> >>>>>
> >>>>> 1. Ignite BinaryObjects, in which case we’d need to supply a Spark
> >>>>>
> >>>>> Encoder
> >>>>>
> >>>>> implementation for BinaryObjects
> >>>>>
> >>>>>
> >>>>> 2. Kryo-serialised versions of the objects.
> >>>>>
> >>>>>
> >>>>>
> >>>>> Seems like first option is simple adapter. Am I right?
> >>>>>
> >>>>> If yes, I think it's a more efficient way comparing with
> >>>>>
> >>>>> transformation of
> >>>>>
> >>>>> each object to some other(Kryo) format.
> >>>>>
> >>>>>
> >>>>> Can you provide some additional links for both options?
> >>>>>
> >>>>> Where I can find API or(and) examples?
> >>>>>
> >>>>>
> >>>>> As a second step, we can apply same approach to the regular key,
> >> value
> >>>>>
> >>>>> caches.
> >>>>>
> >>>>>
> >>>>> Feel free to create a ticket.
> >>>>>
> >>>>>
> >>>>> В Пт, 27/07/2018 в 09:37 +0100, Stuart Macdonald пишет:
> >>>>>
> >>>>>
> >>>>> Ignite Dev Community,
> >>>>>
> >>>>>
> >>>>>
> >>>>> Within Ignite-supplied Spark DataFrames, I’d like to propose adding
> >>>>>
> >>>>> support
> >>>>>
> >>>>>
> >>>>> for _key and _val columns which represent the cache key and value
> >>>>>
> >>>>> objects
> >>>>>
> >>>>>
> >>>>> similar to the current _key/_val column semantics in Ignite SQL.
> >>>>>
> >>>>>
> >>>>>
> >>>>> If the cache key or value objects are standard SQL types (eg. String,
> >>>>>
> >>>>> Int,
> >>>>>
> >>>>>
> >>>>> etc) they will be represented as such in the DataFrame schema,
> >>>>>
> >>>>> otherwise
> >>>>>
> >>>>>
> >>>>> they are represented as Binary types encoded as either: 1. Ignite
> >>>>>
> >>>>>
> >>>>> BinaryObjects, in which case we’d need to supply a Spark Encoder
> >>>>>
> >>>>>
> >>>>> implementation for BinaryObjects, or 2. Kryo-serialised versions of
> >>>>>
> >>>>> the
> >>>>>
> >>>>>
> >>>>> objects. Option 1 would probably be more efficient but option 2 would
> >>>>>
> >>>>> be
> >>>>>
> >>>>>
> >>>>> more idiomatic Spark.
> >>>>>
> >>>>>
> >>>>>
> >>>>> This feature would be controlled with an optional parameter in the
> >>>>>
> >>>>> Ignite
> >>>>>
> >>>>>
> >>>>> data source, defaulting to the current implementation which doesn’t
> >>>>>
> >>>>> supply
> >>>>>
> >>>>>
> >>>>> _key or _val columns. The rationale behind this is the same as the
> >>>>>
> >>>>> Ignite
> >>>>>
> >>>>>
> >>>>> SQL _key and _val columns: to allow access to the full cache objects
> >>>>>
> >>>>> from a
> >>>>>
> >>>>>
> >>>>> SQL context.
> >>>>>
> >>>>>
> >>>>>
> >>>>> Can I ask for feedback on this proposal please?
> >>>>>
> >>>>>
> >>>>>
> >>>>> I’d be happy to contribute this feature if we agree on the concept.
> >>>>>
> >>>>>
> >>>>>
> >>>>> Stuart.
> >>>>>
> >>
>

Re: Spark DataFrames With Cache Key and Value Objects

Reply via email to