Re: Spark DataFrames With Cache Key and Value Objects

Valentin Kulichenko Fri, 27 Jul 2018 15:23:39 -0700

Stuart,

_key and _val fields is quite a dirty hack that was added years ago and is
virtually never used now. We never recommend to use these fields and I
would definitely avoid building new features based on them.


Having said that, I'm not arguing the use case, but we need better
implementation approach here. I suggest we think it over and come back to
this next week :) I'm sure Nikolay will also chime in and share his
thoughts.

-Val

On Fri, Jul 27, 2018 at 12:39 PM Stuart Macdonald <stu...@stuwee.org> wrote:

> If your predicates and joins are expressed in Spark SQL, you cannot
> currently optimise those and also gain access to the key/val objects. If
> you went without the Ignite Spark SQL optimisations and expressed your
> query in Ignite SQL, you still need to use the _key/_val columns. The
> Ignite documentation has this specific example using the _val column (right
> at the end):
> https://apacheignite-fs.readme.io/docs/ignitecontext-igniterdd
>
> Stuart.
>
> On 27 Jul 2018, at 20:05, Valentin Kulichenko <
> valentin.kuliche...@gmail.com>
> wrote:
>
> Well, the second approach would use the optimizations, no?
>
> -Val
>
>
> On Fri, Jul 27, 2018 at 11:49 AM Stuart Macdonald <stu...@stuwee.org>
> wrote:
>
> Val,
>
>
> Yes you can already get access to the cache objects as an RDD or
>
> Dataset but you can’t use the Ignite-optimised DataFrames with these
>
> mechanisms. Optimised DataFrames have to be passed through Spark SQL’s
>
> Catalyst engine to allow for predicate pushdown to Ignite. So the
>
> usecase we’re talking about here is when we want to be able to push
>
> Spark filters/joins to Ignite to optimise, but still have access to
>
> the underlying cache objects, which is not possible currently.
>
>
> Can you elaborate on the reason _key and _val columns in Ignite SQL
>
> will be removed?
>
>
> Stuart.
>
>
> On 27 Jul 2018, at 19:39, Valentin Kulichenko <
>
> valentin.kuliche...@gmail.com> wrote:
>
>
> Stuart, Nikolay,
>
>
> I really don't like the idea of exposing '_key' and '_val' fields. This
>
> is
>
> legacy stuff that hopefully will be removed altogether one day. Let's not
>
> use it in new features.
>
>
> Actually, I don't even think it's even needed. Spark docs [1] suggest two
>
> ways of creating a typed dataset:
>
> 1. Based on RDD. This should be supported using IgniteRDD I believe.
>
> 2. Based on DataFrame providing a class. This would just work out of the
>
> box I guess.
>
>
> Of course, this needs to be tested and verified, and there might be
>
> certain
>
> pieces missing to fully support the use case. But generally I like these
>
> approaches much more.
>
>
>
>
> https://spark.apache.org/docs/2.3.1/sql-programming-guide.html#creating-datasets
>
>
> -Val
>
>
> On Fri, Jul 27, 2018 at 6:31 AM Stuart Macdonald <stu...@stuwee.org>
>
> wrote:
>
>
> Here’s the ticket:
>
>
> https://issues.apache.org/jira/browse/IGNITE-9108
>
>
> Stuart.
>
>
>
> On Friday, 27 July 2018 at 14:19, Nikolay Izhikov wrote:
>
>
> Sure.
>
>
> Please, send ticket number in this thread.
>
>
> пт, 27 июля 2018 г., 16:16 Stuart Macdonald <stu...@stuwee.org
>
> (mailto:
>
> stu...@stuwee.org)>:
>
>
> Thanks Nikolay. For both options if the cache object isn’t a simple
>
> type,
>
> we’d probably do something like this in our Ignite SQL statement:
>
>
> select cast(_key as binary), cast(_val as binary), ...
>
>
> Which would give us the BinaryObject’s byte[], then for option 1 we
>
> keep
>
> the Ignite format and introduce a new Spark Encoder for Ignite binary
>
> types
>
> (
>
>
>
>
>
> https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Encoder.html
>
> ),
>
> so that the end user interface would be something like:
>
>
> IgniteSparkSession session = ...
>
> Dataset<Row> dataFrame = ...
>
> Dataset<MyValClass> valDataSet =
>
>
>
> dataFrame.select(“_val_).as(session.binaryObjectEncoder(MyValClass.class))
>
>
> Or for option 2 we have a behind-the-scenes Ignite-to-Kryo UDF so that
>
> the
>
> user interface would be standard Spark:
>
>
> Dataset<Row> dataFrame = ...
>
> DataSet<MyValClass> dataSet =
>
> dataFrame.select(“_val_).as(Encoders.kryo(MyValClass.class))
>
>
> I’ll create a ticket and maybe put together a test case for further
>
> discussion?
>
>
> Stuart.
>
>
> On 27 Jul 2018, at 09:50, Nikolay Izhikov <nizhi...@apache.org
>
> (mailto:nizhi...@apache.org <nizhi...@apache.org>)> wrote:
>
>
> Hello, Stuart.
>
>
> I like your idea.
>
>
> 1. Ignite BinaryObjects, in which case we’d need to supply a Spark
>
> Encoder
>
> implementation for BinaryObjects
>
>
> 2. Kryo-serialised versions of the objects.
>
>
>
> Seems like first option is simple adapter. Am I right?
>
> If yes, I think it's a more efficient way comparing with
>
> transformation of
>
> each object to some other(Kryo) format.
>
>
> Can you provide some additional links for both options?
>
> Where I can find API or(and) examples?
>
>
> As a second step, we can apply same approach to the regular key, value
>
> caches.
>
>
> Feel free to create a ticket.
>
>
> В Пт, 27/07/2018 в 09:37 +0100, Stuart Macdonald пишет:
>
>
> Ignite Dev Community,
>
>
>
> Within Ignite-supplied Spark DataFrames, I’d like to propose adding
>
> support
>
>
> for _key and _val columns which represent the cache key and value
>
> objects
>
>
> similar to the current _key/_val column semantics in Ignite SQL.
>
>
>
> If the cache key or value objects are standard SQL types (eg. String,
>
> Int,
>
>
> etc) they will be represented as such in the DataFrame schema,
>
> otherwise
>
>
> they are represented as Binary types encoded as either: 1. Ignite
>
>
> BinaryObjects, in which case we’d need to supply a Spark Encoder
>
>
> implementation for BinaryObjects, or 2. Kryo-serialised versions of
>
> the
>
>
> objects. Option 1 would probably be more efficient but option 2 would
>
> be
>
>
> more idiomatic Spark.
>
>
>
> This feature would be controlled with an optional parameter in the
>
> Ignite
>
>
> data source, defaulting to the current implementation which doesn’t
>
> supply
>
>
> _key or _val columns. The rationale behind this is the same as the
>
> Ignite
>
>
> SQL _key and _val columns: to allow access to the full cache objects
>
> from a
>
>
> SQL context.
>
>
>
> Can I ask for feedback on this proposal please?
>
>
>
> I’d be happy to contribute this feature if we agree on the concept.
>
>
>
> Stuart.
>

Re: Spark DataFrames With Cache Key and Value Objects

Reply via email to