Re: Spark DataFrames With Cache Key and Value Objects

Valentin Kulichenko Fri, 27 Jul 2018 11:40:03 -0700

Stuart, Nikolay,

I really don't like the idea of exposing '_key' and '_val' fields. This is
legacy stuff that hopefully will be removed altogether one day. Let's not
use it in new features.


Actually, I don't even think it's even needed. Spark docs [1] suggest two
ways of creating a typed dataset:
1. Based on RDD. This should be supported using IgniteRDD I believe.
2. Based on DataFrame providing a class. This would just work out of the
box I guess.

Of course, this needs to be tested and verified, and there might be certain
pieces missing to fully support the use case. But generally I like these
approaches much more.

https://spark.apache.org/docs/2.3.1/sql-programming-guide.html#creating-datasets

-Val

On Fri, Jul 27, 2018 at 6:31 AM Stuart Macdonald <stu...@stuwee.org> wrote:

> Here’s the ticket:
>
> https://issues.apache.org/jira/browse/IGNITE-9108
>
> Stuart.
>
>
> On Friday, 27 July 2018 at 14:19, Nikolay Izhikov wrote:
>
> > Sure.
> >
> > Please, send ticket number in this thread.
> >
> > пт, 27 июля 2018 г., 16:16 Stuart Macdonald <stu...@stuwee.org (mailto:
> stu...@stuwee.org)>:
> >
> > > Thanks Nikolay. For both options if the cache object isn’t a simple
> type,
> > > we’d probably do something like this in our Ignite SQL statement:
> > >
> > > select cast(_key as binary), cast(_val as binary), ...
> > >
> > > Which would give us the BinaryObject’s byte[], then for option 1 we
> keep
> > > the Ignite format and introduce a new Spark Encoder for Ignite binary
> types
> > > (
> > >
> > >
> https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Encoder.html
> > > ),
> > > so that the end user interface would be something like:
> > >
> > > IgniteSparkSession session = ...
> > > Dataset<Row> dataFrame = ...
> > > Dataset<MyValClass> valDataSet =
> > >
> dataFrame.select(“_val_).as(session.binaryObjectEncoder(MyValClass.class))
> > >
> > > Or for option 2 we have a behind-the-scenes Ignite-to-Kryo UDF so that
> the
> > > user interface would be standard Spark:
> > >
> > > Dataset<Row> dataFrame = ...
> > > DataSet<MyValClass> dataSet =
> > > dataFrame.select(“_val_).as(Encoders.kryo(MyValClass.class))
> > >
> > > I’ll create a ticket and maybe put together a test case for further
> > > discussion?
> > >
> > > Stuart.
> > >
> > > On 27 Jul 2018, at 09:50, Nikolay Izhikov <nizhi...@apache.org
> (mailto:nizhi...@apache.org)> wrote:
> > >
> > > Hello, Stuart.
> > >
> > > I like your idea.
> > >
> > > 1. Ignite BinaryObjects, in which case we’d need to supply a Spark
> Encoder
> > > implementation for BinaryObjects
> > >
> > > 2. Kryo-serialised versions of the objects.
> > >
> > >
> > > Seems like first option is simple adapter. Am I right?
> > > If yes, I think it's a more efficient way comparing with
> transformation of
> > > each object to some other(Kryo) format.
> > >
> > > Can you provide some additional links for both options?
> > > Where I can find API or(and) examples?
> > >
> > > As a second step, we can apply same approach to the regular key, value
> > > caches.
> > >
> > > Feel free to create a ticket.
> > >
> > > В Пт, 27/07/2018 в 09:37 +0100, Stuart Macdonald пишет:
> > >
> > > Ignite Dev Community,
> > >
> > >
> > > Within Ignite-supplied Spark DataFrames, I’d like to propose adding
> support
> > >
> > > for _key and _val columns which represent the cache key and value
> objects
> > >
> > > similar to the current _key/_val column semantics in Ignite SQL.
> > >
> > >
> > > If the cache key or value objects are standard SQL types (eg. String,
> Int,
> > >
> > > etc) they will be represented as such in the DataFrame schema,
> otherwise
> > >
> > > they are represented as Binary types encoded as either: 1. Ignite
> > >
> > > BinaryObjects, in which case we’d need to supply a Spark Encoder
> > >
> > > implementation for BinaryObjects, or 2. Kryo-serialised versions of the
> > >
> > > objects. Option 1 would probably be more efficient but option 2 would
> be
> > >
> > > more idiomatic Spark.
> > >
> > >
> > > This feature would be controlled with an optional parameter in the
> Ignite
> > >
> > > data source, defaulting to the current implementation which doesn’t
> supply
> > >
> > > _key or _val columns. The rationale behind this is the same as the
> Ignite
> > >
> > > SQL _key and _val columns: to allow access to the full cache objects
> from a
> > >
> > > SQL context.
> > >
> > >
> > > Can I ask for feedback on this proposal please?
> > >
> > >
> > > I’d be happy to contribute this feature if we agree on the concept.
> > >
> > >
> > > Stuart.
>
>

Re: Spark DataFrames With Cache Key and Value Objects

Reply via email to