Re: Spark DataFrames With Cache Key and Value Objects

Stuart Macdonald Fri, 27 Jul 2018 06:16:24 -0700

Thanks Nikolay. For both options if the cache object isn’t a simple type,
we’d probably do something like this in our Ignite SQL statement:


select cast(_key as binary), cast(_val as binary), ...

Which would give us the BinaryObject’s byte[], then for option 1 we keep
the Ignite format and introduce a new Spark Encoder for Ignite binary types
(
https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Encoder.html),
so that the end user interface would be something like:

IgniteSparkSession session = ...
Dataset<Row> dataFrame = ...
Dataset<MyValClass> valDataSet =
dataFrame.select(“_val_).as(session.binaryObjectEncoder(MyValClass.class))

Or for option 2 we have a behind-the-scenes Ignite-to-Kryo UDF so that the
user interface would be standard Spark:

Dataset<Row> dataFrame = ...
DataSet<MyValClass> dataSet =
dataFrame.select(“_val_).as(Encoders.kryo(MyValClass.class))

I’ll create a ticket and maybe put together a test case for further
discussion?

Stuart.

On 27 Jul 2018, at 09:50, Nikolay Izhikov <nizhi...@apache.org> wrote:

Hello, Stuart.

I like your idea.

1. Ignite BinaryObjects, in which case we’d need to supply a Spark Encoder
implementation for BinaryObjects

2. Kryo-serialised versions of the objects.


Seems like first option is simple adapter. Am I right?
If yes, I think it's a more efficient way comparing with transformation of
each object to some other(Kryo) format.

Can you provide some additional links for both options?
Where I can find API or(and) examples?

As a second step, we can apply same approach to the regular key, value
caches.

Feel free to create a ticket.

В Пт, 27/07/2018 в 09:37 +0100, Stuart Macdonald пишет:

Ignite Dev Community,


Within Ignite-supplied Spark DataFrames, I’d like to propose adding support

for _key and _val columns which represent the cache key and value objects

similar to the current _key/_val column semantics in Ignite SQL.


If the cache key or value objects are standard SQL types (eg. String, Int,

etc) they will be represented as such in the DataFrame schema, otherwise

they are represented as Binary types encoded as either: 1. Ignite

BinaryObjects, in which case we’d need to supply a Spark Encoder

implementation for BinaryObjects, or 2. Kryo-serialised versions of the

objects. Option 1 would probably be more efficient but option 2 would be

more idiomatic Spark.


This feature would be controlled with an optional parameter in the Ignite

data source, defaulting to the current implementation which doesn’t supply

_key or _val columns. The rationale behind this is the same as the Ignite

SQL _key and _val columns: to allow access to the full cache objects from a

SQL context.


Can I ask for feedback on this proposal please?


I’d be happy to contribute this feature if we agree on the concept.


Stuart.

Re: Spark DataFrames With Cache Key and Value Objects

Reply via email to