Re: Spark DataFrames With Cache Key and Value Objects

Stuart Macdonald Wed, 01 Aug 2018 00:05:58 -0700

I believe suggested approach will not work with the Spark SQL
relational optimisations which perform predicate pushdown from Spark
to Ignite. For that to work we need both the key/val and the
relational fields in a dataframe schema.


Stuart.

> On 1 Aug 2018, at 04:23, Valentin Kulichenko <valentin.kuliche...@gmail.com> 
> wrote:
>
> I don't think there are exact plans to remove _key and _value fields as
> it's pretty hard considering the fact that many users use them and that
> they are deeply integrated into the product. However, we already had
> multiple usability and other issues due to their existence, and while
> fixing them we gradually get rid of _key/_val on public API. Hard to tell
> when we will be able to completely deprecate/remove these fields, but we
> definitely should avoid building new features based on them.
>
> On top of that, I also don't like this approach because it doesn't seem to
> be Spark-friendly to me. That's not how they typically create typed
> datasets (I already provided a documentation link [1] with examples
> earlier).
>
> From API standpoint, I think we should do the following:
> 1. Add 'IgniteSparkSession#createDataset(IgniteCache[K, V] cache):
> Dataset[(K, V)]' method that would create a dataset based on a cache.
> 2. (Scala only) Introduce 'IgniteCache.toDS()' that would do the same, but
> via implicit conversions instead of SparkSession extension.
>
> On implementation level, we can use SqlQuery API (not SqlFieldQuery) that
> is specifically designed to return key-value pairs instead of specific
> fields, while still providing all SQL capabilities.
>
> *Nikolay*, does this makes sense to you? Is it feasible and how hard would
> it be to implement? How much of the existing code can we reuse (I believe
> it should it be majority of it)?
>
> [1]
> https://spark.apache.org/docs/2.3.1/sql-programming-guide.html#creating-datasets
>
> -Val
>
>> On Tue, Jul 31, 2018 at 2:03 PM Denis Magda <dma...@apache.org> wrote:
>>
>> Hello folks,
>>
>> The documentation goes with a small reference about _key and _val usage,
>> and only for Ignite SQL APIs (Java, Net, C++). I tried to clean up all the
>> documentation code snippets.
>>
>> As for the GitHub examples, they require a major overhaul. Instead of _key
>> and _val usage, we need to use SQL fields. Hopefully, someone will groom
>> the examples.
>>
>> Considering this, I wouldn't suggest us exposing _key and _val in other
>> places like Spark. Are there any alternatives to this approach?
>>
>> --
>> Denis
>>
>>
>>
>> On Tue, Jul 31, 2018 at 2:49 AM Nikolay Izhikov <nizhi...@apache.org>
>> wrote:
>>
>>> Hello, Igniters.
>>>
>>> Valentin,
>>>
>>>> We never recommend to use these fields
>>>
>>> Actually, we did:
>>>
>>>        * Documentation [1]. Please, see "Predefined Fields" section.
>>>        * Java Example [2]
>>>        * DotNet Example [3]
>>>        * Scala Example [4]
>>>
>>>> ...hopefully will be removed altogether one day
>>>
>>> This is new for me.
>>>
>>> Do we have specific plans for it?
>>>
>>> [1] https://apacheignite-sql.readme.io/docs/schema-and-indexes
>>> [2]
>>>
>> https://github.com/apache/ignite/blob/master/examples/src/main/java/org/apache/ignite/examples/sql/SqlDmlExample.java#L88
>>> [3]
>>>
>> https://github.com/apache/ignite/blob/master/modules/platforms/dotnet/examples/Apache.Ignite.Examples/Sql/SqlDmlExample.cs#L91
>>> [4]
>>>
>> https://github.com/apache/ignite/blob/master/examples/src/main/scala/org/apache/ignite/scalar/examples/ScalarCachePopularNumbersExample.scala#L124
>>>
>>> В Пт, 27/07/2018 в 15:22 -0700, Valentin Kulichenko пишет:
>>>> Stuart,
>>>>
>>>> _key and _val fields is quite a dirty hack that was added years ago and
>>> is
>>>> virtually never used now. We never recommend to use these fields and I
>>>> would definitely avoid building new features based on them.
>>>>
>>>> Having said that, I'm not arguing the use case, but we need better
>>>> implementation approach here. I suggest we think it over and come back
>> to
>>>> this next week :) I'm sure Nikolay will also chime in and share his
>>>> thoughts.
>>>>
>>>> -Val
>>>>
>>>> On Fri, Jul 27, 2018 at 12:39 PM Stuart Macdonald <stu...@stuwee.org>
>>> wrote:
>>>>
>>>>> If your predicates and joins are expressed in Spark SQL, you cannot
>>>>> currently optimise those and also gain access to the key/val objects.
>>> If
>>>>> you went without the Ignite Spark SQL optimisations and expressed
>> your
>>>>> query in Ignite SQL, you still need to use the _key/_val columns. The
>>>>> Ignite documentation has this specific example using the _val column
>>> (right
>>>>> at the end):
>>>>> https://apacheignite-fs.readme.io/docs/ignitecontext-igniterdd
>>>>>
>>>>> Stuart.
>>>>>
>>>>> On 27 Jul 2018, at 20:05, Valentin Kulichenko <
>>>>> valentin.kuliche...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Well, the second approach would use the optimizations, no?
>>>>>
>>>>> -Val
>>>>>
>>>>>
>>>>> On Fri, Jul 27, 2018 at 11:49 AM Stuart Macdonald <stu...@stuwee.org
>>>
>>>>> wrote:
>>>>>
>>>>> Val,
>>>>>
>>>>>
>>>>> Yes you can already get access to the cache objects as an RDD or
>>>>>
>>>>> Dataset but you can’t use the Ignite-optimised DataFrames with these
>>>>>
>>>>> mechanisms. Optimised DataFrames have to be passed through Spark
>> SQL’s
>>>>>
>>>>> Catalyst engine to allow for predicate pushdown to Ignite. So the
>>>>>
>>>>> usecase we’re talking about here is when we want to be able to push
>>>>>
>>>>> Spark filters/joins to Ignite to optimise, but still have access to
>>>>>
>>>>> the underlying cache objects, which is not possible currently.
>>>>>
>>>>>
>>>>> Can you elaborate on the reason _key and _val columns in Ignite SQL
>>>>>
>>>>> will be removed?
>>>>>
>>>>>
>>>>> Stuart.
>>>>>
>>>>>
>>>>> On 27 Jul 2018, at 19:39, Valentin Kulichenko <
>>>>>
>>>>> valentin.kuliche...@gmail.com> wrote:
>>>>>
>>>>>
>>>>> Stuart, Nikolay,
>>>>>
>>>>>
>>>>> I really don't like the idea of exposing '_key' and '_val' fields.
>> This
>>>>>
>>>>> is
>>>>>
>>>>> legacy stuff that hopefully will be removed altogether one day. Let's
>>> not
>>>>>
>>>>> use it in new features.
>>>>>
>>>>>
>>>>> Actually, I don't even think it's even needed. Spark docs [1] suggest
>>> two
>>>>>
>>>>> ways of creating a typed dataset:
>>>>>
>>>>> 1. Based on RDD. This should be supported using IgniteRDD I believe.
>>>>>
>>>>> 2. Based on DataFrame providing a class. This would just work out of
>>> the
>>>>>
>>>>> box I guess.
>>>>>
>>>>>
>>>>> Of course, this needs to be tested and verified, and there might be
>>>>>
>>>>> certain
>>>>>
>>>>> pieces missing to fully support the use case. But generally I like
>>> these
>>>>>
>>>>> approaches much more.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>> https://spark.apache.org/docs/2.3.1/sql-programming-guide.html#creating-datasets
>>>>>
>>>>>
>>>>> -Val
>>>>>
>>>>>
>>>>> On Fri, Jul 27, 2018 at 6:31 AM Stuart Macdonald <stu...@stuwee.org>
>>>>>
>>>>> wrote:
>>>>>
>>>>>
>>>>> Here’s the ticket:
>>>>>
>>>>>
>>>>> https://issues.apache.org/jira/browse/IGNITE-9108
>>>>>
>>>>>
>>>>> Stuart.
>>>>>
>>>>>
>>>>>
>>>>> On Friday, 27 July 2018 at 14:19, Nikolay Izhikov wrote:
>>>>>
>>>>>
>>>>> Sure.
>>>>>
>>>>>
>>>>> Please, send ticket number in this thread.
>>>>>
>>>>>
>>>>> пт, 27 июля 2018 г., 16:16 Stuart Macdonald <stu...@stuwee.org
>>>>>
>>>>> (mailto:
>>>>>
>>>>> stu...@stuwee.org)>:
>>>>>
>>>>>
>>>>> Thanks Nikolay. For both options if the cache object isn’t a simple
>>>>>
>>>>> type,
>>>>>
>>>>> we’d probably do something like this in our Ignite SQL statement:
>>>>>
>>>>>
>>>>> select cast(_key as binary), cast(_val as binary), ...
>>>>>
>>>>>
>>>>> Which would give us the BinaryObject’s byte[], then for option 1 we
>>>>>
>>>>> keep
>>>>>
>>>>> the Ignite format and introduce a new Spark Encoder for Ignite binary
>>>>>
>>>>> types
>>>>>
>>>>> (
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>> https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Encoder.html
>>>>>
>>>>> ),
>>>>>
>>>>> so that the end user interface would be something like:
>>>>>
>>>>>
>>>>> IgniteSparkSession session = ...
>>>>>
>>>>> Dataset<Row> dataFrame = ...
>>>>>
>>>>> Dataset<MyValClass> valDataSet =
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>> dataFrame.select(“_val_).as(session.binaryObjectEncoder(MyValClass.class))
>>>>>
>>>>>
>>>>> Or for option 2 we have a behind-the-scenes Ignite-to-Kryo UDF so
>> that
>>>>>
>>>>> the
>>>>>
>>>>> user interface would be standard Spark:
>>>>>
>>>>>
>>>>> Dataset<Row> dataFrame = ...
>>>>>
>>>>> DataSet<MyValClass> dataSet =
>>>>>
>>>>> dataFrame.select(“_val_).as(Encoders.kryo(MyValClass.class))
>>>>>
>>>>>
>>>>> I’ll create a ticket and maybe put together a test case for further
>>>>>
>>>>> discussion?
>>>>>
>>>>>
>>>>> Stuart.
>>>>>
>>>>>
>>>>> On 27 Jul 2018, at 09:50, Nikolay Izhikov <nizhi...@apache.org
>>>>>
>>>>> (mailto:nizhi...@apache.org <nizhi...@apache.org>)> wrote:
>>>>>
>>>>>
>>>>> Hello, Stuart.
>>>>>
>>>>>
>>>>> I like your idea.
>>>>>
>>>>>
>>>>> 1. Ignite BinaryObjects, in which case we’d need to supply a Spark
>>>>>
>>>>> Encoder
>>>>>
>>>>> implementation for BinaryObjects
>>>>>
>>>>>
>>>>> 2. Kryo-serialised versions of the objects.
>>>>>
>>>>>
>>>>>
>>>>> Seems like first option is simple adapter. Am I right?
>>>>>
>>>>> If yes, I think it's a more efficient way comparing with
>>>>>
>>>>> transformation of
>>>>>
>>>>> each object to some other(Kryo) format.
>>>>>
>>>>>
>>>>> Can you provide some additional links for both options?
>>>>>
>>>>> Where I can find API or(and) examples?
>>>>>
>>>>>
>>>>> As a second step, we can apply same approach to the regular key,
>> value
>>>>>
>>>>> caches.
>>>>>
>>>>>
>>>>> Feel free to create a ticket.
>>>>>
>>>>>
>>>>> В Пт, 27/07/2018 в 09:37 +0100, Stuart Macdonald пишет:
>>>>>
>>>>>
>>>>> Ignite Dev Community,
>>>>>
>>>>>
>>>>>
>>>>> Within Ignite-supplied Spark DataFrames, I’d like to propose adding
>>>>>
>>>>> support
>>>>>
>>>>>
>>>>> for _key and _val columns which represent the cache key and value
>>>>>
>>>>> objects
>>>>>
>>>>>
>>>>> similar to the current _key/_val column semantics in Ignite SQL.
>>>>>
>>>>>
>>>>>
>>>>> If the cache key or value objects are standard SQL types (eg. String,
>>>>>
>>>>> Int,
>>>>>
>>>>>
>>>>> etc) they will be represented as such in the DataFrame schema,
>>>>>
>>>>> otherwise
>>>>>
>>>>>
>>>>> they are represented as Binary types encoded as either: 1. Ignite
>>>>>
>>>>>
>>>>> BinaryObjects, in which case we’d need to supply a Spark Encoder
>>>>>
>>>>>
>>>>> implementation for BinaryObjects, or 2. Kryo-serialised versions of
>>>>>
>>>>> the
>>>>>
>>>>>
>>>>> objects. Option 1 would probably be more efficient but option 2 would
>>>>>
>>>>> be
>>>>>
>>>>>
>>>>> more idiomatic Spark.
>>>>>
>>>>>
>>>>>
>>>>> This feature would be controlled with an optional parameter in the
>>>>>
>>>>> Ignite
>>>>>
>>>>>
>>>>> data source, defaulting to the current implementation which doesn’t
>>>>>
>>>>> supply
>>>>>
>>>>>
>>>>> _key or _val columns. The rationale behind this is the same as the
>>>>>
>>>>> Ignite
>>>>>
>>>>>
>>>>> SQL _key and _val columns: to allow access to the full cache objects
>>>>>
>>>>> from a
>>>>>
>>>>>
>>>>> SQL context.
>>>>>
>>>>>
>>>>>
>>>>> Can I ask for feedback on this proposal please?
>>>>>
>>>>>
>>>>>
>>>>> I’d be happy to contribute this feature if we agree on the concept.
>>>>>
>>>>>
>>>>>
>>>>> Stuart.
>>>>>
>>

Re: Spark DataFrames With Cache Key and Value Objects

Reply via email to