Re: Spark DataFrames With Cache Key and Value Objects

2018-08-01 Thread Stuart Macdonald
What’s perhaps not clear is that the PrunedFilteredScan trait which needs to be implemented to allow for predicate pushdown does need to return RDD: https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/sources/PrunedFilteredScan.html IgniteSQLRelation implements this to perform Ignit

Re: Spark DataFrames With Cache Key and Value Objects

2018-08-01 Thread Valentin Kulichenko
Stuart, >From API standpoint Dataframe is just a Dataset of Rows, so they have the same set of methods. You have a valid point, by it's valid for both Dataset and Dataframe in the same way. If you provide a function that filter Rows in a Dataframe, current integration would also not take advantag

Re: Spark DataFrames With Cache Key and Value Objects

2018-08-01 Thread Stuart Macdonald
Val, Happy to clarify my thoughts. Let’s take an example, say we have an Ignite cache of Person objects, so in Nikolay’s Ignite Spark SQL implementation you can currently obtain a DataFrame with a column called “age” because that’s been registered as a field in Ignite. Then you can do something li

Re: Spark DataFrames With Cache Key and Value Objects

2018-08-01 Thread Valentin Kulichenko
Stuart, I don't see a reason why it would work with DataFrames, but not with Datasets - they are pretty much the same thing. If you have any particular thoughts on this, please let us know. In any case, I would like to hear from Nikolay as he is an implementor of this functionality. Nikolay, plea

Re: Spark DataFrames With Cache Key and Value Objects

2018-08-01 Thread Stuart Macdonald
I believe suggested approach will not work with the Spark SQL relational optimisations which perform predicate pushdown from Spark to Ignite. For that to work we need both the key/val and the relational fields in a dataframe schema. Stuart. > On 1 Aug 2018, at 04:23, Valentin Kulichenko > wrote

Re: Spark DataFrames With Cache Key and Value Objects

2018-07-31 Thread Valentin Kulichenko
I don't think there are exact plans to remove _key and _value fields as it's pretty hard considering the fact that many users use them and that they are deeply integrated into the product. However, we already had multiple usability and other issues due to their existence, and while fixing them we g

Re: Spark DataFrames With Cache Key and Value Objects

2018-07-31 Thread Denis Magda
Hello folks, The documentation goes with a small reference about _key and _val usage, and only for Ignite SQL APIs (Java, Net, C++). I tried to clean up all the documentation code snippets. As for the GitHub examples, they require a major overhaul. Instead of _key and _val usage, we need to use S

Re: Spark DataFrames With Cache Key and Value Objects

2018-07-31 Thread Nikolay Izhikov
Hello, Igniters. Valentin, > We never recommend to use these fields Actually, we did: * Documentation [1]. Please, see "Predefined Fields" section. * Java Example [2] * DotNet Example [3] * Scala Example [4] > ...hopefully will be removed altogether one day Th

Re: Spark DataFrames With Cache Key and Value Objects

2018-07-27 Thread Valentin Kulichenko
Stuart, _key and _val fields is quite a dirty hack that was added years ago and is virtually never used now. We never recommend to use these fields and I would definitely avoid building new features based on them. Having said that, I'm not arguing the use case, but we need better implementation a

Re: Spark DataFrames With Cache Key and Value Objects

2018-07-27 Thread Stuart Macdonald
If your predicates and joins are expressed in Spark SQL, you cannot currently optimise those and also gain access to the key/val objects. If you went without the Ignite Spark SQL optimisations and expressed your query in Ignite SQL, you still need to use the _key/_val columns. The Ignite documentat

Re: Spark DataFrames With Cache Key and Value Objects

2018-07-27 Thread Valentin Kulichenko
Well, the second approach would use the optimizations, no? -Val On Fri, Jul 27, 2018 at 11:49 AM Stuart Macdonald wrote: > Val, > > Yes you can already get access to the cache objects as an RDD or > Dataset but you can’t use the Ignite-optimised DataFrames with these > mechanisms. Optimised Da

Re: Spark DataFrames With Cache Key and Value Objects

2018-07-27 Thread Stuart Macdonald
Val, Yes you can already get access to the cache objects as an RDD or Dataset but you can’t use the Ignite-optimised DataFrames with these mechanisms. Optimised DataFrames have to be passed through Spark SQL’s Catalyst engine to allow for predicate pushdown to Ignite. So the usecase we’re talking

Re: Spark DataFrames With Cache Key and Value Objects

2018-07-27 Thread Valentin Kulichenko
Stuart, Nikolay, I really don't like the idea of exposing '_key' and '_val' fields. This is legacy stuff that hopefully will be removed altogether one day. Let's not use it in new features. Actually, I don't even think it's even needed. Spark docs [1] suggest two ways of creating a typed dataset:

Re: Spark DataFrames With Cache Key and Value Objects

2018-07-27 Thread Stuart Macdonald
Here’s the ticket: https://issues.apache.org/jira/browse/IGNITE-9108 Stuart. On Friday, 27 July 2018 at 14:19, Nikolay Izhikov wrote: > Sure. > > Please, send ticket number in this thread. > > пт, 27 июля 2018 г., 16:16 Stuart Macdonald (mailto:stu...@stuwee.org)>: > > > Thanks Niko

Re: Spark DataFrames With Cache Key and Value Objects

2018-07-27 Thread Nikolay Izhikov
Sure. Please, send ticket number in this thread. пт, 27 июля 2018 г., 16:16 Stuart Macdonald : > Thanks Nikolay. For both options if the cache object isn’t a simple type, > we’d probably do something like this in our Ignite SQL statement: > > select cast(_key as binary), cast(_val as binary), ..

Re: Spark DataFrames With Cache Key and Value Objects

2018-07-27 Thread Stuart Macdonald
Thanks Nikolay. For both options if the cache object isn’t a simple type, we’d probably do something like this in our Ignite SQL statement: select cast(_key as binary), cast(_val as binary), ... Which would give us the BinaryObject’s byte[], then for option 1 we keep the Ignite format and introdu

Re: Spark DataFrames With Cache Key and Value Objects

2018-07-27 Thread Nikolay Izhikov
Hello, Stuart. I like your idea. > 1. Ignite BinaryObjects, in which case we’d need to supply a Spark Encoder > implementation for BinaryObjects > 2. Kryo-serialised versions of the objects. Seems like first option is simple adapter. Am I right? If yes, I think it's a more efficient way compari