Re: Load Spark dataframes in Arrow buffer using Scala (to be used by Gandiva)

2018-07-30 Thread Bryan Cutler
Hi Richard, Take a look at this JIRA https://issues.apache.org/jira/browse/SPARK-24579, it is geared towards exporting Spark data to DL frameworks, but it's likely to add a general method to map Spark data partitions to a function using Arrow data. In that function you should be able apply Gandiva

Re: Load Spark dataframes in Arrow buffer using Scala (to be used by Gandiva)

2018-07-25 Thread Li Jin
Another pointer to look at: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L3369 This function Dataset.toArrowPayload here turns a Spark Dataset to a RDD[ArrowPayload], where ArrowPayload is basically deserialized bytes in Arrow file format.

Re: Load Spark dataframes in Arrow buffer using Scala (to be used by Gandiva)

2018-07-25 Thread Richard Siebeling
Hi, @Li, same as Jieun , I'd like to start with a single machine but can imagine that there are use cases for a distributed approach. @Wes, thanks, I'll look into it, Richard On Wed, 25 Jul 2018 at 03:59, Wes McKinney wrote: > hi Richard, > > I might start here in the Spark codebase to see how

Re: Load Spark dataframes in Arrow buffer using Scala (to be used by Gandiva)

2018-07-24 Thread Wes McKinney
hi Richard, I might start here in the Spark codebase to see how Spark SQL tables are converted to Arrow record batches: https://github.com/apache/spark/blob/d8aaa771e249b3f54b57ce24763e53fd65a0dbf7/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala The code has be

Re: Load Spark dataframes in Arrow buffer using Scala (to be used by Gandiva)

2018-07-24 Thread Li Jin
Hi, Do you want to collect a Spark DataFrame into Arrow format on a single machine or do you still want to keep the data distributed?

Load Spark dataframes in Arrow buffer using Scala (to be used by Gandiva)

2018-07-24 Thread Richard Siebeling
Hi, how can I load a Spark dataframe into Arrow using Scala? I've seen some older posts regarding this subject, but am hoping that there has been some development around this... I'd like to load a Spark dataframe into Arrow and then use the Gandiva project to do some analytics on that, thanks in