Hi Richard,
Take a look at this JIRA https://issues.apache.org/jira/browse/SPARK-24579,
it is geared towards exporting Spark data to DL frameworks, but it's likely
to add a general method to map Spark data partitions to a function using
Arrow data. In that function you should be able apply Gandiva
Another pointer to look at:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L3369
This function Dataset.toArrowPayload here turns a Spark Dataset to a
RDD[ArrowPayload], where ArrowPayload is basically deserialized bytes in
Arrow file format.
Hi,
@Li, same as Jieun , I'd like to start with a single machine but can
imagine that there are use cases for a distributed approach.
@Wes, thanks, I'll look into it,
Richard
On Wed, 25 Jul 2018 at 03:59, Wes McKinney wrote:
> hi Richard,
>
> I might start here in the Spark codebase to see how
hi Richard,
I might start here in the Spark codebase to see how Spark SQL tables
are converted to Arrow record batches:
https://github.com/apache/spark/blob/d8aaa771e249b3f54b57ce24763e53fd65a0dbf7/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala
The code has be
Hi,
Do you want to collect a Spark DataFrame into Arrow format on a single
machine or do you still want to keep the data distributed?
Hi,
how can I load a Spark dataframe into Arrow using Scala?
I've seen some older posts regarding this subject, but am hoping that there
has been some development around this...
I'd like to load a Spark dataframe into Arrow and then use the Gandiva
project to do some analytics on that,
thanks in