Hi Richard,

Take a look at this JIRA https://issues.apache.org/jira/browse/SPARK-24579,
it is geared towards exporting Spark data to DL frameworks, but it's likely
to add a general method to map Spark data partitions to a function using
Arrow data. In that function you should be able apply Gandiva to the data,
although I don't know many details of that project. Otherwise, some of the
machinery to do this exists already internally in Spark and is not public,
as pointed out.

Bryan

On Wed, Jul 25, 2018 at 1:10 PM, Li Jin <ice.xell...@gmail.com> wrote:

> Another pointer to look at:
>
> https://github.com/apache/spark/blob/master/sql/core/
> src/main/scala/org/apache/spark/sql/Dataset.scala#L3369
>
> This function Dataset.toArrowPayload here turns a Spark Dataset to a
> RDD[ArrowPayload], where ArrowPayload is basically deserialized bytes in
> Arrow file format.
>
> But like Wes mentioned, this is private function in Spark and probably
> needs a bit effort to use and I believe Bryan has a PR to change
> ArrowPayload to be deserialized Record batches.
>
> On Wed, Jul 25, 2018 at 7:34 AM, Richard Siebeling <rsiebel...@gmail.com>
> wrote:
>
> > Hi,
> >
> > @Li, same as Jieun , I'd like to start with a single machine but can
> > imagine that there are use cases for a distributed approach.
> > @Wes, thanks, I'll look into it,
> >
> > Richard
> >
> > On Wed, 25 Jul 2018 at 03:59, Wes McKinney <wesmck...@gmail.com> wrote:
> >
> > > hi Richard,
> > >
> > > I might start here in the Spark codebase to see how Spark SQL tables
> > > are converted to Arrow record batches:
> > >
> > >
> > > https://github.com/apache/spark/blob/d8aaa771e249b3f54b57ce24763e53
> > fd65a0dbf7/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/
> > ArrowConverters.scala
> > >
> > > The code has been developed to send payloads over a socket to PySpark,
> > > but it could be adapted for your needs perhaps without too much
> > > effort. Li and Bryan and others have worked on this so should be able
> > > to answer your questions about it.
> > >
> > > - Wes
> > >
> > > On Tue, Jul 24, 2018 at 8:21 AM, Li Jin <ice.xell...@gmail.com> wrote:
> > > > Hi,
> > > >
> > > > Do you want to collect a Spark DataFrame into Arrow format on a
> single
> > > > machine or do you still want to keep the data distributed?
> > >
> >
>

Reply via email to