I just update my PR for SPARK-13534
https://github.com/apache/spark/pull/15821 that uses the latest from Arrow,
hopefully that should help.  I also have been playing around with Python
UDFs in Spark with Arrow.  I have something sort of working, there are
still some issues though and the branch is kind of messy right now, but
feel free to check it out
https://github.com/BryanCutler/spark/tree/wip-arrow-stream-serializer - I
just mention this because I saw you created a related Spark PR and I'd be
glad to help out if you want.

Bryan

On Wed, Apr 26, 2017 at 2:21 PM, Julien Le Dem <jul...@dremio.com> wrote:

> Example of writing to and reading from a file:
> https://github.com/apache/arrow/blob/master/java/vector/
> src/test/java/org/apache/arrow/vector/file/TestArrowFile.java
> Similarly, in case you don't want to go through a file:
> Unloading a vector into buffers and loading from buffers:
> https://github.com/apache/arrow/blob/master/java/vector/
> src/test/java/org/apache/arrow/vector/TestVectorUnloadLoad.java
> The VectorLoader/Unloader are used to read/write FIles
>
> On Wed, Apr 26, 2017 at 10:31 AM, Li Jin <ice.xell...@gmail.com> wrote:
>
> > Thanks for the various pointers. I was looking at ArrowFileWriter/Reader
> > and got a little bit confused.
> >
> > So what I am trying to do is to convert a list of spark rows into some
> > arrow format in java ( I will probably go with the file format for now),
> > send the bytes to python, deserialize it into a pyarrow table.
> >
> > What is what I currently plan to do:
> > (1) convert the rows to one or more arrow batch record (Use the
> > ValueVectors)
> > (2) serialize the arrow batch records send it over to python (Not sure to
> > use here, ArrowFileWriter?)
> > (3) deserialize the bytes into pyarrow.Table using pyarrow.FileReader
> >
> > I *think* ArrowFileWriter is what I should use to send data over in (2),
> > but:
> > (1)  I would need to turn the arrow batch records into a VectorSchemaRoot
> > by doing sth like
> > this
> > https://github.com/icexelloss/spark/blob/pandas-udf/sql/
> > core/src/test/scala/org/apache/spark/sql/ArrowConvertersSuite.scala#L226
> > (2) I am not sure how do I write all the data in a vector schema root
> using
> > ArrowFileWriter.
> >
> > Does this sound the right thing to do?
> >
> > Thanks,
> > Li
> >
> > On Tue, Apr 25, 2017 at 8:52 PM, Wes McKinney <wesmck...@gmail.com>
> wrote:
> >
> > > Also, now that we have a website that is easier to write content for
> (in
> > > Markdown), it would be great if some Java developers could volunteer
> some
> > > time to write user-facing documentation to go with the Javadocs.
> > >
> > > On Tue, Apr 25, 2017 at 8:51 PM, Wes McKinney <wesmck...@gmail.com>
> > wrote:
> > >
> > > > There is also https://github.com/apache/arrow/blob/master/java/
> > > > veator/src/test/java/org/apache/arrow/vector/file/
> > > TestArrowStreamPipe.java
> > > >
> > > > On Tue, Apr 25, 2017 at 8:46 PM, Li Jin <ice.xell...@gmail.com>
> wrote:
> > > >
> > > >> Thanks Julien. I will follow
> > > >> https://github.com/apache/arrow/blob/990e2bde758ac8bc6e4497a
> > > >> e1bc37f89b71bb5cf/java/vector/src/test/java/org/apache/
> > > >> arrow/vector/stream/MessageSerializerTest.java#L91
> > > >>
> > > >
> > > >
> > >
> >
>
>
>
> --
> Julien
>

Reply via email to