I just update my PR for SPARK-13534 https://github.com/apache/spark/pull/15821 that uses the latest from Arrow, hopefully that should help. I also have been playing around with Python UDFs in Spark with Arrow. I have something sort of working, there are still some issues though and the branch is kind of messy right now, but feel free to check it out https://github.com/BryanCutler/spark/tree/wip-arrow-stream-serializer - I just mention this because I saw you created a related Spark PR and I'd be glad to help out if you want.
Bryan On Wed, Apr 26, 2017 at 2:21 PM, Julien Le Dem <jul...@dremio.com> wrote: > Example of writing to and reading from a file: > https://github.com/apache/arrow/blob/master/java/vector/ > src/test/java/org/apache/arrow/vector/file/TestArrowFile.java > Similarly, in case you don't want to go through a file: > Unloading a vector into buffers and loading from buffers: > https://github.com/apache/arrow/blob/master/java/vector/ > src/test/java/org/apache/arrow/vector/TestVectorUnloadLoad.java > The VectorLoader/Unloader are used to read/write FIles > > On Wed, Apr 26, 2017 at 10:31 AM, Li Jin <ice.xell...@gmail.com> wrote: > > > Thanks for the various pointers. I was looking at ArrowFileWriter/Reader > > and got a little bit confused. > > > > So what I am trying to do is to convert a list of spark rows into some > > arrow format in java ( I will probably go with the file format for now), > > send the bytes to python, deserialize it into a pyarrow table. > > > > What is what I currently plan to do: > > (1) convert the rows to one or more arrow batch record (Use the > > ValueVectors) > > (2) serialize the arrow batch records send it over to python (Not sure to > > use here, ArrowFileWriter?) > > (3) deserialize the bytes into pyarrow.Table using pyarrow.FileReader > > > > I *think* ArrowFileWriter is what I should use to send data over in (2), > > but: > > (1) I would need to turn the arrow batch records into a VectorSchemaRoot > > by doing sth like > > this > > https://github.com/icexelloss/spark/blob/pandas-udf/sql/ > > core/src/test/scala/org/apache/spark/sql/ArrowConvertersSuite.scala#L226 > > (2) I am not sure how do I write all the data in a vector schema root > using > > ArrowFileWriter. > > > > Does this sound the right thing to do? > > > > Thanks, > > Li > > > > On Tue, Apr 25, 2017 at 8:52 PM, Wes McKinney <wesmck...@gmail.com> > wrote: > > > > > Also, now that we have a website that is easier to write content for > (in > > > Markdown), it would be great if some Java developers could volunteer > some > > > time to write user-facing documentation to go with the Javadocs. > > > > > > On Tue, Apr 25, 2017 at 8:51 PM, Wes McKinney <wesmck...@gmail.com> > > wrote: > > > > > > > There is also https://github.com/apache/arrow/blob/master/java/ > > > > veator/src/test/java/org/apache/arrow/vector/file/ > > > TestArrowStreamPipe.java > > > > > > > > On Tue, Apr 25, 2017 at 8:46 PM, Li Jin <ice.xell...@gmail.com> > wrote: > > > > > > > >> Thanks Julien. I will follow > > > >> https://github.com/apache/arrow/blob/990e2bde758ac8bc6e4497a > > > >> e1bc37f89b71bb5cf/java/vector/src/test/java/org/apache/ > > > >> arrow/vector/stream/MessageSerializerTest.java#L91 > > > >> > > > > > > > > > > > > > > > > > -- > Julien >