Serializing Spark DataFrame in either Java or Scala would suffice for the use case, but there may be follow-on JIRAs to make the Arrow adapters more accessible. pandas only needs access to flat schemas for now, for example, so nested Spark SQL schemas could be handled in follow-up work.
Note: this is somewhat dependent on the separate thread around the metadata specification -- ideally Spark SQL would be able to adapt its schema metadata to a form that any Arrow consumer can use. - Wes On Thu, Mar 3, 2016 at 12:39 AM, Dmitriy Morozov <int.2...@gmail.com> wrote: > Hi Wes, > > Thanks for raising the ticket. So it seems like Spark 2.0 will not have > support for Arrow. > Also does SPARK-13534 cover Arrow serialization for Spark's JAVA API, or do > we need to raise a separate ticket for that? > > As of now, I only have a high-level understanding of Arrow and it's data > structure but I'm willing to dive deeper and provide any help I can, mainly > in testing, Java serializer or additional examples. Let me know how I can > help. > > Thanks, > Dima > > On 1 March 2016 at 00:46, Wes McKinney <w...@cloudera.com> wrote: > > > hi Dmitriy, > > > > I created the following JIRA > > https://issues.apache.org/jira/browse/SPARK-13534 related to PySpark > > which seems relevant. I would be happy to collaborate with you on > > this. Since I understand that the Spark developers are exploring an > > in-memory columnar layout for Spark DataFrames/Datasets and Spark SQL > > any conversion code we write right now may end up being temporary. > > Hopefully the Spark columnar memory layout will end up being very > > nearly the same as the official Arrow layout so that limited or no > > conversion will be necessary. > > > > Thanks > > Wes > > > > On Wed, Feb 24, 2016 at 12:38 PM, Dmitriy Morozov <int.2...@gmail.com> > > wrote: > > > Hello everyone, > > > > > > I'm just starting with Arrow. I'd like to see how good Arrow at caching > > > when used in conjunction with Allixio (Tachyon). The use case that I'm > > > going to validate involves reading data from Spark's DataFrame, storing > > in > > > Tachyon in Arrow and then reading back into DataFrame. I checked the > > source > > > code of Arrow but couldn't find any examples or tests. Can anyone guide > > me > > > please where should I start looking at in order to convert DataFrame > to a > > > Arrow struct? > > > > > > Thanks! > > > Dmitriy > > > > > > -- > Kind regards, > Dima >