Re: Java - Spark dataframe to Arrow format

Micah Kornfield Wed, 11 Dec 2019 23:15:58 -0800

There was a discussion/proposal a while ago on the spark mailing list to
use the Arrow memory format natively within spark [1], but the proposal was
scaled back to exposing vectorized APIs only IIUC.


Looking quickly at the links Wes provided, one option for potential
speed-up could be a dynamically generated ArrowWriter [2].  I'm not sure
how much of a performance benefit this would provide (and it also doesn't
solve memory overhead issues).  I think this type of functionality should
be discussed on the Spark mailing list.

For the Parquet parsing, there are some open pull requests and some
discussion on exposing more of the C++ functionality through JNI bindings
which when combined with ArrowColumnVector might provide a feasible way to
reduce overhead.

Maybe members of the Spark community comment on how we can better
collaborate to further reduce overhead?

Thanks,
Micah

[1] https://issues.apache.org/jira/browse/SPARK-27396
[2]
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala

On Fri, Dec 6, 2019 at 5:11 AM GaoXiang Wang <wgx...@gmail.com> wrote:

> Hi Wes and Liya,
>
> Appreciate your feedback and information.
>
> Looking forward to a more efficient integration between Arrow and Spark on
> the Java/Scala level. I would like to make my contribution if I can help in
> any way during my free time.
>
> Thank you very much.
>
>
> *Best Regards,WANG GAOXIANG*
> * (Eric) *
> National University of Singapore Graduate ::
> API Craft Singapore Co-organiser ::
> Singapore Python User Group Co-organiser
> *+6597685360 (P) :: wgx...@gmail.com <wgx...@gmail.com> (E) ::
> **https://medium.com/@wgx731
> <https://medium.com/@wgx731> **(W)*
>
>
> On Fri, Dec 6, 2019 at 6:17 PM Fan Liya <liya.fa...@gmail.com> wrote:
>
> > Hi folks,
> >
> > Thanks for your clarification.
> >
> > I also think this is a universal requirement (including Java UDF in Arrow
> > format).
> >
> > The Java converter provided by Spark is inefficient, due to two reasons
> > (IMO)
> >
> > 1. There are frequent memory copies between on-heap and off-heap memory.
> > 2. The Spark API is in a row-oriented view (Iterator of InternalRow), so
> we
> > need to perform some column/row conversion, and we cannot copy data in
> > batch.
> >
> > To solve the problem, maybe we need something equivalent to pandas in
> Java
> > (I think pandas acts as a bridge between PyArrow and PySpark).
> > In addition, we need to integrate it in Arrow and Spark.
> >
> > Best,
> > Liya Fan
> >
> > On Fri, Dec 6, 2019 at 2:14 AM Chen Li <c...@fb.com> wrote:
> >
> > > We have a similar use case, and we use ArrowConverters.scala mentioned
> by
> > > Wes. However, the overhead of the conversion is kinda high.
> > > ------------------------------
> > > *From:* Wes McKinney <wesmck...@gmail.com>
> > > *Sent:* Thursday, December 5, 2019 6:53 AM
> > > *To:* dev <dev@arrow.apache.org>
> > > *Cc:* Fan Liya <liya.fa...@gmail.com>;
> > > jeetendra.jais...@impetus.co.in.invalid
> > > <jeetendra.jais...@impetus.co.in.invalid>
> > > *Subject:* Re: Java - Spark dataframe to Arrow format
> > >
> > > hi folks,
> > >
> > > I understand the question to be about serialization.
> > >
> > > see
> > >
> > > *
> > >
> >
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java
> > > *
> > >
> >
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala
> > > *
> > >
> >
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala
> > >
> > > This code is used to convert between Spark Data Frames and Arrow
> > > columnar format for UDF evaluation purposes
> > >
> > > On Thu, Dec 5, 2019 at 6:58 AM GaoXiang Wang <wgx...@gmail.com> wrote:
> > > >
> > > > Hi Jeetendra and Liya,
> > > >
> > > > I am actually having a similar use case. We have some data stored as
> > > *parquet
> > > > format in HDFS* and would like to make use of Apache Arrow to improve
> > > > compute performance if possible. Right now, I didn't see there is a
> > > direct
> > > > way to do in Java with Spark.
> > > >
> > > > I have search the Spark documentation, it looks like python support
> is
> > > > added after 2.3.0 (
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_sql-2Dpyspark-2Dpandas-2Dwith-2Darrow.html&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=JX5y-LzqAOulZIcSbMRGYA&m=E587baUuoFDKcpKDDIF4Su8nBHs0I9bGTBqEhtErCuY&s=n28eUF_7egcwK6LLh63Wra3oWTzZWBlB6en3xCxDEdE&e=
> > > ),
> > > > any plan from Apache Arrow team to provide *Spark integration for
> > Java*?
> > > >
> > > > Thank you very much.
> > > >
> > > >
> > > > *Best Regards,WANG GAOXIANG*
> > > > * (Eric) *
> > > > National University of Singapore Graduate ::
> > > > API Craft Singapore Co-organiser ::
> > > > Singapore Python User Group Co-organiser
> > > > *+6597685360 (P) :: wgx...@gmail.com <wgx...@gmail.com> (E) ::
> > > > **
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__medium.com_-40wgx731&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=JX5y-LzqAOulZIcSbMRGYA&m=E587baUuoFDKcpKDDIF4Su8nBHs0I9bGTBqEhtErCuY&s=thoJd3JhOJ8HBCsAJTzhnfw91reStRfH0pUj9v-v5xE&e=
> > > > <
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__medium.com_-40wgx731&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=JX5y-LzqAOulZIcSbMRGYA&m=E587baUuoFDKcpKDDIF4Su8nBHs0I9bGTBqEhtErCuY&s=thoJd3JhOJ8HBCsAJTzhnfw91reStRfH0pUj9v-v5xE&e=
> > > > **(W)*
> > > >
> > > >
> > > > On Thu, Dec 5, 2019 at 6:58 PM Fan Liya <liya.fa...@gmail.com>
> wrote:
> > > >
> > > > > Hi Jeetendra,
> > > > >
> > > > > I am not sure if I understand your question correctly.
> > > > >
> > > > > Arrow is an in-memory columnar data format, and Spark has its own
> > > in-memory
> > > > > data format for DataFrame, which is invisible to end users.
> > > > > So the Spark user has no control over the underlying in-memory
> > layout.
> > > > >
> > > > > If you really want to convert a DataFrame into Arrow format, maybe
> > you
> > > can
> > > > > save the results of a Spark job to some external store (e.g. in ORC
> > > > > format), and then load it back to memory in Arrow format (if this
> is
> > > what
> > > > > you want).
> > > > >
> > > > > Best,
> > > > > Liya Fan
> > > > >
> > > > > On Thu, Dec 5, 2019 at 5:53 PM Jeetendra Kumar Jaiswal
> > > > > <jeetendra.jais...@impetus.co.in.invalid> wrote:
> > > > >
> > > > > > Hi Dev Team,
> > > > > >
> > > > > > Can someone please let me know how to convert spark data frame to
> > > Arrow
> > > > > > format. I am coding in Java.
> > > > > >
> > > > > > Java documentation of Arrow just has function API information. It
> > is
> > > > > > little hard to develop without proper documentation.
> > > > > >
> > > > > > Is there a way to directly convert spark dataframe to Arrow
> format
> > > > > > dataframes.
> > > > > >
> > > > > > Thanks,
> > > > > > Jeetendra
> > > > > >
> > > > > > ________________________________
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > NOTE: This message may contain information that is confidential,
> > > > > > proprietary, privileged or otherwise protected by law. The
> message
> > is
> > > > > > intended solely for the named addressee. If received in error,
> > please
> > > > > > destroy and notify the sender. Any use of this email is
> prohibited
> > > when
> > > > > > received in error. Impetus does not represent, warrant and/or
> > > guarantee,
> > > > > > that the integrity of this communication has been maintained nor
> > > that the
> > > > > > communication is free of errors, virus, interception or
> > interference.
> > > > > >
> > > > >
> > >
> >
>

Re: Java - Spark dataframe to Arrow format

Reply via email to