Re: [DISCUSS] Add support for Apache Arrow format

Jacky Lau Tue, 11 Apr 2023 19:27:24 -0700

Hi
   I also think arrow format  will be useful when reading/writing with
message queue.
   Arrow defines a language-independent columnar memory format for flat and
hierarchical data, organized for efficient analytic operations on modern
hardware like CPUs and GPUs. The Arrow memory format also supports
zero-copy reads for lightning-fast data access without serialization
overhead. it will bring a lot.
   And we  may do some surveys, what other engines support like
spark/hive/presto and so on, how that supports and how it be used.


   Best,
   Jacky.

Aitozi <[email protected]> 于2023年4月2日周日 22:22写道：

> Hi all,
>     Thanks for your input.
>
> @Ran > However, as mentioned in the issue you listed, it may take a lot of
> work
> and the community's consideration for integrating Arrow.
>
> To clarify, this proposal solely aims to introduce flink-arrow as a new
> format,
> similar to flink-csv and flink-protobuf. It will not impact the internal
> data
> structure representation in Flink. For proof of concept, please refer to:
> https://github.com/Aitozi/flink/commits/arrow-format.
>
> @Martijn > I'm wondering if there's really much benefit for the Flink
> project to
> add another file format, over properly supporting the format that we
> already
> have in the project.
>
> Maintain the format we already have and introduce new formats should be
> orthogonal. The requirement of supporting arrow format originally observed
> in
> our internal usage to deserialize the data(VectorSchemaRoot) from other
> storage
> systems to flink internal RowData and serialize the flink internal RowData
> to
> VectorSchemaRoot out to the storage system.  And the requirement from the
> slack[1] is to support the arrow file format. Although, Arrow is not
> usually
> used as the final disk storage format.  But it has a tendency to be used
> as the
> inter-exchange format between different systems or temporary storage for
> analysis due to its columnar format and can be memory mapped to other
> analysis
> programs.
>
> So, I think it's meaningful to support arrow formats in Flink.
>
> @Jim >  If the Flink format interface is used there, then it may be useful
> to
> consider Arrow along with other columnar formats.
>
> I am not well-versed with the formats utilized in Paimon. Upon checking
> [2], it
> appears that Paimon does not directly employ flink formats. Instead, it
> utilizes
> FormatWriterFactory and FormatReaderFactory to handle data serialization
> and
> deserialization. Therefore, I believe that the current work may not be
> applicable for reuse in Paimon at this time.
>
> Best,
> Aitozi.
>
> [1]: https://apache-flink.slack.com/archives/C03GV7L3G2C/p1677915016551629
> [2]:
> https://github.com/apache/incubator-paimon/tree/master/paimon-format/src/main/java/org/apache/paimon/format
>
> Jim Hughes <[email protected]> 于2023年3月31日周五 00:36写道：
> >
> > Hi all,
> >
> > How do Flink formats relate to or interact with Paimon (formerly
> > Flink-Table-Store)?  If the Flink format interface is used there, then it
> > may be useful to consider Arrow along with other columnar formats.
> >
> > Separately, from previous experience, I've seen the Arrow format be
> useful
> > as an output format for clients to read efficiently.  Arrow does support
> > returning batches of records, so there may be some options to use the
> > format in a streaming situation where a sufficient collection of records
> > can be gathered.
> >
> > Cheers,
> >
> > Jim
> >
> >
> >
> > On Thu, Mar 30, 2023 at 8:32 AM Martijn Visser <[email protected]
> >
> > wrote:
> >
> > > Hi,
> > >
> > > To be honest, I haven't seen that much demand for supporting the Arrow
> > > format directly in Flink as a flink-format. I'm wondering if there's
> really
> > > much benefit for the Flink project to add another file format, over
> > > properly supporting the format that we already have in the project.
> > >
> > > Best regards,
> > >
> > > Martijn
> > >
> > > On Thu, Mar 30, 2023 at 2:21 PM Ran Tao <[email protected]> wrote:
> > >
> > > > It is a good point that flink integrates apache arrow as a format.
> > > > Arrow can take advantage of SIMD-specific or vectorized
> optimizations,
> > > > which should be of great benefit to batch tasks.
> > > > However, as mentioned in the issue you listed, it may take a lot of
> work
> > > > and the community's consideration for integrating Arrow.
> > > >
> > > > I think you can try to make a simple poc for verification and some
> > > specific
> > > > plans.
> > > >
> > > >
> > > > Best Regards,
> > > > Ran Tao
> > > >
> > > >
> > > > Aitozi <[email protected]> 于2023年3月29日周三 19:12写道：
> > > >
> > > > > Hi guys
> > > > >      I'm opening this thread to discuss supporting the Apache Arrow
> > > > format
> > > > > in Flink.
> > > > >      Arrow is a language-independent columnar memory format that
> has
> > > > become
> > > > > widely used in different systems, and It can also serve as an
> > > > > inter-exchange format between other systems.
> > > > > So, using it directly in the Flink system will be nice. We also
> > > received
> > > > > some requests from slack[1][2] and jira[3].
> > > > >      In our company's internal usage, we have used flink-python
> > > moudle's
> > > > > ArrowReader and ArrowWriter to support this work. But it still can
> not
> > > > > integrate with the current flink-formats framework closely.
> > > > > So, I'd like to introduce the flink-arrow formats module to
> support the
> > > > > arrow format naturally.
> > > > >      Looking forward to some suggestions.
> > > > >
> > > > >
> > > > > Best,
> > > > > Aitozi
> > > > >
> > > > >
> > > > > [1]:
> > > >
> https://apache-flink.slack.com/archives/C03GV7L3G2C/p1677915016551629
> > > > >
> > > > > [2]:
> > > >
> https://apache-flink.slack.com/archives/C03GV7L3G2C/p1666326443152789
> > > > >
> > > > > [3]: https://issues.apache.org/jira/browse/FLINK-10929
> > > > >
> > > >
> > >
>

Re: [DISCUSS] Add support for Apache Arrow format

Reply via email to