Re: [DISCUSS] Add support for Apache Arrow format

Aitozi Wed, 12 Apr 2023 23:10:29 -0700

> Which connectors would be commonly used when reading in Arrow format?
Filesystem?


Currently, yes. The better way is it can be combined used with
different connector,
but I have not figured out how to integrate the Arrow format
deserializer with the
`DecodingFormat` or `DeserializationSchema` interface. So, as a first
step, I want to introduce
it as the file bulk format.

Martijn Visser <martijnvis...@apache.org> 于2023年4月12日周三 22:53写道：

>
> Which connectors would be commonly used when reading in Arrow format?
> Filesystem?
>
> On Wed, Apr 12, 2023 at 4:27 AM Jacky Lau <liuyong...@gmail.com> wrote:
>
> > Hi
> >    I also think arrow format  will be useful when reading/writing with
> > message queue.
> >    Arrow defines a language-independent columnar memory format for flat and
> > hierarchical data, organized for efficient analytic operations on modern
> > hardware like CPUs and GPUs. The Arrow memory format also supports
> > zero-copy reads for lightning-fast data access without serialization
> > overhead. it will bring a lot.
> >    And we  may do some surveys, what other engines support like
> > spark/hive/presto and so on, how that supports and how it be used.
> >
> >    Best,
> >    Jacky.
> >
> > Aitozi <gjying1...@gmail.com> 于2023年4月2日周日 22:22写道：
> >
> > > Hi all,
> > >     Thanks for your input.
> > >
> > > @Ran > However, as mentioned in the issue you listed, it may take a lot
> > of
> > > work
> > > and the community's consideration for integrating Arrow.
> > >
> > > To clarify, this proposal solely aims to introduce flink-arrow as a new
> > > format,
> > > similar to flink-csv and flink-protobuf. It will not impact the internal
> > > data
> > > structure representation in Flink. For proof of concept, please refer to:
> > > https://github.com/Aitozi/flink/commits/arrow-format.
> > >
> > > @Martijn > I'm wondering if there's really much benefit for the Flink
> > > project to
> > > add another file format, over properly supporting the format that we
> > > already
> > > have in the project.
> > >
> > > Maintain the format we already have and introduce new formats should be
> > > orthogonal. The requirement of supporting arrow format originally
> > observed
> > > in
> > > our internal usage to deserialize the data(VectorSchemaRoot) from other
> > > storage
> > > systems to flink internal RowData and serialize the flink internal
> > RowData
> > > to
> > > VectorSchemaRoot out to the storage system.  And the requirement from the
> > > slack[1] is to support the arrow file format. Although, Arrow is not
> > > usually
> > > used as the final disk storage format.  But it has a tendency to be used
> > > as the
> > > inter-exchange format between different systems or temporary storage for
> > > analysis due to its columnar format and can be memory mapped to other
> > > analysis
> > > programs.
> > >
> > > So, I think it's meaningful to support arrow formats in Flink.
> > >
> > > @Jim >  If the Flink format interface is used there, then it may be
> > useful
> > > to
> > > consider Arrow along with other columnar formats.
> > >
> > > I am not well-versed with the formats utilized in Paimon. Upon checking
> > > [2], it
> > > appears that Paimon does not directly employ flink formats. Instead, it
> > > utilizes
> > > FormatWriterFactory and FormatReaderFactory to handle data serialization
> > > and
> > > deserialization. Therefore, I believe that the current work may not be
> > > applicable for reuse in Paimon at this time.
> > >
> > > Best,
> > > Aitozi.
> > >
> > > [1]:
> > https://apache-flink.slack.com/archives/C03GV7L3G2C/p1677915016551629
> > > [2]:
> > >
> > https://github.com/apache/incubator-paimon/tree/master/paimon-format/src/main/java/org/apache/paimon/format
> > >
> > > Jim Hughes <jhug...@confluent.io.invalid> 于2023年3月31日周五 00:36写道：
> > > >
> > > > Hi all,
> > > >
> > > > How do Flink formats relate to or interact with Paimon (formerly
> > > > Flink-Table-Store)?  If the Flink format interface is used there, then
> > it
> > > > may be useful to consider Arrow along with other columnar formats.
> > > >
> > > > Separately, from previous experience, I've seen the Arrow format be
> > > useful
> > > > as an output format for clients to read efficiently.  Arrow does
> > support
> > > > returning batches of records, so there may be some options to use the
> > > > format in a streaming situation where a sufficient collection of
> > records
> > > > can be gathered.
> > > >
> > > > Cheers,
> > > >
> > > > Jim
> > > >
> > > >
> > > >
> > > > On Thu, Mar 30, 2023 at 8:32 AM Martijn Visser <
> > martijnvis...@apache.org
> > > >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > To be honest, I haven't seen that much demand for supporting the
> > Arrow
> > > > > format directly in Flink as a flink-format. I'm wondering if there's
> > > really
> > > > > much benefit for the Flink project to add another file format, over
> > > > > properly supporting the format that we already have in the project.
> > > > >
> > > > > Best regards,
> > > > >
> > > > > Martijn
> > > > >
> > > > > On Thu, Mar 30, 2023 at 2:21 PM Ran Tao <chucheng...@gmail.com>
> > wrote:
> > > > >
> > > > > > It is a good point that flink integrates apache arrow as a format.
> > > > > > Arrow can take advantage of SIMD-specific or vectorized
> > > optimizations,
> > > > > > which should be of great benefit to batch tasks.
> > > > > > However, as mentioned in the issue you listed, it may take a lot of
> > > work
> > > > > > and the community's consideration for integrating Arrow.
> > > > > >
> > > > > > I think you can try to make a simple poc for verification and some
> > > > > specific
> > > > > > plans.
> > > > > >
> > > > > >
> > > > > > Best Regards,
> > > > > > Ran Tao
> > > > > >
> > > > > >
> > > > > > Aitozi <gjying1...@gmail.com> 于2023年3月29日周三 19:12写道：
> > > > > >
> > > > > > > Hi guys
> > > > > > >      I'm opening this thread to discuss supporting the Apache
> > Arrow
> > > > > > format
> > > > > > > in Flink.
> > > > > > >      Arrow is a language-independent columnar memory format that
> > > has
> > > > > > become
> > > > > > > widely used in different systems, and It can also serve as an
> > > > > > > inter-exchange format between other systems.
> > > > > > > So, using it directly in the Flink system will be nice. We also
> > > > > received
> > > > > > > some requests from slack[1][2] and jira[3].
> > > > > > >      In our company's internal usage, we have used flink-python
> > > > > moudle's
> > > > > > > ArrowReader and ArrowWriter to support this work. But it still
> > can
> > > not
> > > > > > > integrate with the current flink-formats framework closely.
> > > > > > > So, I'd like to introduce the flink-arrow formats module to
> > > support the
> > > > > > > arrow format naturally.
> > > > > > >      Looking forward to some suggestions.
> > > > > > >
> > > > > > >
> > > > > > > Best,
> > > > > > > Aitozi
> > > > > > >
> > > > > > >
> > > > > > > [1]:
> > > > > >
> > > https://apache-flink.slack.com/archives/C03GV7L3G2C/p1677915016551629
> > > > > > >
> > > > > > > [2]:
> > > > > >
> > > https://apache-flink.slack.com/archives/C03GV7L3G2C/p1666326443152789
> > > > > > >
> > > > > > > [3]: https://issues.apache.org/jira/browse/FLINK-10929
> > > > > > >
> > > > > >
> > > > >
> > >
> >

Re: [DISCUSS] Add support for Apache Arrow format

Reply via email to