Which connectors would be commonly used when reading in Arrow format?
Filesystem?

On Wed, Apr 12, 2023 at 4:27 AM Jacky Lau <liuyong...@gmail.com> wrote:

> Hi
>    I also think arrow format  will be useful when reading/writing with
> message queue.
>    Arrow defines a language-independent columnar memory format for flat and
> hierarchical data, organized for efficient analytic operations on modern
> hardware like CPUs and GPUs. The Arrow memory format also supports
> zero-copy reads for lightning-fast data access without serialization
> overhead. it will bring a lot.
>    And we  may do some surveys, what other engines support like
> spark/hive/presto and so on, how that supports and how it be used.
>
>    Best,
>    Jacky.
>
> Aitozi <gjying1...@gmail.com> 于2023年4月2日周日 22:22写道:
>
> > Hi all,
> >     Thanks for your input.
> >
> > @Ran > However, as mentioned in the issue you listed, it may take a lot
> of
> > work
> > and the community's consideration for integrating Arrow.
> >
> > To clarify, this proposal solely aims to introduce flink-arrow as a new
> > format,
> > similar to flink-csv and flink-protobuf. It will not impact the internal
> > data
> > structure representation in Flink. For proof of concept, please refer to:
> > https://github.com/Aitozi/flink/commits/arrow-format.
> >
> > @Martijn > I'm wondering if there's really much benefit for the Flink
> > project to
> > add another file format, over properly supporting the format that we
> > already
> > have in the project.
> >
> > Maintain the format we already have and introduce new formats should be
> > orthogonal. The requirement of supporting arrow format originally
> observed
> > in
> > our internal usage to deserialize the data(VectorSchemaRoot) from other
> > storage
> > systems to flink internal RowData and serialize the flink internal
> RowData
> > to
> > VectorSchemaRoot out to the storage system.  And the requirement from the
> > slack[1] is to support the arrow file format. Although, Arrow is not
> > usually
> > used as the final disk storage format.  But it has a tendency to be used
> > as the
> > inter-exchange format between different systems or temporary storage for
> > analysis due to its columnar format and can be memory mapped to other
> > analysis
> > programs.
> >
> > So, I think it's meaningful to support arrow formats in Flink.
> >
> > @Jim >  If the Flink format interface is used there, then it may be
> useful
> > to
> > consider Arrow along with other columnar formats.
> >
> > I am not well-versed with the formats utilized in Paimon. Upon checking
> > [2], it
> > appears that Paimon does not directly employ flink formats. Instead, it
> > utilizes
> > FormatWriterFactory and FormatReaderFactory to handle data serialization
> > and
> > deserialization. Therefore, I believe that the current work may not be
> > applicable for reuse in Paimon at this time.
> >
> > Best,
> > Aitozi.
> >
> > [1]:
> https://apache-flink.slack.com/archives/C03GV7L3G2C/p1677915016551629
> > [2]:
> >
> https://github.com/apache/incubator-paimon/tree/master/paimon-format/src/main/java/org/apache/paimon/format
> >
> > Jim Hughes <jhug...@confluent.io.invalid> 于2023年3月31日周五 00:36写道:
> > >
> > > Hi all,
> > >
> > > How do Flink formats relate to or interact with Paimon (formerly
> > > Flink-Table-Store)?  If the Flink format interface is used there, then
> it
> > > may be useful to consider Arrow along with other columnar formats.
> > >
> > > Separately, from previous experience, I've seen the Arrow format be
> > useful
> > > as an output format for clients to read efficiently.  Arrow does
> support
> > > returning batches of records, so there may be some options to use the
> > > format in a streaming situation where a sufficient collection of
> records
> > > can be gathered.
> > >
> > > Cheers,
> > >
> > > Jim
> > >
> > >
> > >
> > > On Thu, Mar 30, 2023 at 8:32 AM Martijn Visser <
> martijnvis...@apache.org
> > >
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > To be honest, I haven't seen that much demand for supporting the
> Arrow
> > > > format directly in Flink as a flink-format. I'm wondering if there's
> > really
> > > > much benefit for the Flink project to add another file format, over
> > > > properly supporting the format that we already have in the project.
> > > >
> > > > Best regards,
> > > >
> > > > Martijn
> > > >
> > > > On Thu, Mar 30, 2023 at 2:21 PM Ran Tao <chucheng...@gmail.com>
> wrote:
> > > >
> > > > > It is a good point that flink integrates apache arrow as a format.
> > > > > Arrow can take advantage of SIMD-specific or vectorized
> > optimizations,
> > > > > which should be of great benefit to batch tasks.
> > > > > However, as mentioned in the issue you listed, it may take a lot of
> > work
> > > > > and the community's consideration for integrating Arrow.
> > > > >
> > > > > I think you can try to make a simple poc for verification and some
> > > > specific
> > > > > plans.
> > > > >
> > > > >
> > > > > Best Regards,
> > > > > Ran Tao
> > > > >
> > > > >
> > > > > Aitozi <gjying1...@gmail.com> 于2023年3月29日周三 19:12写道:
> > > > >
> > > > > > Hi guys
> > > > > >      I'm opening this thread to discuss supporting the Apache
> Arrow
> > > > > format
> > > > > > in Flink.
> > > > > >      Arrow is a language-independent columnar memory format that
> > has
> > > > > become
> > > > > > widely used in different systems, and It can also serve as an
> > > > > > inter-exchange format between other systems.
> > > > > > So, using it directly in the Flink system will be nice. We also
> > > > received
> > > > > > some requests from slack[1][2] and jira[3].
> > > > > >      In our company's internal usage, we have used flink-python
> > > > moudle's
> > > > > > ArrowReader and ArrowWriter to support this work. But it still
> can
> > not
> > > > > > integrate with the current flink-formats framework closely.
> > > > > > So, I'd like to introduce the flink-arrow formats module to
> > support the
> > > > > > arrow format naturally.
> > > > > >      Looking forward to some suggestions.
> > > > > >
> > > > > >
> > > > > > Best,
> > > > > > Aitozi
> > > > > >
> > > > > >
> > > > > > [1]:
> > > > >
> > https://apache-flink.slack.com/archives/C03GV7L3G2C/p1677915016551629
> > > > > >
> > > > > > [2]:
> > > > >
> > https://apache-flink.slack.com/archives/C03GV7L3G2C/p1666326443152789
> > > > > >
> > > > > > [3]: https://issues.apache.org/jira/browse/FLINK-10929
> > > > > >
> > > > >
> > > >
> >
>

Reply via email to