Hi I also think arrow format will be useful when reading/writing with message queue. Arrow defines a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. The Arrow memory format also supports zero-copy reads for lightning-fast data access without serialization overhead. it will bring a lot. And we may do some surveys, what other engines support like spark/hive/presto and so on, how that supports and how it be used.
Best, Jacky. Aitozi <gjying1...@gmail.com> 于2023年4月2日周日 22:22写道: > Hi all, > Thanks for your input. > > @Ran > However, as mentioned in the issue you listed, it may take a lot of > work > and the community's consideration for integrating Arrow. > > To clarify, this proposal solely aims to introduce flink-arrow as a new > format, > similar to flink-csv and flink-protobuf. It will not impact the internal > data > structure representation in Flink. For proof of concept, please refer to: > https://github.com/Aitozi/flink/commits/arrow-format. > > @Martijn > I'm wondering if there's really much benefit for the Flink > project to > add another file format, over properly supporting the format that we > already > have in the project. > > Maintain the format we already have and introduce new formats should be > orthogonal. The requirement of supporting arrow format originally observed > in > our internal usage to deserialize the data(VectorSchemaRoot) from other > storage > systems to flink internal RowData and serialize the flink internal RowData > to > VectorSchemaRoot out to the storage system. And the requirement from the > slack[1] is to support the arrow file format. Although, Arrow is not > usually > used as the final disk storage format. But it has a tendency to be used > as the > inter-exchange format between different systems or temporary storage for > analysis due to its columnar format and can be memory mapped to other > analysis > programs. > > So, I think it's meaningful to support arrow formats in Flink. > > @Jim > If the Flink format interface is used there, then it may be useful > to > consider Arrow along with other columnar formats. > > I am not well-versed with the formats utilized in Paimon. Upon checking > [2], it > appears that Paimon does not directly employ flink formats. Instead, it > utilizes > FormatWriterFactory and FormatReaderFactory to handle data serialization > and > deserialization. Therefore, I believe that the current work may not be > applicable for reuse in Paimon at this time. > > Best, > Aitozi. > > [1]: https://apache-flink.slack.com/archives/C03GV7L3G2C/p1677915016551629 > [2]: > https://github.com/apache/incubator-paimon/tree/master/paimon-format/src/main/java/org/apache/paimon/format > > Jim Hughes <jhug...@confluent.io.invalid> 于2023年3月31日周五 00:36写道: > > > > Hi all, > > > > How do Flink formats relate to or interact with Paimon (formerly > > Flink-Table-Store)? If the Flink format interface is used there, then it > > may be useful to consider Arrow along with other columnar formats. > > > > Separately, from previous experience, I've seen the Arrow format be > useful > > as an output format for clients to read efficiently. Arrow does support > > returning batches of records, so there may be some options to use the > > format in a streaming situation where a sufficient collection of records > > can be gathered. > > > > Cheers, > > > > Jim > > > > > > > > On Thu, Mar 30, 2023 at 8:32 AM Martijn Visser <martijnvis...@apache.org > > > > wrote: > > > > > Hi, > > > > > > To be honest, I haven't seen that much demand for supporting the Arrow > > > format directly in Flink as a flink-format. I'm wondering if there's > really > > > much benefit for the Flink project to add another file format, over > > > properly supporting the format that we already have in the project. > > > > > > Best regards, > > > > > > Martijn > > > > > > On Thu, Mar 30, 2023 at 2:21 PM Ran Tao <chucheng...@gmail.com> wrote: > > > > > > > It is a good point that flink integrates apache arrow as a format. > > > > Arrow can take advantage of SIMD-specific or vectorized > optimizations, > > > > which should be of great benefit to batch tasks. > > > > However, as mentioned in the issue you listed, it may take a lot of > work > > > > and the community's consideration for integrating Arrow. > > > > > > > > I think you can try to make a simple poc for verification and some > > > specific > > > > plans. > > > > > > > > > > > > Best Regards, > > > > Ran Tao > > > > > > > > > > > > Aitozi <gjying1...@gmail.com> 于2023年3月29日周三 19:12写道: > > > > > > > > > Hi guys > > > > > I'm opening this thread to discuss supporting the Apache Arrow > > > > format > > > > > in Flink. > > > > > Arrow is a language-independent columnar memory format that > has > > > > become > > > > > widely used in different systems, and It can also serve as an > > > > > inter-exchange format between other systems. > > > > > So, using it directly in the Flink system will be nice. We also > > > received > > > > > some requests from slack[1][2] and jira[3]. > > > > > In our company's internal usage, we have used flink-python > > > moudle's > > > > > ArrowReader and ArrowWriter to support this work. But it still can > not > > > > > integrate with the current flink-formats framework closely. > > > > > So, I'd like to introduce the flink-arrow formats module to > support the > > > > > arrow format naturally. > > > > > Looking forward to some suggestions. > > > > > > > > > > > > > > > Best, > > > > > Aitozi > > > > > > > > > > > > > > > [1]: > > > > > https://apache-flink.slack.com/archives/C03GV7L3G2C/p1677915016551629 > > > > > > > > > > [2]: > > > > > https://apache-flink.slack.com/archives/C03GV7L3G2C/p1666326443152789 > > > > > > > > > > [3]: https://issues.apache.org/jira/browse/FLINK-10929 > > > > > > > > > > > > >