Hi all, Thanks for your input. @Ran > However, as mentioned in the issue you listed, it may take a lot of work and the community's consideration for integrating Arrow.
To clarify, this proposal solely aims to introduce flink-arrow as a new format, similar to flink-csv and flink-protobuf. It will not impact the internal data structure representation in Flink. For proof of concept, please refer to: https://github.com/Aitozi/flink/commits/arrow-format. @Martijn > I'm wondering if there's really much benefit for the Flink project to add another file format, over properly supporting the format that we already have in the project. Maintain the format we already have and introduce new formats should be orthogonal. The requirement of supporting arrow format originally observed in our internal usage to deserialize the data(VectorSchemaRoot) from other storage systems to flink internal RowData and serialize the flink internal RowData to VectorSchemaRoot out to the storage system. And the requirement from the slack[1] is to support the arrow file format. Although, Arrow is not usually used as the final disk storage format. But it has a tendency to be used as the inter-exchange format between different systems or temporary storage for analysis due to its columnar format and can be memory mapped to other analysis programs. So, I think it's meaningful to support arrow formats in Flink. @Jim > If the Flink format interface is used there, then it may be useful to consider Arrow along with other columnar formats. I am not well-versed with the formats utilized in Paimon. Upon checking [2], it appears that Paimon does not directly employ flink formats. Instead, it utilizes FormatWriterFactory and FormatReaderFactory to handle data serialization and deserialization. Therefore, I believe that the current work may not be applicable for reuse in Paimon at this time. Best, Aitozi. [1]: https://apache-flink.slack.com/archives/C03GV7L3G2C/p1677915016551629 [2]: https://github.com/apache/incubator-paimon/tree/master/paimon-format/src/main/java/org/apache/paimon/format Jim Hughes <jhug...@confluent.io.invalid> 于2023年3月31日周五 00:36写道: > > Hi all, > > How do Flink formats relate to or interact with Paimon (formerly > Flink-Table-Store)? If the Flink format interface is used there, then it > may be useful to consider Arrow along with other columnar formats. > > Separately, from previous experience, I've seen the Arrow format be useful > as an output format for clients to read efficiently. Arrow does support > returning batches of records, so there may be some options to use the > format in a streaming situation where a sufficient collection of records > can be gathered. > > Cheers, > > Jim > > > > On Thu, Mar 30, 2023 at 8:32 AM Martijn Visser <martijnvis...@apache.org> > wrote: > > > Hi, > > > > To be honest, I haven't seen that much demand for supporting the Arrow > > format directly in Flink as a flink-format. I'm wondering if there's really > > much benefit for the Flink project to add another file format, over > > properly supporting the format that we already have in the project. > > > > Best regards, > > > > Martijn > > > > On Thu, Mar 30, 2023 at 2:21 PM Ran Tao <chucheng...@gmail.com> wrote: > > > > > It is a good point that flink integrates apache arrow as a format. > > > Arrow can take advantage of SIMD-specific or vectorized optimizations, > > > which should be of great benefit to batch tasks. > > > However, as mentioned in the issue you listed, it may take a lot of work > > > and the community's consideration for integrating Arrow. > > > > > > I think you can try to make a simple poc for verification and some > > specific > > > plans. > > > > > > > > > Best Regards, > > > Ran Tao > > > > > > > > > Aitozi <gjying1...@gmail.com> 于2023年3月29日周三 19:12写道: > > > > > > > Hi guys > > > > I'm opening this thread to discuss supporting the Apache Arrow > > > format > > > > in Flink. > > > > Arrow is a language-independent columnar memory format that has > > > become > > > > widely used in different systems, and It can also serve as an > > > > inter-exchange format between other systems. > > > > So, using it directly in the Flink system will be nice. We also > > received > > > > some requests from slack[1][2] and jira[3]. > > > > In our company's internal usage, we have used flink-python > > moudle's > > > > ArrowReader and ArrowWriter to support this work. But it still can not > > > > integrate with the current flink-formats framework closely. > > > > So, I'd like to introduce the flink-arrow formats module to support the > > > > arrow format naturally. > > > > Looking forward to some suggestions. > > > > > > > > > > > > Best, > > > > Aitozi > > > > > > > > > > > > [1]: > > > https://apache-flink.slack.com/archives/C03GV7L3G2C/p1677915016551629 > > > > > > > > [2]: > > > https://apache-flink.slack.com/archives/C03GV7L3G2C/p1666326443152789 > > > > > > > > [3]: https://issues.apache.org/jira/browse/FLINK-10929 > > > > > > > > >