> Which connectors would be commonly used when reading in Arrow format? Filesystem?
Currently, yes. The better way is it can be combined used with different connector, but I have not figured out how to integrate the Arrow format deserializer with the `DecodingFormat` or `DeserializationSchema` interface. So, as a first step, I want to introduce it as the file bulk format. Martijn Visser <martijnvis...@apache.org> 于2023年4月12日周三 22:53写道: > > Which connectors would be commonly used when reading in Arrow format? > Filesystem? > > On Wed, Apr 12, 2023 at 4:27 AM Jacky Lau <liuyong...@gmail.com> wrote: > > > Hi > > I also think arrow format will be useful when reading/writing with > > message queue. > > Arrow defines a language-independent columnar memory format for flat and > > hierarchical data, organized for efficient analytic operations on modern > > hardware like CPUs and GPUs. The Arrow memory format also supports > > zero-copy reads for lightning-fast data access without serialization > > overhead. it will bring a lot. > > And we may do some surveys, what other engines support like > > spark/hive/presto and so on, how that supports and how it be used. > > > > Best, > > Jacky. > > > > Aitozi <gjying1...@gmail.com> 于2023年4月2日周日 22:22写道: > > > > > Hi all, > > > Thanks for your input. > > > > > > @Ran > However, as mentioned in the issue you listed, it may take a lot > > of > > > work > > > and the community's consideration for integrating Arrow. > > > > > > To clarify, this proposal solely aims to introduce flink-arrow as a new > > > format, > > > similar to flink-csv and flink-protobuf. It will not impact the internal > > > data > > > structure representation in Flink. For proof of concept, please refer to: > > > https://github.com/Aitozi/flink/commits/arrow-format. > > > > > > @Martijn > I'm wondering if there's really much benefit for the Flink > > > project to > > > add another file format, over properly supporting the format that we > > > already > > > have in the project. > > > > > > Maintain the format we already have and introduce new formats should be > > > orthogonal. The requirement of supporting arrow format originally > > observed > > > in > > > our internal usage to deserialize the data(VectorSchemaRoot) from other > > > storage > > > systems to flink internal RowData and serialize the flink internal > > RowData > > > to > > > VectorSchemaRoot out to the storage system. And the requirement from the > > > slack[1] is to support the arrow file format. Although, Arrow is not > > > usually > > > used as the final disk storage format. But it has a tendency to be used > > > as the > > > inter-exchange format between different systems or temporary storage for > > > analysis due to its columnar format and can be memory mapped to other > > > analysis > > > programs. > > > > > > So, I think it's meaningful to support arrow formats in Flink. > > > > > > @Jim > If the Flink format interface is used there, then it may be > > useful > > > to > > > consider Arrow along with other columnar formats. > > > > > > I am not well-versed with the formats utilized in Paimon. Upon checking > > > [2], it > > > appears that Paimon does not directly employ flink formats. Instead, it > > > utilizes > > > FormatWriterFactory and FormatReaderFactory to handle data serialization > > > and > > > deserialization. Therefore, I believe that the current work may not be > > > applicable for reuse in Paimon at this time. > > > > > > Best, > > > Aitozi. > > > > > > [1]: > > https://apache-flink.slack.com/archives/C03GV7L3G2C/p1677915016551629 > > > [2]: > > > > > https://github.com/apache/incubator-paimon/tree/master/paimon-format/src/main/java/org/apache/paimon/format > > > > > > Jim Hughes <jhug...@confluent.io.invalid> 于2023年3月31日周五 00:36写道: > > > > > > > > Hi all, > > > > > > > > How do Flink formats relate to or interact with Paimon (formerly > > > > Flink-Table-Store)? If the Flink format interface is used there, then > > it > > > > may be useful to consider Arrow along with other columnar formats. > > > > > > > > Separately, from previous experience, I've seen the Arrow format be > > > useful > > > > as an output format for clients to read efficiently. Arrow does > > support > > > > returning batches of records, so there may be some options to use the > > > > format in a streaming situation where a sufficient collection of > > records > > > > can be gathered. > > > > > > > > Cheers, > > > > > > > > Jim > > > > > > > > > > > > > > > > On Thu, Mar 30, 2023 at 8:32 AM Martijn Visser < > > martijnvis...@apache.org > > > > > > > > wrote: > > > > > > > > > Hi, > > > > > > > > > > To be honest, I haven't seen that much demand for supporting the > > Arrow > > > > > format directly in Flink as a flink-format. I'm wondering if there's > > > really > > > > > much benefit for the Flink project to add another file format, over > > > > > properly supporting the format that we already have in the project. > > > > > > > > > > Best regards, > > > > > > > > > > Martijn > > > > > > > > > > On Thu, Mar 30, 2023 at 2:21 PM Ran Tao <chucheng...@gmail.com> > > wrote: > > > > > > > > > > > It is a good point that flink integrates apache arrow as a format. > > > > > > Arrow can take advantage of SIMD-specific or vectorized > > > optimizations, > > > > > > which should be of great benefit to batch tasks. > > > > > > However, as mentioned in the issue you listed, it may take a lot of > > > work > > > > > > and the community's consideration for integrating Arrow. > > > > > > > > > > > > I think you can try to make a simple poc for verification and some > > > > > specific > > > > > > plans. > > > > > > > > > > > > > > > > > > Best Regards, > > > > > > Ran Tao > > > > > > > > > > > > > > > > > > Aitozi <gjying1...@gmail.com> 于2023年3月29日周三 19:12写道: > > > > > > > > > > > > > Hi guys > > > > > > > I'm opening this thread to discuss supporting the Apache > > Arrow > > > > > > format > > > > > > > in Flink. > > > > > > > Arrow is a language-independent columnar memory format that > > > has > > > > > > become > > > > > > > widely used in different systems, and It can also serve as an > > > > > > > inter-exchange format between other systems. > > > > > > > So, using it directly in the Flink system will be nice. We also > > > > > received > > > > > > > some requests from slack[1][2] and jira[3]. > > > > > > > In our company's internal usage, we have used flink-python > > > > > moudle's > > > > > > > ArrowReader and ArrowWriter to support this work. But it still > > can > > > not > > > > > > > integrate with the current flink-formats framework closely. > > > > > > > So, I'd like to introduce the flink-arrow formats module to > > > support the > > > > > > > arrow format naturally. > > > > > > > Looking forward to some suggestions. > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > Aitozi > > > > > > > > > > > > > > > > > > > > > [1]: > > > > > > > > > https://apache-flink.slack.com/archives/C03GV7L3G2C/p1677915016551629 > > > > > > > > > > > > > > [2]: > > > > > > > > > https://apache-flink.slack.com/archives/C03GV7L3G2C/p1666326443152789 > > > > > > > > > > > > > > [3]: https://issues.apache.org/jira/browse/FLINK-10929 > > > > > > > > > > > > > > > > > > > > > > >