[ https://issues.apache.org/jira/browse/FLINK-10929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16826677#comment-16826677 ]
Liya Fan edited comment on FLINK-10929 at 4/30/19 3:01 AM: ----------------------------------------------------------- Hi [Stephan Ewen|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=StephanEwen] and [Kurt Young|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=ykt836], thanks a lot for your valuable comments. It seems there is not much debate for point 1). For point 2), I want to list a few advantages of using Arrow as internal data structures for Flink. These conclusions are based on our investigations of research papers, as well as our engineering experiences in recent development on Blink. # Arrow adopts columnar data format, which makes it possible to apply SIMD ([link|[http://prestodb.rocks/code/simd/])|http://prestodb.rocks/code/simd/%5d).]. SIMD means higher performance. Java starts to support SIMD in Java 8, and it is expected that high versions of Java will provide better support for SIMD. # Arrow provides more compact memory layout. To make it simple, encoding a batch of rows as Arrow vectors will require much less memory space, compared with encoding by BinaryRow. This, in turn, leads to two results: ## With the same amount of memory space, more data can be loaded into memory, and fewer spills are required. ## The cost of shuffling can be reduced significantly. For example, the amount of shuffled data for TPC-H Q4 reduces to almost the half. # Columnar format makes it much easier for data compression (see [paper|http://db.lcs.mit.edu/projects/cstore/abadisigmod06.pdf], for example). Our initial experiments have shown some positive results. There are more advantages for Arrow, but I just list a few, due to space constraints. The TPC-H performance in the attachment speaks for itself. We understand and respect that Blink is currently being merged, and some changes should be postponed. We would like to provide our devoted help in this process. However, too much change is not a good reason to refuse Arrow, if we want to make Flink the world-top-level computing engine. All these difficulties can be overcome. We have a plan to carry out the changes incrementally. was (Author: fan_li_ya): Hi [Stephan Ewen|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=StephanEwen] and [Kurt Young|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=ykt836], thanks a lot for your valuable comments. It seems there is not much debate for point 1). For point 2), I want to list a few advantages of using Arrow as internal data structures for Flink. These conclusions are based on our investigations of research papers, as well as our engineering experiences in secondary development on Blink. # Arrow adopts columnar data format, which makes it possible to apply SIMD ([link|[http://prestodb.rocks/code/simd/])|http://prestodb.rocks/code/simd/%5d).]. SIMD means higher performance. Java starts to support SIMD in Java 8, and it is expected that high versions of Java will provide better support for SIMD. # Arrow provides more compact memory layout. To make it simple, encoding a batch of rows as Arrow vectors will require much less memory space, compared with encoding by BinaryRow. This, in turn, leads to two results: ## With the same amount of memory space, more data can be loaded into memory, and fewer spills are required. ## The cost of shuffling can be reduced significantly. For example, the amount of shuffled data for TPC-H Q4 reduces to almost the half. # Columnar format makes it much easier for data compression (see [paper|http://db.lcs.mit.edu/projects/cstore/abadisigmod06.pdf], for example). Our initial experiments have shown some positive results. There are more advantages for Arrow, but I just list a few, due to space constraints. The TPC-H performance in the attachment speaks for itself. We understand and respect that Blink is currently being merged, and some changes should be postponed. We would like to provide our devoted help in this process. However, too much change is not a good reason to refuse Arrow, if we want to make Flink the world-top-level computing engine. All these difficulties can be overcome. We have a plan to carry out the changes incrementally. > Add support for Apache Arrow > ---------------------------- > > Key: FLINK-10929 > URL: https://issues.apache.org/jira/browse/FLINK-10929 > Project: Flink > Issue Type: Wish > Components: Runtime / State Backends > Reporter: Pedro Cardoso Silva > Priority: Minor > Attachments: image-2019-04-10-13-43-08-107.png > > > Investigate the possibility of adding support for Apache Arrow as a > standardized columnar, memory format for data. > Given the activity that [https://github.com/apache/arrow] is currently > getting and its claims objective of providing a zero-copy, standardized data > format across platforms, I think it makes sense for Flink to look into > supporting it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)