[jira] [Comment Edited] (FLINK-10929) Add support for Apache Arrow

Liya Fan (JIRA) Mon, 29 Apr 2019 20:02:15 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-10929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16826677#comment-16826677
 ]


Liya Fan edited comment on FLINK-10929 at 4/30/19 3:01 AM:
-----------------------------------------------------------

Hi [Stephan 
Ewen|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=StephanEwen] 
and [Kurt 
Young|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=ykt836], 
thanks a lot for your valuable comments. 

It seems there is not much debate for point 1). 

For point 2), I want to list a few advantages of using Arrow as internal data 
structures for Flink. These conclusions are based on our investigations of 
research papers, as well as our engineering experiences in recent development 
on Blink.
 # Arrow adopts columnar data format, which makes it possible to apply SIMD 
([link|[http://prestodb.rocks/code/simd/])|http://prestodb.rocks/code/simd/%5d).].
 SIMD means higher performance. Java starts to support SIMD in Java 8, and it 
is expected that high versions of Java will provide better support for SIMD. 
 # Arrow provides more compact memory layout. To make it simple, encoding a 
batch of rows as Arrow vectors will require much less memory space, compared 
with encoding by BinaryRow. This, in turn, leads to two results:
 ## With the same amount of memory space, more data can be loaded into memory, 
and fewer spills are required.
 ## The cost of shuffling can be reduced significantly. For example, the amount 
of shuffled data for TPC-H Q4 reduces to almost the half. 
 # Columnar format makes it much easier for data compression (see 
[paper|http://db.lcs.mit.edu/projects/cstore/abadisigmod06.pdf], for example). 
Our initial experiments have shown some positive results.

There are more advantages for Arrow, but I just list a few, due to space 
constraints. The TPC-H performance in the attachment speaks for itself.

We understand and respect that Blink is currently being merged, and some 
changes should be postponed. We would like to provide our devoted help in this 
process.

However, too much change is not a good reason to refuse Arrow, if we want to 
make Flink the world-top-level computing engine. All these difficulties can be 
overcome. We have a plan to carry out the changes incrementally.

 

 


was (Author: fan_li_ya):
Hi [Stephan 
Ewen|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=StephanEwen] 
and [Kurt 
Young|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=ykt836], 
thanks a lot for your valuable comments. 

It seems there is not much debate for point 1). 

For point 2), I want to list a few advantages of using Arrow as internal data 
structures for Flink. These conclusions are based on our investigations of 
research papers, as well as our engineering experiences in secondary 
development on Blink.
 # Arrow adopts columnar data format, which makes it possible to apply SIMD 
([link|[http://prestodb.rocks/code/simd/])|http://prestodb.rocks/code/simd/%5d).].
 SIMD means higher performance. Java starts to support SIMD in Java 8, and it 
is expected that high versions of Java will provide better support for SIMD. 
 # Arrow provides more compact memory layout. To make it simple, encoding a 
batch of rows as Arrow vectors will require much less memory space, compared 
with encoding by BinaryRow. This, in turn, leads to two results:
 ## With the same amount of memory space, more data can be loaded into memory, 
and fewer spills are required.
 ## The cost of shuffling can be reduced significantly. For example, the amount 
of shuffled data for TPC-H Q4 reduces to almost the half. 
 # Columnar format makes it much easier for data compression (see 
[paper|http://db.lcs.mit.edu/projects/cstore/abadisigmod06.pdf], for example). 
Our initial experiments have shown some positive results.

There are more advantages for Arrow, but I just list a few, due to space 
constraints. The TPC-H performance in the attachment speaks for itself.

We understand and respect that Blink is currently being merged, and some 
changes should be postponed. We would like to provide our devoted help in this 
process.

However, too much change is not a good reason to refuse Arrow, if we want to 
make Flink the world-top-level computing engine. All these difficulties can be 
overcome. We have a plan to carry out the changes incrementally.

 

 

> Add support for Apache Arrow
> ----------------------------
>
>                 Key: FLINK-10929
>                 URL: https://issues.apache.org/jira/browse/FLINK-10929
>             Project: Flink
>          Issue Type: Wish
>          Components: Runtime / State Backends
>            Reporter: Pedro Cardoso Silva
>            Priority: Minor
>         Attachments: image-2019-04-10-13-43-08-107.png
>
>
> Investigate the possibility of adding support for Apache Arrow as a 
> standardized columnar, memory format for data.
> Given the activity that [https://github.com/apache/arrow] is currently 
> getting and its claims objective of providing a zero-copy, standardized data 
> format across platforms, I think it makes sense for Flink to look into 
> supporting it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (FLINK-10929) Add support for Apache Arrow

Reply via email to