[jira] [Commented] (FLINK-19481) Add support for a flink native GCS FileSystem

Jamie Grier (Jira) Fri, 09 Oct 2020 17:12:35 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-19481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211477#comment-17211477
 ]


Jamie Grier commented on FLINK-19481:
-------------------------------------

I think a native GCS filesytem would be a major benefit to Flink users.  The 
only way to support GCS currently is, as stated, through the Hadoop Filesystem 
implementation which brings several problems along with it.  The two largest 
problems I've experienced are:

1) Hadoop has a huge dependency footprint which is a significant headache for 
Flink application developers dealing with dependency-hell.

2) The total stack of FileSystem abstractions on this path becomes very 
difficult to tune, understand, and support.  By stack I'm referring to Flink's 
own FileSystem abstraction, then the Hadoop layer, then the GCS libraries.  
This is very difficult to work with in production as each layer has its own 
intricacies, connection pools, thread pools, tunable configuration, versions, 
dependency versions, etc.

Having gone down this path with the old-style Hadoop+S3 filesystem approach I 
know how difficult it can be and a native implementation should prove to be 
much simpler to support and easier to tune and modify for performance.  This is 
why the presto-s3-fs filesystem was adopted, for example.

 

> Add support for a flink native GCS FileSystem
> ---------------------------------------------
>
>                 Key: FLINK-19481
>                 URL: https://issues.apache.org/jira/browse/FLINK-19481
>             Project: Flink
>          Issue Type: Improvement
>          Components: Connectors / FileSystem, FileSystems
>    Affects Versions: 1.12.0
>            Reporter: Ben Augarten
>            Priority: Major
>             Fix For: 1.12.0
>
>
> Currently, GCS is supported but only by using the hadoop connector[1]
>  
> The objective of this improvement is to add support for checkpointing to 
> Google Cloud Storage with the Flink File System,
>  
> This would allow the `gs://` scheme to be used for savepointing and 
> checkpointing. Long term, it would be nice if we could use the GCS FileSystem 
> as a source and sink in flink jobs as well. 
>  
> Long term, I hope that implementing a flink native GCS FileSystem will 
> simplify usage of GCS because the hadoop FileSystem ends up bringing in many 
> unshaded dependencies.
>  
> [1] 
> [https://github.com/GoogleCloudDataproc/hadoop-connectors|https://github.com/GoogleCloudDataproc/hadoop-connectors)]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-19481) Add support for a flink native GCS FileSystem

Reply via email to