[ 
https://issues.apache.org/jira/browse/FLINK-5706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15897097#comment-15897097
 ] 

Stephan Ewen commented on FLINK-5706:
-------------------------------------

[~steve_l] Thanks for joining the discussion and for sharing all the experience 
you have gathered. This issue is certainly up for discussion, we have not done 
any work there, yet.

To avoid confusion, let me start with saying that we don't want to drop all 
Hadoop related code in Flink, we simply want to make Hadoop an optional 
dependency. Hadoop and its transitive dependencies have frquently clashed with 
dependencies from users that work on non-Hadoop stacks, so we wanted to make 
Flinks required dependencies as slim as possible, and have optional modules for 
Hadoop and other stacks that can be added. One point where this was causing 
problems was around the S3 connectors.

One point that I should mention in that Flink's FileSystem abstraction is much 
less rich than Hadoop's FileSystem abstraction.
Furthermore, the requirements that we have to support writing checkpoints to 
that destination are very small:
  - We don't do any {{list()}}, {{getStatus()}}, or any form of {{exists()}} 
calls in the checkpointing path (from Flink 1.3) on.
  - We really only rely on raw create consistency, meaning that after an object 
has been created, everyone can open a stream with the absolute path.
  - We tried to spell out what we need here: 
https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/core/fs/FileSystem.java#L71

There is code that uses more file system functionality, in the sink operators 
that write batch- or rolling streaming results.

What do you think about the general idea of having something like a 
{{SimpleConnector}} that mainly allows to open streams to specific absolute 
paths? That one could do direct writes and avoid a lot of the more complex 
logic. The more feature rich FileSystem abstraction (that supports listing, 
renames, etc) could be based on Hadoop's S3a file system still.


> Implement Flink's own S3 filesystem
> -----------------------------------
>
>                 Key: FLINK-5706
>                 URL: https://issues.apache.org/jira/browse/FLINK-5706
>             Project: Flink
>          Issue Type: New Feature
>          Components: filesystem-connector
>            Reporter: Stephan Ewen
>
> As part of the effort to make Flink completely independent from Hadoop, Flink 
> needs its own S3 filesystem implementation. Currently Flink relies on 
> Hadoop's S3a and S3n file systems.
> An own S3 file system can be implemented using the AWS SDK. As the basis of 
> the implementation, the Hadoop File System can be used (Apache Licensed, 
> should be okay to reuse some code as long as we do a proper attribution).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to