[ https://issues.apache.org/jira/browse/FLINK-5706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15897097#comment-15897097 ]
Stephan Ewen commented on FLINK-5706: ------------------------------------- [~steve_l] Thanks for joining the discussion and for sharing all the experience you have gathered. This issue is certainly up for discussion, we have not done any work there, yet. To avoid confusion, let me start with saying that we don't want to drop all Hadoop related code in Flink, we simply want to make Hadoop an optional dependency. Hadoop and its transitive dependencies have frquently clashed with dependencies from users that work on non-Hadoop stacks, so we wanted to make Flinks required dependencies as slim as possible, and have optional modules for Hadoop and other stacks that can be added. One point where this was causing problems was around the S3 connectors. One point that I should mention in that Flink's FileSystem abstraction is much less rich than Hadoop's FileSystem abstraction. Furthermore, the requirements that we have to support writing checkpoints to that destination are very small: - We don't do any {{list()}}, {{getStatus()}}, or any form of {{exists()}} calls in the checkpointing path (from Flink 1.3) on. - We really only rely on raw create consistency, meaning that after an object has been created, everyone can open a stream with the absolute path. - We tried to spell out what we need here: https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/core/fs/FileSystem.java#L71 There is code that uses more file system functionality, in the sink operators that write batch- or rolling streaming results. What do you think about the general idea of having something like a {{SimpleConnector}} that mainly allows to open streams to specific absolute paths? That one could do direct writes and avoid a lot of the more complex logic. The more feature rich FileSystem abstraction (that supports listing, renames, etc) could be based on Hadoop's S3a file system still. > Implement Flink's own S3 filesystem > ----------------------------------- > > Key: FLINK-5706 > URL: https://issues.apache.org/jira/browse/FLINK-5706 > Project: Flink > Issue Type: New Feature > Components: filesystem-connector > Reporter: Stephan Ewen > > As part of the effort to make Flink completely independent from Hadoop, Flink > needs its own S3 filesystem implementation. Currently Flink relies on > Hadoop's S3a and S3n file systems. > An own S3 file system can be implemented using the AWS SDK. As the basis of > the implementation, the Hadoop File System can be used (Apache Licensed, > should be okay to reuse some code as long as we do a proper attribution). -- This message was sent by Atlassian JIRA (v6.3.15#6346)