[GitHub] flink pull request #4818: [FLINK-5706] [file systems] Add S3 file systems wi...

StephanEwen Thu, 12 Oct 2017 10:41:34 -0700

GitHub user StephanEwen opened a pull request:

    https://github.com/apache/flink/pull/4818


    [FLINK-5706] [file systems] Add S3 file systems without Hadoop dependencies

    ## What is the purpose of the change
    
    This adds two implementations of a file system that write to S3 so that 
users can use Flink with S3 without depending on Hadoop and have an alternative 
to Hadoop's S3 connectors.
    
    Both are not actual re-implementations but wrap other implementations and 
shade dependencies.
    
    1. The first is a wrapper around Hadoop's s3a file system. By pulling a 
smaller dependency tree and shading all dependencies away, this keeps the 
appearance of Flink being Hadoop-free, from a dependency perspective. We can 
also bump the shaded Hadoop dependency here to get improvements to s3a in (as 
in Hadoop 3.0) without causing dependency conflicts.
    
    2. The second S3 file system is from the Presto Project. Initial simple 
tests seem to indicate that it responds slightly faster and in a bit more 
lightweight manner to write/read/list requests, compared to the Hadoop s3a FS, 
but it has some semantical differences. For example, creating a directory does 
not mean the file system recognized that the directory is there. The directory 
is only recognized as existing once files are inserted. For checkpointing, that 
could even be preferable.
    
    Both file systems register themselves under `s3` to not overload the `s3n` 
and `s3a` schemes used by Hadoop,
    
    ## Brief change log
    
      - Adds `flink-filesystems/flink-s3-fs-hadoop`
      - Adds `flink-filesystems/flink-s3-fs-presto`
    
    ## Verifying this change
    
    This adds some initial integration tests, which do depend on S3 
credentials. These credentials are not in the code, but only encrypted on 
Travis, which is why the tests can only run in a meaningful way either on the 
`apache/flink` master branch, or in a committer repository when the committer 
enabled Travis uploads to S3 (for logs) - the tests here use the same secret 
credentials.
    
    Since this does not implement the actual S3 communication, we have no tests 
for that. The tests only test instantiation and whether S3 communication can be 
established (simple reads/writes to a bucket, listing, etc).
    
    Change can also be verified by building Flink, pulling the respective S3 FS 
jar from `/opt` into `/lib` and running a workload that checkpoints or writes 
to S3.
    
    ## Does this pull request potentially affect one of the following parts:
    
      - Dependencies (does it add or upgrade a dependency): (**yes** / no)
      - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (yes / **no**)
      - The serializers: (yes / **no** / don't know)
      - The runtime per-record code paths (performance sensitive): (yes / 
**no** / don't know)
      - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: (**yes** / no / don't know)
    
    Proper behavior of the File Systems is important, otherwise checkpointing 
may fail. In some sense we are already relying on proper tests of HDFS and S3 
connectors by the Hadoop project. This adds a similar dependency.
    
    ## Documentation
    
      - Does this pull request introduce a new feature? (**yes** / no)
      - If yes, how is the feature documented? (not applicable / docs / 
JavaDocs / **not documented**)
    
    Will add documentation once the details of this feature are agreed upon.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/StephanEwen/incubator-flink fs_s3

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/4818.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4818
    
----
commit a0b89ad02d4cb67c4d5e1e28efcb6872af0540e6
Author: Stephan Ewen <se...@apache.org>
Date:   2017-10-06T15:41:00Z

    [FLINK-5706] [file systems] Add S3 file systems without Hadoop dependencies
    
    This adds two implementations of a file system that write to S3.
    Both are not actual re-implementations but wrap other implementations and 
shade dependencies.
    
    (1) A wrapper around Hadoop's s3a file system. By pulling a smaller 
dependency tree and
        shading all dependencies away, this keeps the appearance of Flink being 
Hadoop-free,
        from a dependency perspective.
    
    (2) The second S3 file system is from the Presto Project.
        Initial simple tests seem to indicate that it responds slightly faster
        and in a bit more lightweight manner to write/read/list requests, 
compared
        to the Hadoop s3a FS, but it has some semantical differences.

----


---

[GitHub] flink pull request #4818: [FLINK-5706] [file systems] Add S3 file systems wi...

Reply via email to