[ https://issues.apache.org/jira/browse/FLINK-5706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895648#comment-15895648 ]
Steve Loughran commented on FLINK-5706: --------------------------------------- Stefan, I don't think you appreciate how hard it is to do this. I will draw your attention to all the features coming in Hadoop 2.8, HADOOP-11694, including seek-optimised input streams, disk/heap/byte block buffered uploads, support for encryption, optimisation of all requests HTTPS call by HTTPS call. Then there's the todo list for later HADOOP-13204. One aspect of this, HADOOP-13345, s3guard uses dynamo DB for that consistent view of metadata, and in HADOOP-13786, something to direct commits to s3 which supports speculation and fault tolerance. These are all the things you get to replicate, along with the scale tests, which do find things, as HADOOP-14028 showed up on 70GB writes, the various intermittent failures you don't see often but cause serious problems when they do: example, the final POST of a multipart PUT doesn't do retries, you have to yourself. After you find the problems. As a sibling project, you are free to lift the entirety of the s3a code, along with all the tests it includes. But you then take on the burden of maintaining it, fielding support calls, doing your entire integration test work yourself, performance turning. Did [I mention testing?|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/testing.md]. We have put a lot of effort in there. You have to balance remote test runs with scalable performance and affordability, where reusing amazon's own datasets is the secret, our own TCP-DS datasets, and running tests in-EC2 to really stress things. This is not trivial, even if you lift the latest branch-2 code. We are still finding bugs there, ones that surface in the field. We may have a broader set of downstream things to deal with: distcp, mapreduce, hive., spark, even flink, but that helps us with the test reports (We keep an eye on JIRAs and stack overflow for the word "s3a"), and the different deployment scenarios. Please, do not take this on lightly. Returning to your example above, # it's not just that the {{exists()/HEAD}} probe can take time to respond, it is that the directory listing lags the direct {{HEAD object}} call; even if the exists() check returns 404, a LIST operation may still list the entry. And because the front end load balancers cache things, the code deleting the object may get a 404 indicating that the object is gone, *there is no guarantee that a different caller will not get a 200*. # You may even get it in the same process, though if your calls are using a thread pool of keep-alive HTTP1.1 calls and all calls are on the same TCP connection, you'll be hitting the same load balancer and so get the cached 404. Because yes, load balancers cache 404 entries, meaning you don't even get create consistency if you do a check first. # S3 doesn't have RAW consistency. It now has create consistency across all regions (yes, for a long time it had different behaviours on US-East vs the others) provided you don't do a HEAD first. # You don't get PUT-over-PUT consistency, DELETE consistency, and metadata queries invariably lag the object state, even on create. # there is no such thing as `rename()`, merely a COPY of approx 6-10MB/s, so being O(data) and non atomic. # if you are copying atop objects with the same name, you hit update consistency, for which there are no guarantees. Again, different callers may see different results, irrespective of call ordering, and listing will lag creation. What you have seen so far is "demo scale" behaviours over a reused HTTP1/1 thread against the same load balancer. You cannot extrapolate from what works there with what offers guaranteed outcomes on large-scale operations with production data across multiple clusters, except for the special case "if it doesn't work here it won't magically work in production" > Implement Flink's own S3 filesystem > ----------------------------------- > > Key: FLINK-5706 > URL: https://issues.apache.org/jira/browse/FLINK-5706 > Project: Flink > Issue Type: New Feature > Components: filesystem-connector > Reporter: Stephan Ewen > > As part of the effort to make Flink completely independent from Hadoop, Flink > needs its own S3 filesystem implementation. Currently Flink relies on > Hadoop's S3a and S3n file systems. > An own S3 file system can be implemented using the AWS SDK. As the basis of > the implementation, the Hadoop File System can be used (Apache Licensed, > should be okay to reuse some code as long as we do a proper attribution). -- This message was sent by Atlassian JIRA (v6.3.15#6346)