I'm soft -1 on depending on a private copy of 3rd party dependency unless necessary (in this case, it feels avoidable), but I won't block this if others think it's a good way forward.
AWS SDK bump sounds like a patch that the Presto community might happily take in. Have you explored that option? (~ can you provide an additional context on "This is because that Presto still uses AWS SDK V1 and won’t add support for AWS SDK V2 in short term"?) Best, D. On Wed, Oct 11, 2023 at 8:59 AM Min, Maomao <mimao...@amazon.com> wrote: > Thanks everyone for your responses! > > A little more context about why we are reaching out: AWS SDK V1 will be > deprecated soon and our team (AWS EMR) is working on upgrading all our > components to AWS SDK V2 including Flink. > > Regarding the suggestion of native S3 filesystem, considering its > complexity, I think it should be a long term solution with thorough > discussion and investigation. However, in short term, it’s urgent for our > team to upgrade to AWS SDK V2 for Flink in AWS EMR, and we want to align > with Flink Community and contribute back. > > So, in short term, we still want to choose one of the following three > options: > > 1. Keep a private copy of Presto’s S3 FileSystem with AWS SDK V2 > support added. > 2. Add AWS SDK V2 support in Presto’s S3 FileSystem and use new Presto > version with this feature in Flink. > 3. Deprecate Presto’s FileSystem as we still can use Hadoop’s > FileSystem (or native one in the future). > > Any suggestions would be appreciated! > > Best, > Maomao > > > > > > > ------------------------------------------------------------------------------------------------------------------------ > > *From: *Jing Ge <j...@ververica.com> > *Date: *Tuesday, October 10, 2023 at 16:04 > *To: *"dev@flink.apache.org" <dev@flink.apache.org> > *Cc: *"Zhao, Kevin" <kevnz...@amazon.com>, "Josephraj, Prabhu" < > jopra...@amazon.com>, emr-flink-team <emr-flink-t...@amazon.com> > *Subject: *RE: [EXTERNAL] Support AWS SDK V2 for Flink's S3 FileSystem > > > > *CAUTION*: This email originated from outside of the organization. Do not > click links or open attachments unless you can confirm the sender and know > the content is safe. > > > > +1 for the s3 file consolidation. We already have many issues with > internal communication and talking to customers. Different file schemas are > not very user friendly, btw. > > > > Best regards, > > Jing > > > > *From: *Matthias Pohl <matthias.p...@aiven.io> > *Date: *Tuesday, October 10, 2023 at 15:35 > *To: *"dev@flink.apache.org" <dev@flink.apache.org> > *Cc: *"Zhao, Kevin" <kevnz...@amazon.com>, "Josephraj, Prabhu" < > jopra...@amazon.com>, emr-flink-team <emr-flink-t...@amazon.com> > *Subject: *RE: [EXTERNAL] Support AWS SDK V2 for Flink's S3 FileSystem > > > > *CAUTION*: This email originated from outside of the organization. Do not > click links or open attachments unless you can confirm the sender and know > the content is safe. > > > > Just to add a bit more context to the performance test question: What I > had in mind was the exists call on a (non-existing) directories in a bucket > with a lot of objects. A comment from one of the SDK contributors about > that call was that it could be an expensive call in an object store if > implemented wrongly. I would imagine that this could be a valid concern > because the concept of directories is not really present in an object store > like S3, if I'm not mistaken?! > > > > On Mon, Oct 9, 2023 at 6:49 PM Matthias Pohl <matthias.p...@aiven.io> > wrote: > > I would agree with David's proposal as well. > > > > Would it make sense to come up with some performance comparisons for the > different S3 implementations in the end? ...just to ensure that we're > improving things or (at least) don't make things worse. Or is there > something like that already somewhere? > > > > A bit out of scope: > > We noticed that the FileSystem contract is not well defined. The JavaDoc > is ambiguous (IMHO) for some operations. For instance, the return value of > delete [1] is true "if the operation was successful": It's unclear (at > least to me) what success means here. Is it about the processing (i.e. the > delete was performed on an existing file) or the outcome (i.e. success is > reached as well if the file didn't exist in the first place). Removing the > return type could help to make the contract clearer. In the end, only the > outcome (i.e. the file doesn't exist anymore) matters in my opinion. A > similar argument could be applied to mkdirs [2] and rename [3]. > > > > That said, I'm not suggesting you adapt the interface as part of your > work. But it would be good to collect other improvements as part of it. We > could consider improving the FileSystem interface as part of the 2.0 > efforts as a follow-up. > > > > [1] > https://github.com/apache/flink/blob/d78d52b27af2550f50b44349d3ec6dc84b966a8a/flink-core/src/main/java/org/apache/flink/core/fs/FileSystem.java#L695 > > [2] > https://github.com/apache/flink/blob/d78d52b27af2550f50b44349d3ec6dc84b966a8a/flink-core/src/main/java/org/apache/flink/core/fs/FileSystem.java#L706 > > [3] > https://github.com/apache/flink/blob/d78d52b27af2550f50b44349d3ec6dc84b966a8a/flink-core/src/main/java/org/apache/flink/core/fs/FileSystem.java#L773 > > > > On Tue, Oct 3, 2023 at 6:25 PM Martijn Visser <martijnvis...@apache.org> > wrote: > > +1 for David's suggestion. We should get away from the current > approach with two abstractions and get to one rock solid one. > > On Mon, Oct 2, 2023 at 11:13 PM David Morávek <d...@apache.org> wrote: > > > > Hi Maomao, > > > > I wonder whether it would make sense to take a stab at consolidating the > S3 > > filesystems instead and introduce a native one. The whole Hadoop wrapper > > around the S3 client exists for legacy reasons, and it adds complexity > and > > probably an unnecessary performance penalty. > > > > If you take a look at the underlying presto implementation, it's actually > > not too complex to adapt to Flink interfaces (since you're proposing to > > maintain a copy of it anyway). > > > > Overall, the S3 FS is probably the most used one that we have so this > could > > be rather high impact. It would also eliminate user confusion when > choosing > > the implementation to use. > > > > WDYT? > > > > Best, > > D. > > > > On Fri, Sep 29, 2023 at 2:41 PM Min, Maomao <mimao...@amazon.com.invalid > > > > wrote: > > > > > Hi Flink Dev, > > > > > > I’m Maomao, a developer from AWS EMR. > > > > > > Recently, our team is working on adding AWS SDK V2 support for Flink’s > S3 > > > Filesystem. During development, we found out that our work was blocked > by > > > Presto. This is because that Presto still uses AWS SDK V1 and won’t add > > > support for AWS SDK V2 in short term. To unblock, our team proposed > several > > > options and I’ve created a JIRA issue as here< > > > https://issues.apache.org/jira/browse/FLINK-33157>. > > > > > > Since our team plans to contribute this work back to the community > later, > > > we’d like to collect feedback from the community about the options we > > > proposed in the long term so that the community won’t need to duplicate > > > this work in the future. > > > > > > Best, > > > Maomao > > > > > > > >