Hi Gabor, They are absolutely not coupled, I'm merely mentioning it because it could "waste" potentially someone's time: first by creating a bandaid which would then need to be replaced with a more permanent solution. Could be fine, but it all depends on how much effort & errorprone the bandaid is vs creating a structural solution.
Thanks, Martijn On Wed, Oct 26, 2022 at 11:27 AM Gabor Somogyi <gabor.g.somo...@gmail.com> wrote: > Hi Martijn, > > > I'm not sure that creating another bandaid is a good idea. > > I think we have multiple issues here which are not coupled: > * Hadoop config handling in Flink is not consistent (for example runtime > uses hadoop FS connector code and each connector has it's own > implementation) > * S3 connector for Flink quality can be definitely improved > > Fixing the first doesn't require a full S3 connector rewrite though I agree > with your direction. > > > I'm not sure if there are any experts on S3 in the Flink Community but it > would be great if we could find them and see if/how we can improve. > > I'm not calling myself S3 expert but I'm planning to implement a token > provider for S3 so intended to drive that effort. > While I'm there I can have a look how it's implemented. > > BR, > G > > > On Tue, Oct 25, 2022 at 8:52 PM Martijn Visser <martijnvis...@apache.org> > wrote: > > > Hi all, > > > > I have been thinking that we should consider creating one new, rock solid > > S3 connector for Flink. I think it's confusing for users that there is an > > S3 Presto and an S3 Hadoop implementation, which both are not perfect. > I'm > > not sure that creating another bandaid is a good idea. > > > > I'm not sure if there are any experts on S3 in the Flink Community but it > > would be great if we could find them and see if/how we can improve. > > > > Thanks, > > > > Martijn > > > > Op di 25 okt. 2022 om 16:56 schreef Péter Váry < > > peter.vary.apa...@gmail.com> > > > > > Thanks for the answer Gabor! > > > > > > Just for the sake of clarity: > > > - The issue is that the `flink-s3-fs-hadoop` does not even read the > > > `core-site.xml` if it is not on the classpath > > > > > > Do I understand correctly that the proposal is: > > > - Write a new `getHadoopConfiguration` method somewhere without using > the > > > dependencies, and reading the files as plain Configuration files > > > - Start using the new way of accessing these Configurations everywhere > in > > > the Flink code? > > > > > > Thanks, > > > Peter > > > > > > Gabor Somogyi <gabor.g.somo...@gmail.com> ezt írta (időpont: 2022. > okt. > > > 25., K, 13:31): > > > > > > > Hi Peter, > > > > > > > > > would this cause issues for the users? > > > > > > > > I think yes, it is going to make trouble for users who want to use S3 > > > > without HDFS client. > > > > Adding HDFS client may happen but enforcing it is not a good > direction. > > > > > > > > As mentioned I've realized that we have 6 different ways how Hadoop > > conf > > > is > > > > loaded > > > > but not sure one can make one generic from it. Sometimes one need > > > > HdfsConfiguration > > > > or YarnConfiguration instances which is hard to generalize. > > > > > > > > What I can imagine is the following (but super time consuming): > > > > * One creates specific configuration instance in the connector > > > > (HdfsConfiguration, YarnConfiguration) > > > > * Casting it to Configuration instance > > > > * Calling a generic loadConfiguration(Configuration conf, > List<String> > > > > filesToLoad) > > > > * Use locations which are covered in > HadoopUtils.getHadoopConfiguration > > > > (except the deprecated ones) > > > > * Use this function on all the places around Flink > > > > > > > > In filesToLoad one could specify core-site.xml, hdfs-site.xml etc. > > > > Never tried it out but this idea is in my head for quite some time... > > > > > > > > BR, > > > > G > > > > > > > > > > > > On Tue, Oct 25, 2022 at 11:43 AM Péter Váry < > > peter.vary.apa...@gmail.com > > > > > > > > wrote: > > > > > > > > > Hi Team, > > > > > > > > > > I have recently faced the issue that the S3 FileSystem read my > > > > > core-site.xml until it was on the classpath, but later when I tried > > to > > > > add > > > > > it using the HADOOP_CONF_DIR then the configuration file was not > > > loaded. > > > > > Filed a jira [1] and created a PR [2] for fixing it. > > > > > > > > > > HadoopUtils.getHadoopConfiguration is the method which considers > all > > > the > > > > > relevant configurations for accessing / loading the hadoop > > > configuration > > > > > files, so I used it to fix the issue. The downside is that in this > > > method > > > > > we instantiate the HdfsConfiguration object which requires me to > add > > > the > > > > > hadoop-hdfs-client as a provided dependency. > > > > > > > > > > My question for the more experienced folks - would this cause > issues > > > for > > > > > the users? Could we assume that if the hadoop-common is on the > > > classpath > > > > > then hadoop-hdfs-client is on the classpath as well? Do you see > other > > > > > possible drawbacks or issues with my approach? > > > > > > > > > > Thanks, > > > > > Peter > > > > > > > > > > [1] https://issues.apache.org/jira/browse/FLINK-29754 > > > > > [2] https://github.com/apache/flink/pull/21148 > > > > > > > > > > > > > > -- > > Martijn > > https://twitter.com/MartijnVisser82 > > https://github.com/MartijnVisser > > >