Re: flink-s3-fs-hadoop dependencies

Martijn Visser Tue, 25 Oct 2022 11:52:33 -0700

Hi all,

I have been thinking that we should consider creating one new, rock solid
S3 connector for Flink. I think it's confusing for users that there is an
S3 Presto and an S3 Hadoop  implementation, which both are not perfect. I'm
not sure that creating another bandaid is a good idea.


I'm not sure if there are any experts on S3 in the Flink Community but it
would be great if we could find them and see if/how we can improve.

Thanks,

Martijn

Op di 25 okt. 2022 om 16:56 schreef Péter Váry <[email protected]>

> Thanks for the answer Gabor!
>
> Just for the sake of clarity:
> - The issue is that the `flink-s3-fs-hadoop` does not even read the
> `core-site.xml` if it is not on the classpath
>
> Do I understand correctly that the proposal is:
> - Write a new `getHadoopConfiguration` method somewhere without using the
> dependencies, and reading the files as plain Configuration files
> - Start using the new way of accessing these Configurations everywhere in
> the Flink code?
>
> Thanks,
> Peter
>
> Gabor Somogyi <[email protected]> ezt írta (időpont: 2022. okt.
> 25., K, 13:31):
>
> > Hi Peter,
> >
> > > would this cause issues for the users?
> >
> > I think yes, it is going to make trouble for users who want to use S3
> > without HDFS client.
> > Adding HDFS client may happen but enforcing it is not a good direction.
> >
> > As mentioned I've realized that we have 6 different ways how Hadoop conf
> is
> > loaded
> > but not sure one can make one generic from it. Sometimes one need
> > HdfsConfiguration
> > or YarnConfiguration instances which is hard to generalize.
> >
> > What I can imagine is the following (but super time consuming):
> > * One creates specific configuration instance in the connector
> > (HdfsConfiguration, YarnConfiguration)
> > * Casting it to Configuration instance
> > * Calling a generic loadConfiguration(Configuration conf, List<String>
> > filesToLoad)
> > * Use locations which are covered in HadoopUtils.getHadoopConfiguration
> > (except the deprecated ones)
> > * Use this function on all the places around Flink
> >
> > In filesToLoad one could specify core-site.xml, hdfs-site.xml etc.
> > Never tried it out but this idea is in my head for quite some time...
> >
> > BR,
> > G
> >
> >
> > On Tue, Oct 25, 2022 at 11:43 AM Péter Váry <[email protected]
> >
> > wrote:
> >
> > > Hi Team,
> > >
> > > I have recently faced the issue that the S3 FileSystem read my
> > > core-site.xml until it was on the classpath, but later when I tried to
> > add
> > > it using the HADOOP_CONF_DIR then the configuration file was not
> loaded.
> > > Filed a jira [1] and created a PR [2] for fixing it.
> > >
> > > HadoopUtils.getHadoopConfiguration is the method which considers all
> the
> > > relevant configurations for accessing / loading the hadoop
> configuration
> > > files, so I used it to fix the issue. The downside is that in this
> method
> > > we instantiate the HdfsConfiguration object which requires me to add
> the
> > > hadoop-hdfs-client as a provided dependency.
> > >
> > > My question for the more experienced folks - would this cause issues
> for
> > > the users? Could we assume that if the hadoop-common is on the
> classpath
> > > then hadoop-hdfs-client is on the classpath as well? Do you see other
> > > possible drawbacks or issues with my approach?
> > >
> > > Thanks,
> > > Peter
> > >
> > > [1] https://issues.apache.org/jira/browse/FLINK-29754
> > > [2] https://github.com/apache/flink/pull/21148
> > >
> >
>
-- 
Martijn
https://twitter.com/MartijnVisser82
https://github.com/MartijnVisser

Re: flink-s3-fs-hadoop dependencies

Reply via email to