Re: flink-s3-fs-hadoop dependencies

Gabor Somogyi Wed, 26 Oct 2022 02:27:13 -0700

Hi Martijn,

> I'm not sure that creating another bandaid is a good idea.


I think we have multiple issues here which are not coupled:
* Hadoop config handling in Flink is not consistent (for example runtime
uses hadoop FS connector code and each connector has it's own
implementation)
* S3 connector for Flink quality can be definitely improved

Fixing the first doesn't require a full S3 connector rewrite though I agree
with your direction.

> I'm not sure if there are any experts on S3 in the Flink Community but it
would be great if we could find them and see if/how we can improve.

I'm not calling myself S3 expert but I'm planning to implement a token
provider for S3 so intended to drive that effort.
While I'm there I can have a look how it's implemented.

BR,
G


On Tue, Oct 25, 2022 at 8:52 PM Martijn Visser <martijnvis...@apache.org>
wrote:

> Hi all,
>
> I have been thinking that we should consider creating one new, rock solid
> S3 connector for Flink. I think it's confusing for users that there is an
> S3 Presto and an S3 Hadoop  implementation, which both are not perfect. I'm
> not sure that creating another bandaid is a good idea.
>
> I'm not sure if there are any experts on S3 in the Flink Community but it
> would be great if we could find them and see if/how we can improve.
>
> Thanks,
>
> Martijn
>
> Op di 25 okt. 2022 om 16:56 schreef Péter Váry <
> peter.vary.apa...@gmail.com>
>
> > Thanks for the answer Gabor!
> >
> > Just for the sake of clarity:
> > - The issue is that the `flink-s3-fs-hadoop` does not even read the
> > `core-site.xml` if it is not on the classpath
> >
> > Do I understand correctly that the proposal is:
> > - Write a new `getHadoopConfiguration` method somewhere without using the
> > dependencies, and reading the files as plain Configuration files
> > - Start using the new way of accessing these Configurations everywhere in
> > the Flink code?
> >
> > Thanks,
> > Peter
> >
> > Gabor Somogyi <gabor.g.somo...@gmail.com> ezt írta (időpont: 2022. okt.
> > 25., K, 13:31):
> >
> > > Hi Peter,
> > >
> > > > would this cause issues for the users?
> > >
> > > I think yes, it is going to make trouble for users who want to use S3
> > > without HDFS client.
> > > Adding HDFS client may happen but enforcing it is not a good direction.
> > >
> > > As mentioned I've realized that we have 6 different ways how Hadoop
> conf
> > is
> > > loaded
> > > but not sure one can make one generic from it. Sometimes one need
> > > HdfsConfiguration
> > > or YarnConfiguration instances which is hard to generalize.
> > >
> > > What I can imagine is the following (but super time consuming):
> > > * One creates specific configuration instance in the connector
> > > (HdfsConfiguration, YarnConfiguration)
> > > * Casting it to Configuration instance
> > > * Calling a generic loadConfiguration(Configuration conf, List<String>
> > > filesToLoad)
> > > * Use locations which are covered in HadoopUtils.getHadoopConfiguration
> > > (except the deprecated ones)
> > > * Use this function on all the places around Flink
> > >
> > > In filesToLoad one could specify core-site.xml, hdfs-site.xml etc.
> > > Never tried it out but this idea is in my head for quite some time...
> > >
> > > BR,
> > > G
> > >
> > >
> > > On Tue, Oct 25, 2022 at 11:43 AM Péter Váry <
> peter.vary.apa...@gmail.com
> > >
> > > wrote:
> > >
> > > > Hi Team,
> > > >
> > > > I have recently faced the issue that the S3 FileSystem read my
> > > > core-site.xml until it was on the classpath, but later when I tried
> to
> > > add
> > > > it using the HADOOP_CONF_DIR then the configuration file was not
> > loaded.
> > > > Filed a jira [1] and created a PR [2] for fixing it.
> > > >
> > > > HadoopUtils.getHadoopConfiguration is the method which considers all
> > the
> > > > relevant configurations for accessing / loading the hadoop
> > configuration
> > > > files, so I used it to fix the issue. The downside is that in this
> > method
> > > > we instantiate the HdfsConfiguration object which requires me to add
> > the
> > > > hadoop-hdfs-client as a provided dependency.
> > > >
> > > > My question for the more experienced folks - would this cause issues
> > for
> > > > the users? Could we assume that if the hadoop-common is on the
> > classpath
> > > > then hadoop-hdfs-client is on the classpath as well? Do you see other
> > > > possible drawbacks or issues with my approach?
> > > >
> > > > Thanks,
> > > > Peter
> > > >
> > > > [1] https://issues.apache.org/jira/browse/FLINK-29754
> > > > [2] https://github.com/apache/flink/pull/21148
> > > >
> > >
> >
> --
> Martijn
> https://twitter.com/MartijnVisser82
> https://github.com/MartijnVisser
>

Re: flink-s3-fs-hadoop dependencies

Reply via email to