Re: flink-s3-fs-hadoop dependencies

Martijn Visser Wed, 26 Oct 2022 06:57:24 -0700

Hi Gabor,

They are absolutely not coupled, I'm merely mentioning it because it could
"waste" potentially someone's time: first by creating a bandaid which would
then need to be replaced with a more permanent solution. Could be fine, but
it all depends on how much effort & errorprone the bandaid is vs creating a
structural solution.


Thanks,

Martijn

On Wed, Oct 26, 2022 at 11:27 AM Gabor Somogyi <gabor.g.somo...@gmail.com>
wrote:

> Hi Martijn,
>
> > I'm not sure that creating another bandaid is a good idea.
>
> I think we have multiple issues here which are not coupled:
> * Hadoop config handling in Flink is not consistent (for example runtime
> uses hadoop FS connector code and each connector has it's own
> implementation)
> * S3 connector for Flink quality can be definitely improved
>
> Fixing the first doesn't require a full S3 connector rewrite though I agree
> with your direction.
>
> > I'm not sure if there are any experts on S3 in the Flink Community but it
> would be great if we could find them and see if/how we can improve.
>
> I'm not calling myself S3 expert but I'm planning to implement a token
> provider for S3 so intended to drive that effort.
> While I'm there I can have a look how it's implemented.
>
> BR,
> G
>
>
> On Tue, Oct 25, 2022 at 8:52 PM Martijn Visser <martijnvis...@apache.org>
> wrote:
>
> > Hi all,
> >
> > I have been thinking that we should consider creating one new, rock solid
> > S3 connector for Flink. I think it's confusing for users that there is an
> > S3 Presto and an S3 Hadoop  implementation, which both are not perfect.
> I'm
> > not sure that creating another bandaid is a good idea.
> >
> > I'm not sure if there are any experts on S3 in the Flink Community but it
> > would be great if we could find them and see if/how we can improve.
> >
> > Thanks,
> >
> > Martijn
> >
> > Op di 25 okt. 2022 om 16:56 schreef Péter Váry <
> > peter.vary.apa...@gmail.com>
> >
> > > Thanks for the answer Gabor!
> > >
> > > Just for the sake of clarity:
> > > - The issue is that the `flink-s3-fs-hadoop` does not even read the
> > > `core-site.xml` if it is not on the classpath
> > >
> > > Do I understand correctly that the proposal is:
> > > - Write a new `getHadoopConfiguration` method somewhere without using
> the
> > > dependencies, and reading the files as plain Configuration files
> > > - Start using the new way of accessing these Configurations everywhere
> in
> > > the Flink code?
> > >
> > > Thanks,
> > > Peter
> > >
> > > Gabor Somogyi <gabor.g.somo...@gmail.com> ezt írta (időpont: 2022.
> okt.
> > > 25., K, 13:31):
> > >
> > > > Hi Peter,
> > > >
> > > > > would this cause issues for the users?
> > > >
> > > > I think yes, it is going to make trouble for users who want to use S3
> > > > without HDFS client.
> > > > Adding HDFS client may happen but enforcing it is not a good
> direction.
> > > >
> > > > As mentioned I've realized that we have 6 different ways how Hadoop
> > conf
> > > is
> > > > loaded
> > > > but not sure one can make one generic from it. Sometimes one need
> > > > HdfsConfiguration
> > > > or YarnConfiguration instances which is hard to generalize.
> > > >
> > > > What I can imagine is the following (but super time consuming):
> > > > * One creates specific configuration instance in the connector
> > > > (HdfsConfiguration, YarnConfiguration)
> > > > * Casting it to Configuration instance
> > > > * Calling a generic loadConfiguration(Configuration conf,
> List<String>
> > > > filesToLoad)
> > > > * Use locations which are covered in
> HadoopUtils.getHadoopConfiguration
> > > > (except the deprecated ones)
> > > > * Use this function on all the places around Flink
> > > >
> > > > In filesToLoad one could specify core-site.xml, hdfs-site.xml etc.
> > > > Never tried it out but this idea is in my head for quite some time...
> > > >
> > > > BR,
> > > > G
> > > >
> > > >
> > > > On Tue, Oct 25, 2022 at 11:43 AM Péter Váry <
> > peter.vary.apa...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Hi Team,
> > > > >
> > > > > I have recently faced the issue that the S3 FileSystem read my
> > > > > core-site.xml until it was on the classpath, but later when I tried
> > to
> > > > add
> > > > > it using the HADOOP_CONF_DIR then the configuration file was not
> > > loaded.
> > > > > Filed a jira [1] and created a PR [2] for fixing it.
> > > > >
> > > > > HadoopUtils.getHadoopConfiguration is the method which considers
> all
> > > the
> > > > > relevant configurations for accessing / loading the hadoop
> > > configuration
> > > > > files, so I used it to fix the issue. The downside is that in this
> > > method
> > > > > we instantiate the HdfsConfiguration object which requires me to
> add
> > > the
> > > > > hadoop-hdfs-client as a provided dependency.
> > > > >
> > > > > My question for the more experienced folks - would this cause
> issues
> > > for
> > > > > the users? Could we assume that if the hadoop-common is on the
> > > classpath
> > > > > then hadoop-hdfs-client is on the classpath as well? Do you see
> other
> > > > > possible drawbacks or issues with my approach?
> > > > >
> > > > > Thanks,
> > > > > Peter
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/FLINK-29754
> > > > > [2] https://github.com/apache/flink/pull/21148
> > > > >
> > > >
> > >
> > --
> > Martijn
> > https://twitter.com/MartijnVisser82
> > https://github.com/MartijnVisser
> >
>

Re: flink-s3-fs-hadoop dependencies

Reply via email to