Re: [DISCUSS] Introduce a new module 'flink-hadoop-utils'

Robert Metzger Wed, 27 May 2020 23:44:08 -0700

Hi Sivaprasanna,

thanks a lot for your proposal. Now that I ran into a HadoopUtils-related
issue myself [1] I see the benefit in this proposal.


I'm happy to be the Flink committer that mentors this change. If we do
this, I would like to have a small scope for the initial change:
- create a "flink-hadoop-utils" module
- move generic, common utils into that module (for example
SerializableHadoopConfiguration)

I agree with Till that we should initially leave out the Hadoop
compatibility modules.

You can go ahead with filing a JIRA ticket! Let's discuss the exact scope
there.


[1] https://github.com/apache/flink/pull/12146


On Thu, Apr 30, 2020 at 6:54 PM Sivaprasanna <sivaprasanna...@gmail.com>
wrote:

> Bump.
>
> Please let me know, if someone is interested in reviewing this one. I am
> willing to start working on this. BTW, a small and new addition to the
> list: With FLINK-10114 merged, OrcBulkWriterFactory can also reuse
> `SerializableHadoopConfiguration` along with SequenceFileWriterFactory and
> CompressWriterFactory.
>
> CC - Kostas Kloudas since he has a better understanding on the
> `SerializableHadoopConfiguration.`
>
> Cheers,
> Sivaprasanna
>
> On Mon, Mar 30, 2020 at 3:17 PM Chesnay Schepler <ches...@apache.org>
> wrote:
>
> > I would recommend to wait until a committer has signed up for reviewing
> > your changes before preparing any PR.
> > Otherwise the chances are high that you invest a lot of time but the
> > changes never get in.
> >
> > On 30/03/2020 11:42, Sivaprasanna wrote:
> > > Hello Till,
> > >
> > > I agree with having the scope limited and more concentrated. I can
> file a
> > > Jira and get started with the code changes, as and when someone has
> some
> > > bandwidth, the review can also be done. What do you think?
> > >
> > > Cheers,
> > > Sivaprasanna
> > >
> > > On Mon, Mar 30, 2020 at 3:00 PM Till Rohrmann <trohrm...@apache.org>
> > wrote:
> > >
> > >> Hi Sivaprasanna,
> > >>
> > >> thanks for starting this discussion. In general I like the idea to
> > remove
> > >> duplications and move common code to a shared module. As a
> > recommendation,
> > >> I would exclude the whole part about Flink's Hadoop compatibility
> > modules
> > >> because they are legacy code and hardly used anymore. This would also
> > have
> > >> the benefit of making the scope of the proposal a bit smaller.
> > >>
> > >> What we now need is a committer who wants to help with this effort. It
> > >> might be that this takes a bit of time as many of the committers are
> > quite
> > >> busy.
> > >>
> > >> Cheers,
> > >> Till
> > >>
> > >> On Thu, Mar 19, 2020 at 2:15 PM Sivaprasanna <
> sivaprasanna...@gmail.com
> > >
> > >> wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> Continuing on an earlier discussion[1] regarding having a separate
> > module
> > >>> for Hadoop related utility components, I have gone through our
> project
> > >>> briefly and found the following components which I feel could be
> moved
> > >> to a
> > >>> separate module for reusability, and better module structure.
> > >>>
> > >>> Module Name Class Name Used at / Remarks
> > >>>
> > >>> flink-hadoop-fs
> > >>> flink.runtime.util.HadoopUtils
> > >>> flink-runtime => HadoopModule & HadoopModuleFactory
> > >>> flink-swift-fs-hadoop => SwiftFileSystemFactory
> > >>> flink-yarn => Utils, YarnClusterDescriptor
> > >>>
> > >>> flink-hadoop-compatability
> > >>> api.java.hadoop.mapred.utils.HadoopUtils
> > >>> Both belong to the same module but with different packages
> > >>> (api.java.hadoop.mapred and api.java.hadoop.mapreduce)
> > >>> api.java.hadoop.mapreduce.utils.HadoopUtils
> > >>> flink-sequeunce-file
> > >>> formats.sequeuncefile.SerializableHadoopConfiguration Currently,
> > >>> it is used at formats.sequencefile.SequenceFileWriterFactory but can
> > also
> > >>> be used at HadoopCompressionBulkWriter, a potential OrcBulkWriter and
> > >>> pretty much everywhere to avoid NotSerializableException.
> > >>>
> > >>> *Proposal*
> > >>> To summarise, I believe we can create a new module
> (flink-hadoop-utils
> > ?)
> > >>> and move these reusable components to this new module which will have
> > an
> > >>> optional/provided dependency on flink-shaded-hadoop-2.
> > >>>
> > >>> *Structure*
> > >>> In the present form, I think we will have two classes with the
> > packaging
> > >>> structure being *org.apache.flink.hadoop.[utils/serialization]*
> > >>> 1. HadoopUtils with all static methods ( after combining and
> > eliminating
> > >>> the duplicate code fragments from the three HadoopUtils classes
> > mentioned
> > >>> above)
> > >>> 2. Move the existing SerializableHadoopConfiguration from the
> > >>> flink-sequence-file to this new module .
> > >>>
> > >>> *Justification*
> > >>> * With this change, we would be stripping away the dependency on
> > >>> flink-hadoop-fs from flink-runtime as I don't see any other classes
> > from
> > >>> flink-hadoop-fs is being used anywhere in flink-runtime module.
> > >>> * We will have a common place where all the utilities related to
> Hadoop
> > >> can
> > >>> go which can be reused easily without leading to jar hell.
> > >>>
> > >>> In addition to this, if you are aware of any other classes that fit
> in
> > >> this
> > >>> approach, please share the details here.
> > >>>
> > >>> *Note*
> > >>> I don't have a complete understanding here but I did see two
> > >>> implementations of the following classes under two different packages
> > >>> *.mapred and *.mapreduce.
> > >>> * HadoopInputFormat
> > >>> * HadoopInputFormatBase
> > >>> * HadoopOutputFormat
> > >>> * HadoopOutputFormatBase
> > >>>
> > >>> Can we somehow figure and have them in this new module?
> > >>>
> > >>> Thanks,
> > >>> Sivaprasanna
> > >>>
> > >>> [1]
> > >>>
> > >>>
> > >>
> >
> https://lists.apache.org/thread.html/r198f09496ba46885adbcc41fe778a7a34ad1cd685eeae8beb71e6fbb%40%3Cdev.flink.apache.org%3E
> >
> >
> >
>

Re: [DISCUSS] Introduce a new module 'flink-hadoop-utils'

Reply via email to