Re: [DISCUSS] Introduce a new module 'flink-hadoop-utils'

Robert Metzger Fri, 29 May 2020 04:39:14 -0700

Thanks a lot! Let's continue the discussion in the ticket! (I might not be
able to respond before Monday there)


On Thu, May 28, 2020 at 5:08 PM Sivaprasanna <sivaprasanna...@gmail.com>
wrote:

> FYI.
>
> I created a Jira to track this improvement.
> https://issues.apache.org/jira/browse/FLINK-18013
>
> -
> Sivaprasanna
>
> On Thu, May 28, 2020 at 12:22 PM Sivaprasanna <sivaprasanna...@gmail.com>
> wrote:
>
> > Awesome. : )
> > Thanks, Robert for signing up to be the reviewer. I will create Jira and
> > share the link here.
> >
> > Stay safe.
> >
> > -
> > Sivaprasanna
> >
> > On Thu, May 28, 2020 at 12:13 PM Robert Metzger <rmetz...@apache.org>
> > wrote:
> >
> >> Hi Sivaprasanna,
> >>
> >> thanks a lot for your proposal. Now that I ran into a
> HadoopUtils-related
> >> issue myself [1] I see the benefit in this proposal.
> >>
> >> I'm happy to be the Flink committer that mentors this change. If we do
> >> this, I would like to have a small scope for the initial change:
> >> - create a "flink-hadoop-utils" module
> >> - move generic, common utils into that module (for example
> >> SerializableHadoopConfiguration)
> >>
> >> I agree with Till that we should initially leave out the Hadoop
> >> compatibility modules.
> >>
> >> You can go ahead with filing a JIRA ticket! Let's discuss the exact
> scope
> >> there.
> >>
> >>
> >> [1] https://github.com/apache/flink/pull/12146
> >>
> >>
> >> On Thu, Apr 30, 2020 at 6:54 PM Sivaprasanna <sivaprasanna...@gmail.com
> >
> >> wrote:
> >>
> >> > Bump.
> >> >
> >> > Please let me know, if someone is interested in reviewing this one. I
> am
> >> > willing to start working on this. BTW, a small and new addition to the
> >> > list: With FLINK-10114 merged, OrcBulkWriterFactory can also reuse
> >> > `SerializableHadoopConfiguration` along with SequenceFileWriterFactory
> >> and
> >> > CompressWriterFactory.
> >> >
> >> > CC - Kostas Kloudas since he has a better understanding on the
> >> > `SerializableHadoopConfiguration.`
> >> >
> >> > Cheers,
> >> > Sivaprasanna
> >> >
> >> > On Mon, Mar 30, 2020 at 3:17 PM Chesnay Schepler <ches...@apache.org>
> >> > wrote:
> >> >
> >> > > I would recommend to wait until a committer has signed up for
> >> reviewing
> >> > > your changes before preparing any PR.
> >> > > Otherwise the chances are high that you invest a lot of time but the
> >> > > changes never get in.
> >> > >
> >> > > On 30/03/2020 11:42, Sivaprasanna wrote:
> >> > > > Hello Till,
> >> > > >
> >> > > > I agree with having the scope limited and more concentrated. I can
> >> > file a
> >> > > > Jira and get started with the code changes, as and when someone
> has
> >> > some
> >> > > > bandwidth, the review can also be done. What do you think?
> >> > > >
> >> > > > Cheers,
> >> > > > Sivaprasanna
> >> > > >
> >> > > > On Mon, Mar 30, 2020 at 3:00 PM Till Rohrmann <
> trohrm...@apache.org
> >> >
> >> > > wrote:
> >> > > >
> >> > > >> Hi Sivaprasanna,
> >> > > >>
> >> > > >> thanks for starting this discussion. In general I like the idea
> to
> >> > > remove
> >> > > >> duplications and move common code to a shared module. As a
> >> > > recommendation,
> >> > > >> I would exclude the whole part about Flink's Hadoop compatibility
> >> > > modules
> >> > > >> because they are legacy code and hardly used anymore. This would
> >> also
> >> > > have
> >> > > >> the benefit of making the scope of the proposal a bit smaller.
> >> > > >>
> >> > > >> What we now need is a committer who wants to help with this
> >> effort. It
> >> > > >> might be that this takes a bit of time as many of the committers
> >> are
> >> > > quite
> >> > > >> busy.
> >> > > >>
> >> > > >> Cheers,
> >> > > >> Till
> >> > > >>
> >> > > >> On Thu, Mar 19, 2020 at 2:15 PM Sivaprasanna <
> >> > sivaprasanna...@gmail.com
> >> > > >
> >> > > >> wrote:
> >> > > >>
> >> > > >>> Hi,
> >> > > >>>
> >> > > >>> Continuing on an earlier discussion[1] regarding having a
> separate
> >> > > module
> >> > > >>> for Hadoop related utility components, I have gone through our
> >> > project
> >> > > >>> briefly and found the following components which I feel could be
> >> > moved
> >> > > >> to a
> >> > > >>> separate module for reusability, and better module structure.
> >> > > >>>
> >> > > >>> Module Name Class Name Used at / Remarks
> >> > > >>>
> >> > > >>> flink-hadoop-fs
> >> > > >>> flink.runtime.util.HadoopUtils
> >> > > >>> flink-runtime => HadoopModule & HadoopModuleFactory
> >> > > >>> flink-swift-fs-hadoop => SwiftFileSystemFactory
> >> > > >>> flink-yarn => Utils, YarnClusterDescriptor
> >> > > >>>
> >> > > >>> flink-hadoop-compatability
> >> > > >>> api.java.hadoop.mapred.utils.HadoopUtils
> >> > > >>> Both belong to the same module but with different packages
> >> > > >>> (api.java.hadoop.mapred and api.java.hadoop.mapreduce)
> >> > > >>> api.java.hadoop.mapreduce.utils.HadoopUtils
> >> > > >>> flink-sequeunce-file
> >> > > >>> formats.sequeuncefile.SerializableHadoopConfiguration Currently,
> >> > > >>> it is used at formats.sequencefile.SequenceFileWriterFactory but
> >> can
> >> > > also
> >> > > >>> be used at HadoopCompressionBulkWriter, a potential
> OrcBulkWriter
> >> and
> >> > > >>> pretty much everywhere to avoid NotSerializableException.
> >> > > >>>
> >> > > >>> *Proposal*
> >> > > >>> To summarise, I believe we can create a new module
> >> > (flink-hadoop-utils
> >> > > ?)
> >> > > >>> and move these reusable components to this new module which will
> >> have
> >> > > an
> >> > > >>> optional/provided dependency on flink-shaded-hadoop-2.
> >> > > >>>
> >> > > >>> *Structure*
> >> > > >>> In the present form, I think we will have two classes with the
> >> > > packaging
> >> > > >>> structure being *org.apache.flink.hadoop.[utils/serialization]*
> >> > > >>> 1. HadoopUtils with all static methods ( after combining and
> >> > > eliminating
> >> > > >>> the duplicate code fragments from the three HadoopUtils classes
> >> > > mentioned
> >> > > >>> above)
> >> > > >>> 2. Move the existing SerializableHadoopConfiguration from the
> >> > > >>> flink-sequence-file to this new module .
> >> > > >>>
> >> > > >>> *Justification*
> >> > > >>> * With this change, we would be stripping away the dependency on
> >> > > >>> flink-hadoop-fs from flink-runtime as I don't see any other
> >> classes
> >> > > from
> >> > > >>> flink-hadoop-fs is being used anywhere in flink-runtime module.
> >> > > >>> * We will have a common place where all the utilities related to
> >> > Hadoop
> >> > > >> can
> >> > > >>> go which can be reused easily without leading to jar hell.
> >> > > >>>
> >> > > >>> In addition to this, if you are aware of any other classes that
> >> fit
> >> > in
> >> > > >> this
> >> > > >>> approach, please share the details here.
> >> > > >>>
> >> > > >>> *Note*
> >> > > >>> I don't have a complete understanding here but I did see two
> >> > > >>> implementations of the following classes under two different
> >> packages
> >> > > >>> *.mapred and *.mapreduce.
> >> > > >>> * HadoopInputFormat
> >> > > >>> * HadoopInputFormatBase
> >> > > >>> * HadoopOutputFormat
> >> > > >>> * HadoopOutputFormatBase
> >> > > >>>
> >> > > >>> Can we somehow figure and have them in this new module?
> >> > > >>>
> >> > > >>> Thanks,
> >> > > >>> Sivaprasanna
> >> > > >>>
> >> > > >>> [1]
> >> > > >>>
> >> > > >>>
> >> > > >>
> >> > >
> >> >
> >>
> https://lists.apache.org/thread.html/r198f09496ba46885adbcc41fe778a7a34ad1cd685eeae8beb71e6fbb%40%3Cdev.flink.apache.org%3E
> >> > >
> >> > >
> >> > >
> >> >
> >>
> >
>

Re: [DISCUSS] Introduce a new module 'flink-hadoop-utils'

Reply via email to