Re: [DISCUSS] Introduce a new module 'flink-hadoop-utils'

Chesnay Schepler Mon, 30 Mar 2020 02:48:00 -0700

I would recommend to wait until a committer has signed up for reviewingyour changes before preparing any PR.Otherwise the chances are high that you invest a lot of time but thechanges never get in.


On 30/03/2020 11:42, Sivaprasanna wrote:

Hello Till,


I agree with having the scope limited and more concentrated. I can file a
Jira and get started with the code changes, as and when someone has some
bandwidth, the review can also be done. What do you think?

Cheers,
Sivaprasanna

On Mon, Mar 30, 2020 at 3:00 PM Till Rohrmann <[email protected]> wrote:

Hi Sivaprasanna,

thanks for starting this discussion. In general I like the idea to remove
duplications and move common code to a shared module. As a recommendation,
I would exclude the whole part about Flink's Hadoop compatibility modules
because they are legacy code and hardly used anymore. This would also have
the benefit of making the scope of the proposal a bit smaller.

What we now need is a committer who wants to help with this effort. It
might be that this takes a bit of time as many of the committers are quite
busy.

Cheers,
Till

On Thu, Mar 19, 2020 at 2:15 PM Sivaprasanna <[email protected]>
wrote:

Hi,

Continuing on an earlier discussion[1] regarding having a separate module
for Hadoop related utility components, I have gone through our project
briefly and found the following components which I feel could be moved

to a

separate module for reusability, and better module structure.

Module Name Class Name Used at / Remarks

flink-hadoop-fs
flink.runtime.util.HadoopUtils
flink-runtime => HadoopModule & HadoopModuleFactory
flink-swift-fs-hadoop => SwiftFileSystemFactory
flink-yarn => Utils, YarnClusterDescriptor

flink-hadoop-compatability
api.java.hadoop.mapred.utils.HadoopUtils
Both belong to the same module but with different packages
(api.java.hadoop.mapred and api.java.hadoop.mapreduce)
api.java.hadoop.mapreduce.utils.HadoopUtils
flink-sequeunce-file
formats.sequeuncefile.SerializableHadoopConfiguration Currently,
it is used at formats.sequencefile.SequenceFileWriterFactory but can also
be used at HadoopCompressionBulkWriter, a potential OrcBulkWriter and
pretty much everywhere to avoid NotSerializableException.

*Proposal*
To summarise, I believe we can create a new module (flink-hadoop-utils ?)
and move these reusable components to this new module which will have an
optional/provided dependency on flink-shaded-hadoop-2.

*Structure*
In the present form, I think we will have two classes with the packaging
structure being *org.apache.flink.hadoop.[utils/serialization]*
1. HadoopUtils with all static methods ( after combining and eliminating
the duplicate code fragments from the three HadoopUtils classes mentioned
above)
2. Move the existing SerializableHadoopConfiguration from the
flink-sequence-file to this new module .

*Justification*
* With this change, we would be stripping away the dependency on
flink-hadoop-fs from flink-runtime as I don't see any other classes from
flink-hadoop-fs is being used anywhere in flink-runtime module.
* We will have a common place where all the utilities related to Hadoop

can

go which can be reused easily without leading to jar hell.

In addition to this, if you are aware of any other classes that fit in

this

approach, please share the details here.

*Note*
I don't have a complete understanding here but I did see two
implementations of the following classes under two different packages
*.mapred and *.mapreduce.
* HadoopInputFormat
* HadoopInputFormatBase
* HadoopOutputFormat
* HadoopOutputFormatBase

Can we somehow figure and have them in this new module?

Thanks,
Sivaprasanna

[1]

https://lists.apache.org/thread.html/r198f09496ba46885adbcc41fe778a7a34ad1cd685eeae8beb71e6fbb%40%3Cdev.flink.apache.org%3E

Re: [DISCUSS] Introduce a new module 'flink-hadoop-utils'

Reply via email to