[DISCUSS] Introduce a new module 'flink-hadoop-utils'

Sivaprasanna Thu, 19 Mar 2020 06:15:13 -0700

Hi,

Continuing on an earlier discussion[1] regarding having a separate module
for Hadoop related utility components, I have gone through our project
briefly and found the following components which I feel could be moved to a
separate module for reusability, and better module structure.


Module Name Class Name Used at / Remarks

flink-hadoop-fs
flink.runtime.util.HadoopUtils
flink-runtime => HadoopModule & HadoopModuleFactory
flink-swift-fs-hadoop => SwiftFileSystemFactory
flink-yarn => Utils, YarnClusterDescriptor

flink-hadoop-compatability
api.java.hadoop.mapred.utils.HadoopUtils
Both belong to the same module but with different packages
(api.java.hadoop.mapred and api.java.hadoop.mapreduce)
api.java.hadoop.mapreduce.utils.HadoopUtils
flink-sequeunce-file
formats.sequeuncefile.SerializableHadoopConfiguration Currently,
it is used at formats.sequencefile.SequenceFileWriterFactory but can also
be used at HadoopCompressionBulkWriter, a potential OrcBulkWriter and
pretty much everywhere to avoid NotSerializableException.

*Proposal*
To summarise, I believe we can create a new module (flink-hadoop-utils ?)
and move these reusable components to this new module which will have an
optional/provided dependency on flink-shaded-hadoop-2.

*Structure*
In the present form, I think we will have two classes with the packaging
structure being *org.apache.flink.hadoop.[utils/serialization]*
1. HadoopUtils with all static methods ( after combining and eliminating
the duplicate code fragments from the three HadoopUtils classes mentioned
above)
2. Move the existing SerializableHadoopConfiguration from the
flink-sequence-file to this new module .

*Justification*
* With this change, we would be stripping away the dependency on
flink-hadoop-fs from flink-runtime as I don't see any other classes from
flink-hadoop-fs is being used anywhere in flink-runtime module.
* We will have a common place where all the utilities related to Hadoop can
go which can be reused easily without leading to jar hell.

In addition to this, if you are aware of any other classes that fit in this
approach, please share the details here.

*Note*
I don't have a complete understanding here but I did see two
implementations of the following classes under two different packages
*.mapred and *.mapreduce.
* HadoopInputFormat
* HadoopInputFormatBase
* HadoopOutputFormat
* HadoopOutputFormatBase

Can we somehow figure and have them in this new module?

Thanks,
Sivaprasanna

[1]
https://lists.apache.org/thread.html/r198f09496ba46885adbcc41fe778a7a34ad1cd685eeae8beb71e6fbb%40%3Cdev.flink.apache.org%3E

[DISCUSS] Introduce a new module 'flink-hadoop-utils'

Reply via email to