Hi, Continuing on an earlier discussion[1] regarding having a separate module for Hadoop related utility components, I have gone through our project briefly and found the following components which I feel could be moved to a separate module for reusability, and better module structure.
Module Name Class Name Used at / Remarks flink-hadoop-fs flink.runtime.util.HadoopUtils flink-runtime => HadoopModule & HadoopModuleFactory flink-swift-fs-hadoop => SwiftFileSystemFactory flink-yarn => Utils, YarnClusterDescriptor flink-hadoop-compatability api.java.hadoop.mapred.utils.HadoopUtils Both belong to the same module but with different packages (api.java.hadoop.mapred and api.java.hadoop.mapreduce) api.java.hadoop.mapreduce.utils.HadoopUtils flink-sequeunce-file formats.sequeuncefile.SerializableHadoopConfiguration Currently, it is used at formats.sequencefile.SequenceFileWriterFactory but can also be used at HadoopCompressionBulkWriter, a potential OrcBulkWriter and pretty much everywhere to avoid NotSerializableException. *Proposal* To summarise, I believe we can create a new module (flink-hadoop-utils ?) and move these reusable components to this new module which will have an optional/provided dependency on flink-shaded-hadoop-2. *Structure* In the present form, I think we will have two classes with the packaging structure being *org.apache.flink.hadoop.[utils/serialization]* 1. HadoopUtils with all static methods ( after combining and eliminating the duplicate code fragments from the three HadoopUtils classes mentioned above) 2. Move the existing SerializableHadoopConfiguration from the flink-sequence-file to this new module . *Justification* * With this change, we would be stripping away the dependency on flink-hadoop-fs from flink-runtime as I don't see any other classes from flink-hadoop-fs is being used anywhere in flink-runtime module. * We will have a common place where all the utilities related to Hadoop can go which can be reused easily without leading to jar hell. In addition to this, if you are aware of any other classes that fit in this approach, please share the details here. *Note* I don't have a complete understanding here but I did see two implementations of the following classes under two different packages *.mapred and *.mapreduce. * HadoopInputFormat * HadoopInputFormatBase * HadoopOutputFormat * HadoopOutputFormatBase Can we somehow figure and have them in this new module? Thanks, Sivaprasanna [1] https://lists.apache.org/thread.html/r198f09496ba46885adbcc41fe778a7a34ad1cd685eeae8beb71e6fbb%40%3Cdev.flink.apache.org%3E