FYI. I created a Jira to track this improvement. https://issues.apache.org/jira/browse/FLINK-18013
- Sivaprasanna On Thu, May 28, 2020 at 12:22 PM Sivaprasanna <sivaprasanna...@gmail.com> wrote: > Awesome. : ) > Thanks, Robert for signing up to be the reviewer. I will create Jira and > share the link here. > > Stay safe. > > - > Sivaprasanna > > On Thu, May 28, 2020 at 12:13 PM Robert Metzger <rmetz...@apache.org> > wrote: > >> Hi Sivaprasanna, >> >> thanks a lot for your proposal. Now that I ran into a HadoopUtils-related >> issue myself [1] I see the benefit in this proposal. >> >> I'm happy to be the Flink committer that mentors this change. If we do >> this, I would like to have a small scope for the initial change: >> - create a "flink-hadoop-utils" module >> - move generic, common utils into that module (for example >> SerializableHadoopConfiguration) >> >> I agree with Till that we should initially leave out the Hadoop >> compatibility modules. >> >> You can go ahead with filing a JIRA ticket! Let's discuss the exact scope >> there. >> >> >> [1] https://github.com/apache/flink/pull/12146 >> >> >> On Thu, Apr 30, 2020 at 6:54 PM Sivaprasanna <sivaprasanna...@gmail.com> >> wrote: >> >> > Bump. >> > >> > Please let me know, if someone is interested in reviewing this one. I am >> > willing to start working on this. BTW, a small and new addition to the >> > list: With FLINK-10114 merged, OrcBulkWriterFactory can also reuse >> > `SerializableHadoopConfiguration` along with SequenceFileWriterFactory >> and >> > CompressWriterFactory. >> > >> > CC - Kostas Kloudas since he has a better understanding on the >> > `SerializableHadoopConfiguration.` >> > >> > Cheers, >> > Sivaprasanna >> > >> > On Mon, Mar 30, 2020 at 3:17 PM Chesnay Schepler <ches...@apache.org> >> > wrote: >> > >> > > I would recommend to wait until a committer has signed up for >> reviewing >> > > your changes before preparing any PR. >> > > Otherwise the chances are high that you invest a lot of time but the >> > > changes never get in. >> > > >> > > On 30/03/2020 11:42, Sivaprasanna wrote: >> > > > Hello Till, >> > > > >> > > > I agree with having the scope limited and more concentrated. I can >> > file a >> > > > Jira and get started with the code changes, as and when someone has >> > some >> > > > bandwidth, the review can also be done. What do you think? >> > > > >> > > > Cheers, >> > > > Sivaprasanna >> > > > >> > > > On Mon, Mar 30, 2020 at 3:00 PM Till Rohrmann <trohrm...@apache.org >> > >> > > wrote: >> > > > >> > > >> Hi Sivaprasanna, >> > > >> >> > > >> thanks for starting this discussion. In general I like the idea to >> > > remove >> > > >> duplications and move common code to a shared module. As a >> > > recommendation, >> > > >> I would exclude the whole part about Flink's Hadoop compatibility >> > > modules >> > > >> because they are legacy code and hardly used anymore. This would >> also >> > > have >> > > >> the benefit of making the scope of the proposal a bit smaller. >> > > >> >> > > >> What we now need is a committer who wants to help with this >> effort. It >> > > >> might be that this takes a bit of time as many of the committers >> are >> > > quite >> > > >> busy. >> > > >> >> > > >> Cheers, >> > > >> Till >> > > >> >> > > >> On Thu, Mar 19, 2020 at 2:15 PM Sivaprasanna < >> > sivaprasanna...@gmail.com >> > > > >> > > >> wrote: >> > > >> >> > > >>> Hi, >> > > >>> >> > > >>> Continuing on an earlier discussion[1] regarding having a separate >> > > module >> > > >>> for Hadoop related utility components, I have gone through our >> > project >> > > >>> briefly and found the following components which I feel could be >> > moved >> > > >> to a >> > > >>> separate module for reusability, and better module structure. >> > > >>> >> > > >>> Module Name Class Name Used at / Remarks >> > > >>> >> > > >>> flink-hadoop-fs >> > > >>> flink.runtime.util.HadoopUtils >> > > >>> flink-runtime => HadoopModule & HadoopModuleFactory >> > > >>> flink-swift-fs-hadoop => SwiftFileSystemFactory >> > > >>> flink-yarn => Utils, YarnClusterDescriptor >> > > >>> >> > > >>> flink-hadoop-compatability >> > > >>> api.java.hadoop.mapred.utils.HadoopUtils >> > > >>> Both belong to the same module but with different packages >> > > >>> (api.java.hadoop.mapred and api.java.hadoop.mapreduce) >> > > >>> api.java.hadoop.mapreduce.utils.HadoopUtils >> > > >>> flink-sequeunce-file >> > > >>> formats.sequeuncefile.SerializableHadoopConfiguration Currently, >> > > >>> it is used at formats.sequencefile.SequenceFileWriterFactory but >> can >> > > also >> > > >>> be used at HadoopCompressionBulkWriter, a potential OrcBulkWriter >> and >> > > >>> pretty much everywhere to avoid NotSerializableException. >> > > >>> >> > > >>> *Proposal* >> > > >>> To summarise, I believe we can create a new module >> > (flink-hadoop-utils >> > > ?) >> > > >>> and move these reusable components to this new module which will >> have >> > > an >> > > >>> optional/provided dependency on flink-shaded-hadoop-2. >> > > >>> >> > > >>> *Structure* >> > > >>> In the present form, I think we will have two classes with the >> > > packaging >> > > >>> structure being *org.apache.flink.hadoop.[utils/serialization]* >> > > >>> 1. HadoopUtils with all static methods ( after combining and >> > > eliminating >> > > >>> the duplicate code fragments from the three HadoopUtils classes >> > > mentioned >> > > >>> above) >> > > >>> 2. Move the existing SerializableHadoopConfiguration from the >> > > >>> flink-sequence-file to this new module . >> > > >>> >> > > >>> *Justification* >> > > >>> * With this change, we would be stripping away the dependency on >> > > >>> flink-hadoop-fs from flink-runtime as I don't see any other >> classes >> > > from >> > > >>> flink-hadoop-fs is being used anywhere in flink-runtime module. >> > > >>> * We will have a common place where all the utilities related to >> > Hadoop >> > > >> can >> > > >>> go which can be reused easily without leading to jar hell. >> > > >>> >> > > >>> In addition to this, if you are aware of any other classes that >> fit >> > in >> > > >> this >> > > >>> approach, please share the details here. >> > > >>> >> > > >>> *Note* >> > > >>> I don't have a complete understanding here but I did see two >> > > >>> implementations of the following classes under two different >> packages >> > > >>> *.mapred and *.mapreduce. >> > > >>> * HadoopInputFormat >> > > >>> * HadoopInputFormatBase >> > > >>> * HadoopOutputFormat >> > > >>> * HadoopOutputFormatBase >> > > >>> >> > > >>> Can we somehow figure and have them in this new module? >> > > >>> >> > > >>> Thanks, >> > > >>> Sivaprasanna >> > > >>> >> > > >>> [1] >> > > >>> >> > > >>> >> > > >> >> > > >> > >> https://lists.apache.org/thread.html/r198f09496ba46885adbcc41fe778a7a34ad1cd685eeae8beb71e6fbb%40%3Cdev.flink.apache.org%3E >> > > >> > > >> > > >> > >> >