Robert, Chamikara, Yes, I agree that we need to give enough time for that. I’m fine to wait until 3.0
> On 12 Sep 2018, at 19:27, Chamikara Jayalath <chamik...@google.com> wrote: > > +1 for going with option 3. > > On Wed, Sep 12, 2018 at 8:51 AM Robert Bradshaw <rober...@google.com > <mailto:rober...@google.com>> wrote: > On Wed, Sep 12, 2018 at 5:27 PM Alexey Romanenko <aromanenko....@gmail.com > <mailto:aromanenko....@gmail.com>> wrote: > Thank you everybody for your feedback! > > I think we can conclude that the most popular option, according to discussion > above, is number 3. Not sure if we need to do a separate vote for that but, > please, let me know if we need. > > So, for now, I’d split a work into the following steps: > a) Create new module "hadoop-mapreduce-format” which implements support for > MapReduce OutputFormat through new HadoopMapreduceFormat.Write class. For > that, I just need to change a bit my already created PR 6306 > <https://github.com/apache/beam/pull/6306> that I added recently (renaming of > module and class names). > b) Move all source and test classes of “hadoop-input-format” into the module > "hadoop-mapreduce-format” and create new class HadoopMapreduceFormat.Read > there to support MapReduce InputFormat. > c) Make old HadoopInputFormat.Read (in old “hadoop-input-format” module) > deprecated and as proxy class to newly created HadoopMapreduceFormat.Read (to > keep API compatibility) > > Sounds like a great plan. > > These 3 steps should be performed and completed within one release cycle > (approx. in 2.8). For steps “b” and “c” I’d create another PR to avoid having > a huge commit if it will include step “a” as well. > > Big +1. > > Then, in next release after: > d) Remove completely module “hadoop-input-format” (approx. in 2.9). > > I don't think we'd be able to remove this until 3.0. > > I think we technically we can remove HadoopInputFormat before 3.0 since it's > marked as experimental [1] but I'd suggest keeping it deprecated for at least > two releases (3 months) before removal. Not sure if we have a policy on this. > > [1] > https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-input-format/src/main/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIO.java#L177 > > <https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-input-format/src/main/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIO.java#L177> > > > > Other two Hadoop modules (common and file-system) we leave as it is. > > I hope that this a correct summary of what community decided and I can move > forward. > > Sounds good. > > Please, let me know if there any objections against this plan or other > suggestions. > > >> On 11 Sep 2018, at 16:08, Thomas Weise <t...@apache.org >> <mailto:t...@apache.org>> wrote: >> >> I'm in favor of a combination of 2) and 3): New module >> "hadoop-mapreduce-format" ("hadoop-format" does not sufficiently qualify >> what it is). Turn existing " hadoop-input-format" into a proxy for new >> module for backward compatibility (marked deprecated and removed in next >> major version). >> >> I don't think everything "Hadoop" should be merged, purpose and usage is >> just too different. As an example, the Hadoop file system abstraction itself >> has implementation for multiple other systems and is not limited to HDFS. >> >> On Tue, Sep 11, 2018 at 8:47 AM Alexey Romanenko <aromanenko....@gmail.com >> <mailto:aromanenko....@gmail.com>> wrote: >> Dharmendra, >> For now, you can’t write with Hadoop MapReduce OutputFormat. However, you >> can use FileIO or TextIO to write to HDFS, these IOs support different file >> systems. >> >>> On 11 Sep 2018, at 11:11, dharmendra pratap singh <dharmendra0...@gmail.com >>> <mailto:dharmendra0...@gmail.com>> wrote: >>> >>> Hello Team, >>> Does this mean, as of today we can read from Hadoop FS but can't write to >>> Hadoop FS using Beam HDFS API ? >>> >>> Regards >>> Dharmendra >>> >>> On Thu, Sep 6, 2018 at 8:54 PM Alexey Romanenko <aromanenko....@gmail.com >>> <mailto:aromanenko....@gmail.com>> wrote: >>> Hello everyone, >>> >>> I’d like to discuss the following topic (see below) with community since >>> the optimal solution is not clear for me. >>> >>> There is Java IO module, called “hadoop-input-format”, which allows to use >>> MapReduce InputFormat implementations to read data from different sources >>> (for example, org.apache.hadoop.mapreduce.lib.db.DBInputFormat). According >>> to its name, it has only “Read" and it's missing “Write” part, so, I'm >>> working on “hadoop-output-format” to support MapReduce OutputFormat (PR >>> 6306 <https://github.com/apache/beam/pull/6306>). For this I created >>> another module with this name. So, in the end, we will have two different >>> modules “hadoop-input-format” and “hadoop-output-format” and it looks quite >>> strange for me since, afaik, every existed Java IO, that we have, >>> incapsulates Read and Write parts into one module. Additionally, we have >>> “hadoop-common” and “hadoop-file-system” as other hadoop-related modules. >>> >>> Now I’m thinking how it will be better to organise all these Hadoop modules >>> better. There are several options in my mind: >>> >>> 1) Add new module “hadoop-output-format” and leave all Hadoop modules “as >>> it is”. >>> Pros: no breaking changes, no additional work >>> Cons: not logical for users to have the same IO in two different >>> modules and with different names. >>> >>> 2) Merge “hadoop-input-format” and “hadoop-output-format” into one module >>> called, say, “hadoop-format” or “hadoop-mapreduce-format”, keep the other >>> Hadoop modules “as it is”. >>> Pros: to have InputFormat/OutputFormat in one IO module which is >>> logical for users >>> Cons: breaking changes for user code because of module/IO renaming >>> >>> 3) Add new module “hadoop-format” (or “hadoop-mapreduce-format”) which will >>> include new “write” functionality and be a proxy for old >>> “hadoop-input-format”. In its turn, “hadoop-input-format” should become >>> deprecated and be finally moved to common “hadoop-format” module in future >>> releases. Keep the other Hadoop modules “as it is”. >>> Pros: finally it will be only one module for hadoop MR format; changes >>> are less painful for user >>> Cons: hidden difficulties of implementation this strategy; a bit >>> confusing for user >>> >>> 4) Add new module “hadoop” and move all already existed modules there as >>> submodules (like we have for “io/google-cloud-platform”), merge >>> “hadoop-input-format” and “hadoop-output-format” into one module. >>> Pros: unification of all hadoop-related modules >>> Cons: breaking changes for user code, additional complexity with deps >>> and testing >>> >>> 5) Your suggestion?.. >>> >>> My personal preferences are lying between 2 and 3 (if 3 is possible). >>> >>> I’m wondering if there were similar situations in Beam before and how it >>> was finally resolved. If yes then probably we need to do here in similar >>> way. >>> Any suggestions/advices/comments would be very appreciated. >>> >>> Thanks, >>> Alexey