Re: [DISCUSS] Unification of Hadoop related IO modules

Alexey Romanenko Thu, 13 Sep 2018 11:26:56 -0700

Robert, Chamikara,
Yes, I agree that we need to give enough time for that. I’m fine to wait until 
3.0


> On 12 Sep 2018, at 19:27, Chamikara Jayalath <chamik...@google.com> wrote:
> 
> +1 for going with option 3.
> 
> On Wed, Sep 12, 2018 at 8:51 AM Robert Bradshaw <rober...@google.com 
> <mailto:rober...@google.com>> wrote:
> On Wed, Sep 12, 2018 at 5:27 PM Alexey Romanenko <aromanenko....@gmail.com 
> <mailto:aromanenko....@gmail.com>> wrote:
> Thank you everybody for your feedback!
> 
> I think we can conclude that the most popular option, according to discussion 
> above, is number 3. Not sure if we need to do a separate vote for that but, 
> please, let me know if we need.
> 
> So, for now, I’d split a work into the following steps:
> a) Create new module "hadoop-mapreduce-format” which implements support for 
> MapReduce OutputFormat through new HadoopMapreduceFormat.Write class. For 
> that, I just need to change a bit my already created PR 6306 
> <https://github.com/apache/beam/pull/6306> that I added recently (renaming of 
> module and class names).
> b) Move all source and test classes of “hadoop-input-format” into the module 
> "hadoop-mapreduce-format” and create new class HadoopMapreduceFormat.Read 
> there to support MapReduce InputFormat.
> c) Make old HadoopInputFormat.Read (in old “hadoop-input-format” module) 
> deprecated and as proxy class to newly created HadoopMapreduceFormat.Read (to 
> keep API compatibility)
> 
> Sounds like a great plan. 
>  
> These 3 steps should be performed and completed within one release cycle 
> (approx. in 2.8). For steps “b” and “c” I’d create another PR to avoid having 
> a huge commit if it will include step “a” as well.
> 
> Big +1. 
>  
> Then, in next release after:
> d) Remove completely module “hadoop-input-format”  (approx. in 2.9). 
> 
> I don't think we'd be able to remove this until 3.0. 
> 
> I think we technically we can remove HadoopInputFormat before 3.0 since it's 
> marked as experimental [1] but I'd suggest keeping it deprecated for at least 
> two releases (3 months) before removal. Not sure if we have a policy on this.
> 
> [1] 
> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-input-format/src/main/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIO.java#L177
>  
> <https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-input-format/src/main/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIO.java#L177>
> 
>  
> 
> Other two Hadoop modules (common and file-system) we leave as it is.
> 
> I hope that this a correct summary of what community decided and I can move 
> forward. 
> 
> Sounds good. 
>  
> Please, let me know if there any objections against this plan or other 
> suggestions.
> 
> 
>> On 11 Sep 2018, at 16:08, Thomas Weise <t...@apache.org 
>> <mailto:t...@apache.org>> wrote:
>> 
>> I'm in favor of a combination of 2) and 3): New module 
>> "hadoop-mapreduce-format" ("hadoop-format" does not sufficiently qualify 
>> what it is). Turn existing " hadoop-input-format" into a proxy for new 
>> module for backward compatibility (marked deprecated and removed in next 
>> major version).
>> 
>> I don't think everything "Hadoop" should be merged, purpose and usage is 
>> just too different. As an example, the Hadoop file system abstraction itself 
>> has implementation for multiple other systems and is not limited to HDFS.
>> 
>> On Tue, Sep 11, 2018 at 8:47 AM Alexey Romanenko <aromanenko....@gmail.com 
>> <mailto:aromanenko....@gmail.com>> wrote:
>> Dharmendra,
>> For now, you can’t write with Hadoop MapReduce OutputFormat. However, you 
>> can use FileIO or TextIO to write to HDFS, these IOs support different file 
>> systems.
>> 
>>> On 11 Sep 2018, at 11:11, dharmendra pratap singh <dharmendra0...@gmail.com 
>>> <mailto:dharmendra0...@gmail.com>> wrote:
>>> 
>>> Hello Team,
>>> Does this mean, as of today we can read from Hadoop FS but can't write to 
>>> Hadoop FS using Beam HDFS API ?
>>> 
>>> Regards
>>> Dharmendra
>>> 
>>> On Thu, Sep 6, 2018 at 8:54 PM Alexey Romanenko <aromanenko....@gmail.com 
>>> <mailto:aromanenko....@gmail.com>> wrote:
>>> Hello everyone,
>>> 
>>> I’d like to discuss the following topic (see below) with community since 
>>> the optimal solution is not clear for me.
>>> 
>>> There is Java IO module, called “hadoop-input-format”, which allows to use 
>>> MapReduce InputFormat implementations to read data from different sources 
>>> (for example, org.apache.hadoop.mapreduce.lib.db.DBInputFormat). According 
>>> to its name, it has only “Read" and it's missing “Write” part, so, I'm 
>>> working on “hadoop-output-format” to support MapReduce OutputFormat (PR 
>>> 6306 <https://github.com/apache/beam/pull/6306>). For this I created 
>>> another module with this name. So, in the end, we will have two different 
>>> modules “hadoop-input-format” and “hadoop-output-format” and it looks quite 
>>> strange for me since, afaik, every existed Java IO, that we have, 
>>> incapsulates Read and Write parts into one module. Additionally, we have 
>>> “hadoop-common” and “hadoop-file-system” as other hadoop-related modules. 
>>> 
>>> Now I’m thinking how it will be better to organise all these Hadoop modules 
>>> better. There are several options in my mind: 
>>> 
>>> 1) Add new module “hadoop-output-format” and leave all Hadoop modules “as 
>>> it is”. 
>>>     Pros: no breaking changes, no additional work 
>>>     Cons: not logical for users to have the same IO in two different 
>>> modules and with different names.
>>> 
>>> 2) Merge “hadoop-input-format” and “hadoop-output-format” into one module 
>>> called, say, “hadoop-format” or “hadoop-mapreduce-format”, keep the other 
>>> Hadoop modules “as it is”.
>>>     Pros: to have InputFormat/OutputFormat in one IO module which is 
>>> logical for users
>>>     Cons: breaking changes for user code because of module/IO renaming 
>>> 
>>> 3) Add new module “hadoop-format” (or “hadoop-mapreduce-format”) which will 
>>> include new “write” functionality and be a proxy for old 
>>> “hadoop-input-format”. In its turn, “hadoop-input-format” should become 
>>> deprecated and be finally moved to common “hadoop-format” module in future 
>>> releases. Keep the other Hadoop modules “as it is”.
>>>     Pros: finally it will be only one module for hadoop MR format; changes 
>>> are less painful for user
>>>     Cons: hidden difficulties of implementation this strategy; a bit 
>>> confusing for user 
>>> 
>>> 4) Add new module “hadoop” and move all already existed modules there as 
>>> submodules (like we have for “io/google-cloud-platform”), merge 
>>> “hadoop-input-format” and “hadoop-output-format” into one module. 
>>>     Pros: unification of all hadoop-related modules
>>>     Cons: breaking changes for user code, additional complexity with deps 
>>> and testing
>>> 
>>> 5) Your suggestion?..
>>> 
>>> My personal preferences are lying between 2 and 3 (if 3 is possible). 
>>> 
>>> I’m wondering if there were similar situations in Beam before and how it 
>>> was finally resolved. If yes then probably we need to do here in similar 
>>> way.
>>> Any suggestions/advices/comments would be very appreciated.
>>> 
>>> Thanks,
>>> Alexey

Re: [DISCUSS] Unification of Hadoop related IO modules

Reply via email to