Re: [DISCUSS] Unification of Hadoop related IO modules

Alexey Romanenko Tue, 11 Sep 2018 05:47:40 -0700

Dharmendra,
For now, you can’t write with Hadoop MapReduce OutputFormat. However, you can 
use FileIO or TextIO to write to HDFS, these IOs support different file systems.


> On 11 Sep 2018, at 11:11, dharmendra pratap singh <dharmendra0...@gmail.com> 
> wrote:
> 
> Hello Team,
> Does this mean, as of today we can read from Hadoop FS but can't write to 
> Hadoop FS using Beam HDFS API ?
> 
> Regards
> Dharmendra
> 
> On Thu, Sep 6, 2018 at 8:54 PM Alexey Romanenko <aromanenko....@gmail.com 
> <mailto:aromanenko....@gmail.com>> wrote:
> Hello everyone,
> 
> I’d like to discuss the following topic (see below) with community since the 
> optimal solution is not clear for me.
> 
> There is Java IO module, called “hadoop-input-format”, which allows to use 
> MapReduce InputFormat implementations to read data from different sources 
> (for example, org.apache.hadoop.mapreduce.lib.db.DBInputFormat). According to 
> its name, it has only “Read" and it's missing “Write” part, so, I'm working 
> on “hadoop-output-format” to support MapReduce OutputFormat (PR 6306 
> <https://github.com/apache/beam/pull/6306>). For this I created another 
> module with this name. So, in the end, we will have two different modules 
> “hadoop-input-format” and “hadoop-output-format” and it looks quite strange 
> for me since, afaik, every existed Java IO, that we have, incapsulates Read 
> and Write parts into one module. Additionally, we have “hadoop-common” and 
> “hadoop-file-system” as other hadoop-related modules. 
> 
> Now I’m thinking how it will be better to organise all these Hadoop modules 
> better. There are several options in my mind: 
> 
> 1) Add new module “hadoop-output-format” and leave all Hadoop modules “as it 
> is”. 
>       Pros: no breaking changes, no additional work 
>       Cons: not logical for users to have the same IO in two different 
> modules and with different names.
> 
> 2) Merge “hadoop-input-format” and “hadoop-output-format” into one module 
> called, say, “hadoop-format” or “hadoop-mapreduce-format”, keep the other 
> Hadoop modules “as it is”.
>       Pros: to have InputFormat/OutputFormat in one IO module which is 
> logical for users
>       Cons: breaking changes for user code because of module/IO renaming 
> 
> 3) Add new module “hadoop-format” (or “hadoop-mapreduce-format”) which will 
> include new “write” functionality and be a proxy for old 
> “hadoop-input-format”. In its turn, “hadoop-input-format” should become 
> deprecated and be finally moved to common “hadoop-format” module in future 
> releases. Keep the other Hadoop modules “as it is”.
>       Pros: finally it will be only one module for hadoop MR format; changes 
> are less painful for user
>       Cons: hidden difficulties of implementation this strategy; a bit 
> confusing for user 
> 
> 4) Add new module “hadoop” and move all already existed modules there as 
> submodules (like we have for “io/google-cloud-platform”), merge 
> “hadoop-input-format” and “hadoop-output-format” into one module. 
>       Pros: unification of all hadoop-related modules
>       Cons: breaking changes for user code, additional complexity with deps 
> and testing
> 
> 5) Your suggestion?..
> 
> My personal preferences are lying between 2 and 3 (if 3 is possible). 
> 
> I’m wondering if there were similar situations in Beam before and how it was 
> finally resolved. If yes then probably we need to do here in similar way.
> Any suggestions/advices/comments would be very appreciated.
> 
> Thanks,
> Alexey

Re: [DISCUSS] Unification of Hadoop related IO modules

Reply via email to