Re: [DISCUSS] Unification of Hadoop related IO modules

Thomas Weise Tue, 11 Sep 2018 07:09:02 -0700

I'm in favor of a combination of 2) and 3): New module
"hadoop-mapreduce-format" ("hadoop-format" does not sufficiently qualify
what it is). Turn existing " hadoop-input-format" into a proxy for new
module for backward compatibility (marked deprecated and removed in next
major version).


I don't think everything "Hadoop" should be merged, purpose and usage is
just too different. As an example, the Hadoop file system abstraction
itself has implementation for multiple other systems and is not limited to
HDFS.

On Tue, Sep 11, 2018 at 8:47 AM Alexey Romanenko <[email protected]>
wrote:

> Dharmendra,
> For now, you can’t write with Hadoop MapReduce OutputFormat. However, you
> can use FileIO or TextIO to write to HDFS, these IOs support different file
> systems.
>
> On 11 Sep 2018, at 11:11, dharmendra pratap singh <
> [email protected]> wrote:
>
> Hello Team,
> Does this mean, as of today we can read from Hadoop FS but can't write to
> Hadoop FS using Beam HDFS API ?
>
> Regards
> Dharmendra
>
> On Thu, Sep 6, 2018 at 8:54 PM Alexey Romanenko <[email protected]>
> wrote:
>
>> Hello everyone,
>>
>> I’d like to discuss the following topic (see below) with community since
>> the optimal solution is not clear for me.
>>
>> There is Java IO module, called “*hadoop-input-format*”, which allows to
>> use MapReduce InputFormat implementations to read data from different
>> sources (for example, org.apache.hadoop.mapreduce.lib.db.DBInputFormat).
>> According to its name, it has only “Read" and it's missing “Write” part,
>> so, I'm working on “*hadoop-output-format*” to support MapReduce
>> OutputFormat (PR 6306 <https://github.com/apache/beam/pull/6306>). For
>> this I created another module with this name. So, in the end, we will have
>> two different modules “*hadoop-input-format*” and “*hadoop-output-format*”
>> and it looks quite strange for me since, afaik, every existed Java IO, that
>> we have, incapsulates Read and Write parts into one module. Additionally,
>> we have “*hadoop-common*” and *“hadoop-file-system*” as other
>> hadoop-related modules.
>>
>> Now I’m thinking how it will be better to organise all these Hadoop
>> modules better. There are several options in my mind:
>>
>> 1) Add new module “*hadoop-output-format*” and leave all Hadoop modules
>> “as it is”.
>> Pros: no breaking changes, no additional work
>> Cons: not logical for users to have the same IO in two different modules
>> and with different names.
>>
>> 2) Merge “*hadoop-input-format*” and “*hadoop-output-format*” into one
>> module called, say, “*hadoop-format*” or “*hadoop-mapreduce-format*”,
>> keep the other Hadoop modules “as it is”.
>> Pros: to have InputFormat/OutputFormat in one IO module which is logical
>> for users
>> Cons: breaking changes for user code because of module/IO renaming
>>
>> 3) Add new module “*hadoop-format*” (or “*hadoop-mapreduce-format*”)
>> which will include new “write” functionality and be a proxy for old “
>> *hadoop-input-format*”. In its turn, “*hadoop-input-format*” should
>> become deprecated and be finally moved to common “*hadoop-format*”
>> module in future releases. Keep the other Hadoop modules “as it is”.
>> Pros: finally it will be only one module for hadoop MR format; changes
>> are less painful for user
>> Cons: hidden difficulties of implementation this strategy; a bit
>> confusing for user
>>
>> 4) Add new module “*hadoop*” and move all already existed modules there
>> as submodules (like we have for “*io/google-cloud-platform*”), merge “
>> *hadoop-input-format*” and “*hadoop-output-format*” into one module.
>> Pros: unification of all hadoop-related modules
>> Cons: breaking changes for user code, additional complexity with deps and
>> testing
>>
>> 5) Your suggestion?..
>>
>> My personal preferences are lying between 2 and 3 (if 3 is possible).
>>
>> I’m wondering if there were similar situations in Beam before and how it
>> was finally resolved. If yes then probably we need to do here in similar
>> way.
>> Any suggestions/advices/comments would be very appreciated.
>>
>> Thanks,
>> Alexey
>>
>
>

Re: [DISCUSS] Unification of Hadoop related IO modules

Reply via email to