Re: Running multiple Pig jobs simultaneously on same data

Jonathan Coveney Wed, 15 Jun 2011 06:37:32 -0700

Yong,

Currently, HDFS does not support appending to a file. So once a file is
created, it literally cannot be changed (although it can be deleted, I
suppose). this lets you avoid issues where I do a SELECT * on the entire
database, and the dba can't update a row, or other things like that. There
are some append patches in the works but I am not sure how they handle the
concurrency implications.


Make sense?
Jon

2011/6/15 勇胡 <[email protected]>

> I read the link, and I just felt that the HDFS is designed for the
> read-frequently operation, not for the write-frequently( A file
> once created, written, and closed need not be changed.) .
>
> For your description (Immutable means that after creation it cannot be
> modified.), if I understand correct, you mean that the HDFS can not
> implement "update" semantics as same as in the database area? The write
> operation can not directly apply to the specific tuple or record? The
> result
> of write operation just appends at the end of the file.
>
> Regards
>
> Yong
>
> 2011/6/15 Nathan Bijnens <[email protected]>
>
> > Immutable means that after creation it cannot be modified.
> >
> > HDFS applications need a write-once-read-many access model for files. A
> > file
> > once created, written, and closed need not be changed. This assumption
> > simplifies data coherency issues and enables high throughput data access.
> A
> > MapReduce application or a web crawler application fits perfectly with
> this
> > model. There is a plan to support appending-writes to files in the
> future.
> >
> >
> http://hadoop.apache.org/hdfs/docs/current/hdfs_design.html#Simple+Coherency+Model
> >
> > Best regards,
> >  Nathan
> > ---
> > [email protected] : http://nathan.gs : http://twitter.com/nathan_gs
> >
> >
> > On Wed, Jun 15, 2011 at 12:58 PM, 勇胡 <[email protected]> wrote:
> >
> > > How can I understand immutable? I mean whether the HDFS implements lock
> > > mechanism to obtain immutable data access when the concurrent tasks
> > process
> > > the same set of data or uses other strategy to implement immutable?
> > >
> > > Thanks
> > >
> > > Yong
> > >
> > > 2011/6/14 Bill Graham <[email protected]>
> > >
> > > > Yes, this is possible. Data in HDFS is immutable and MR tasks are
> > spawned
> > > > in
> > > > their own VM so multiple concurrent jobs acting on the same input
> data
> > > are
> > > > fine.
> > > >
> > > > On Tue, Jun 14, 2011 at 11:18 AM, Pradipta Kumar Dutta <
> > > > [email protected]> wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > We have a requirement where we have to process same set of data (in
> > > > Hadoop
> > > > > cluster) by running multiple Pig jobs simultaneously.
> > > > >
> > > > > Any idea whether this is possible in Pig?
> > > > >
> > > > > Thanks,
> > > > > Pradipta
> > > > >
> > > >
> > >
> >
>

Re: Running multiple Pig jobs simultaneously on same data

Reply via email to