Re: [VOTE] Release Hadoop 0.19.2 (candidate 0)

2009-07-01 Thread jason hadoop
Can you put  http://issues.apache.org/jira/browse/HADOOP-5589 in please?


On Wed, Jul 1, 2009 at 2:44 AM, Tom White  wrote:

> I have created a candidate build for Hadoop 0.19.2. This fixes 42
> issues in 0.19.1
> (
> http://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&mode=hide&sorter/order=DESC&sorter/field=priority&pid=12310240&fixfor=12313650
> ).
>
> *** Please download, test and vote before the
> *** vote closes on Friday, July 3.
>
> http://people.apache.org/~tomwhite/hadoop-0.19.2-candidate-0/
>
> Cheers,
> Tom
>



-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals


Re: [VOTE] Release Hadoop 0.19.2 (candidate 0)

2009-07-01 Thread jason hadoop
It is a bug fix, it fixes the mapside join to work as documented instead of
having an undocumented failure case when you have more than 32 tables in
your join statement.

On Wed, Jul 1, 2009 at 5:40 PM, Nigel Daley  wrote:

> HADOOP-5589 is not a bug fix so it shouldn't go into branch-0.19.
>
> Nige
>
> On Jul 1, 2009, at 7:49 AM, jason hadoop wrote:
>
>  Can you put  http://issues.apache.org/jira/browse/HADOOP-5589 in please?
>>
>>
>> On Wed, Jul 1, 2009 at 2:44 AM, Tom White  wrote:
>>
>>  I have created a candidate build for Hadoop 0.19.2. This fixes 42
>>> issues in 0.19.1
>>> (
>>>
>>> http://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&mode=hide&sorter/order=DESC&sorter/field=priority&pid=12310240&fixfor=12313650
>>> ).
>>>
>>> *** Please download, test and vote before the
>>> *** vote closes on Friday, July 3.
>>>
>>> http://people.apache.org/~tomwhite/hadoop-0.19.2-candidate-0/<http://people.apache.org/%7Etomwhite/hadoop-0.19.2-candidate-0/>
>>> <http://people.apache.org/%7Etomwhite/hadoop-0.19.2-candidate-0/>
>>>
>>> Cheers,
>>> Tom
>>>
>>>
>>
>>
>> --
>> Pro Hadoop, a book to guide you from beginner to hadoop mastery,
>> http://www.amazon.com/dp/1430219424?tag=jewlerymall
>> www.prohadoopbook.com a community for Hadoop Professionals
>>
>
>


-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals


Re: [VOTE] Release Hadoop 0.19.2 (candidate 0)

2009-07-02 Thread jason hadoop
I don't believe that patch breaks any compatibility, the change is
completely internal to TupleWritable.
The version in 18, requires larger changes as CompositeInputReader needs to
change also.

On Wed, Jul 1, 2009 at 11:50 PM, Owen O'Malley wrote:

> On Wed, Jul 1, 2009 at 8:09 PM, jason hadoop
> wrote:
> > It is a bug fix,
>
> It removes an undocumented limitation. *sigh*
>
> If I remember right, that patch breaks compatibility. If so, I'd vote
> against putting it into any of the release branches.
>
> -- Owen
>



-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals


Re: [VOTE] Release Hadoop 0.19.2 (candidate 0)

2009-07-02 Thread jason hadoop
Since the only place that data is incompatable is in a map, I don't think it
should matter, unless someone has written out a file of TupleWriteables in a
sequence file.
The serialization format is designed to be able to read the old format.

On Thu, Jul 2, 2009 at 3:10 PM, Owen O'Malley  wrote:

> Ok,
>   I checked and it doesn't break compatibility, but it will write data that
> isn't readable by the unpatched code. Rather than push this back into what
> is likely the last 0.19 release, I'd propose that we just backport this to
> 0.20. Is that reasonable?
>
> -- Owen
>



-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals


Re: Need help understanding the source

2009-07-07 Thread jason hadoop
If your constraints are loose enough, you could consider using the chain
mapping that became available in 19, and
have multiple mappers for your jobs.
The extra mappers only receive the output of the prior map in the chain and
if I remember correctly, the combiner is run at the end of the chain of
mappers, when a reduce is scheduled.

The other alternative you may try is simply to write your map outputs to
HDFS [ie: setNumReduces(0)], and have a consumer pick up the map outputs as
they appear. If the life of the files is short and you can withstand data
loss, you may turn down the replication factor, to speed the writes.

The Map/Reduce framework is carefully constructed to provide a completely
sorted input set to the reducer, as that is part of the fundamental
contract.

On Tue, Jul 7, 2009 at 12:30 AM, Amr Awadallah  wrote:

> To add to Todd/Ted's wise words, the Hadoop (and MapReduce) architects
> didn't impose this limitation just for fun, it is very core to enabling
> Hadoop to be as reliable as it is. If the reducer starts processing mapper
> output immediately and a specific mapper fails then the reducer would have
> to know how to undo the specific pieces of work related to the failed
> mapper, not trivial at all. That said, the combiners do achieve a bit of
> that for you, as they start working immediately on the map out, but on a
> per-mapper basis (not global), so easy to handle failure in that case (you
> just redo that mapper and the combining for it).
>
> -- amr
>
>
> Ted Dunning wrote:
>
>> I would consider this to be a very delicate optimization with little
>> utility
>> in the real world.  It is very, very rare to reliably know how many
>> records
>> the reducer will see.  Getting this wrong would be a disaster.  Getting it
>> right would be very difficult in almost all cases.
>>
>> Moreover, this assumption is baked all through the map-reduce design and
>> thus doing a change to allow reduce to go ahead is likely to be really
>> tricky (not that I know this for a fact).
>>
>>
>> On Mon, Jul 6, 2009 at 11:14 AM, Naresh Rapolu <
>> nareshreddy.rap...@gmail.com
>>
>>
>>> wrote:
>>>
>>>
>>
>>
>>
>>> My aim is to make the reduce move ahead with reduction as and when it
>>> gets
>>> the data required, instead of waiting for all the maps to complete.  If
>>> it
>>> knows how many records it needs and compares it with number of records it
>>> has got until now,  it can move on once they become equal without waiting
>>> for all the maps to finish.
>>>
>>>
>>>
>>
>>
>>
>


-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals


Re: Need help understanding the source

2009-07-07 Thread jason hadoop
When you have 0 reduces, the map outputs themselves are moved to the output
directory for you.

It is also straight forward to open your own file and write to it directory
instead of using the output collector.

On Tue, Jul 7, 2009 at 10:14 AM, Todd Lipcon  wrote:

> On Tue, Jul 7, 2009 at 1:13 AM, jason hadoop 
> wrote:
> >
> >
> > The other alternative you may try is simply to write your map outputs to
> > HDFS [ie: setNumReduces(0)], and have a consumer pick up the map outputs
> as
> > they appear. If the life of the files is short and you can withstand data
> > loss, you may turn down the replication factor, to speed the writes.
> >
>
> I'm not sure that would be very easy, since the output is initially written
> into a temporary directory. I suppose you could go digging through the
> temporary directory to catch the map outputs as they finish, but it's
> probably tricky at best and certainly not intended
>
> -Todd
>
>
> > On Tue, Jul 7, 2009 at 12:30 AM, Amr Awadallah  wrote:
> >
> > > To add to Todd/Ted's wise words, the Hadoop (and MapReduce) architects
> > > didn't impose this limitation just for fun, it is very core to enabling
> > > Hadoop to be as reliable as it is. If the reducer starts processing
> > mapper
> > > output immediately and a specific mapper fails then the reducer would
> > have
> > > to know how to undo the specific pieces of work related to the failed
> > > mapper, not trivial at all. That said, the combiners do achieve a bit
> of
> > > that for you, as they start working immediately on the map out, but on
> a
> > > per-mapper basis (not global), so easy to handle failure in that case
> > (you
> > > just redo that mapper and the combining for it).
> > >
> > > -- amr
> > >
> > >
> > > Ted Dunning wrote:
> > >
> > >> I would consider this to be a very delicate optimization with little
> > >> utility
> > >> in the real world.  It is very, very rare to reliably know how many
> > >> records
> > >> the reducer will see.  Getting this wrong would be a disaster.
>  Getting
> > it
> > >> right would be very difficult in almost all cases.
> > >>
> > >> Moreover, this assumption is baked all through the map-reduce design
> and
> > >> thus doing a change to allow reduce to go ahead is likely to be really
> > >> tricky (not that I know this for a fact).
> > >>
> > >>
> > >> On Mon, Jul 6, 2009 at 11:14 AM, Naresh Rapolu <
> > >> nareshreddy.rap...@gmail.com
> > >>
> > >>
> > >>> wrote:
> > >>>
> > >>>
> > >>
> > >>
> > >>
> > >>> My aim is to make the reduce move ahead with reduction as and when it
> > >>> gets
> > >>> the data required, instead of waiting for all the maps to complete.
>  If
> > >>> it
> > >>> knows how many records it needs and compares it with number of
> records
> > it
> > >>> has got until now,  it can move on once they become equal without
> > waiting
> > >>> for all the maps to finish.
> > >>>
> > >>>
> > >>>
> > >>
> > >>
> > >>
> > >
> >
> >
> > --
> > Pro Hadoop, a book to guide you from beginner to hadoop mastery,
> > http://www.amazon.com/dp/1430219424?tag=jewlerymall
> > www.prohadoopbook.com a community for Hadoop Professionals
> >
>



-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals