You are right: I actually already set these 2 parameters, and they
usually work well (at least when the loader used is BigStorage).
Nevertheless, when my own loader is used, despite the fact that data
to be reduced is several gigabytes, it ends up by using only 1
reducer, and this is what I don't understand (and of course
pig.exec.reducers.bytes.per.reducer is set to 1GB) .
Is there something special a loader must do for this automatic guess
of the number of reducer to use work correctly ?
Le 21/05/13 19:23, Norbert Burger a écrit :
As Jonathan mentioned, TOP should obviate this particular use case. But
for future examples, the parameters
pig.exec.reducers.bytes.per.reducer and pig.exec.reducers.max
might be useful:
https://issues.apache.org/jira/browse/PIG-1249
Norbert
On Tue, May 21, 2013 at 9:23 AM, Vincent Barat <[email protected]>wrote:
Thanks for your reply.
My goal is actually to AVOID using PARALLEL toi let PIG guess a good
number of reducer by itself.
Usually it works well for me, so I don't understadn why in that case it
does not.
Le 19/05/13 15:37, Norbert Burger a écrit :
Take a look at the PARALLEL clause:
http://pig.apache.org/docs/r0.**7.0/cookbook.html#Use+the+**
PARALLEL+Clause<http://pig.apache.org/docs/r0.7.0/cookbook.html#Use+the+PARALLEL+Clause>
On Fri, May 17, 2013 at 10:48 AM, Vincent Barat <[email protected]>
**wrote:
Hi,
I use this request to remove duplicated entries from a set of input files
(I cannot use DISTINCT since some fields can be different)
grp = GROUP alias BY key;
alias = FOREACH grp {
record = LIMIT alias 1;
GENERATE FLATTEN(record) AS ... :
}
It appears that this request always generates 1 reducer (I use 0 as
default nb of reducer to let PIG decide) whatever the size of my input
data.
Is it a normal behavior ? How can I improve my request time by using
several reducers ?
Thanks a lot for your help.