Re: Limit number of Streaming Programs

Cheolsoo Park Mon, 24 Dec 2012 14:15:43 -0800

Hi Thomas,

If I understand your question correctly, what you want is reduce the number
of mappers that spawn streaming processes. The default-parallel controls
the number of reducers, so it won't have any effect to the number of
mappers. Although the number of mappers is auto-determined by the size of
input data, you can try to set "pig.maxCombinedSplitSize" to combine input
files into bigger ones. For more details, please refer to:
http://pig.apache.org/docs/r0.10.0/perf.html#combine-files


You can also read a discussion on a similar topic here:
http://search-hadoop.com/m/J5hCw1UdxTa/How+can+I+set+the+mapper+number&subj=How+can+I+set+the+mapper+number+for+pig+script+

Thanks,
Cheolsoo


On Tue, Dec 18, 2012 at 12:00 PM, Thomas Bach
<[email protected]>wrote:

> Hi,
>
> I have around 4 million time series. ~1000 of them had a special
> occurrence at some point. Now, I want to draw 10 samples for each
> special time-series based on a similarity comparison.
>
> What I have currently implemented is a script in Python which consumes
> time-series one-by-one and does a comparison with all 1000 special
> time-series. If the similarity is sufficient with one of them I pass
> it back to Pig and strike out the according special time-series,
> subsequent time-series will not be compared against this one.
>
> This routine runs, but it lasts around 6 hours.
>
> One of the problems I'm facing is that Pig starts >160 scripts
> although 10 would be sufficient. Is there some way to define the
> number of scripts Pig starts in a `STREAM THROUGH` step? I tried to
> set default_parallel to 10, but it doesn't seem to have any effect.
>
> I'm also open to any other ideas on how to accomplish the task.
>
> Regards,
>         Thomas Bach.
>

Re: Limit number of Streaming Programs

Reply via email to