Re: Parallelism for small input data

Dipesh Kumar Singh Tue, 15 Jan 2013 10:24:04 -0800

Thanks Dmitriy and Vitalii... !!

I am able to control number of mappers by setting the split size. And, yes
there isn't any reason of re-reading the dictionary, except that i was
porting an existing code. I will re-implement to read it once and check
the performance.


Regards,
Dipesh

On Mon, Jan 14, 2013 at 3:52 PM, Vitalii Tymchyshyn <[email protected]>wrote:

> Well, if you will set split size to 1, you should get per-line split.
>
>
> 2013/1/13 Dipesh Kumar Singh <[email protected]>
>
> > Hello users,
> >
> > I have an input file (1.2 MB) which contains list of words/phrases in
> every
> > new line. I am reading each phrase per line and passing it to udf to
> > correct/check that phrase.
> > The udf (simple extends eval func) refers and reads a dictionary file of
> 6
> > MB for each input phrase.
> >
> > Since, the input dataset is very small, Pig launches only one mapper (out
> > of 150 slots) to process the input and no parallelism is gained here.
> >
> > I would like to get some input/suggestions on how these kind of scenarios
> > are efficiently implemented in pig.
> >
> > =====code snip====
> >
> > register 'Dudfs.jar';
> > define CorrectPhrases CorrectPhrases('/user/home/big.txt');
> > input_term = load '/user/home/input.txt' using PigStorage('\n') as
> > (phrase:chararray);
> > checked_term = foreach input_term generate phrase, CorrectPhrases(phrase)
> > as correctedTerms;
> > store checked_term into '/user/home/corrected_phrases' using
> > PigStorage(',');
> >
> > ===================================
> >
> > Forgive me if i am getting into wrong direction, feel free to correct me
> > and suggest your ways.
> >
> > Thanks in advance!
> >
> >
> > Regards,
> > Dipesh
> > --
> > Dipesh Kr. Singh
> >
>
>
>
> --
> Best regards,
>  Vitalii Tymchyshyn
>



-- 
Dipesh Kr. Singh

Re: Parallelism for small input data

Reply via email to