Well, if you will set split size to 1, you should get per-line split.
2013/1/13 Dipesh Kumar Singh <[email protected]> > Hello users, > > I have an input file (1.2 MB) which contains list of words/phrases in every > new line. I am reading each phrase per line and passing it to udf to > correct/check that phrase. > The udf (simple extends eval func) refers and reads a dictionary file of 6 > MB for each input phrase. > > Since, the input dataset is very small, Pig launches only one mapper (out > of 150 slots) to process the input and no parallelism is gained here. > > I would like to get some input/suggestions on how these kind of scenarios > are efficiently implemented in pig. > > =====code snip==== > > register 'Dudfs.jar'; > define CorrectPhrases CorrectPhrases('/user/home/big.txt'); > input_term = load '/user/home/input.txt' using PigStorage('\n') as > (phrase:chararray); > checked_term = foreach input_term generate phrase, CorrectPhrases(phrase) > as correctedTerms; > store checked_term into '/user/home/corrected_phrases' using > PigStorage(','); > > =================================== > > Forgive me if i am getting into wrong direction, feel free to correct me > and suggest your ways. > > Thanks in advance! > > > Regards, > Dipesh > -- > Dipesh Kr. Singh > -- Best regards, Vitalii Tymchyshyn
