This is a map side udf. pig script loads a log file and grabs contents inside angle brackets. a = load; b = foreach a generate F(a); dump b;
I see following on tasktrackers- 2011-02-23 18:01:25,992 INFO org.apache.pig.impl.util.SpillableMemoryManager: first memory handler call - Collection threshold init = 5439488(5312K) used = 409337824(399743K) committed = 534118400(521600K) max = 715849728(699072K) 2011-02-23 18:01:26,102 INFO org.apache.pig.impl.util.SpillableMemoryManager: first memory handler call- Usage threshold init = 5439488(5312K) used = 546751088(533936K) committed = 671547392(655808K) max = 715849728(699072K) I am trying out some changes in udf to see if they work. Thanks, Aniket On Thu, February 24, 2011 7:25 pm, Daniel Dai wrote: > Hi, Aniket, > What is your Pig script? Is the UDF in map side or reduce side? > > > Daniel > > > Dmitriy Ryaboy wrote: > >> That's a max of 3.3K single-character strings. Even with the java >> overhead that shouldn't be more than a meg right? none of these should >> make it out of young gen assuming the list "cats" doesn't stick around >> outside the udf. >> >> On Thu, Feb 24, 2011 at 3:49 PM, Aniket Mokashi >> <amoka...@andrew.cmu.edu>wrote: >> >> >> >>> Hi Jai, >>> >>> >>> Thanks for your email. I suspect that its the Strings in tight loop >>> reason as you have suggested. I have a loop in my udf that does the >>> following. >>> >>> while((startInd = someLog.indexOf('[',startInd)) > 0) { endInd = >>> someLog.indexOf(']', startInd); if(endInd > 0) { category = >>> someLog.substring(startInd, endInd+1); cats.add(category); } >>> startInd = endInd; } >>> >>> >>> My jobs are failing in both local and mr mode. UDF works fine for a >>> smaller input (a few lines). Also, I checked that sizeof someLog >>> doesnt exceed a 10000. >>> >>> Thanks, >>> Aniket >>> >>> >>> >>> On Thu, February 24, 2011 3:58 am, Jai Krishna wrote: >>> >>> >>>> Sharing the code would be useful as mentioned. Also of help would >>>> the heap settings that the JVM had. >>>> >>>> However, off the top of my head, one common situation (esp. in text >>>> processing/tokenizing) is instantiating Strings in a tight loop. >>>> >>>> Besides you could also exercise your UDF in a local JVM and take a >>>> heap dump / profile it. If your heap is less than 512M, you could >>>> use basic profiling via hprof/hat (see >>>> http://java.sun.com/developer/technicalArticles/Programming/HPROF.h >>>> tml). >>>> >>>> >>>> Thanks, >>>> Jai >>>> >>>> >>>> >>>> >>>> On 2/24/11 9:26 AM, "Dmitriy Ryaboy" <dvrya...@gmail.com> wrote: >>>> >>>> >>>> >>>> Aniket, share the code? >>>> It really depends on how you create them. >>>> >>>> >>>> >>>> -D >>>> >>>> >>>> >>>> On Wed, Feb 23, 2011 at 7:49 PM, Aniket Mokashi >>>> <amoka...@andrew.cmu.edu>wrote: >>>> >>>> >>>> >>>> >>>>> I ve written a simple UDF that parses a chararray (which looks >>>>> like ...[a].....[b]...[a]...) to capture stuff inside brackets and >>>>> return them as String a=2;b=1; and so on. The input chararray are >>>>> rarely more than 1000 characters and are not more than 100000 (I >>>>> ve added log.warn in my udf to ensure this). But, I still see java >>>>> heap error while running this udf (even in local mode, the job >>>>> simply fails). My assumption is maps and lists that I use locally >>>>> will be recollected by gc. Am I missing something? >>>>> >>>>> Thanks, >>>>> Aniket >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>> >>> > > >