I also just found out that the bag from the nested order by is org.apache.pig.data.InternalCachedBag and not org.apache.pig.data.SortedDataBag
should be like that? On 28 Φεβ 2014, at 1:51 π.μ., Anastasis Andronidis <[email protected]> wrote: > Hi again, > > I added this in my UDF: > > if(!((DataBag) input.get(0)).isSorted()) { > throw new IOException("It's not sorted"); > } > > And the exception arises. Why? I don't understand it. I specified ORDER BY in > the nested foreach. > > Thank you for helping me btw! > > On 28 Φεβ 2014, at 1:12 π.μ., Pradeep Gollakota <[email protected]> wrote: > >> No... that wouldn't be related since you're not doing a GROUP ALL. >> >> The `FLATTEN(MY_UDF(t))` has me a little weary. Something is possibly going >> wrong in your UDF. The output of your UDF is going to be a string that is >> some generic status right? My uneducated guess is that there's a bug in >> your UDF. To confirm, do you get the correct result if you replace your UDF >> with an out of the box one e.g. COUNT? >> >> >> On Thu, Feb 27, 2014 at 2:21 PM, Anastasis Andronidis < >> [email protected]> wrote: >> >>> BTW, is this some how related[1] ? >>> >>> >>> [1]: >>> http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%[email protected]%3E >>> >>> On 27 Φεβ 2014, at 11:20 μ.μ., Anastasis Andronidis < >>> [email protected]> wrote: >>> >>>> Yes, of course, my output is like that: >>>> >>>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) >>>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) >>>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2) >>>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2) >>>> (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) >>>> . >>>> . >>>> . >>>> >>>> and when I put PARALLEL 1 in GROUP BY I get: >>>> >>>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) >>>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2) >>>> (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE) >>>> . >>>> . >>>> . >>>> >>>> >>>> On 27 Φεβ 2014, at 10:20 μ.μ., Pradeep Gollakota <[email protected]> >>> wrote: >>>> >>>>> Where exactly are you getting duplicates? I'm not sure I understand your >>>>> question. Can you give an example please? >>>>> >>>>> >>>>> On Thu, Feb 27, 2014 at 11:15 AM, Anastasis Andronidis < >>>>> [email protected]> wrote: >>>>> >>>>>> Hello everyone, >>>>>> >>>>>> I have a foreach statement and inside of it, I use an order by. After >>> the >>>>>> order by, I have a UDF. Example like this: >>>>>> >>>>>> >>>>>> logs = LOAD 'raw_data' USING org.apache.hcatalog.pig.HCatLoader(); >>>>>> >>>>>> logs_g = GROUP logs BY (date, site, profile) PARALLEL 2; >>>>>> >>>>>> service_flavors = FOREACH logs_g { >>>>>> t = ORDER logs BY status; >>>>>> GENERATE group.date as dates, group.site as site, group.profile >>> as >>>>>> profile, >>>>>> FLATTEN(MY_UDF(t)) as >>>>>> (generic_status); >>>>>> }; >>>>>> >>>>>> The problem is that I get duplicate results.. I know that MY_UDF is >>>>>> running on mappers, but shouldn't each mapper take 1 group from the >>> logs_g? >>>>>> Is something wrong with order by? I tried to add order by parallel >>> but I >>>>>> get syntax errors... >>>>>> >>>>>> My problem is resolved if I put GROUP logs BY (date, site, profile) >>>>>> PARALLEL 1; But this is not a scalable solution. Can someone help me >>> pls? I >>>>>> am using pig 0.11 >>>>>> >>>>>> Cheers, >>>>>> Anastasis >>>> >>> >>> > >
