Re: Nested foreach with order by

Anastasis Andronidis Thu, 27 Feb 2014 17:08:18 -0800

I also just found out that the bag from the nested order by is 
org.apache.pig.data.InternalCachedBag and not org.apache.pig.data.SortedDataBag


should be like that?

On 28 Φεβ 2014, at 1:51 π.μ., Anastasis Andronidis <[email protected]> 
wrote:

> Hi again,
> 
> I added this in my UDF:
> 
>     if(!((DataBag) input.get(0)).isSorted()) {
>         throw new IOException("It's not sorted");
>     }
> 
> And the exception arises. Why? I don't understand it. I specified ORDER BY in 
> the nested foreach.
> 
> Thank you for helping me btw!
> 
> On 28 Φεβ 2014, at 1:12 π.μ., Pradeep Gollakota <[email protected]> wrote:
> 
>> No... that wouldn't be related since you're not doing a GROUP ALL.
>> 
>> The `FLATTEN(MY_UDF(t))` has me a little weary. Something is possibly going
>> wrong in your UDF. The output of your UDF is going to be a string that is
>> some generic status right? My uneducated guess is that there's a bug in
>> your UDF. To confirm, do you get the correct result if you replace your UDF
>> with an out of the box one e.g. COUNT?
>> 
>> 
>> On Thu, Feb 27, 2014 at 2:21 PM, Anastasis Andronidis <
>> [email protected]> wrote:
>> 
>>> BTW, is this some how related[1] ?
>>> 
>>> 
>>> [1]:
>>> http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%[email protected]%3E
>>> 
>>> On 27 Φεβ 2014, at 11:20 μ.μ., Anastasis Andronidis <
>>> [email protected]> wrote:
>>> 
>>>> Yes, of course, my output is like that:
>>>> 
>>>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
>>>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
>>>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
>>>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
>>>> (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
>>>> .
>>>> .
>>>> .
>>>> 
>>>> and when I put PARALLEL 1 in GROUP BY I get:
>>>> 
>>>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
>>>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
>>>> (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
>>>> .
>>>> .
>>>> .
>>>> 
>>>> 
>>>> On 27 Φεβ 2014, at 10:20 μ.μ., Pradeep Gollakota <[email protected]>
>>> wrote:
>>>> 
>>>>> Where exactly are you getting duplicates? I'm not sure I understand your
>>>>> question. Can you give an example please?
>>>>> 
>>>>> 
>>>>> On Thu, Feb 27, 2014 at 11:15 AM, Anastasis Andronidis <
>>>>> [email protected]> wrote:
>>>>> 
>>>>>> Hello everyone,
>>>>>> 
>>>>>> I have a foreach statement and inside of it, I use an order by. After
>>> the
>>>>>> order by, I have a UDF. Example like this:
>>>>>> 
>>>>>> 
>>>>>> logs = LOAD 'raw_data' USING org.apache.hcatalog.pig.HCatLoader();
>>>>>> 
>>>>>> logs_g = GROUP logs BY (date, site, profile) PARALLEL 2;
>>>>>> 
>>>>>> service_flavors = FOREACH logs_g {
>>>>>>  t = ORDER logs BY status;
>>>>>>  GENERATE group.date as dates, group.site as site, group.profile
>>> as
>>>>>> profile,
>>>>>>                                  FLATTEN(MY_UDF(t)) as
>>>>>> (generic_status);
>>>>>> };
>>>>>> 
>>>>>> The problem is that I get duplicate results.. I know that MY_UDF is
>>>>>> running on mappers, but shouldn't each mapper take 1 group from the
>>> logs_g?
>>>>>> Is something wrong with order by? I tried to add  order by parallel
>>> but I
>>>>>> get syntax errors...
>>>>>> 
>>>>>> My problem is resolved if I put  GROUP logs BY (date, site, profile)
>>>>>> PARALLEL 1; But this is not a scalable solution. Can someone help me
>>> pls? I
>>>>>> am using pig 0.11
>>>>>> 
>>>>>> Cheers,
>>>>>> Anastasis
>>>> 
>>> 
>>> 
> 
>

Re: Nested foreach with order by

Reply via email to