Woops, look at me, I didn't realize you were passing it to python. The same
should work, just...in python. The bags will be lists, the input a tuple of
those two bag lists.

2011/1/13 Jonathan Coveney <[email protected]>

> You absolutely can do this, I'm just not sure if you can do it using the
> accumulator interface (I THINK you can, as I think it ratchets only the
> first tuple input, and passes the entire second one, but am not sure,
> someone else can weigh in). If you CAN do it with the accumulator interface
> though I highly recommend it, as it's more memory efficient.
>
> Basicaly, you'll have this as your pig script:
>
> public class doublebag {
>   public <OutputType> exec(Tuple input) throws IOException {
>     DataBag innerBag = (DataBag)input.get(0);
>     DataBag outerBag = (DataBag)input.get(1);
>     Iterator<Tuple> ibit = innerBag.iterator();
>     while (ibit.hasNext()) {
>        Tuple ibelem = ibit.next();
>        Iterator<Tuple> obit = outerBag.iterator();
>        while (obit.hasNext()) {
>          Tuple obelem = obit.next();
>        }
>     }
>   }
> }
>
> Obviously, you need to do things like checking for empty input etc etc,
> this is just a rough rough example of the code. The point is, if you just do
> exec, you'll simply have a bag as your first input and a bag as the second.
>
> And then in your code if you want to do one entire bag against another,
> you'd just do a group thing all; and pass it that thing.
>
> Hope that explanation made sense, if it didn't just ask again. It's worth
> going over the bag->bag example in the UDF manual, and really, you just have
> 2 bag inputs instead of one.
>
> 2011/1/13 <[email protected]>
>
> Hi,
>>
>> I wish to pass an outer bag into a Python UDF.
>>
>> Something like:
>>
>> Calculation-data = LOAD /path/to/data AS (field;int);
>> Actual_data = LOAD /path/to/data AS (field1:int, field2:int)
>>
>>
>> Calculation_data is not a very big bag. Maybe about 500 tuples in all - a
>> single file.
>> Actual_data is the real data source lying on HDFS.
>>
>> For each tuple in Actual_data, I wish to have the entire Calculation_data
>> (whole bag) for some calculation that I wish to do.
>>
>> So, with every call to a UDF, I need to pass this bag along with tuple
>> from Actual_data.
>>
>> Simply, is there any way of passing an outer bag into a Python UDF?
>>
>> Regards,
>> Deepak
>> Please do not print this email unless it is absolutely necessary.
>>
>> The information contained in this electronic message and any attachments
>> to this message are intended for the exclusive use of the addressee(s) and
>> may contain proprietary, confidential or privileged information. If you are
>> not the intended recipient, you should not disseminate, distribute or copy
>> this e-mail. Please notify the sender immediately and destroy all copies of
>> this message and any attachments.
>>
>> WARNING: Computer viruses can be transmitted via email. The recipient
>> should check this email and any attachments for the presence of viruses. The
>> company accepts no liability for any damage caused by any virus transmitted
>> by this email.
>>
>> www.wipro.com
>>
>
>

Reply via email to