Woops, look at me, I didn't realize you were passing it to python. The same should work, just...in python. The bags will be lists, the input a tuple of those two bag lists.
2011/1/13 Jonathan Coveney <[email protected]> > You absolutely can do this, I'm just not sure if you can do it using the > accumulator interface (I THINK you can, as I think it ratchets only the > first tuple input, and passes the entire second one, but am not sure, > someone else can weigh in). If you CAN do it with the accumulator interface > though I highly recommend it, as it's more memory efficient. > > Basicaly, you'll have this as your pig script: > > public class doublebag { > public <OutputType> exec(Tuple input) throws IOException { > DataBag innerBag = (DataBag)input.get(0); > DataBag outerBag = (DataBag)input.get(1); > Iterator<Tuple> ibit = innerBag.iterator(); > while (ibit.hasNext()) { > Tuple ibelem = ibit.next(); > Iterator<Tuple> obit = outerBag.iterator(); > while (obit.hasNext()) { > Tuple obelem = obit.next(); > } > } > } > } > > Obviously, you need to do things like checking for empty input etc etc, > this is just a rough rough example of the code. The point is, if you just do > exec, you'll simply have a bag as your first input and a bag as the second. > > And then in your code if you want to do one entire bag against another, > you'd just do a group thing all; and pass it that thing. > > Hope that explanation made sense, if it didn't just ask again. It's worth > going over the bag->bag example in the UDF manual, and really, you just have > 2 bag inputs instead of one. > > 2011/1/13 <[email protected]> > > Hi, >> >> I wish to pass an outer bag into a Python UDF. >> >> Something like: >> >> Calculation-data = LOAD /path/to/data AS (field;int); >> Actual_data = LOAD /path/to/data AS (field1:int, field2:int) >> >> >> Calculation_data is not a very big bag. Maybe about 500 tuples in all - a >> single file. >> Actual_data is the real data source lying on HDFS. >> >> For each tuple in Actual_data, I wish to have the entire Calculation_data >> (whole bag) for some calculation that I wish to do. >> >> So, with every call to a UDF, I need to pass this bag along with tuple >> from Actual_data. >> >> Simply, is there any way of passing an outer bag into a Python UDF? >> >> Regards, >> Deepak >> Please do not print this email unless it is absolutely necessary. >> >> The information contained in this electronic message and any attachments >> to this message are intended for the exclusive use of the addressee(s) and >> may contain proprietary, confidential or privileged information. If you are >> not the intended recipient, you should not disseminate, distribute or copy >> this e-mail. Please notify the sender immediately and destroy all copies of >> this message and any attachments. >> >> WARNING: Computer viruses can be transmitted via email. The recipient >> should check this email and any attachments for the presence of viruses. The >> company accepts no liability for any damage caused by any virus transmitted >> by this email. >> >> www.wipro.com >> > >
