Thanks Prashant Kommireddi, But my question is: How to call the UDF in PIG, especially the parameters to put into the UDF.
在 2011年12月15日 下午4:05,Prashant Kommireddi <[email protected]>写道: > Not sure what you mean. Have you tried the code I forwarded? Are you facing > any issues there? > > If your question is regarding binarySearch implementation, here is > pseudo-code'ish implementation. I have not tested this, please treat this > as a general idea on how to go about accessing the elements within the > Tuple. > > ALSO, I am assuming you have defined schema for (inner) Tuple contents. > > public String binarySearch(Tuple tuple, long toSearch, int low, int high) { > if(low > high) > return "NOT FOUND"; //Handle this the way you would like > > if(tuple == null) > throw new IllegalArgumentException("Tuple is null"); //Handle > this the way you would like > > int mid = (low + high)/2; > Tuple midTuple = tuple.get(mid); > String tag = midTuple.get(0).toString(); > long ipstart = (Long)midTuple.get(1); > long ipend = (Long)midTuple.get(2); > String loc = midTuple.get(3).toString(); > > if(toSearch == ipstart) //Or ipend, I am not sure how you want to search > { > return loc; > } > else if(toSearch < ipstart) > return binarySearch(tuple, low, mid - 1); > > else > return binarySearch(tuple, mid+1, high); > > } > > > > > > > > 2011/12/14 唐亮 <[email protected]> > > > Hi Prashant Kommireddi, > > > > If so, how should I write the UDF, especially the data types in UDF? > > > > 2011/12/15 Prashant Kommireddi <[email protected]> > > > > > When you flatten your BAG all your segments are within a single tuple. > > > Something like > > > > > > ((tag, ipstart, ipend, loc), (tag, ipstart, ipend, loc)...(tagN, > > > ipstartN, ipendN, locN)) > > > > > > You can access the inner tuples positionally. > > > > > > Sent from my iPhone > > > > > > On Dec 14, 2011, at 6:28 PM, "唐亮" <[email protected]> wrote: > > > > > > > Now the question is: > > > > How should I put all the "IP Segments" in one TUPLE? > > > > > > > > Please help me! > > > > > > > > > > > > 2011/12/15 Prashant Kommireddi <[email protected]> > > > > > > > >> Michael, > > > >> > > > >> This would have no benefit over using a DistributedCache. For a > large > > > >> cluster this would mean poor performance. If the file is static and > > > needs > > > >> to be looked-up across the cluster, DistributedCache would be a > better > > > >> approach. > > > >> > > > >> Thanks, > > > >> Prashant > > > >> > > > >> On Wed, Dec 14, 2011 at 11:18 AM, jiang licht < > [email protected]> > > > >> wrote: > > > >> > > > >>> If that list of ip pairs is pretty static most time and will be > used > > > >>> frequently, maybe just copy it in hdfs with a high replication > > factor. > > > >> Then > > > >>> use it as a look up table or some binary tree or treemap kind of > > thing > > > by > > > >>> reading it from hdfs instead of using distributed cache if that > > sounds > > > an > > > >>> easier thing to do. > > > >>> > > > >>> > > > >>> Best regards, > > > >>> Michael > > > >>> > > > >>> > > > >>> ________________________________ > > > >>> From: Dmitriy Ryaboy <[email protected]> > > > >>> To: [email protected] > > > >>> Sent: Wednesday, December 14, 2011 10:28 AM > > > >>> Subject: Re: Implement Binary Search in PIG > > > >>> > > > >>> hbase has nothing to do with distributed cache. > > > >>> > > > >>> > > > >>> 2011/12/14 唐亮 <[email protected]> > > > >>> > > > >>>> Now, I didn't use HBase, > > > >>>> so, maybe I can't use DistributedCache. > > > >>>> > > > >>>> And if FLATTEN DataBag, the results are Tuples, > > > >>>> then in UDF I can process only one Tuple, which can't implement > > > >>>> BinarySearch. > > > >>>> > > > >>>> So, please help and show me the detailed solution. > > > >>>> Thanks! > > > >>>> > > > >>>> 在 2011年12月14日 下午5:59,唐亮 <[email protected]>写道: > > > >>>> > > > >>>>> Hi Prashant Kommireddi, > > > >>>>> > > > >>>>> If I do 1. and 2. as you mentioned, > > > >>>>> the schema will be {tag, ipStart, ipEnd, locName}. > > > >>>>> > > > >>>>> BUT, how should I write the UDF, especially how should I set the > > type > > > >>> of > > > >>>>> the input parameter? > > > >>>>> > > > >>>>> Currently, the UDF codes are as below, whose input parameter is > > > >>> DataBag: > > > >>>>> > > > >>>>> public class GetProvinceNameFromIPNum extends EvalFunc<String> { > > > >>>>> > > > >>>>> public String exec(Tuple input) throws IOException { > > > >>>>> if (input == null || input.size() == 0) > > > >>>>> return UnknownIP; > > > >>>>> if (input.size() != 2) { > > > >>>>> throw new IOException("Expected input's size is 2, but is: " + > > > >>>>> input.size()); > > > >>>>> } > > > >>>>> > > > >>>>> Object o1 = input.get(0); * // This should be the IP you > > want > > > >>> to > > > >>>>> look up* > > > >>>>> if (!(o1 instanceof Long)) { > > > >>>>> throw new IOException("Expected input 1 to be Long, > but > > > >>> got " > > > >>>>> + o1.getClass().getName()); > > > >>>>> } > > > >>>>> Object o2 = input.get(1); *// This is the Bag of IP segs* > > > >>>>> if (!(o2 instanceof *DataBag*)) { //* Should I change it > to > > > >>> "(o2 > > > >>>>> instanceof Tuple)"?* > > > >>>>> throw new IOException("Expected input 2 to be DataBag, > > > >> but > > > >>>> got > > > >>>>> " > > > >>>>> + o2.getClass().getName()); > > > >>>>> } > > > >>>>> > > > >>>>> ........... other codes ........... > > > >>>>> } > > > >>>>> > > > >>>>> } > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> 在 2011年12月14日 下午3:16,Prashant Kommireddi <[email protected] > >写道: > > > >>>>> > > > >>>>> Seems like at the end of this you have a Single bag with all the > > > >>>> elements, > > > >>>>>> and somehow you would like to check whether an element exists in > > it > > > >>>> based > > > >>>>>> on ipstart/end. > > > >>>>>> > > > >>>>>> > > > >>>>>> 1. Use FLATTEN > > > >> http://pig.apache.org/docs/r0.9.1/basic.html#flatten- > > > >>>>>> this will convert the Bag to Tuple: to_tuple = FOREACH > > > >>> order_ip_segs > > > >>>>>> GENERATE tag, FLATTEN(order_seq); ---- This is O(n) > > > >>>>>> 2. Now write a UDF that can access the elements positionally > for > > > >> the > > > >>>>>> BinarySearch > > > >>>>>> 3. Dmitriy and Jonathan's ideas with DistributedCache could > > > >> perform > > > >>>>>> better than the above approach, so you could go down that route > > > >> too. > > > >>>>>> > > > >>>>>> > > > >>>>>> 2011/12/13 唐亮 <[email protected]> > > > >>>>>> > > > >>>>>>> The detailed PIG codes are as below: > > > >>>>>>> > > > >>>>>>> raw_ip_segment = load ... > > > >>>>>>> ip_segs = foreach raw_ip_segment generate ipstart, ipend, name; > > > >>>>>>> group_ip_segs = group ip_segs all; > > > >>>>>>> > > > >>>>>>> order_ip_segs = foreach group_ip_segs { > > > >>>>>>> order_seg = order ip_segs by ipstart, ipend; > > > >>>>>>> generate 't' as tag, order_seg; > > > >>>>>>> } > > > >>>>>>> describe order_ip_segs > > > >>>>>>> order_ip_segs: {tag: chararray,order_seg: {ipstart: long,ipend: > > > >>>>>> long,poid: > > > >>>>>>> chararray}} > > > >>>>>>> > > > >>>>>>> Here, the order_ip_segs::order_seg is a BAG, > > > >>>>>>> how can I transer it to a TUPLE? > > > >>>>>>> > > > >>>>>>> And can I access the TUPLE randomly in UDF? > > > >>>>>>> > > > >>>>>>> 在 2011年12月14日 下午2:41,唐亮 <[email protected]>写道: > > > >>>>>>> > > > >>>>>>>> Then how can I transfer all the items in Bag to a Tuple? > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> 2011/12/14 Jonathan Coveney <[email protected]> > > > >>>>>>>> > > > >>>>>>>>> It's funny, but if you look wayyyy in the past, I actually > > > >> asked > > > >>> a > > > >>>>>> bunch > > > >>>>>>>>> of > > > >>>>>>>>> questions that circled around, literally, this exact problem. > > > >>>>>>>>> > > > >>>>>>>>> Dmitriy and Prahsant are correct: the best way is to make a > UDF > > > >>>> that > > > >>>>>> can > > > >>>>>>>>> do > > > >>>>>>>>> the lookup really efficiently. This is what the maxmind API > > > >> does, > > > >>>> for > > > >>>>>>>>> example. > > > >>>>>>>>> > > > >>>>>>>>> 2011/12/13 Prashant Kommireddi <[email protected]> > > > >>>>>>>>> > > > >>>>>>>>>> I am lost when you say "If enumerate every IP, it will be > > > >> more > > > >>>> than > > > >>>>>>>>>> 100000000 single IPs" > > > >>>>>>>>>> > > > >>>>>>>>>> If each bag is a collection of 30000 tuples it might not be > > > >> too > > > >>>>>> bad on > > > >>>>>>>>> the > > > >>>>>>>>>> memory if you used Tuple to store segments instead? > > > >>>>>>>>>> > > > >>>>>>>>>> (8 bytes long + 8 bytes long + 20 bytes for chararray ) = 36 > > > >>>>>>>>>> Lets say we incur an additional overhead 4X times this, > which > > > >>> is > > > >>>>>> ~160 > > > >>>>>>>>> bytes > > > >>>>>>>>>> per tuple. > > > >>>>>>>>>> Total per Bag = 30000 X 160 = ~5 MB > > > >>>>>>>>>> > > > >>>>>>>>>> You could probably store the ipsegments as Tuple and test it > > > >> on > > > >>>>>> your > > > >>>>>>>>>> servers. > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> On Tue, Dec 13, 2011 at 8:39 PM, Dmitriy Ryaboy < > > > >>>>>> [email protected]> > > > >>>>>>>>>> wrote: > > > >>>>>>>>>> > > > >>>>>>>>>>> Do you have many such bags or just one? If one, and you > > > >> want > > > >>> to > > > >>>>>> look > > > >>>>>>>>> up > > > >>>>>>>>>>> many ups in it, might be more efficient to serialize this > > > >>>>>> relation > > > >>>>>>> to > > > >>>>>>>>>> hdfs, > > > >>>>>>>>>>> and write a lookup udf that specifies the serialized data > > > >> set > > > >>>> as > > > >>>>>> a > > > >>>>>>>>> file > > > >>>>>>>>>> to > > > >>>>>>>>>>> put in distributed cache. At init time, load up the file > > > >> into > > > >>>>>>> memory, > > > >>>>>>>>>> then > > > >>>>>>>>>>> for every ip do the binary search in exec() > > > >>>>>>>>>>> > > > >>>>>>>>>>> On Dec 13, 2011, at 7:55 PM, 唐亮 <[email protected]> > > > >> wrote: > > > >>>>>>>>>>> > > > >>>>>>>>>>>> Thank you all! > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> The detail is: > > > >>>>>>>>>>>> A bag contains many "IP Segments", whose schema is > > > >>>>>> (ipStart:long, > > > >>>>>>>>>>>> ipEnd:long, locName:chararray) and the number of tuples > > > >> is > > > >>>>>> about > > > >>>>>>>>> 30000, > > > >>>>>>>>>>>> and I want to check wheather an IP is belong to one > > > >> segment > > > >>>> in > > > >>>>>> the > > > >>>>>>>>> bag. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> I want to order the "IP Segments" by (ipStart, ipEnd) in > > > >>> MR, > > > >>>>>>>>>>>> and then binary search wheather an IP is in the bag in > > > >> UDF. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> If enumerate every IP, it will be more than 100000000 > > > >>> single > > > >>>>>> IPs, > > > >>>>>>>>>>>> I think it will also be time consuming by JOIN in PIG. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> Please help me how can I deal with it efficiently! > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> 2011/12/14 Thejas Nair <[email protected]> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>>> My assumption is that 唐亮 is trying to do binary search > > > >> on > > > >>>> bags > > > >>>>>>>>> within > > > >>>>>>>>>>> the > > > >>>>>>>>>>>>> tuples in a relation (ie schema of the relation has a > > > >> bag > > > >>>>>>> column). > > > >>>>>>>>> I > > > >>>>>>>>>>> don't > > > >>>>>>>>>>>>> think he is trying to treat the entire relation as one > > > >> bag > > > >>>>>> and do > > > >>>>>>>>>> binary > > > >>>>>>>>>>>>> search on that. > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> -Thejas > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> On 12/13/11 2:30 PM, Andrew Wells wrote: > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>>> I don't think this could be done, > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> pig is just a hadoop job, and the idea behind hadoop is > > > >>> to > > > >>>>>> read > > > >>>>>>>>> all > > > >>>>>>>>>> the > > > >>>>>>>>>>>>>> data in a file. > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> so by the time you put all the data into an array, you > > > >>>> would > > > >>>>>>> have > > > >>>>>>>>>> been > > > >>>>>>>>>>>>>> better off just checking each element for the one you > > > >>> were > > > >>>>>>> looking > > > >>>>>>>>>> for. > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> So what you would get is [n + lg (n)], which will just > > > >> be > > > >>>> [n] > > > >>>>>>>>> after > > > >>>>>>>>>>>>>> putting > > > >>>>>>>>>>>>>> that into an array. > > > >>>>>>>>>>>>>> Second, hadoop is all about large data analysis, > > > >> usually > > > >>>> more > > > >>>>>>> than > > > >>>>>>>>>>> 100GB, > > > >>>>>>>>>>>>>> so putting this into memory is out of the question. > > > >>>>>>>>>>>>>> Third, hadoop is efficient because it processes this > > > >>> large > > > >>>>>>> amount > > > >>>>>>>>> of > > > >>>>>>>>>>> data > > > >>>>>>>>>>>>>> by splitting it up into multiple processes. To do an > > > >>>>>> efficient > > > >>>>>>>>> binary > > > >>>>>>>>>>>>>> search, you would need do this in one mapper or one > > > >>>> reducer. > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> My opinion is just don't fight hadoop/pig. > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> On Tue, Dec 13, 2011 at 1:56 PM, Thejas Nair< > > > >>>>>>>>> [email protected]> > > > >>>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> Bags can be very large might not fit into memory, and > > > >> in > > > >>>> such > > > >>>>>>>>> cases > > > >>>>>>>>>>> some > > > >>>>>>>>>>>>>>> or all of the bag might have to be stored on disk. In > > > >>> such > > > >>>>>>>>> cases, it > > > >>>>>>>>>>> is > > > >>>>>>>>>>>>>>> not > > > >>>>>>>>>>>>>>> efficient to do random access on the bag. That is why > > > >>> the > > > >>>>>>> DataBag > > > >>>>>>>>>>>>>>> interface > > > >>>>>>>>>>>>>>> does not support it. > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> As Prashant suggested, storing it in a tuple would be > > > >> a > > > >>>> good > > > >>>>>>>>>>> alternative, > > > >>>>>>>>>>>>>>> if you want to have random access to do binary search. > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> -Thejas > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> On 12/12/11 7:54 PM, 唐亮 wrote: > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> Hi all, > > > >>>>>>>>>>>>>>>> How can I implement a binary search in pig? > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> In one relation, there exists a bag whose items are > > > >>>> sorted. > > > >>>>>>>>>>>>>>>> And I want to check there exists a specific item in > > > >> the > > > >>>>>> bag. > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> In UDF, I can't random access items in DataBag > > > >>> container. > > > >>>>>>>>>>>>>>>> So I have to transfer the items in DataBag to an > > > >>>> ArrayList, > > > >>>>>>> and > > > >>>>>>>>>> this > > > >>>>>>>>>>> is > > > >>>>>>>>>>>>>>>> time consuming. > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> How can I implement the binary search efficiently in > > > >>> pig? > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>> > > > >>>>>> > > > >>>>> > > > >>>>> > > > >>>> > > > >>> > > > >> > > > > > >
