Re: Implement Binary Search in PIG

唐亮 Tue, 13 Dec 2011 22:55:14 -0800

The detailed PIG codes are as below:

raw_ip_segment = load ...
ip_segs = foreach raw_ip_segment generate ipstart, ipend, name;
group_ip_segs = group ip_segs all;


order_ip_segs = foreach group_ip_segs {
  order_seg = order ip_segs by ipstart, ipend;
  generate 't' as tag, order_seg;
}
describe order_ip_segs
order_ip_segs: {tag: chararray,order_seg: {ipstart: long,ipend: long,poid:
chararray}}

Here, the order_ip_segs::order_seg is a BAG,
how can I transer it to a TUPLE?

And can I access the TUPLE randomly in UDF?

在 2011年12月14日 下午2:41，唐亮 <[email protected]>写道：

> Then how can I transfer all the items in Bag to a Tuple?
>
>
> 2011/12/14 Jonathan Coveney <[email protected]>
>
>> It's funny, but if you look wayyyy in the past, I actually asked a bunch
>> of
>> questions that circled around, literally, this exact problem.
>>
>> Dmitriy and Prahsant are correct: the best way is to make a UDF that can
>> do
>> the lookup really efficiently. This is what the maxmind API does, for
>> example.
>>
>> 2011/12/13 Prashant Kommireddi <[email protected]>
>>
>> > I am lost when you say "If enumerate every IP, it will be more than
>> > 100000000 single IPs"
>> >
>> > If each bag is a collection of 30000 tuples it might not be too bad on
>> the
>> > memory if you used Tuple to store segments instead?
>> >
>> > (8 bytes long + 8 bytes long + 20 bytes for chararray ) = 36
>> > Lets say we incur an additional overhead 4X times this, which is ~160
>> bytes
>> > per tuple.
>> > Total per Bag = 30000 X 160 = ~5 MB
>> >
>> > You could probably store the ipsegments as Tuple and test it on your
>> > servers.
>> >
>> >
>> > On Tue, Dec 13, 2011 at 8:39 PM, Dmitriy Ryaboy <[email protected]>
>> > wrote:
>> >
>> > > Do you have many such bags or just one? If one, and you want to look
>> up
>> > > many ups in it, might be more efficient to serialize this relation to
>> > hdfs,
>> > > and write a lookup udf that specifies the serialized data set as a
>> file
>> > to
>> > > put in distributed cache. At init time, load up the file into memory,
>> > then
>> > > for every ip do the binary search in exec()
>> > >
>> > > On Dec 13, 2011, at 7:55 PM, 唐亮 <[email protected]> wrote:
>> > >
>> > > > Thank you all!
>> > > >
>> > > > The detail is:
>> > > > A bag contains many "IP Segments", whose schema is (ipStart:long,
>> > > > ipEnd:long, locName:chararray) and the number of tuples is about
>> 30000,
>> > > > and I want to check wheather an IP is belong to one segment in the
>> bag.
>> > > >
>> > > > I want to order the "IP Segments" by (ipStart, ipEnd) in MR,
>> > > > and then binary search wheather an IP is in the bag in UDF.
>> > > >
>> > > > If enumerate every IP, it will be more than 100000000 single IPs,
>> > > > I think it will also be time consuming by JOIN in PIG.
>> > > >
>> > > > Please help me how can I deal with it efficiently!
>> > > >
>> > > >
>> > > > 2011/12/14 Thejas Nair <[email protected]>
>> > > >
>> > > >> My assumption is that 唐亮 is trying to do binary search on bags
>> within
>> > > the
>> > > >> tuples in a relation (ie schema of the relation has a bag column).
>> I
>> > > don't
>> > > >> think he is trying to treat the entire relation as one bag and do
>> > binary
>> > > >> search on that.
>> > > >>
>> > > >>
>> > > >> -Thejas
>> > > >>
>> > > >>
>> > > >>
>> > > >> On 12/13/11 2:30 PM, Andrew Wells wrote:
>> > > >>
>> > > >>> I don't think this could be done,
>> > > >>>
>> > > >>> pig is just a hadoop job, and the idea behind hadoop is to read
>> all
>> > the
>> > > >>> data in a file.
>> > > >>>
>> > > >>> so by the time you put all the data into an array, you would have
>> > been
>> > > >>> better off just checking each element for the one you were looking
>> > for.
>> > > >>>
>> > > >>> So what you would get is [n + lg (n)], which will just be [n]
>> after
>> > > >>> putting
>> > > >>> that into an array.
>> > > >>> Second, hadoop is all about large data analysis, usually more than
>> > > 100GB,
>> > > >>> so putting this into memory is out of the question.
>> > > >>> Third, hadoop is efficient because it processes this large amount
>> of
>> > > data
>> > > >>> by splitting it up into multiple processes. To do an efficient
>> binary
>> > > >>> search, you would need do this in one mapper or one reducer.
>> > > >>>
>> > > >>> My opinion is just don't fight hadoop/pig.
>> > > >>>
>> > > >>>
>> > > >>>
>> > > >>> On Tue, Dec 13, 2011 at 1:56 PM, Thejas Nair<
>> [email protected]>
>> > > >>> wrote:
>> > > >>>
>> > > >>> Bags can be very large might not fit into memory, and in such
>> cases
>> > > some
>> > > >>>> or all of the bag might have to be stored on disk. In such
>> cases, it
>> > > is
>> > > >>>> not
>> > > >>>> efficient to do random access on the bag. That is why the DataBag
>> > > >>>> interface
>> > > >>>> does not support it.
>> > > >>>>
>> > > >>>> As Prashant suggested, storing it in a tuple would be a good
>> > > alternative,
>> > > >>>> if you want to have random access to do binary search.
>> > > >>>>
>> > > >>>> -Thejas
>> > > >>>>
>> > > >>>>
>> > > >>>>
>> > > >>>> On 12/12/11 7:54 PM, 唐亮 wrote:
>> > > >>>>
>> > > >>>> Hi all,
>> > > >>>>> How can I implement a binary search in pig?
>> > > >>>>>
>> > > >>>>> In one relation, there exists a bag whose items are sorted.
>> > > >>>>> And I want to check there exists a specific item in the bag.
>> > > >>>>>
>> > > >>>>> In UDF, I can't random access items in DataBag container.
>> > > >>>>> So I have to transfer the items in DataBag to an ArrayList, and
>> > this
>> > > is
>> > > >>>>> time consuming.
>> > > >>>>>
>> > > >>>>> How can I implement the binary search efficiently in pig?
>> > > >>>>>
>> > > >>>>>
>> > > >>>>>
>> > > >>>>
>> > > >>>
>> > > >>
>> > >
>> >
>>
>
>

Re: Implement Binary Search in PIG

Reply via email to