I posted on this very same topic a few weeks ago with no response. It is still an unresolved issue for me, so if anyone had any ideas it would be greatly appreciated.
Interestingly enough I ran into issues right around the same size that you are dealing with (50k rows) so I am wondering if it is an issue with how Pig handles things. I'd recommend tuning some of the parameters that I mention in my post (below) as it may help you complete the job. http://search-hadoop.com/m/kJghFzruCA1/nested+cross&subj=Moving+Cross+of+Large+Data+to+be+Nested On Thu, Apr 18, 2013 at 9:17 PM, KALLURI, RAJESH K (AG/1000) < [email protected]> wrote: > I have a relation of about 50000 tuples that I want to join to itself > either by using a cross operator or something similar. Then I would be > doing pair wise computation of half the matrix (avoiding comparing to self > and duplicate). > > I was wondering what the most effective way to do this, below is some > pseudo pig latin. > > > -- About 50,000 - 70,000 entries > a = LOAD 'part-r-00000.txt' USING PigStorage() > AS (id:long, x:int, y:int); > -- Same as a , About 50,000 - 70,000 entries > b = LOAD 'part-r-00000.txt' USING PigStorage() > AS (id:long, x:int, y:int); > > jnd = join a by id , b by id; > -- filter comparisons to self and duplicates from the matrix > -- end up with 50000 X (50000-1)/2 entries > filter_self = filter jnd by a::id != b::id and a::id > b::id; > > raw = foreach filter_self generate a::id as id1, b::id as id2, TOBAG(a::x, > b::y) as z; > -- group pairs for comparison > grpd = group raw by (id1, id2); > -- calculate similarity between id1 and id2 based on a udf > prjctd = foreach grpd generate flatten(group), UDF(raw.z); > > This e-mail message may contain privileged and/or confidential > information, and is intended to be received only by persons entitled > to receive such information. If you have received this e-mail in error, > please notify the sender immediately. Please delete it and > all attachments from any servers, hard drives or any other media. Other > use of this e-mail by you is strictly prohibited. > > All e-mails and attachments sent and received are subject to monitoring, > reading and archival by Monsanto, including its > subsidiaries. The recipient of this e-mail is solely responsible for > checking for the presence of "Viruses" or other "Malware". > Monsanto, along with its subsidiaries, accepts no liability for any damage > caused by any such code transmitted by or accompanying > this e-mail or any attachment. > > > The information contained in this email may be subject to the export > control laws and regulations of the United States, potentially > including but not limited to the Export Administration Regulations (EAR) > and sanctions regulations issued by the U.S. Department of > Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this > information you are obligated to comply with all > applicable U.S. export laws and regulations. >
