Re: intersection of row ids

2011-03-13 Thread Ted Dunning
Well, since you can start iterating from any point, you can just do a map-reduce over the larger table. In each mapper, on the first call, initialize a scanner into the smaller table to start with the key that you get from the larger table. Each time you get a sequential key from the master table

Re: intersection of row ids

2011-03-13 Thread Jesse Daniels
Has anyone tried the "zig-zag" merge join algorithm that Google uses to do something similar with their AppEngine data store (BigTable)? It's described here starting on slide 29: http://www.scribd.com/doc/16952419/Building-scalable-complex-apps-on-App-Engine

Re: intersection of row ids

2011-03-11 Thread Stack
Understand the ROWCOL can use more memory than ROWs. In general, blooms could soak up a bunch of your RAM. Just be conscious of this fact. St.Ack On Fri, Mar 11, 2011 at 4:25 PM, Usman Waheed wrote: > I suggest it to be ROWCOL because you have many columns to match against in > your second tabl

Re: intersection of row ids

2011-03-11 Thread Bill Graham
You could also do this with MR easily using Pig's HBaseStorage and either an inner join or an outer join with a filter on null, depending on if you want matches or misses, respectively. On Fri, Mar 11, 2011 at 4:25 PM, Usman Waheed wrote: > I suggest it to be ROWCOL because you have many columns

Re: intersection of row ids

2011-03-11 Thread Usman Waheed
I suggest it to be ROWCOL because you have many columns to match against in your second table (column qualifiers). -Usman Should the Bloom filter be ROW or ROWCOL? Vishal On Fri, Mar 11, 2011 at 11:44 AM, Lars George wrote: Hi, If you expect a lot of misses with that approach then en

Re: intersection of row ids

2011-03-11 Thread Vishal Kapoor
Should the Bloom filter be ROW or ROWCOL? Vishal On Fri, Mar 11, 2011 at 11:44 AM, Lars George wrote: > Hi, > > If you expect a lot of misses with that approach then enable bloom filters > on the second table for fast lookups of misses. > > Lars > > On Mar 11, 2011, at 9:44, Amandeep Khurana w

Re: intersection of row ids

2011-03-11 Thread Dave Latham
If the ordering of the row ids is the same in both tables and both are of the same order of magnitude of size, I would recommend opening scanners on both tables, then compare the current row in each scanner, and advance whichever scanner is behind. Whenever you hit a match, you output it and advan

Re: intersection of row ids

2011-03-11 Thread Lars George
Hi, If you expect a lot of misses with that approach then enable bloom filters on the second table for fast lookups of misses. Lars On Mar 11, 2011, at 9:44, Amandeep Khurana wrote: > You can scan through one table and see if the other one has those rowids or > not. > > On Thu, Mar 10, 2011

Re: intersection of row ids

2011-03-11 Thread Amandeep Khurana
You can scan through one table and see if the other one has those rowids or not. On Thu, Mar 10, 2011 at 8:08 PM, Vishal Kapoor wrote: > Friends, > how do I best achieve intersection of sets of row ids > suppose I have two tables with similar row ids > how can I get the row ids present in one and

Re: intersection of row ids

2011-03-10 Thread Ted Dunning
You mean like write a map-reduce program that joins the key sets and outputs what you want? On Thu, Mar 10, 2011 at 8:08 PM, Vishal Kapoor wrote: > Friends, > how do I best achieve intersection of sets of row ids > suppose I have two tables with similar row ids > how can I get the row ids present

intersection of row ids

2011-03-10 Thread Vishal Kapoor
Friends, how do I best achieve intersection of sets of row ids suppose I have two tables with similar row ids how can I get the row ids present in one and not in the other? does things get better if I have row ids as values in some qualifier/ qualifier itself? I hope the question is not too confusi