On Wed, 2005-11-30 at 16:27, John D. Burger wrote: > Scott Marlowe wrote: > > > select > > v1.pkey1, > > v1.field2, > > v1.field3, > > v1.field4, > > v2.pkey1, > > v2.field2, > > v2.field3, > > v2.field4, > > from > > view v1 > > join > > view v2 > > on ( > > v1.field2=v2.field2 and > > v1.field3=v2.field3 and > > v1.field3=v2.field3 and > > v1.pkey1<>v2.pkey > > ) > > > > How does that work? > > Won't this be a massive cross product of all pkey pairs that have the > same field values? >
Yes, assuming there are a lot of them. OTOH, if there are only a few duplicates you're looking for... How many are you expecting, percentage wise, to get back? > Here's what I'm currently using, in terms of your very helpful view: > > select v1.pkey1, v1.field2, v1.field3, v1.field4 > from view as v1 > join > (select v2.field1, v2.field2, v2.field3 > from view as v2 > group by v2.field2, v2.field3, v2.field4 > having count(*) = 1) > using (field2, field3, field4); > > This is the one that takes eight hours. :( Another way to express what > I want is this: > > select v1.pkey1, v1.field2, v1.field3, v1.field4 > from view as v1 > where not exists > (select true from view as v2 > where v1.field2 = v2.field2 > and v1.field3 = v2.field3 > and v1.field4 = v2.field4 > and v1.pkey1 <> v2.pkey1); > > That looks like a horrible nested loop, but I suppose I should try it > to make sure it is indeed slower then the previous query. If you can allocated enough shared memory for the set to fit in memory, you might be able to get a hash agg method, which is much faster than most other methods for this kind of thing, since it requires no sort. In a side point, I'm currently mushing 888,000,000 6 character codes up against each other to check for duplicates. I have 6 machines doing this, at 1 million codes compared to 1 million codes every 0.5 seconds aggregate. That gets me down to about 1 week. So, 8 hours is seeming quite fast. :) ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings