On Wed, 2005-11-30 at 16:27, John D. Burger wrote:
> Scott Marlowe wrote:
> 
> > select
> >     v1.pkey1,
> >     v1.field2,
> >     v1.field3,
> >     v1.field4,
> >     v2.pkey1,
> >     v2.field2,
> >     v2.field3,
> >     v2.field4,
> > from
> >     view v1
> > join
> >     view v2
> > on (
> >     v1.field2=v2.field2 and
> >     v1.field3=v2.field3 and
> >     v1.field3=v2.field3 and
> >     v1.pkey1<>v2.pkey
> > )
> >
> > How does that work?
> 
> Won't this be a massive cross product of all pkey pairs that have the 
> same field values?
> 

Yes, assuming there are a lot of them.  OTOH, if there are only a few
duplicates you're looking for...

How many are you expecting, percentage wise, to get back?

> Here's what I'm currently using, in terms of your very helpful view:
> 
> select v1.pkey1, v1.field2, v1.field3, v1.field4
>    from view as v1
>    join
>    (select v2.field1, v2.field2, v2.field3
>       from view as v2
>       group by v2.field2,  v2.field3, v2.field4
>       having count(*) = 1)
>    using (field2, field3, field4);
> 
> This is the one that takes eight hours. :(  Another way to express what 
> I want is this:
> 
> select v1.pkey1, v1.field2, v1.field3, v1.field4
>    from view as v1
>    where not exists
>      (select true from view as v2
>         where v1.field2 = v2.field2
>         and v1.field3 = v2.field3
>         and v1.field4 = v2.field4
>         and v1.pkey1 <> v2.pkey1);
> 
> That looks like a horrible nested loop, but I suppose I should try it 
> to make sure it is indeed slower then the previous query.

If you can allocated enough shared memory for the set to fit in memory,
you might be able to get a hash agg method, which is much faster than
most other methods for this kind of thing, since it requires no sort.

In a side point, I'm currently mushing 888,000,000 6 character codes up
against each other to check for duplicates.  I have 6 machines doing
this, at 1 million codes compared to 1 million codes every 0.5 seconds
aggregate.  That gets me down to about 1 week.  So, 8 hours is seeming
quite fast.  :)

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Reply via email to