I've been having my students post journal entries to a Drupal site, and we just 
noticed that some of them have been copying from each other (surprise, 
surprise).  If this was done with reports or other submissions, we'd just 
submit the deliverables to TurnItIn.com and go from there.

In this case I'd rather run the similarity checking on our local server, and my 
guess is that Full Text Search is the way to go.  I've found a couple of online 
posts about such things, but they're not quite as clear as I was hoping.  I'm 
hoping that the list can help me out (and keep this from becoming an O(n^4) 
challenge).

What I have is essentially a table containing an ID and some TEXT.

My first attempt used the similarity() function from pg_trgm as follows:

select
   aid
  ,bid
  ,similarity(source,dest) sim
from
(
   select 
      a.id aid
     ,a.body source
     ,b.id bid
     ,b.body dest 
   from
     posts a
   cross join 
     posts b
   where
     and a.id != b.id
) as inside 
order by sim desc;

I haven't been thinking in SQL terms for a while, so I didn't come up with a 
way to eliminate half of my checks by exploiting symmetry (a simple way to 
accomplish this would be helpful).  Unfortunately this solution will take way 
too long to execute.

I'm hoping that the full text features can be shoehorned into this problem, but 
from what little I've seen it will be tricky because comparing a tsvector to 
another tsvector isn't supported.  Plus there's weird (to the novice) indexing 
that will (I hope) help to keep the complexity down.

Thoughts from the list are most welcome, and I'll write up a HOWTO once I get 
this working.

Thanks!

Jason
-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Reply via email to