Duplicate documents in a corpus

2011-07-28 Thread Rich Heimann
either Lucene at ingestion or Mahout at post-processing? The Vector Space Model seems to be notional similar to PCA or Factor Analysis, which both have similar ambitions. Thoughts??? Thank you in advance Regards, Rich Heimann Richard Heimann

Re: Please help me with a basic question...

2011-05-20 Thread Rich Heimann
f you are convinced that length normalization is the culprit you could > give > a try to: > - omitting norms all together at indexing > - using e.g. SeetSpotSimilarity which do not favor shorter documents. > Regards, > Doron > > On Thu, May 19, 2011 at 5:20 PM, Rich Heimann

Re: Please help me with a basic question...

2011-05-19 Thread Rich Heimann
re IDF (in similarity? in > solrconfig?). > > paul > > > Le 18 mai 2011 à 21:30, Rich Heimann a écrit : > > > Hello all, > > > > This is my first time on the list and my first question...forgive me it > this > > has been hacked out in the past. > &g

Please help me with a basic question...

2011-05-18 Thread Rich Heimann
Hello all, This is my first time on the list and my first question...forgive me it this has been hacked out in the past. We have set up Lucene/Solr and are getting somewhat spurious results. It appears to be a result of heterogeneous document sizes. In other words, the top results are sometimes (