Re: Lucene & LSA

Miles Efron Thu, 14 Dec 2006 18:26:15 -0800

U of Tennessee professor Michael Berry maintains a good site regardingsoftware for computing SVD on large, sparse matrices:


        http://www.cs.utk.edu/~lsi/


The site also points to the LSI patent.

FWIW it's very easy to extract term-doc counts from a lucene index andformat them for software such as SVDPACK


        http://www.netlib.org/svdpack/index.html

or

        http://tedlab.mit.edu:16080/~dr/SVDLIBC/

Of course using the resulting matrices isn't so trivial, if you want tostay within Lucene. Also, they are dense, so even at relatively lowdimensionality, you're still storing a lot of data.


-Miles

On Thu, 14 Dec 2006, Marvin Humphrey wrote:

On Dec 14, 2006, at 11:16 AM, Soeren Pekrul wrote:
it is possible to extract the matrix from the indexing file?
I don?t know any API to extract the matrix from the index file directly.
How could we make it work to write an open source decomposed vector modelsearch engine a la LSA without running afoul of the LSA patents? Maybe usean algorithm other than SVD for the decomposition?
I'm only superficially familiar with LSA, but I'm always looking for ways toimprove relevance. In theory it would be nice to factor in a decomposedsimilarity measure, so that on a search for 'napoleonic war', documents whichcontained a lot of words which were similar to either 'napoleon' and 'war'would score higher than documents which had only a passing mention.
Personally, I'm less interested in "more like this" queries, because theprecision of search results based solely on on similar document vectors is sopoor -- proper names and other rare tokens unrelated to the original querywreak havok on the relevance scores. But maybe there's a way in the originalkeyword search to juice up the scores of documents which not only contain theoriginal terms, but also a lot of terms which are similar to them.
I dunno if it would be worth the computational effort, though. A decomposedmatrix is going to be inherently expensive to generate, because you have tostart from a complete matrix. That doesn't jibe well with incrementalindexing.
Also, it's not clear to me how much of a gain we'd get in relevance. Myhunch is that shorter, tightly focused documents would benefit some and thatlonger more diffuse documents -- which might contain passages which were justas useful as those in a shorter document -- would lose. That wouldn't behelpful for a common case in naive web search, where impossible-to-excludenavigational and advertising text could end up diluting the scores ofperfectly good material.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


__________________________________
Miles Efron
http://www.ibiblio.org/mefron
[EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene & LSA

Reply via email to