U of Tennessee professor Michael Berry maintains a good site regarding
software for computing SVD on large, sparse matrices:
http://www.cs.utk.edu/~lsi/
The site also points to the LSI patent.
FWIW it's very easy to extract term-doc counts from a lucene index and
format them for software such as SVDPACK
http://www.netlib.org/svdpack/index.html
or
http://tedlab.mit.edu:16080/~dr/SVDLIBC/
Of course using the resulting matrices isn't so trivial, if you want to
stay within Lucene. Also, they are dense, so even at relatively low
dimensionality, you're still storing a lot of data.
-Miles
On Thu, 14 Dec 2006, Marvin Humphrey wrote:
On Dec 14, 2006, at 11:16 AM, Soeren Pekrul wrote:
it is possible to extract the matrix from the indexing file?
I don?t know any API to extract the matrix from the index file directly.
How could we make it work to write an open source decomposed vector model
search engine a la LSA without running afoul of the LSA patents? Maybe use
an algorithm other than SVD for the decomposition?
I'm only superficially familiar with LSA, but I'm always looking for ways to
improve relevance. In theory it would be nice to factor in a decomposed
similarity measure, so that on a search for 'napoleonic war', documents which
contained a lot of words which were similar to either 'napoleon' and 'war'
would score higher than documents which had only a passing mention.
Personally, I'm less interested in "more like this" queries, because the
precision of search results based solely on on similar document vectors is so
poor -- proper names and other rare tokens unrelated to the original query
wreak havok on the relevance scores. But maybe there's a way in the original
keyword search to juice up the scores of documents which not only contain the
original terms, but also a lot of terms which are similar to them.
I dunno if it would be worth the computational effort, though. A decomposed
matrix is going to be inherently expensive to generate, because you have to
start from a complete matrix. That doesn't jibe well with incremental
indexing.
Also, it's not clear to me how much of a gain we'd get in relevance. My
hunch is that shorter, tightly focused documents would benefit some and that
longer more diffuse documents -- which might contain passages which were just
as useful as those in a shorter document -- would lose. That wouldn't be
helpful for a common case in naive web search, where impossible-to-exclude
navigational and advertising text could end up diluting the scores of
perfectly good material.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
__________________________________
Miles Efron
http://www.ibiblio.org/mefron
[EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]