Re: Lucene + LSI

2005-12-13 Thread Dave Kor
On 12/13/05, Dave Kor <[EMAIL PROTECTED]> wrote: > On 12/13/05, Ian Soboroff <[EMAIL PROTECTED]> wrote: > > Paul Libbrecht <[EMAIL PROTECTED]> writes: > > > > > We're also thinking about implementing something similar to LSI within > > > ActiveMath which is lucene-powered where both formulae and te

Re: Impact of Term Vectors

2005-12-13 Thread Ira Goldstein
We've run into an issue with the term vectors. When indexing a small corpus (~3k docs, 1.3G) everything works fine, as it does with a small number of documents from TREC-6 (so we believe that our indexing code is ok). However, when we tried to index the full TREC-6 corpus (~300,000 docs, 2G) the t

Re: Top n Searches

2005-12-13 Thread Erik Hatcher
It would be if the PhraseQuery (or SpanNearQuery) were used with some slop specified. Erik On Dec 13, 2005, at 1:08 PM, gekkokid wrote: would 'x y z' and 'y z x' be the same results? i didnt think that was the case - Original Message - From: "Paul Williams" <[EMAIL PROT

Re: DistributingMultiFieldQueryParser and DisjunctionMaxQuery

2005-12-13 Thread Chris Hostetter
: The DistributingMultiFieldQueryParser would correctly generate a query : that would find fruit in one of the fields, but would only ensure that : apples did not appear in one field, not not appear in all the fields, : which was the behaviour I wanted. Hence negations didn't really work if : the

Re: Top n Searches

2005-12-13 Thread gekkokid
would 'x y z' and 'y z x' be the same results? i didnt think that was the case - Original Message - From: "Paul Williams" <[EMAIL PROTECTED]> To: Sent: Tuesday, December 13, 2005 5:22 PM Subject: RE: Top n Searches That was the approach I was planning to take but I've been asked to

Impact of Term Vectors (was ApacheCon next week)

2005-12-13 Thread Dan Climan
Good question. I was wondering about the impact of adding term vectors with the various options. For example, is adding term vectors with both positions and offsets a significant impact? Which current parts of lucene (including contributions) take advantage of term vectors being present? I know tha

Re: ApacheCon next week

2005-12-13 Thread Grant Ingersoll
Thanks, Jeff. I have only done basic testing, so not completely sure on your question. However, one trade off is definitely in disk space. As far as searching, I don't think there should be any impact b/c you get the vector separate from a search via the IndexReader. Perhaps, the compound

RE: Top n Searches

2005-12-13 Thread Paul Williams
That was the approach I was planning to take but I've been asked to provide a more intelligent implementation. Basically I need to count search phrases like 'x y z' and 'y z x' as being the same. -Original Message- From: Cheolgoo Kang [mailto:[EMAIL PROTECTED] Sent: 08 December 2005 08:

Re: html parsers and numers of terms

2005-12-13 Thread J.J. Larrea
Glad that hint was useful. I was totally bit by that artifact myself. It turns out that there were XML numeric character references within VARCHAR fields in a database I was indexing, so I never suspected that the NCRs I was seeing in Luke had anything to do with the non-XML non-HTML (so I tho

Re: html parsers and numers of terms

2005-12-13 Thread Robert Watkins
So obvious I missed it (at least that's my excuse). I'm on the road at the moment and -- can you believe it? -- didn't bring my copy of Lucene In Action with me! Looks like I'll have to get the source code from lucenebook.com to crib the analyzer demo code. Much obliged, -- Robert On Tue, 13 Dec

lucene similarity value range

2005-12-13 Thread duiduder
Hi, I am wondering whether the range of the similarity values is guaranteed to be inside a well-defined range (e.g. between [0..1]). I use the DefaultSimilarity implementation from the SVN Lucene version and actually recieve values of e.g. 1.84. Is this a bug? Is there any range guaranteed? Wha

Re: html parsers and numers of terms

2005-12-13 Thread Robert Watkins
Aha! I had, indeed, been fooled by Luke into thinking that the entities had been converted upon analysis, but you have set me straight. Thanks, -- Robert On Tue, 13 Dec 2005, J.J. Larrea wrote: Beware of HTML/XML entities in your input stream! The Lucene analyzers (including StandardAnalyzer

Re: html parsers and numers of terms

2005-12-13 Thread J.J. Larrea
Beware of HTML/XML entities in your input stream! The Lucene analyzers (including StandardAnalyzer) do not interpret these representation-specific encodings, and assume the & and ; delimiters are punctuation. How they deal with punctuation depends on the specific Analyzer logic. For example,

Re: :how to add int fileds to lucene:

2005-12-13 Thread Erik Hatcher
Ravi, A few great starting points... the code from http:// www.lucenebook.com (and it wouldn't hurt my feelings if you picked up a copy of the book itself too :), the examples from the many articles that have been written on Lucene, and last but not least, the unit tests of Lucene itself w

Re: html parsers and numers of terms

2005-12-13 Thread Erik Hatcher
How about taking a single simple HTML file, running it through each parser, dumping the tokens into separate collections (or output to a single text file) and diff them? Erik On Dec 13, 2005, at 7:33 AM, Robert Watkins wrote: I have been experimenting with a couple of HTML parsers,

html parsers and numers of terms

2005-12-13 Thread Robert Watkins
I have been experimenting with a couple of HTML parsers, primarily to compare performance, but have discovered a difference in the index for which I haven't, with assurance discovered the cause. The difference is in the number of terms reported by Luke. The indexes created with the content parsed

Re: Lucene + LSI

2005-12-13 Thread adasal
I'm sure your evaluation is correct from the inspection I have made of the NITLE effort, e.g. the need for tweaking. I will follow up your links when I have a chance, and thank you for them. I didn't know that about Maciej Ceglowski. People do move on. I wish him luck (in case he is reading this or

Re: Lucene + LSI

2005-12-13 Thread Sebastian Marius Kirsch
On Tue, Dec 13, 2005 at 10:53:42AM +, adasal wrote: > There seem to be quite a few alternatives around. I would be interested in > comments on the following:- > The work at NITLE > using Contextual > Network Search (CNS) a graph-based alternative

Re: Lucene + LSI

2005-12-13 Thread adasal
There seem to be quite a few alternatives around. I would be interested in comments on the following:- The work at NITLE using Contextual Network Search (CNS) a graph-based alternative to LSI. This work *[PDF]* An Introduction to *Random* Indexing

:Creating the search on last modified list value:

2005-12-13 Thread Ravi
Hi , I want create a following search query search and Return results updated in theany time Last 7 days Last 2 weeks Last 1 Month so on. http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/javascript/queryCo nstructor/luceneQueryConst

newbie question

2005-12-13 Thread Dan Nicolici
Hi! For starters I want to apologize if this is the wrong place to post a Cocoon-Lucene question. This is my first encounter with Lucene. I am trying to integrate it with Cocoon. From what I read so far I prefer the approach with LuceneIndexTransformer. Here is where I need some assistance. Fro

DistributingMultiFieldQueryParser and DisjunctionMaxQuery

2005-12-13 Thread Miles Barr
On Mon, 2005-12-12 at 15:35 -0800, Chris Hostetter wrote: > : Oh, BTW: I just found the DisjunctionMaxQuery class, recently added it > : seems. Do you think this query structure could benefit from using it > : instead of the BooleanQuery? > > DisjunctionMaxQuery kicks ass (in my opinion), and It

Re: Integrating Lucene with hibernate3

2005-12-13 Thread Benjamin Reitzammer
Hi, Hibernate 3.1 has (rudimentary) builtin support for Lucene, via Annotations. See here http://www.hibernate.org/hib_docs/annotations/reference/en/html/lucene.html I haven't tested it extensively but it worked quite well in my basic testing. Though couldn't find much documentation. But the sour