Linear search using reader vs. scorer implementation

2006-08-07 Thread Mathias Lux
Hi! I'm working in my spare time on Lire, a content based image retrieval library (searching for similar looking images in other words, see http://www.semanticmetadata.net/lire) based on Lucene. As the cbir features are medium sized integer vectors I put them into fields, and read them with the I

Re: Stemmer Implementation Strategy - feedback?

2006-08-07 Thread Marvin Humphrey
On Aug 7, 2006, at 11:23 PM, Marios Skounakis wrote: I directed the question to the lucene list in order to find out what people think about the general case Martin Porter touches on some of the pros and cons of a dictionary- based approach to stemming at

Re: Stemmer Implementation Strategy - feedback?

2006-08-07 Thread Marios Skounakis
Hi Grant, Thanks for the interesting reply. Grant Ingersoll wrote: Hey Marios, It sounds like you have a reasonable plan and you have thought through the ideas. And the answer to many of your questions below is "it depends". Do you have enough memory to hold the whole lexicon in memory?

Poor performance "race condition" in FieldSortedHitQueue

2006-08-07 Thread hutchiko
Hey all, just want to run an issue that I've recently identified while looking at some performance issues we are having with our larger indexes past you all. Basically what we are seeing is that when there are a number of concurrent searches being executed over a new IndexSearcher, the quite expe

Re: About the use of HitCollector

2006-08-07 Thread hu andy
Hey,Simon, thanks for your reply I have an ID Field in the index. For the efficiency of indexing speed, I put some fields in a database, because I found the total of fields in a Document will badly degrade the indexing speed. So for the search, I will first query the database to get a list of ID,

Re: Classifieds rotation - weighting Lucene results by previous show frequency?

2006-08-07 Thread Grant Ingersoll
You could also write a custom sorter that does this, I think. -Grant On Aug 7, 2006, at 10:24 PM, Doron Cohen wrote: If the 'small classifieds index' is sufficiently small to be re- indexed every night, I think this would be a simple solution - just set the document boosts according to these

Re: Classifieds rotation - weighting Lucene results by previous show frequency?

2006-08-07 Thread Doron Cohen
If the 'small classifieds index' is sufficiently small to be re-indexed every night, I think this would be a simple solution - just set the document boosts according to these statistics - i.e. boost more down docs of classifieds that were shown more yesterday - http://lucene.apache.org/java/docs/ap

Classifieds rotation - weighting Lucene results by previous show frequency?

2006-08-07 Thread Chun Wei Ho
We are starting to run a small index of classifieds alongside our main search items. The classifieds are also in a lucene index. We show classifieds that match the user's search criteria, which means we do a lucene search on that index and show the top few results. We also keep track of the number

Re: More like this returning similarities that are too generic

2006-08-07 Thread Chad Hardin
Thank you Erick, that was what I anticipated would be necessary. There's still the issue of the queries from MoreLikeThis not returning results for terms I had expected ("bikes"). For example, I have these four very short documents: "bikes are a handy tool for getting from diffrent locations

Re: query syntax problem

2006-08-07 Thread Yiqun \"Eddie\" Cao
Setting field to Field.Index.UN_TOKENIZED works perfectly. Thanks to all. Regards, Eddie On 8/7/06, Nicolas Lalevée <[EMAIL PROTECTED]> wrote: Le Lundi 07 Août 2006 19:28, Yiqun "Eddie" Cao a écrit: > Hi, > > We are using lucene in a chemistry database, and we are dealing with > special words

Re: More like this returning similarities that are too generic

2006-08-07 Thread Erick Erickson
Well, I expect that defining "less common" is tricky and doesn't lend itself to a canned answer . Would it work to create your own list of stop words (possibly very large) to use for indexing and/or searching? This would simply exclude the "less common" words (as you define them). StandardAnalyzer

More like this returning similarities that are too generic

2006-08-07 Thread Chad Hardin
hi all, I'm new to lucene but I'm loving it! I'm writing a prototype that links documents together based upon similarities. Obviously the first thing I did was use MoreLikeThis. However, it seems to be finding matches based upon words that are too common, in this case the words "from"

Re: About the use of HitCollector

2006-08-07 Thread Simon Willnauer
Hey Andy, i don't know how you determinate whether a document has to be displayed or not but I use a filter to do such kind of jobs. We have a index for a specific website with personalized areas which should be searchable for users having corresponding usergroups. That works quiet well and you c

Re: query syntax problem

2006-08-07 Thread Nicolas Lalevée
Le Lundi 07 Août 2006 19:28, Yiqun "Eddie" Cao a écrit : > Hi, > > We are using lucene in a chemistry database, and we are dealing with > special words containing both digits and characters in English alphabets, > such as PFC-0234. To prevent lucene from cutting the word into two, we have > replace

Re: query syntax problem

2006-08-07 Thread Erick Erickson
When you say "we've tried the whitespace analyzer", did you mean for BOTH indexing and searching? If you ony use it for one of those, you'd see results like this. And do you use Luke? It'll let you examine your index and see what's *actually* in it. It's the first place I go when I don't get resu

query syntax problem

2006-08-07 Thread Yiqun \"Eddie\" Cao
Hi, We are using lucene in a chemistry database, and we are dealing with special words containing both digits and characters in English alphabets, such as PFC-0234. To prevent lucene from cutting the word into two, we have replaced all dashes into underscores, so PFC-0234 is stored and indexed as

Re: About the use of HitCollector

2006-08-07 Thread hu andy
Martin, Thank you for your reply. But the Lucene API said: This is called in an inner search loop. For good search performance, implementations of this method should not call Searcher.doc(int)or IndexReader.document(int)on every document number encountered Because I have to check a field in the

Re: Modify index on database update

2006-08-07 Thread vasu shah
Thanks Michael. You explained it very nice. I will look into the third approach. The first and second approach are not feasible for me. Thanks again. -Vasu Michael McCandless <[EMAIL PROTECTED]> wrote: > My application database can be updated outside the application also. Wheneve

RE: running a lucene indexing app as a windows service on xp, crashing

2006-08-07 Thread Mark Modrall
Oh, sorry forgot - jdk 1.5.0_06 This e-mail message, and any attachments, is intended only for the use of the individual or entity identified in the alias address of this message and may contain information that is confidential, privileged and subject to legal restrictions and penalties reg

RE: running a lucene indexing app as a windows service on xp, crashing

2006-08-07 Thread Mark Modrall
Hi Mike... Sorry I didn't respond over the weekend; I wasn't checking work email. A few more pieces of information about our circumstances: the indexing system is running Windows Server 2003 sp1, the directories that the indexer is using are shares (helped reproduce; we had a cou

Re: lengthNorm method of Similarity not beeing called

2006-08-07 Thread Michael McCandless
At this post Erik says: "Sure, you can subclass DefaultSimilarity and override and tweak just the lengthNorm() method. Be sure to use IndexWriter.setSimilarity() to get your custom one used." Well, I traced my own method lengthNorm and realized that this method is not being called. The leng

Re: Stemmer Implementation Strategy - feedback?

2006-08-07 Thread Grant Ingersoll
Hey Marios, It sounds like you have a reasonable plan and you have thought through the ideas. And the answer to many of your questions below is "it depends". Do you have enough memory to hold the whole lexicon in memory? Is this lexicon going to grow significantly over time? I have, in

lengthNorm method of Similarity not beeing called

2006-08-07 Thread Enrique Lamas
Hi, I want to execute a query and sort the results in a special way. Seeing the Explanation info returned, I've decided to alter the value that at Explanation is given as fieldNorm. Searching at this maillist, I found this post: http://www.mail-archive.com/java-user@lucene.apache.org/msg03304.htm

Re: About the use of HitCollector

2006-08-07 Thread Martin Braun
hi andy, > How can I use HitCollector to iterate over every returned document? You have to override the function collect for the HitCollector class and then store the retrieved Data in an array or map. Here is just a source-code scratch (is = IndexSearcher) is.search(query, null

About the use of HitCollector

2006-08-07 Thread hu andy
How can I use HitCollector to iterate over every returned document? Thank you in advance.