Stemmers remove part of a query when using QueryParser

2008-01-25 Thread Jay Hill
I have added stemming Analyzer to my indexing and searching. I've tried both Porter and KStem, have gotten very good results with both with KStem being the best. The only problem is that, when analyzing on the search end using QueryParser part of my query is being removed by QueryParser: +pub:game

Re: Threads blocking on isDeleted when swapping indices for a very long time...

2008-01-25 Thread Michael Stoppelman
For anyone interested I did work around this. The problem was that I wasn't warming the index enough. When I ran full queries against the new index being switched to everything runs smoothly when the indices are swapped now. There must be some lower level caches that need to be filled that I'm not

Re: OutOfMemoryError on small search in large, simple index

2008-01-25 Thread jm
I am very interested indeed, do I understand correctly that the tweak you made reduces the memory when searching if you have many docs in the index?? I am omitting norms too. If that is the case, can someone point me to what is hte required change that should be done? I understand from Yoniks comm

Re: Retain the index

2008-01-25 Thread Developer Developer
Check if there are any lock files in your index directory after the process is completed. There should be no lock files if the index was correctly closed . On Jan 25, 2008 8:59 AM, Erick Erickson <[EMAIL PROTECTED]> wrote: > This should not be happening. I've got to assume that you have > more t

Re: TermEnum trick

2008-01-25 Thread Grant Ingersoll
How about a QueryWrapperFilter.bits(IndexReader)? -Grant On Jan 25, 2008, at 10:49 AM, Cam Bazz wrote: Hello, How about getting which documents have the that term as a bitset? In other words, now that I have field=a, field=b do I use regular query logic to get the bitsets with hitcollecto

RE: Lucene to index OCR text

2008-01-25 Thread Renaud Waldura
The author of the presentation I linked to earlier pointed me to this: http://wiki.apache.org/jakarta-lucene/SpellChecker Which is implemented by: http://www.marine-geo.org/services/oai/docs/javadoc/org/apache/lucene/spell/ NGramSpeller.html -Original Message- From: [EMAIL PROTECTED

Re: TermEnum trick

2008-01-25 Thread Erick Erickson
Several things. 1> you're allocating a new bitset each time around. Do it outside the loop. 2> You want to use TermDocs. Something like (but I haven't tried it in this form) Term term = new Term(blah blah blah); TermDocs td = ir.termDocs(term); BitSet b

Re: TermEnum trick

2008-01-25 Thread Cam Bazz
Currently I am doing: do { term = te.term(); if ((term == null) || ! term.field().equals("cat")) { return; } final BitSet bits = new BitSet(reader.maxDoc()); searcher.search(new TermQuery(new Term("cat", term.text())), new HitCollector() { pub

Re: TermEnum trick

2008-01-25 Thread Erick Erickson
Can you show us what you've tried? Erick On Jan 25, 2008 10:49 AM, Cam Bazz <[EMAIL PROTECTED]> wrote: > Hello, > > How about getting which documents have the that term as a bitset? > > In other words, now that I have field=a, field=b do I use regular query > logic to get the bitsets with hitcol

Re: TermEnum trick

2008-01-25 Thread Cam Bazz
Hello, How about getting which documents have the that term as a bitset? In other words, now that I have field=a, field=b do I use regular query logic to get the bitsets with hitcollector, or do can I do it with TermDocs() - (I could not figure it out with termdocs); Best, and thanks a lot for y

Re: Lucene to index OCR text

2008-01-25 Thread waldura
Thanks everyone for their ideas and suggestions! Some had occurred to us but were discarded because we feel our solution needs to be automated -- 45 million pages are a lot of thrust on any human-driven effort. I like Itamar's idea of doing "competing" OCR, and keeping the best result. Unfortunate

Re: TermEnum trick

2008-01-25 Thread Erick Erickson
Try this, where ir is an IndexReader. The trick is that starting with "" gives you the entire list.. Note that you'll go off the end of the field sometime. TermEnum theTerms = ir.terms(new Term("field", "")); Term term = null; do { term = theTerms.term

TermEnum trick

2008-01-25 Thread Cam Bazz
Hello, How do we get the TermEnum trick? I could not figure it out. basically, I have a field called category, and I like to learn what different values the category field takes. (sort of like unique in sql) Best Regards, -C.B.

Re: Lucene to index OCR text

2008-01-25 Thread Erick Erickson
That is brilliant! On Jan 25, 2008 6:12 AM, mark harwood <[EMAIL PROTECTED]> wrote: > Probably not a practical solution for you to set up but I love this idea: > http://blog.wired.com/monkeybites/2007/05/recaptcha_fight.html > > - Original Message > From: Renaud Waldura <[EMAIL PROTECTE

Re: Retain the index

2008-01-25 Thread Erick Erickson
This should not be happening. I've got to assume that you have more than one IndexWriter open at the same time. There's no problem at all with updating an existing index that I've ever seen, any similar errors on my part have been coding errors on my part. So first make absolutely sure that your

Mahout Machine Learning Project Launches

2008-01-25 Thread Grant Ingersoll
(Apologies for cross-posting) The Lucene PMC is pleased to announce the creation of the Mahout Machine Learning project, located at http://lucene.apache.org/mahout. Mahout's goal is to create a suite of practical, scalable machine learning libraries. Our initial plan is to utilize Hadoop

Re: Lucene to index OCR text

2008-01-25 Thread mark harwood
Probably not a practical solution for you to set up but I love this idea: http://blog.wired.com/monkeybites/2007/05/recaptcha_fight.html - Original Message From: Renaud Waldura <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Friday, 25 January, 2008 1:43:06 AM Subject: Lucene

Re: Threads blocking on isDeleted when swapping indices for a very long time...

2008-01-25 Thread Michael Stoppelman
BTW, I'm using Lucene 2.2.0. -M p.s. Congrats on the 2.3.0 release! On Jan 24, 2008 7:42 PM, Michael Stoppelman <[EMAIL PROTECTED]> wrote: > Hi all, > > I've been tracking down a problem happening in our production environment. > When we switch an index after doing deletes & adds, running some

RE: Lucene to index OCR text

2008-01-25 Thread Itamar Syn-Hershko
In our (very) small project (several thousands of pages), we scan what we can scan (and type what is not scannable), and then take someone to read-proof the OCRd material. Precision matters in our case, and this seemed to be the only way. One thought I had on your case - maybe there's an OCR librar

Re: Lucene to index OCR text

2008-01-25 Thread Paul Elschot
Op Friday 25 January 2008 03:46:23 schreef Kyle Maxwell: > > I've been poking around the list archives and didn't really come up against > > anything interesting. Anyone using Lucene to index OCR text? Any > > strategies/algorithms/packages you recommend? > > > > I have a large collection (10^7 doc

Retain the index

2008-01-25 Thread anjana m
I want to retain the older index. I dont want to delete the older index. Please help me. Does the recent release has the option to update the indexes without deleting it. I am ruuning the indexer on my sun application server. and its thorwing exceptions like cannot delete indexex. now every time