I have added stemming Analyzer to my indexing and searching. I've tried both
Porter and KStem, have gotten very good results with both with KStem being
the best. The only problem is that, when analyzing on the search end using
QueryParser part of my query is being removed by QueryParser:
+pub:game
For anyone interested I did work around this. The problem was that I wasn't
warming the index enough. When I ran full queries against the new index
being
switched to everything runs smoothly when the indices are swapped now. There
must be some lower level caches that need to be filled that I'm not
I am very interested indeed, do I understand correctly that the tweak
you made reduces the memory when searching if you have many docs in
the index?? I am omitting norms too.
If that is the case, can someone point me to what is hte required
change that should be done? I understand from Yoniks comm
Check if there are any lock files in your index directory after the process
is completed. There should be no lock files if the index was correctly
closed .
On Jan 25, 2008 8:59 AM, Erick Erickson <[EMAIL PROTECTED]> wrote:
> This should not be happening. I've got to assume that you have
> more t
How about a QueryWrapperFilter.bits(IndexReader)?
-Grant
On Jan 25, 2008, at 10:49 AM, Cam Bazz wrote:
Hello,
How about getting which documents have the that term as a bitset?
In other words, now that I have field=a, field=b do I use regular
query
logic to get the bitsets with hitcollecto
The author of the presentation I linked to earlier pointed me to this:
http://wiki.apache.org/jakarta-lucene/SpellChecker
Which is implemented by:
http://www.marine-geo.org/services/oai/docs/javadoc/org/apache/lucene/spell/
NGramSpeller.html
-Original Message-
From: [EMAIL PROTECTED
Several things.
1> you're allocating a new bitset each time around. Do it outside the loop.
2> You want to use TermDocs.
Something like (but I haven't tried it in this form)
Term term = new Term(blah blah blah);
TermDocs td = ir.termDocs(term);
BitSet b
Currently I am doing:
do {
term = te.term();
if ((term == null) || ! term.field().equals("cat")) { return; }
final BitSet bits = new BitSet(reader.maxDoc());
searcher.search(new TermQuery(new Term("cat", term.text())), new
HitCollector() {
pub
Can you show us what you've tried?
Erick
On Jan 25, 2008 10:49 AM, Cam Bazz <[EMAIL PROTECTED]> wrote:
> Hello,
>
> How about getting which documents have the that term as a bitset?
>
> In other words, now that I have field=a, field=b do I use regular query
> logic to get the bitsets with hitcol
Hello,
How about getting which documents have the that term as a bitset?
In other words, now that I have field=a, field=b do I use regular query
logic to get the bitsets with hitcollector, or do can I do it with
TermDocs() - (I could not figure it out with termdocs);
Best, and thanks a lot for y
Thanks everyone for their ideas and suggestions! Some had occurred to us
but were discarded because we feel our solution needs to be automated --
45 million pages are a lot of thrust on any human-driven effort.
I like Itamar's idea of doing "competing" OCR, and keeping the best
result. Unfortunate
Try this, where ir is an IndexReader. The trick is that starting with ""
gives
you the entire list..
Note that you'll go off the end of the field sometime.
TermEnum theTerms = ir.terms(new Term("field", ""));
Term term = null;
do {
term = theTerms.term
Hello,
How do we get the TermEnum trick? I could not figure it out. basically, I
have a field called category, and I like to learn what different values the
category field takes. (sort of like unique in sql)
Best Regards,
-C.B.
That is brilliant!
On Jan 25, 2008 6:12 AM, mark harwood <[EMAIL PROTECTED]> wrote:
> Probably not a practical solution for you to set up but I love this idea:
> http://blog.wired.com/monkeybites/2007/05/recaptcha_fight.html
>
> - Original Message
> From: Renaud Waldura <[EMAIL PROTECTE
This should not be happening. I've got to assume that you have
more than one IndexWriter open at the same time.
There's no problem at all with updating an existing index that
I've ever seen, any similar errors on my part have been coding
errors on my part.
So first make absolutely sure that your
(Apologies for cross-posting)
The Lucene PMC is pleased to announce the creation of the Mahout
Machine Learning project, located at http://lucene.apache.org/mahout.
Mahout's goal is to create a suite of practical, scalable machine
learning libraries. Our initial plan is to utilize Hadoop
Probably not a practical solution for you to set up but I love this idea:
http://blog.wired.com/monkeybites/2007/05/recaptcha_fight.html
- Original Message
From: Renaud Waldura <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Friday, 25 January, 2008 1:43:06 AM
Subject: Lucene
BTW, I'm using Lucene 2.2.0.
-M
p.s. Congrats on the 2.3.0 release!
On Jan 24, 2008 7:42 PM, Michael Stoppelman <[EMAIL PROTECTED]> wrote:
> Hi all,
>
> I've been tracking down a problem happening in our production environment.
> When we switch an index after doing deletes & adds, running some
In our (very) small project (several thousands of pages), we scan what we
can scan (and type what is not scannable), and then take someone to
read-proof the OCRd material. Precision matters in our case, and this seemed
to be the only way. One thought I had on your case - maybe there's an OCR
librar
Op Friday 25 January 2008 03:46:23 schreef Kyle Maxwell:
> > I've been poking around the list archives and didn't really come up against
> > anything interesting. Anyone using Lucene to index OCR text? Any
> > strategies/algorithms/packages you recommend?
> >
> > I have a large collection (10^7 doc
I want to retain the older index.
I dont want to delete the older index.
Please help me.
Does the recent release has the option to update the indexes without
deleting it.
I am ruuning the indexer on my sun application server.
and its thorwing exceptions like cannot delete indexex.
now every time
21 matches
Mail list logo