There is quite a bit of litterature available on this topic. This paper
presents a summary. Nothing immediately applicable I'm afraid.
Retrieving OCR Text: A survey of current approaches
Steven M. Beitzel, Eric C. Jensen, David A Grossman
Illinois Institute of Technology
It lists a number of othe
dexes, becoming more
generalizable to a number of use cases including allowing it to support the
use case of one (or more indexes) and a high work load of queries that need
to be managed. It could use the same defaults if the TPE is not set
externally (is null).
-Glen
2008/4/22 Renaud Waldura &l
> one solution is to set-up a ThreadPoolExecutor[2] with a fixed
> number of threads and a limited queue size (use a bound BlockingQueue[3])
Yes, this is precisely how the ConcurrentMultiSearcher works.
https://issues.apache.org/jira/browse/LUCENE-423
-Original Message-
From: Glen New
Anshum:
Have you looked into the ConcurrentMultiSearcher? It would have you split
your index into N sub-indices, and search each with a dedicated thread.
--Renaud
-Original Message-
From: Anshum [mailto:[EMAIL PROTECTED]
Sent: Monday, April 21, 2008 9:10 PM
To: java-user@lucene.apache
The author of the presentation I linked to earlier pointed me to this:
http://wiki.apache.org/jakarta-lucene/SpellChecker
Which is implemented by:
http://www.marine-geo.org/services/oai/docs/javadoc/org/apache/lucene/spell/
NGramSpeller.html
-Original Message-
From: [EMAIL PROTECTED
Thanks everyone for their ideas and suggestions! Some had occurred to us
but were discarded because we feel our solution needs to be automated --
45 million pages are a lot of thrust on any human-driven effort.
I like Itamar's idea of doing "competing" OCR, and keeping the best
result. Unfortunate
tc. with Lucene without any trouble, but OCR errors are a
problem, when doing exact phrase matches in particular. I'm looking for
ideas on how to deal with this thorny problem.
--
Renaud Waldura
Applications Group Manager
Library and Center for Knowledge Management
University of California, San
Could someone who understands Lucene internals help me port
https://issues.apache.org/jira/browse/LUCENE-423 to Lucene 2.0? I have beefy
hardware (32 cores) and want to try this out, but it won't compile.
There are 2 issues:
1- maxScore
On line 412 TopFieldDocs constructor now needs a maxScore.
Being a French speaker, I will mention the following special cases:
- "plus ça change" -> "plus ca change"
- "œuf" -> "oeuf"
- "lætitia" -> "laetitia"
But I just looked, and it looks like ISOLatin1AccentFilter handles these.
Better test to be sure...
--Renaud
-Original Message-
From:
Yeah, it's a surprise, isn't it? I'm afraid there isn't a good answer.
http://wiki.apache.org/lucene-java/BooleanQuerySyntax
The "best practice" appears to be to require parens everywhere to force the
evaluation order. Not very satisfying, but it does work 100%.
-Original Message-
From
Often documents can be divided in "metadata" and "contents" sections. Say
you're indexing Web pages, you could index them with HEAD data all in one
field, and the BODY content in another. While also creating separate fields
for every HEAD field, e.g. TITLE etc.
At search time, you rewrite every qu
Regarding the Lucene Wiki, is there an editing policy or should I feel free
to change stuff as I see fit? E.g. I've added a page LuceneCaveats, and now
I want to edit http://wiki.apache.org/lucene-java/ConceptsAndDefinitions and
add a "Core Classes" section, and refactor that page.
--Renaud
Mark:
Thanks a million for this comprehensive analysis. This is going straight to
my manager. :)
--Renaud
-Original Message-
From: Mark Miller [mailto:[EMAIL PROTECTED]
Sent: Monday, July 02, 2007 2:11 PM
To: java-user@lucene.apache.org
Subject: Re: highlighting phrase query
There ha
the QueryParser to act as you want (generate
a PhraseQuery or MultiPhraseQuery when it sees <>).
Are you sure you need a PhraseQuery and not a Boolean query of Should
clauses?
- Mark
On 6/14/07, Renaud Waldura <[EMAIL PROTECTED]> wrote:
>
> Thanks guys, I like it! I'm already
s always to get out the sledgehammer...
- Mark
Erick Erickson wrote:
> Well, perhaps the simplest thing would be to pre-process the query and
> make the comma into a whitespace before sending anything to the query
> parser. I don't know how generalizable that sort of solution is in
>
My very simple analyzer produces tokens made of digits and/or letters only.
Anything else is discarded. E.g. the input "smith,anna" gets tokenized as 2
tokens, first "smith" then "anna".
Say I have indexed documents that contained both "smith,anna" and
"smith,annanicole". To find them, I enter th
;author:apple" clauses
QueryTermExtractor produced no strings for highlighting when you asked for
the field.
Cheers
Mark
On 3/2/07, Renaud Waldura <[EMAIL PROTECTED]> wrote:
>
> Hello Mark:
>
> I apologize for not responding earlier, more urgent stuff took over. I
just the "allContent"
field values and pass a TokenStream for the "title" field to the highlighter
and it would highlight the appropriate values in the title.
Do any of these options work?
Renaud Waldura wrote:
> The old highlighter code used to highlight found terms in an
The old highlighter code used to highlight found terms in any field (too
broad). The new highlighter lets one specify a field when highlighting, but
it highlights that field only (too narrow).
In my case we have an "all" field that is the concatenation of all data
about the document. When I high
you're feeling lazy.
This assumes that most good matches are at the start of the document, and
that the files on disk are not compressed.
moraleslos wrote:
> Maybe keeping the data in the DB would make it quicker? Seems like
> the I/O performance would cause most of the pe
: java-user@lucene.apache.org
Subject: RE: Text storing design and performance question
Maybe keeping the data in the DB would make it quicker? Seems like the I/O
performance would cause most of the performance issues you're seeing.
-los
Renaud Waldura-5 wrote:
>
> We used to store
We used to store a big text field for highlighting purposes too, and it
proved a big pain. The index was gigantic, it took forever to build, and the
search performance would sometimes suffer from it (just a hunch).
Now we keep this big text field on disk (in a file), and feed it to the
highlighter
Read my own complaints about QueryParser here:
http://marc.theaimsgroup.com/?l=lucene-user&m=116069469827270&w=2
You're in for a surprise. As alluded by Erick, the stock QP doesn't quite do
what one (legitimately IMO) expects.
--Renaud
-Original Message-
From: Erick Erickson [mailto:[
Greetings:
I read the mailing-list archives about this topic and found the PDFBox
solutions at: http://www.pdfbox.org/userguide/highlighting.html
Basically there are 3 options:
1- append query parameters to the PDF URL
2- generate a highlight XML document that Acrobat Reader will download
separa
igator.jspa?reset=true&mode=hide&pid=12310110&sorter/order=DESC&sorter/field=priority&resolution=-1&component=12310234
--Renaud
- Original Message -
From: "Renaud Waldura" <[EMAIL PROTECTED]>
To:
Sent: Thursday, October 12, 2006 4:11 PM
Subject: Quer
I'm developing an application used by scientists -- people who have a pretty
good idea of what logic is -- and they were shocked to find out that neither
of these queries return the same results:
1- banana AND apple OR orange
2- banana AND (apple OR orange)
3- (banana AND apple) OR orange
I'd
Erik:
I commend you for giving all the information that's relevant. For the sake
of simplicity, and because it is the vast majority of use cases, could you
endorse the following as the simplest, most correct way (i.e. a best
practice) to implement Lucene for Web applications.
1- create an In
27 matches
Mail list logo