RE: Use of scanned documents for text extraction and indexing

2009-02-26 Thread Renaud Waldura
There is quite a bit of litterature available on this topic. This paper presents a summary. Nothing immediately applicable I'm afraid. Retrieving OCR Text: A survey of current approaches Steven M. Beitzel, Eric C. Jensen, David A Grossman Illinois Institute of Technology It lists a number of othe

RE: Binding lucene instance/threads to a particular processor(or core)

2008-04-22 Thread Renaud Waldura
dexes, becoming more generalizable to a number of use cases including allowing it to support the use case of one (or more indexes) and a high work load of queries that need to be managed. It could use the same defaults if the TPE is not set externally (is null). -Glen 2008/4/22 Renaud Waldura &l

RE: Binding lucene instance/threads to a particular processor(or core)

2008-04-22 Thread Renaud Waldura
> one solution is to set-up a ThreadPoolExecutor[2] with a fixed > number of threads and a limited queue size (use a bound BlockingQueue[3]) Yes, this is precisely how the ConcurrentMultiSearcher works. https://issues.apache.org/jira/browse/LUCENE-423 -Original Message- From: Glen New

RE: Binding lucene instance/threads to a particular processor(or core)

2008-04-22 Thread Renaud Waldura
Anshum: Have you looked into the ConcurrentMultiSearcher? It would have you split your index into N sub-indices, and search each with a dedicated thread. --Renaud -Original Message- From: Anshum [mailto:[EMAIL PROTECTED] Sent: Monday, April 21, 2008 9:10 PM To: java-user@lucene.apache

RE: Lucene to index OCR text

2008-01-25 Thread Renaud Waldura
The author of the presentation I linked to earlier pointed me to this: http://wiki.apache.org/jakarta-lucene/SpellChecker Which is implemented by: http://www.marine-geo.org/services/oai/docs/javadoc/org/apache/lucene/spell/ NGramSpeller.html -Original Message- From: [EMAIL PROTECTED

Re: Lucene to index OCR text

2008-01-25 Thread waldura
Thanks everyone for their ideas and suggestions! Some had occurred to us but were discarded because we feel our solution needs to be automated -- 45 million pages are a lot of thrust on any human-driven effort. I like Itamar's idea of doing "competing" OCR, and keeping the best result. Unfortunate

Lucene to index OCR text

2008-01-24 Thread Renaud Waldura
tc. with Lucene without any trouble, but OCR errors are a problem, when doing exact phrase matches in particular. I'm looking for ideas on how to deal with this thorny problem. -- Renaud Waldura Applications Group Manager Library and Center for Knowledge Management University of California, San

LUCENE-423: thread pool implementation of parallel queries

2007-08-15 Thread Renaud Waldura
Could someone who understands Lucene internals help me port https://issues.apache.org/jira/browse/LUCENE-423 to Lucene 2.0? I have beefy hardware (32 cores) and want to try this out, but it won't compile. There are 2 issues: 1- maxScore On line 412 TopFieldDocs constructor now needs a maxScore.

RE: a question for french analyzer

2007-07-30 Thread Renaud Waldura
Being a French speaker, I will mention the following special cases: - "plus ça change" -> "plus ca change" - "œuf" -> "oeuf" - "lætitia" -> "laetitia" But I just looked, and it looks like ISOLatin1AccentFilter handles these. Better test to be sure... --Renaud -Original Message- From:

RE: Question regarding boolean query

2007-07-30 Thread Renaud Waldura
Yeah, it's a surprise, isn't it? I'm afraid there isn't a good answer. http://wiki.apache.org/lucene-java/BooleanQuerySyntax The "best practice" appears to be to require parens everywhere to force the evaluation order. Not very satisfying, but it does work 100%. -Original Message- From

RE: search through all fields

2007-07-16 Thread Renaud Waldura
Often documents can be divided in "metadata" and "contents" sections. Say you're indexing Web pages, you could index them with HEAD data all in one field, and the BODY content in another. While also creating separate fields for every HEAD field, e.g. TITLE etc. At search time, you rewrite every qu

Lucene Wiki Editing Guidelines

2007-07-03 Thread Renaud Waldura
Regarding the Lucene Wiki, is there an editing policy or should I feel free to change stuff as I see fit? E.g. I've added a page LuceneCaveats, and now I want to edit http://wiki.apache.org/lucene-java/ConceptsAndDefinitions and add a "Core Classes" section, and refactor that page. --Renaud

RE: highlighting phrase query

2007-07-02 Thread Renaud Waldura
Mark: Thanks a million for this comprehensive analysis. This is going straight to my manager. :) --Renaud -Original Message- From: Mark Miller [mailto:[EMAIL PROTECTED] Sent: Monday, July 02, 2007 2:11 PM To: java-user@lucene.apache.org Subject: Re: highlighting phrase query There ha

RE: Wildcard query with untokenized punctuation (again)

2007-06-14 Thread Renaud Waldura
the QueryParser to act as you want (generate a PhraseQuery or MultiPhraseQuery when it sees <>). Are you sure you need a PhraseQuery and not a Boolean query of Should clauses? - Mark On 6/14/07, Renaud Waldura <[EMAIL PROTECTED]> wrote: > > Thanks guys, I like it! I'm already

RE: Wildcard query with untokenized punctuation (again)

2007-06-14 Thread Renaud Waldura
s always to get out the sledgehammer... - Mark Erick Erickson wrote: > Well, perhaps the simplest thing would be to pre-process the query and > make the comma into a whitespace before sending anything to the query > parser. I don't know how generalizable that sort of solution is in >

Wildcard query with untokenized punctuation (again)

2007-06-13 Thread Renaud Waldura
My very simple analyzer produces tokens made of digits and/or letters only. Anything else is discarded. E.g. the input "smith,anna" gets tokenized as 2 tokens, first "smith" then "anna". Say I have indexed documents that contained both "smith,anna" and "smith,annanicole". To find them, I enter th

RE: More Precise Highlighting (MY SOLUTION)

2007-03-29 Thread Renaud Waldura
;author:apple" clauses QueryTermExtractor produced no strings for highlighting when you asked for the field. Cheers Mark On 3/2/07, Renaud Waldura <[EMAIL PROTECTED]> wrote: > > Hello Mark: > > I apologize for not responding earlier, more urgent stuff took over. I

RE: More Precise Highlighting

2007-03-02 Thread Renaud Waldura
just the "allContent" field values and pass a TokenStream for the "title" field to the highlighter and it would highlight the appropriate values in the title. Do any of these options work? Renaud Waldura wrote: > The old highlighter code used to highlight found terms in an

More Precise Highlighting

2007-02-13 Thread Renaud Waldura
The old highlighter code used to highlight found terms in any field (too broad). The new highlighter lets one specify a field when highlighting, but it highlights that field only (too narrow). In my case we have an "all" field that is the concatenation of all data about the document. When I high

RE: Text storing design and performance question

2007-01-11 Thread Renaud Waldura
you're feeling lazy. This assumes that most good matches are at the start of the document, and that the files on disk are not compressed. moraleslos wrote: > Maybe keeping the data in the DB would make it quicker? Seems like > the I/O performance would cause most of the pe

RE: Text storing design and performance question

2007-01-10 Thread Renaud Waldura
: java-user@lucene.apache.org Subject: RE: Text storing design and performance question Maybe keeping the data in the DB would make it quicker? Seems like the I/O performance would cause most of the performance issues you're seeing. -los Renaud Waldura-5 wrote: > > We used to store

RE: Text storing design and performance question

2007-01-10 Thread Renaud Waldura
We used to store a big text field for highlighting purposes too, and it proved a big pain. The index was gigantic, it took forever to build, and the search performance would sometimes suffer from it (just a hunch). Now we keep this big text field on disk (in a file), and feed it to the highlighter

RE: BooleanQuery

2006-12-06 Thread Renaud Waldura
Read my own complaints about QueryParser here: http://marc.theaimsgroup.com/?l=lucene-user&m=116069469827270&w=2 You're in for a surprise. As alluded by Erick, the stock QP doesn't quite do what one (legitimately IMO) expects. --Renaud -Original Message- From: Erick Erickson [mailto:[

PDF Highlighting Again

2006-11-09 Thread Renaud Waldura
Greetings: I read the mailing-list archives about this topic and found the PDFBox solutions at: http://www.pdfbox.org/userguide/highlighting.html Basically there are 3 options: 1- append query parameters to the PDF URL 2- generate a highlight XML document that Acrobat Reader will download separa

Re: QueryParser Is Badly Broken

2006-10-13 Thread Renaud Waldura
igator.jspa?reset=true&mode=hide&pid=12310110&sorter/order=DESC&sorter/field=priority&resolution=-1&component=12310234 --Renaud - Original Message - From: "Renaud Waldura" <[EMAIL PROTECTED]> To: Sent: Thursday, October 12, 2006 4:11 PM Subject: Quer

QueryParser Is Badly Broken

2006-10-12 Thread Renaud Waldura
I'm developing an application used by scientists -- people who have a pretty good idea of what logic is -- and they were shocked to find out that neither of these queries return the same results: 1- banana AND apple OR orange 2- banana AND (apple OR orange) 3- (banana AND apple) OR orange I'd

Re: IndexSearcher in Servlet

2006-06-27 Thread Renaud Waldura
Erik: I commend you for giving all the information that's relevant. For the sake of simplicity, and because it is the vast majority of use cases, could you endorse the following as the simplest, most correct way (i.e. a best practice) to implement Lucene for Web applications. 1- create an In