Re: Indexing and Hit Highlighting OCR Data

2005-06-06 Thread Steven Rowe
There is a proposal to extend indexing (item #11 in the API Changes section): http://wiki.apache.org/jakarta-lucene/Lucene2Whiteboard An excerpt: 11. (Hard) Make indexing more flexible, so that one could e.g., not store positions or even frequencies, or alternately, to store extra inf

Re: Indexing and Hit Highlighting OCR Data

2005-06-03 Thread Erik Hatcher
On Jun 3, 2005, at 8:50 AM, Corey Keith wrote: With this approach all work is done at the word level. When we have a phrase query the results will contain pages with the entire phrase but when we go to highlight the document _all_ words in the phrase regardless of being in the phrase will

Re: Indexing and Hit Highlighting OCR Data

2005-06-03 Thread Richard Krenek
Corey, I have one off the wall approach that may or may not work for you. If you convert your scanned images to PDF then use something like Acrobat to convert those PDFs into PDFs with hidden text (The OCR data). You can then tell Acrobat Reader via XML what to highlight when your user opens the

Re: Indexing and Hit Highlighting OCR Data

2005-06-03 Thread Corey Keith
With this approach all work is done at the word level. When we have a phrase query the results will contain pages with the entire phrase but when we go to highlight the document _all_ words in the phrase regardless of being in the phrase will be highlighted. Is that correct? It would also be

Re: Indexing and Hit Highlighting OCR Data

2005-06-03 Thread Erik Hatcher
On Jun 2, 2005, at 9:02 PM, Chris Hostetter wrote: This is a pretty interesting problem. I envy you. I would avoid the existing highlighter for your purposes -- highlighting in token space is a very differnet problem from "highlihgting" in 2D space. based on the XML sample you provided, it

Re: Indexing and Hit Highlighting OCR Data

2005-06-02 Thread Chris Hostetter
This is a pretty interesting problem. I envy you. I would avoid the existing highlighter for your purposes -- highlighting in token space is a very differnet problem from "highlihgting" in 2D space. based on the XML sample you provided, it looks like your XML files are allready a "tokenized" fo

Indexing and Hit Highlighting OCR Data

2005-06-02 Thread Corey Keith
Hi, I am involved in a project which is trying to provide searching and hit highlighting on the scanned image of historical newspapers. We have an XML based OCR format. A sample is below. We need to index the CONTENT attribute of the String element which is the easy part. We would like to