There is a proposal to extend indexing (item #11 in the API Changes
section):
http://wiki.apache.org/jakarta-lucene/Lucene2Whiteboard
An excerpt:
11. (Hard) Make indexing more flexible, so that one could
e.g., not store positions or even frequencies, or alternately,
to store extra inf
On Jun 3, 2005, at 8:50 AM, Corey Keith wrote:
With this approach all work is done at the word level. When we
have a phrase query the results will contain pages with the entire
phrase but when we go to highlight the document _all_ words in the
phrase regardless of being in the phrase will
Corey,
I have one off the wall approach that may or may not work for you.
If you convert your scanned images to PDF then use something like
Acrobat to convert those PDFs into PDFs with hidden text (The OCR
data). You can then tell Acrobat Reader via XML what to highlight when
your user opens the
With this approach all work is done at the word level. When we have a phrase
query the results will contain pages with the entire phrase but when we go to
highlight the document _all_ words in the phrase regardless of being in the
phrase will be highlighted. Is that correct? It would also be
On Jun 2, 2005, at 9:02 PM, Chris Hostetter wrote:
This is a pretty interesting problem. I envy you.
I would avoid the existing highlighter for your purposes --
highlighting
in token space is a very differnet problem from "highlihgting" in 2D
space.
based on the XML sample you provided, it
This is a pretty interesting problem. I envy you.
I would avoid the existing highlighter for your purposes -- highlighting
in token space is a very differnet problem from "highlihgting" in 2D
space.
based on the XML sample you provided, it looks like your XML files
are allready a "tokenized" fo
Hi,
I am involved in a project which is trying to provide searching and hit
highlighting on the scanned image of historical newspapers. We have an XML
based OCR format. A sample is below. We need to index the CONTENT attribute
of the String element which is the easy part. We would like to