On Jul 8, 2005, at 2:57 AM, Daniel Moldovan wrote:
My application must index a lot of books that are stored in xml files.
Each xml file represents a page of the book and this way each page
becomes a
lucene Document.
Each page is organized in different sections and finally each section
contains lines.
What I need to do is give the user the possibility to search for a
phrase
that starts at the
and of a page and continues on the next page. The span should have
some
limits, let's say, 6 words on each page.
Does any one experienced this kind of search? Please share you
knowledge if
you did.
You're lucky you get to represent your data so hierarchically! Try
getting scholars to represent a book in such a fashion!!! (I'm
dealing with scholarly works in XML format and sections do not fall
_within_ pages, they can span across pages).
In this case, one field of your document should probably index a page
+ 6 words on either side of it from the previous and next pages.
Maybe you also have a field that represents only the page as well.
Perhaps something at query time decides which field to search? Maybe
all phrase queries use the overlapped field and other query types use
the single page field?
Erik