Re: Searching pdf, getting page number

Erick Erickson Mon, 16 Oct 2006 05:16:24 -0700

Well, anything's possible <G>.

There's nothing magic about Lucene and its interaction with, say, a PDF
document. What you put into the index is all you can get out. So..


You could index the PDF document by pages. That is, each page is a lucene
"document", related by some ID (NOT the lucene doc_id, since that can
change).

You could index the document and give the first term of each page a large
positionincrementgap and reconstruct the page data.

You could index meta-data in a field of the document giving the term offsets
of each page start and reconstruct which page it came from.

You could insert a special token at the beginning of each page. You'd have
to count to get the page.

and on and on. The take-away here is that Lucene is a search *engine*, not a
package. You have to carefully construct your application around Lucene to
get this kind of meta-data out of it...

That said, there might already be a contribution and/or package out there
that does much of this for you, but I'm unaware of any...

Hope this helps at least a little
Erick

On 10/16/06, Christoph Pächter <[EMAIL PROTECTED]> wrote:


Hi,

I know that I can index pdf-files (using a third-party library).

Is it possible to search the index for a phrase, getting not only the
document, but also the page number in the (pdf-)document?
Or is it even possible to get a bookmark, leading to this page?

I am thankful for any information you can provide me, either how to do
this indicing and searching, or where I can find further information or
example code.

Kind regards
Christoph

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Searching pdf, getting page number

Reply via email to