Re: Indexing PDF documents with structure information

Mathieu Lecarme Tue, 14 Aug 2007 06:05:41 -0700

Thomas Arni a écrit :
> Hello Luceners
>
> I have started a new project and need to index pdf documents.
> There are several projects around, which allow to extract the content,
> like pdfbox, xpdf and pjclassic.
>
> As far as I studied the FAQ's and examples, all these
> tools allow simple text extraction.
>
> Which of these open source tool can you recommend the most?
pdftk or iText?
>
> My pdf documents are quite long (in average more than 60 pages long).
> Therefore I would like to have additional structure information for
> indexing.
> This allows that the user not only gets the whole document as a result,
> he also gets additional information like the page or the chapter, where
> the relevant information is.
page is simple to extract, chapter should be more tricky, if the
document got internal links.
PDF reader accept argument like in http to open a page.
>
> As anyone have similar requirements? Which of these tools
> are the best to fit my requirements?


Have a look to "PDF hacks" (ISBN: 0596006551). When your document will
be split, it will be easy to index it.

M.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Indexing PDF documents with structure information

Reply via email to