Thomas Arni a écrit : > Hello Luceners > > I have started a new project and need to index pdf documents. > There are several projects around, which allow to extract the content, > like pdfbox, xpdf and pjclassic. > > As far as I studied the FAQ's and examples, all these > tools allow simple text extraction. > > Which of these open source tool can you recommend the most? pdftk or iText? > > My pdf documents are quite long (in average more than 60 pages long). > Therefore I would like to have additional structure information for > indexing. > This allows that the user not only gets the whole document as a result, > he also gets additional information like the page or the chapter, where > the relevant information is. page is simple to extract, chapter should be more tricky, if the document got internal links. PDF reader accept argument like in http to open a page. > > As anyone have similar requirements? Which of these tools > are the best to fit my requirements?
Have a look to "PDF hacks" (ISBN: 0596006551). When your document will be split, it will be easy to index it. M. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]