Hello Bill, Problem I am having is that some of them has multiple columns. and multiple word boxes. Does the xpdf patch extract different columns and wordboxes?
Best, -C.B. On Wed, May 14, 2008 at 6:35 PM, Bill Janssen <[EMAIL PROTECTED]> wrote: > > > the unix program pdf2text can convert keeping the text places, but I > wanted > > > to ask you guys if you know something better, > > > > AFAIK, PDFBox has a lower-level API that allows you to get hold of text > > positions. > > In UpLib, I use xpdf-3.02pl2 with a patch which gives me position and > font information for each word. You can get the xpdf sources from > http://www.foolabs.com/xpdf/, and the patch file is at > http://uplib.parc.com/misc/xpdf-3.02-PATCH. To extract the byte > positions, use pdftotext with the "-wordboxes" switch, and see the > pdftotext man page for more info. This is run automatically in UpLib > before the indexing is done. > > Bill > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >