Mike O'Leary wrote:
Please forgive the laziness inherent in this question, as I haven't looked through the PDFBox code yet. I am wondering if that code supports extracting text from PDF files while preserving such things as sequences of whitespace between characters and other layout and formatting information. I am working with a project that extracts and operates on certain table-like blocks of text from PDF files, and a lot of freeware and shareware PDF to text converters seem to either ignore formatting or try to preserve formatting and not get it quite right.
Even pdftohtml? The sample outputs I've seen from that application don't look too bad to me.
Daniel -- Daniel Noll Nuix Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, Australia Ph: +61 2 9280 0699 Web: http://nuix.com/ Fax: +61 2 9212 6902 This message is intended only for the named recipient. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this message or attachment is strictly prohibited. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]