Re: Extracting formatted text from PDF files

Soeren Pekrul Thu, 22 Mar 2007 11:34:26 -0800

Mike O'Leary wrote:

Please forgive the laziness inherent in this question, as I haven't looked
through the PDFBox code yet. I am wondering if that code supports extracting
text from PDF files while preserving such things as sequences of whitespace
between characters and other layout and formatting information. I am working
with a project that extracts and operates on certain table-like blocks of
text from PDF files, and a lot of freeware and shareware PDF to text
converters seem to either ignore formatting or try to preserve formatting
and not get it quite right. I am wondering if PDFBox provides better support
for this kind of thing. Thanks.

That is not so simple. Usually there is not this information inside aPDF file. PDF is an output file format. It contains just the informationprint a character "a" at the position x and y. In many cases a PDF filedoesn’t know even words or white spaces. We read words due to theposition of characters, we see paragraphs due to the position ofcharacters, and we see tables due to the position of characters. Thefile doesn’t contain this information.I found this code in a PDF file for the German word "Wuchsform" (form ofgrowing) and the colon ":":


/F1 1 Tf
-3.8801 -1.274 TD
[ (W) 29.60001 (uchsform:) ] TJ

First line: Select a font
Second line: Move the cursor to position -3.8801, -1.274

Third line: Print the character "W", move the cursor 29.60001 units toright and print the characters "uchsform:".

Extracting the words from a PDF file for indexing means you have firstto build words from the characters positions. Recognizing paragraphs,column text, tables, captions, lists, footnotes etc. is much more difficult.


Sören

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Extracting formatted text from PDF files

Reply via email to