Mike O'Leary wrote:
Please forgive the laziness inherent in this question, as I haven't looked
through the PDFBox code yet. I am wondering if that code supports extracting
text from PDF files while preserving such things as sequences of whitespace
between characters and other layout and formatting information. I am working
with a project that extracts and operates on certain table-like blocks of
text from PDF files, and a lot of freeware and shareware PDF to text
converters seem to either ignore formatting or try to preserve formatting
and not get it quite right. I am wondering if PDFBox provides better support
for this kind of thing. Thanks.
That is not so simple. Usually there is not this information inside a
PDF file. PDF is an output file format. It contains just the information
print a character "a" at the position x and y. In many cases a PDF file
doesn’t know even words or white spaces. We read words due to the
position of characters, we see paragraphs due to the position of
characters, and we see tables due to the position of characters. The
file doesn’t contain this information.
I found this code in a PDF file for the German word "Wuchsform" (form of
growing) and the colon ":":
/F1 1 Tf
-3.8801 -1.274 TD
[ (W) 29.60001 (uchsform:) ] TJ
First line: Select a font
Second line: Move the cursor to position -3.8801, -1.274
Third line: Print the character "W", move the cursor 29.60001 units to
right and print the characters "uchsform:".
Extracting the words from a PDF file for indexing means you have first
to build words from the characters positions. Recognizing paragraphs,
column text, tables, captions, lists, footnotes etc. is much more difficult.
Sören
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]