Marcelo Modesto created PDFBOX-5969:
---------------------------------------
Summary: Support for text location information in the ExtractText
command-line tool
Key: PDFBOX-5969
URL: https://issues.apache.org/jira/browse/PDFBOX-5969
Project: PDFBox
Issue Type: New Feature
Components: Text extraction
Affects Versions: 3.0.4 PDFBox
Environment: Ubuntu 24.04.2 LTS
openjdk version "11.0.26" 2025-01-21
OpenJDK Runtime Environment (build 11.0.26+4-post-Ubuntu-1ubuntu124.04)
OpenJDK 64-Bit Server VM (build 11.0.26+4-post-Ubuntu-1ubuntu124.04, mixed
mode, sharing)
Reporter: Marcelo Modesto
Attachments: PDFText2JSONLine.java, json_line.diff, sample_output.txt
I've been using ExtractText command-line tool to process lots of PDF files
successfully.
Basically, I use some filters with regular expression that allow me to extract
and structure the information that I need.
Sometimes I could obtain a better result if I had some information about the
text location. For example, for some tabular text data.
I',ve read about Tabula project and PDFBox text location features on stack
overflow and I've inspected PrintTextLocations and DrawPrintTextLocations
source code.
I decided to implement a new output format in the ExtractText command-line tool.
Basically, each line of text in the PDF will create a JSON object with some
location information.
I'm attaching the changes I made and an example output (with some limitations I
noted).
I'm sending it with the hope that it might be useful to someone else.
Feel free to decline if you find the proposal useless or even outside the scope
of the ExtractText tool.
Thank you!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]