TL;DR how to integrate PDFBox into a Python framework for installation and
use by non-computer-scientists?

I have used PDFBox for at least 10 years and love it and the community. I
use it to make (scientific) PDF's semantic, by trapping the events and
saving as SVG, after which I can assemble structured text, vector graphics
objects and images  located within that. This goes a long way towards
recreating semantic data from the original authors' tool (words, graphics,
and science).
I used PDFBox 1.*, and then upgraded to PDFBox 2.*, subclassing PageDrawer
. The code is in https://github.com/petermr/ami3 (no significant commits in
last 2 years).

I have moved my functionality to Python (https://github.com/petermr/pyami)
because the scientific community now uses it and knows how to install, run
and extend it. However I didn't find a good equivalent to PDFBox (pdfminer,
pdfplumber) and have struggled to create an equivalent to ami3.

I'm therefore considering reverting to PDFBox for the PDF->SVG conversion
which would be used as a black box. It would take a list of page numbers,
and output a semantic representation of the PDF, initially in SVG. Since I
am 2 years adrift of PDFBox I'd very much appreciate guidance to stop me
going down rabbit holes.

* are there other PDFBox users who create semantic representations of pages
(I need chars+screenxy+size+width+fill/stroke+font+style(if present)+matrix)
* Does PDFBox3 have more functionality than PDFBox2 that would help?
* Is there any better way of representing a semantic page than SVG?
* Assuming I debug a new version, I would wish to install it (presumably a
JAR file) with
(Python) pip install pyami
It's too hard to get users to manipulate JARs themselves - many don't like
CLI.

Thanks!

P.

-- 
"I always retain copyright in my papers, and nothing in any contract I sign
with any publisher will override that fact. You should do the same".

Peter Murray-Rust
Reader Emeritus in Molecular Informatics
Yusuf Hamied Department of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-336432

Reply via email to