TL;DR how to integrate PDFBox into a Python framework for installation and use by non-computer-scientists?
I have used PDFBox for at least 10 years and love it and the community. I use it to make (scientific) PDF's semantic, by trapping the events and saving as SVG, after which I can assemble structured text, vector graphics objects and images located within that. This goes a long way towards recreating semantic data from the original authors' tool (words, graphics, and science). I used PDFBox 1.*, and then upgraded to PDFBox 2.*, subclassing PageDrawer . The code is in https://github.com/petermr/ami3 (no significant commits in last 2 years). I have moved my functionality to Python (https://github.com/petermr/pyami) because the scientific community now uses it and knows how to install, run and extend it. However I didn't find a good equivalent to PDFBox (pdfminer, pdfplumber) and have struggled to create an equivalent to ami3. I'm therefore considering reverting to PDFBox for the PDF->SVG conversion which would be used as a black box. It would take a list of page numbers, and output a semantic representation of the PDF, initially in SVG. Since I am 2 years adrift of PDFBox I'd very much appreciate guidance to stop me going down rabbit holes. * are there other PDFBox users who create semantic representations of pages (I need chars+screenxy+size+width+fill/stroke+font+style(if present)+matrix) * Does PDFBox3 have more functionality than PDFBox2 that would help? * Is there any better way of representing a semantic page than SVG? * Assuming I debug a new version, I would wish to install it (presumably a JAR file) with (Python) pip install pyami It's too hard to get users to manipulate JARs themselves - many don't like CLI. Thanks! P. -- "I always retain copyright in my papers, and nothing in any contract I sign with any publisher will override that fact. You should do the same". Peter Murray-Rust Reader Emeritus in Molecular Informatics Yusuf Hamied Department of Chemistry University of Cambridge CB2 1EW, UK +44-1223-336432