Package: poppler-utils
Version: 25.03.0-4
Severity: minor
The pdftohtml utility included with poppler-utils can be made to print an XML
format with a command-line switch. Unlike converting to an SVG, this is a nice
format for parsing and scraping afterwards or perhaps for subsequent
conversion. To try it one can do this:
pdftohtml -l 1 -q -i -stdout -xml
/usr/share/doc/debian-reference-common/docs/debian-reference.en.pdf
and obtain something like
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">
<pdf2xml producer="poppler" version="25.03.0">
<page number="1" position="absolute" top="0" left="0" height="1262"
width="892"/>
<outline>
<item page="28">GNU/Linux tutorials</item>
<outline>
<item page="28">Console basics</item>
...
That DTD file is in the source tree and can be used to validate the XML or to
aid understanding the format. It's not installed but it would be helpful if
poppler-utils did so. Be mindful that, if I recall correctly, Debian has
specific packaging policy for SGML/XML DTDs such as this so it can be easily
found by tools wanting it.