Package: poppler-utils
Version: 25.03.0-4
Severity: minor

The pdftohtml utility included with poppler-utils can be made to print an XML 
format with a command-line switch. Unlike converting to an SVG, this is a nice 
format for parsing and scraping afterwards or perhaps for subsequent 
conversion. To try it one can do this:
        pdftohtml -l 1 -q -i -stdout -xml 
/usr/share/doc/debian-reference-common/docs/debian-reference.en.pdf
and obtain something like
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">
<pdf2xml producer="poppler" version="25.03.0">
        <page number="1" position="absolute" top="0" left="0" height="1262" 
width="892"/>
        <outline>
                <item page="28">GNU/Linux tutorials</item>
                <outline>
                        <item page="28">Console basics</item>
...

That DTD file is in the source tree and can be used to validate the XML or to 
aid understanding the format. It's not installed but it would be helpful if 
poppler-utils did so. Be mindful that, if I recall correctly, Debian has 
specific packaging policy for SGML/XML DTDs such as this so it can be easily 
found by tools wanting it.

Reply via email to