sdext/source/pdfimport/README.md | 106 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 106 insertions(+)
New commits: commit 9f147a024c2efd88a3fadbf2f89800a1659f544e Author: Dr. David Alan Gilbert <d...@treblig.org> AuthorDate: Sat Jan 11 21:53:50 2025 +0000 Commit: David Gilbert <d...@treblig.org> CommitDate: Sat Jan 18 01:11:32 2025 +0100 sdext: Document the pdf import code Change-Id: I572d9a73a652df1f26cf4c6434be4ebe8c5bff00 Reviewed-on: https://gerrit.libreoffice.org/c/core/+/180132 Tested-by: Jenkins Reviewed-by: David Gilbert <freedesk...@treblig.org> diff --git a/sdext/source/pdfimport/README.md b/sdext/source/pdfimport/README.md new file mode 100644 index 000000000000..6ce7986005e9 --- /dev/null +++ b/sdext/source/pdfimport/README.md @@ -0,0 +1,106 @@ +# PDF import + +## Introduction + +The code in this directory parses a PDF file and builds a LibreOffice +document contain similar elements, which can then be edited. +It is invoked when opening a PDF file, but **not** when inserting +a PDF into a document. Inserting a PDF file renders it and inserts +a non-editable, rendered version. + +The parsing is done by the libary [Poppler](https://poppler.freedesktop.org/) +which then calls back into one layer of this code which is built as a +Poppler output device implementation. + +The PDF format is specified by [this document](https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf). + +Note that PDF is a language that describes how to **render** a page, not +a language for describing an editable document, thus some of the conversion +is a heuristic that doesn't always give good results. + +Indeed, PDF is Turing complete, and can embed Javascript, which is also +Turing complete, so it's a wonder that PDFs ever manage to display anything. + +## Current limitations + +- Not all elements have clipping implemented. + +- LibreOffice's clipping routines all use Even-odd winding rules, where +as PDF can (and usually does) use non-zero winding rules, making some +clipping operations incorrect. + +- In PDF, there's no concept of lines of text or paragraphs, each +character can be entirely separate. The code has very simple heuristics +for reassembling characters back into lines of text. +Other programs, like *pdftotext* have more complex heuristics that might be worth a try. + +- Some cheap PDF operations, like the more advanced fills, generate many +hundreds of objects in LibreOffice, which can make the document painfully +slow to open. At least some of these are possible to improve by adding +more Poppler API implementations. Some may require expanding LibreOffice's +set of fill types. + +- There can be differences between distributions Poppler library builds +and the builds LibreOffice builds when it doesn't have a distro build +to use, e.g. in LibreOffice's own distributed builds or the bibisect +builds. In particular the distro builds may include another library +(supporting another embedded image type) than LibreOffice's build. + +## Fundamental limitations + +- The ordering of fonts embedded in PDF are often ASCII, but not always. +Sometimes they're arbitrary. They may then include a *ToUnicode* map allowing +programs to map the arbitrary index back to Unicode. Alas not all PDFs +include it, and some even use a bogus map to make it harder to copy/edit. +If the same PDF renders correctly in other readers but fails to copy-and-paste +then this is probably the issue. + +- PDF can use complex programming in many places, for example a simple fill +could be composed of a complex program to generate the fill tiles instead +of an obvious simple item that can be encoded as LibreOffice shading type. +Rendering these down to image tiles works OK but can sometimes end up +with a fuzzy image rather than a nice sharp vector representation. + +- Poppler's device interface API is not meant to be stable. The code +thus has lots of ifdef's to deal with different Poppler versions. + +## Structure + +Note that the structure is dictated by Poppler being GPL licensed, where +as LibreOffice isn't. + +- *xpdfwrapper/* contains the GPL code that's linked with Poppler +and forms the *xpdfimport* binary. That binary outputs a stream +representing the PDF as simpler operations (lines, clipping operations, +images etc). These form a series of commands on stdout, and binary +data (mostly images) on stderr. This does make adding debugging tricky. + +- *wrapper/* contains the LibreOffice glue that execs the *xpdfimport* +binary and parses the stream. It also sets up password entry for +protected PDFs. After parsing the keyword and then any data that +should be with the keyword, this layer than calls into the following +tree layer. + +- *tree/*' forms internal tree objects for each of the calls from the +wrapper layer. The tree is then 'visited' by optimisation layers +(that do things like assemble individual characters into lines of text) +and then by backend specific XML generators (e.g. for Draw and Writer) +that then generate an XML stream to be parsed by the core of LibreOffice. + +## Bug handling + +- Please tag bugs with *filter:pdf* in component *filters and storage*. + +- The *pdfseparate* utility which is part of poppler is useful for splitting +a PDF into individual pages to figure out which page is causing a crash +or hang or shrinking the problem down. + +- [qpdf](https://github.com/qpdf/qpdf) is useful for editing raw PDF +files to really cut down the number of primitives, but takes some +getting used to. + +- The xpdfimport binary can be run independently of the rest of LibreOffice +to allow the translated stream to be examined: + + ./instdir/program/xpdfimport problem.pdf < /dev/null > stream 2> binarystream +