Splitter does not include structure tree in documents past the first split

Alastair Porter Wed, 14 May 2025 09:37:46 -0700

Hi,
Apologies if my terminology is wrong on some of the following topics, I've
not worked with PDFs in much detail before.


When using the Splitter to split pdfs, it appears that any split that
doesn't start on the first page of the input document does not include
Structure tree elements / accessibility tags.
I note the recent work in PDFBOX-2725 ([PATCH] Split pdf lose accessibility
tags) and PDFBOX-5929 (Remove orphan annotations in structure tree) which
may have affected some of this related code.

I can reproduce this with both the app cli:
    java -jar pdfbox/app/target/pdfbox-app-4.0.0-SNAPSHOT.jar split -i
input.pdf -outputPrefix output-split

and also with the API:
    Splitter splitter = new Splitter();
    splitter.setSplitAtPage(20);
    List<PDDocument> documents = splitter.split(inputDocument);

I also checked pdfbox 3.0.3 (last release before PDFBOX-5929) and the
behaviour appears to be the same - that is, it doesn't appear that the
patch broke some already existing functionality.

I am evaluating the resulting pdfs using the PAC PDF Accessibility Checker (
https://pac.pdf-accessibility.org/en) and also the pdfbox debugger. I
expect to see items in Root/StructTreeRoot/K in the debugger.

In the first file, I correctly see the /K element. What's more, this
element has correctly been pruned and doesn't include any items from the
input document which point to pages that are not in this split.
In subsequent split files, I see no /K element in the StructTreeRoot at all.

I attached a PDF which I've been using for simple testing, which exhibits
this behaviour.

I had a bit of a look through the existing code, and I see that in
Splitter.java, in cloneStructureTree

COSBase k1 = srcStructureTreeRoot.getK();
COSBase k2 = new KCloner(dstPageTree).createClone(k1,
dstStructureTreeRoot.getCOSObject(), null);
dstStructureTreeRoot.setK(k2);

k2 is always null after the first split, it seems like it may not be
created correctly.

Is this a known bug, or perhaps an issue with the way I'm using the API or
the format of the input documents?

Thanks,
Alastair

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Splitter does not include structure tree in documents past the first split

Reply via email to