[
https://issues.apache.org/jira/browse/PDFBOX-6163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mruthyunjaya S updated PDFBOX-6163:
-----------------------------------
Attachment: After Merge Of the PDFS.png
before Merge.png
DataLoss.png
> Technical Issue Report: Intermittent Data Loss and Font Missing Errors during
> PDF Mergin
> ----------------------------------------------------------------------------------------
>
> Key: PDFBOX-6163
> URL: https://issues.apache.org/jira/browse/PDFBOX-6163
> Project: PDFBox
> Issue Type: Bug
> Components: FontBox
> Affects Versions: 2.0.33
> Environment: Production Environament. Windows server 2019
> Reporter: Mruthyunjaya S
> Priority: Critical
> Attachments: After Merge Of the PDFS.png, DataLoss.png, before
> Merge.png
>
>
> h3. *Technical Issue Report: Intermittent Data Loss and Font Missing Errors
> during PDF Merging*
> *Environment:*
> * *Library:* PDFBox 2.0.33
> * *Operating System:* Windows Server 2019
> * *Application Context:* Windchill MethodServer
> * *Affected Fonts:* Japanese fonts, specifically *MS Gothic* (Subsetted)
> !BONwP8BDbZ1SWAE0xMAAAAASUVORK5CYII=!
> *Issue Summary:* We are experiencing a high failure rate ({*}80% failure{*})
> when merging technical CAD drawings into a single PDF package. Even though
> the source PDFs have fonts embedded, the generated "Merged PDF" frequently
> suffers from missing fonts or garbled text on specific pages.
> *Key Observations:*
> * *Reproducibility:* In a test of 10 consecutive merge attempts, the output
> was correct only {*}2 times{*}, while the remaining *8 attempts* resulted in
> missing fonts.
> * *Specific Error:* The issue is almost exclusively linked to Japanese *MS
> Gothic* variants (e.g., {{{}ACWDKT+MS
> Gothic{}}}).
> * *Error Logs:* We frequently encounter {{Format 14 cmap table is not
> supported}} and {{Format 12 cmap contains an invalid
> glyph index}} warnings during the process.
> * *Client Behavior:* Adobe Acrobat fails to render the text, displaying the
> error: {_}"Cannot extract the embedded font 'ACWDKT+MS Gothic'. Some
> characters may not display or print correctly."{_}.
> *Suspected Causes:*
> # *I/O Race Condition:* We intermittently receive {{{}java.io.IOException:
> Missing root
> object specification in trailer{}}}, suggesting the merger may
> be accessing files before they are fully flushed to disk or while they are
> still locked by the external converter.
> # *Resource Clashing:* We suspect the {{PDFMergerUtility}} may be clashing
> font resource aliases (like {{{}/F1{}}}) across different subsetted drawings,
> leading to corrupted Character Maps (CMaps) in the final document.
> *Current Code Implementation:* We have attempted to fix this by implementing
> a {*}Targeted Healing Pass{*}. We re-open the merged PDF, scan for corrupted
> subsets, and re-embed a full English *Century Gothic* font using
> {{PDResources.put()}} and {{{}PDPageContentStream.AppendMode.APPEND{}}}.
> Despite this, the inconsistency persists.
> !B9LkDnsB5VwVAAAAABJRU5ErkJggg==|width=691,height=127!
> !TpFB0dHTVRXy1X8sSJE3qprF27tpZfhibqnp2djZOTE5aWlpqor1YrqfRpJd3lfwB8
> 2EI21HCfQAAAABJRU5ErkJggg==!
> !wNB1HRIeiHDFwAAAABJRU5ErkJggg==|width=808,height=565!
>
>
>
> # Is there a built-in mechanism in {{PDFMergerUtility}} to "flatten" or
> deduplicate subsetted fonts during the merge to prevent CMap clashing?
> # Given that Format 14/12 warnings are logged but don't throw exceptions, is
> there a recommended way to programmatically detect this "data loss" state
> before the file is saved?
> # Are there known issues with {{setupTempFileOnly()}} vs
> {{setupMainMemoryOnly()}} when dealing with large, complex vector drawings
> that might contribute to trailer parsing failures?
>
>
>
> =======================Code For merge i had used=====================
> private void mergeUsingPDFBox(List<String> pdfFiles, String outputFile)
> throws IOException {
> PDFMergerUtility merger = new PDFMergerUtility();
> merger.setDestinationFileName(outputFile);
> for (String file : pdfFiles) {
> merger.addSource(new File(file));
> }
> merger.mergeDocuments(MemoryUsageSetting.setupTempFileOnly());
> }
> =====================================================
>
>
>
> ===================== I had tryed for the missing embed fonts to fix the
> reemmbed Bu it have the still issue.===============================
> private void mergeUsingPDFBox(List<String> pdfFiles, String outputPath)
> throws IOException {
> org.apache.pdfbox.multipdf.PDFMergerUtility merger = new
> org.apache.pdfbox.multipdf.PDFMergerUtility();
> merger.setDestinationFileName(outputPath);
> System.out.println("\n[PHASE 1] Initial PDF Merging...");
> for (String filePath : pdfFiles) {
> File sourceFile = new File(filePath);
>
> // LOGIC: Prevent "Missing root object specification in trailer"
> error
> // This happens if we try to merge a file that is still 0-bytes
> (locked by converter)
> if (sourceFile.exists() && sourceFile.length() > 100) {
> merger.addSource(sourceFile);
> System.out.println(" --> Added to merge queue: " +
> sourceFile.getName() + " [" + sourceFile.length() + " bytes]");
> } else {
> System.out.println(" WARN: Skipping empty/invalid file
> (might be locked by conversion): " + filePath);
> }
> }
> // Execute merge using Main Memory to protect Windchill server heap
>
> merger.mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting.setupMainMemoryOnly());
> System.out.println("[PHASE 2] Merge Saved to Disk. Starting CMap
> Corruption Scan...");
> // RE-OPEN the result to identify and heal Japanese encoding issues
> try (PDDocument mergedDoc = PDDocument.load(new File(outputPath))) {
>
> // LOGIC: Remove security. Modified content streams are blocked
> if owner password exists.
> if (mergedDoc.isEncrypted()) {
> System.out.println("DEBUG: Removing encryption to allow font
> re-embedding...");
> mergedDoc.setAllSecurityToBeRemoved(true);
> }
> // Load replacement font ONCE per document loop for memory
> efficiency
> String fontPath = getFontFileFor("GOTHIC");
> try (InputStream fontStream = new FileInputStream(new
> File(fontPath))) {
>
> // Load FULL font (no subsetting) to ensure all English
> character indices exist
> PDType0Font englishFont = PDType0Font.load(mergedDoc,
> fontStream, false);
> int repairCount = 0;
> for (int i = 0; i < mergedDoc.getNumberOfPages(); i++) {
> PDPage page = mergedDoc.getPage(i);
> PDResources res = page.getResources();
> if (res == null) continue;
> boolean pageHasError = false;
>
> // STEP A: Detect CMap Corruption (Format 12/14 warnings)
> for (COSName fontAlias : res.getFontNames()) {
> PDFont font = res.getFont(fontAlias);
>
> // Use our helper to force a check of the internal
> font mapping
> if (isFontCorrupted(font)) {
> System.out.println(" ALERT: Page " + (i + 1) + "
> contains corrupted Japanese CMap/Subsets. Repairing...");
> pageHasError = true;
> break;
> }
> }
> // STEP B: Heal the problematic page
> if (pageHasError) {
> for (COSName fontAlias : res.getFontNames()) {
> String name =
> res.getFont(fontAlias).getName().toUpperCase();
>
> // LOGIC: Target Gothic subsets (+) that failed
> the validation check
> if (name.contains("GOTHIC") || name.contains("+")
> || name.contains("MS-")) {
> res.put(fontAlias, englishFont);
> repairCount++;
> }
> }
>
> // LOGIC: Re-render the operational stream
> // Adding a space forces the PDF viewer to reload the
> character map using our new font
> try (PDPageContentStream cs = new
> PDPageContentStream(mergedDoc, page,
> PDPageContentStream.AppendMode.APPEND, true,
> true)) {
> cs.beginText();
> cs.setFont(englishFont, 1);
> cs.newLineAtOffset(0, 0);
> cs.showText(" ");
> cs.endText();
> }
> }
> }
> System.out.println("INFO : Total font mappings repaired
> during final pass: " + repairCount);
> }
> // Final Save: Overwrite the merged file with the high-fidelity
> English version
> mergedDoc.save(outputPath);
> System.out.println("[PHASE 3] Final healing pass complete. Output
> verified.");
> } catch (Exception e) {
> System.err.println("CRITICAL ERROR: Failed to heal the merged
> PDF: " + e.getMessage());
> e.printStackTrace();
> }
> System.out.println(">>> SUCCESS! High-Fidelity PDF saved to: " +
> outputPath + "\n");
> }
> =================================================================
> --
> *Logs Captured during Merge:* Our internal diagnostic tools show the
> following warnings from FontBox during the failing merges:
> * {{org.apache.fontbox.ttf.CmapSubtable:
> Format 14 cmap table is not supported and will be
> ignored}}
> * {{org.apache.fontbox.ttf.CmapSubtable:
> Format 12 cmap contains an invalid glyph index}}
> *Questions for the Community:*
> # Why would the merger intermittently corrupt the CMap of a subsetted font
> that is already valid in the source document?
> # Is there a way to force {{PDFMergerUtility}} to *not* rename font subsets
> during merging, as we suspect alias clashing is causing the 80% failure rate?
> # Is there a more reliable way to "flatten" these Japanese fonts during the
> merge process to ensure 100% rendering success?
> \{*}How can we reslove this issue , please help us{*}.
> **
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]