[ 
https://issues.apache.org/jira/browse/PDFBOX-6163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-6163:
------------------------------------
    Affects Version/s: 2.0.35

> Technical Issue Report: Intermittent Data Loss and Font Missing Errors during 
> PDF Mergin
> ----------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-6163
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-6163
>             Project: PDFBox
>          Issue Type: Bug
>          Components: FontBox
>    Affects Versions: 2.0.33, 2.0.35
>         Environment: Production Environament. Windows server 2019
>            Reporter: Mruthyunjaya S
>            Priority: Critical
>         Attachments: After Merge Of the PDFS.png, DataLoss.png, 
> Ezk00aGGaL8oi2A3.png, before Merge.png
>
>
> h3. *Technical Issue Report: Intermittent Data Loss and Font Missing Errors 
> during PDF Merging*
> *Environment:*
>  * *Library:* PDFBox 2.0.33
>  * *Operating System:* Windows Server 2019
>  * *Application Context:* Windchill MethodServer
>  * *Affected Fonts:* Japanese fonts, specifically *MS Gothic* (Subsetted)
> !BONwP8BDbZ1SWAE0xMAAAAASUVORK5CYII=!
> *Issue Summary:* We are experiencing a high failure rate ({*}80% failure{*}) 
> when merging technical CAD drawings into a single PDF package. Even though 
> the source PDFs have fonts embedded, the generated "Merged PDF" frequently 
> suffers from missing fonts or garbled text on specific pages.
> *Key Observations:*
>  * *Reproducibility:* In a test of 10 consecutive merge attempts, the output 
> was correct only {*}2 times{*}, while the remaining *8 attempts* resulted in 
> missing fonts.
>  * *Specific Error:* The issue is almost exclusively linked to Japanese *MS 
> Gothic* variants (e.g., {{{}ACWDKT+MS
>               Gothic{}}}).
>  * *Error Logs:* We frequently encounter {{Format 14 cmap table is not
>               supported}} and {{Format 12 cmap contains an invalid
>               glyph index}} warnings during the process.
>  * *Client Behavior:* Adobe Acrobat fails to render the text, displaying the 
> error: {_}"Cannot extract the embedded font 'ACWDKT+MS Gothic'. Some 
> characters may not display or print correctly."{_}.
> *Suspected Causes:*
>  # *I/O Race Condition:* We intermittently receive {{{}java.io.IOException: 
> Missing root
>               object specification in trailer{}}}, suggesting the merger may 
> be accessing files before they are fully flushed to disk or while they are 
> still locked by the external converter.
>  # *Resource Clashing:* We suspect the {{PDFMergerUtility}} may be clashing 
> font resource aliases (like {{{}/F1{}}}) across different subsetted drawings, 
> leading to corrupted Character Maps (CMaps) in the final document.
> *Current Code Implementation:* We have attempted to fix this by implementing 
> a {*}Targeted Healing Pass{*}. We re-open the merged PDF, scan for corrupted 
> subsets, and re-embed a full English *Century Gothic* font using 
> {{PDResources.put()}} and {{{}PDPageContentStream.AppendMode.APPEND{}}}. 
> Despite this, the inconsistency persists.
> !B9LkDnsB5VwVAAAAABJRU5ErkJggg==|width=691,height=127!
> !TpFB0dHTVRXy1X8sSJE3qprF27tpZfhibqnp2djZOTE5aWlpqor1YrqfRpJd3lfwB8 
> 2EI21HCfQAAAABJRU5ErkJggg==!
> !wNB1HRIeiHDFwAAAABJRU5ErkJggg==|width=808,height=565!
>  
>  
>  
>  # Is there a built-in mechanism in {{PDFMergerUtility}} to "flatten" or 
> deduplicate subsetted fonts during the merge to prevent CMap clashing?
>  # Given that Format 14/12 warnings are logged but don't throw exceptions, is 
> there a recommended way to programmatically detect this "data loss" state 
> before the file is saved?
>  # Are there known issues with {{setupTempFileOnly()}} vs 
> {{setupMainMemoryOnly()}} when dealing with large, complex vector drawings 
> that might contribute to trailer parsing failures?
>  
>  
>  
> =======================Code For merge i had used=====================
> private void mergeUsingPDFBox(List<String> pdfFiles, String outputFile) 
> throws IOException {
>         PDFMergerUtility merger = new PDFMergerUtility();
>         merger.setDestinationFileName(outputFile);
>         for (String file : pdfFiles) {
>             merger.addSource(new File(file));
>         }
>         merger.mergeDocuments(MemoryUsageSetting.setupTempFileOnly());
>     }
> =====================================================
>  
>  
>  
> ===================== I had tryed for the missing embed fonts to fix the 
> reemmbed Bu it have the still issue.===============================
> private void mergeUsingPDFBox(List<String> pdfFiles, String outputPath) 
> throws IOException {
>         org.apache.pdfbox.multipdf.PDFMergerUtility merger = new 
> org.apache.pdfbox.multipdf.PDFMergerUtility();
>         merger.setDestinationFileName(outputPath);
>         System.out.println("\n[PHASE 1] Initial PDF Merging...");
>         for (String filePath : pdfFiles) {
>             File sourceFile = new File(filePath);
>             
>             // LOGIC: Prevent "Missing root object specification in trailer" 
> error
>             // This happens if we try to merge a file that is still 0-bytes 
> (locked by converter)
>             if (sourceFile.exists() && sourceFile.length() > 100) {
>                 merger.addSource(sourceFile);
>                 System.out.println("  --> Added to merge queue: " + 
> sourceFile.getName() + " [" + sourceFile.length() + " bytes]");
>             } else {
>                 System.out.println("  WARN: Skipping empty/invalid file 
> (might be locked by conversion): " + filePath);
>             }
>         }
>         // Execute merge using Main Memory to protect Windchill server heap
>         
> merger.mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting.setupMainMemoryOnly());
>         System.out.println("[PHASE 2] Merge Saved to Disk. Starting CMap 
> Corruption Scan...");
>         // RE-OPEN the result to identify and heal Japanese encoding issues
>         try (PDDocument mergedDoc = PDDocument.load(new File(outputPath))) {
>             
>             // LOGIC: Remove security. Modified content streams are blocked 
> if owner password exists.
>             if (mergedDoc.isEncrypted()) {
>                 System.out.println("DEBUG: Removing encryption to allow font 
> re-embedding...");
>                 mergedDoc.setAllSecurityToBeRemoved(true);
>             }
>             // Load replacement font ONCE per document loop for memory 
> efficiency
>             String fontPath = getFontFileFor("GOTHIC");
>             try (InputStream fontStream = new FileInputStream(new 
> File(fontPath))) {
>                 
>                 // Load FULL font (no subsetting) to ensure all English 
> character indices exist
>                 PDType0Font englishFont = PDType0Font.load(mergedDoc, 
> fontStream, false);
>                 int repairCount = 0;
>                 for (int i = 0; i < mergedDoc.getNumberOfPages(); i++) {
>                     PDPage page = mergedDoc.getPage(i);
>                     PDResources res = page.getResources();
>                     if (res == null) continue;
>                     boolean pageHasError = false;
>                     
>                     // STEP A: Detect CMap Corruption (Format 12/14 warnings)
>                     for (COSName fontAlias : res.getFontNames()) {
>                         PDFont font = res.getFont(fontAlias);
>                         
>                         // Use our helper to force a check of the internal 
> font mapping
>                         if (isFontCorrupted(font)) {
>                             System.out.println("  ALERT: Page " + (i + 1) + " 
> contains corrupted Japanese CMap/Subsets. Repairing...");
>                             pageHasError = true;
>                             break; 
>                         }
>                     }
>                     // STEP B: Heal the problematic page
>                     if (pageHasError) {
>                         for (COSName fontAlias : res.getFontNames()) {
>                             String name = 
> res.getFont(fontAlias).getName().toUpperCase();
>                             
>                             // LOGIC: Target Gothic subsets (+) that failed 
> the validation check
>                             if (name.contains("GOTHIC") || name.contains("+") 
> || name.contains("MS-")) {
>                                 res.put(fontAlias, englishFont);
>                                 repairCount++;
>                             }
>                         }
>                         
>                         // LOGIC: Re-render the operational stream
>                         // Adding a space forces the PDF viewer to reload the 
> character map using our new font
>                         try (PDPageContentStream cs = new 
> PDPageContentStream(mergedDoc, page, 
>                                 PDPageContentStream.AppendMode.APPEND, true, 
> true)) {
>                             cs.beginText();
>                             cs.setFont(englishFont, 1);
>                             cs.newLineAtOffset(0, 0);
>                             cs.showText(" "); 
>                             cs.endText();
>                         }
>                     }
>                 }
>                 System.out.println("INFO : Total font mappings repaired 
> during final pass: " + repairCount);
>             }
>             // Final Save: Overwrite the merged file with the high-fidelity 
> English version
>             mergedDoc.save(outputPath);
>             System.out.println("[PHASE 3] Final healing pass complete. Output 
> verified.");
>         } catch (Exception e) {
>             System.err.println("CRITICAL ERROR: Failed to heal the merged 
> PDF: " + e.getMessage());
>             e.printStackTrace();
>         }
>         System.out.println(">>> SUCCESS! High-Fidelity PDF saved to: " + 
> outputPath + "\n");
>     }
> =================================================================
> -- 
> *Logs Captured during Merge:* Our internal diagnostic tools show the 
> following warnings from FontBox during the failing merges:
>  * {{org.apache.fontbox.ttf.CmapSubtable:
>                 Format 14 cmap table is not supported and will be
>                 ignored}}
>  * {{org.apache.fontbox.ttf.CmapSubtable:
>                 Format 12 cmap contains an invalid glyph index}}
> *Questions for the Community:*
>  # Why would the merger intermittently corrupt the CMap of a subsetted font 
> that is already valid in the source document?
>  # Is there a way to force {{PDFMergerUtility}} to *not* rename font subsets 
> during merging, as we suspect alias clashing is causing the 80% failure rate?
>  # Is there a more reliable way to "flatten" these Japanese fonts during the 
> merge process to ensure 100% rendering success?
>  \{*}How can we reslove this issue ,  please help us{*}.
>  ** 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to