[ https://issues.apache.org/jira/browse/PDFBOX-6007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17951282#comment-17951282 ]
Greta commented on PDFBOX-6007: ------------------------------- Thank you for your answer. After analyzing your suggestion, I would like to suggest a new approach. I propose creating a new method, which would handle cases where a diacritic is incorrectly mapped as a space. {code:java} private boolean isMisidentifiedDiacritic(TextPosition candidate, TextPosition previous) { return " ".equals(candidate.getUnicode()) && candidate.getWidth() < candidate.getFontSize() * 0.1 && previous.contains(candidate); }{code} This method would be called in the _processTextPosition_ method, by adding additional _else if_ statement. {code:java} else if (isMisidentifiedDiacritic(text, previousTextPosition)) { previousTextPosition.mergeDiacritic(text); }{code} > Incorrect Word Splitting During Text Extraction When Special Characters Are > Rendered Using Fallback Fonts > --------------------------------------------------------------------------------------------------------- > > Key: PDFBOX-6007 > URL: https://issues.apache.org/jira/browse/PDFBOX-6007 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 3.0.5 PDFBox > Reporter: Greta > Priority: Trivial > Labels: newbie > Fix For: 3.0.6 PDFBox > > Attachments: lithuanian_words.pdf > > > When extracting text from PDFs where words contain special language > characters (for example, ą, č, ę, ė, į, š, ų, ū, ž) not supported by the > originally used font, these characters are rendered using a fallback/default > font. This often results in slight visual gaps after the special character > due to differing font metrics. > During text extraction, PDFBox interprets these visual gaps as word > boundaries, causing words to be incorrectly split. This behavior negatively > affects natural language processing, search indexing, and text analysis on > extracted content. > *An example:* > Words in PDF: žiema, šaltis, ąžuolas, važiavimas, žąsis > Extracted text: ž iema, šaltis, ąž uolas, važ iavimas, ž ąsis > I have uploaded a test PDF file that contains more Lithuanian words written > with different fonts that do not support Lithuanian language special > characters. > > To resolve the issue of unintended spaces being inserted during text > extraction, I propose enhancing the current logic in {{PDFTextStripper.java}} > that handles space glyphs. > Current implementation: > {code:java} > // PDFBOX-3774: conditionally ignore spaces from the content stream > if (" ".equals(characterValue) && getIgnoreContentStreamSpaceGlyphs()) { > continue; > }{code} > This logic only skips space characters if the > {{ignoreContentStreamSpaceGlyphs}} flag is enabled, without considering the > actual visual spacing. > > Proposed improvement: > > {code:java} > // PDFBOX-3774: conditionally ignore spaces from the content stream > if (" ".equals(characterValue)) { > if (getIgnoreContentStreamSpaceGlyphs()) { > continue; > } > float actualSpaceWidth = position.getWidth(); > float expectedSpaceWidth = position.getWidthOfSpace(); > float threshold = expectedSpaceWidth * 0.5f; > if (actualSpaceWidth < threshold) { > continue; > } > } > {code} > > The proposed fix skips space characters that are visually too narrow to be > real word separators, preventing incorrect word splits caused by font > fallback or character spacing differences. > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org