[jira] [Commented] (PDFBOX-6007) Incorrect Word Splitting During Text Extraction When Special Characters Are Rendered Using Fallback Fonts

Greta (Jira) Tue, 13 May 2025 13:45:05 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-6007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17951282#comment-17951282
 ]


Greta commented on PDFBOX-6007:
-------------------------------

Thank you for your answer.

After analyzing your suggestion, I would like to suggest a new approach.

I propose creating a new method, which would handle cases where a diacritic is 
incorrectly mapped as a space.

 
{code:java}
private boolean isMisidentifiedDiacritic(TextPosition candidate, TextPosition 
previous)
{
    return " ".equals(candidate.getUnicode())
        && candidate.getWidth() < candidate.getFontSize() * 0.1
        && previous.contains(candidate);
}{code}
 

This method would be called in the _processTextPosition_ method, by adding 
additional _else if_ statement.

 

 
{code:java}
else if (isMisidentifiedDiacritic(text, previousTextPosition)) {
    previousTextPosition.mergeDiacritic(text);
}{code}
 

 

> Incorrect Word Splitting During Text Extraction When Special Characters Are 
> Rendered Using Fallback Fonts
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-6007
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-6007
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 3.0.5 PDFBox
>            Reporter: Greta
>            Priority: Trivial
>              Labels: newbie
>             Fix For: 3.0.6 PDFBox
>
>         Attachments: lithuanian_words.pdf
>
>
> When extracting text from PDFs where words contain special language 
> characters (for example, ą, č, ę, ė, į, š, ų, ū, ž) not supported by the 
> originally used font, these characters are rendered using a fallback/default 
> font. This often results in slight visual gaps after the special character 
> due to differing font metrics.
> During text extraction, PDFBox interprets these visual gaps as word 
> boundaries, causing words to be incorrectly split. This behavior negatively 
> affects natural language processing, search indexing, and text analysis on 
> extracted content.
> *An example:*
> Words in PDF: žiema, šaltis, ąžuolas, važiavimas, žąsis
> Extracted text: ž iema, šaltis, ąž uolas, važ iavimas, ž ąsis
> I have uploaded a test PDF file that contains more Lithuanian words written 
> with different fonts that do not support Lithuanian language special 
> characters.
>  
> To resolve the issue of unintended spaces being inserted during text 
> extraction, I propose enhancing the current logic in {{PDFTextStripper.java}} 
> that handles space glyphs.
> Current implementation:
> {code:java}
> // PDFBOX-3774: conditionally ignore spaces from the content stream
> if (" ".equals(characterValue) && getIgnoreContentStreamSpaceGlyphs()) {
>     continue;
> }{code}
> This logic only skips space characters if the 
> {{ignoreContentStreamSpaceGlyphs}} flag is enabled, without considering the 
> actual visual spacing.
>  
> Proposed improvement:
>  
> {code:java}
> // PDFBOX-3774: conditionally ignore spaces from the content stream
> if (" ".equals(characterValue)) {
>     if (getIgnoreContentStreamSpaceGlyphs()) {
>         continue;
>     }
>     float actualSpaceWidth = position.getWidth();
>     float expectedSpaceWidth = position.getWidthOfSpace();
>     float threshold = expectedSpaceWidth * 0.5f;
>     if (actualSpaceWidth < threshold) {
>         continue;
>     }
> }
> {code}
>  
> The proposed fix skips space characters that are visually too narrow to be 
> real word separators, preventing incorrect word splits caused by font 
> fallback or character spacing differences.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-6007) Incorrect Word Splitting During Text Extraction When Special Characters Are Rendered Using Fallback Fonts

Reply via email to