[
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17873485#comment-17873485
]
Tilman Hausherr edited comment on PDFBOX-5868 at 8/14/24 10:30 AM:
-------------------------------------------------------------------
It's in the content stream:
!screenshot-1.png!
Here's some code to detect it:
{code:java}
String urlText =
"https://issues.apache.org/jira/secure/attachment/13070873/multilingual_test.pdf";
try (PDDocument doc = PDDocument.load(new URL(urlText).openStream()))
{
for (int p = 0; p < doc.getNumberOfPages(); ++p)
{
PDPage page = doc.getPage(p);
PDFStreamParser pdfStreamParser = new PDFStreamParser(page);
Object token = pdfStreamParser.parseNextToken();
while (token != null)
{
if (token instanceof COSDictionary && ((COSDictionary)
token).containsKey(COSName.ACTUAL_TEXT))
{
System.out.println("/ActualText in page " + (p + 1));
break;
}
token = pdfStreamParser.parseNextToken();
}
pdfStreamParser.close();
}
}
{code}
However, having it doesn't mean that all will be bad. For example the
extraction of the first page looks ok. Also, there are many other different
reasons that you could have a bad text extraction, e.g. obfuscation.
was (Author: tilman):
It's in the content stream:
!screenshot-1.png!
Here's some code to detect it:
{code:java}
String urlText =
"https://issues.apache.org/jira/secure/attachment/13070873/multilingual_test.pdf";
try (PDDocument doc = PDDocument.load(new URL(urlText).openStream()))
{
for (int p = 0; p < doc.getNumberOfPages(); ++p)
{
PDPage page = doc.getPage(p);
PDFStreamParser pdfStreamParser = new PDFStreamParser(page);
Object token = pdfStreamParser.parseNextToken();
while (token != null)
{
if (token instanceof COSDictionary && ((COSDictionary)
token).containsKey(COSName.ACTUAL_TEXT))
{
System.out.println("/ActualText in page " + (p + 1));
break;
}
token = pdfStreamParser.parseNextToken();
}
pdfStreamParser.close();
}
}
{code}
> PDFBox not extracting text of non-latin languages(tamil, bengali) properly
> but adobe reader's save as text does
> ---------------------------------------------------------------------------------------------------------------
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
> Reporter: Manish S N
> Priority: Major
> Attachments: adobe_out.txt, multilingual_test.pdf, pdfbox_out.txt,
> screenshot-1.png
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used
> the export:text command line tool to obtain the results
> * the multilingual_test.pdf is the original pdf i made to test multilingual
> text extraction.
> * the pdfbox_out.txt is the text file produced by pdfbox
> * the adobe_out.txt is the text file created by adobe reader's save as text
> feature
>
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird
> unicodes for tamil and bengali (for hindi the charecters are extracted but
> not overlapped; japanese seems fine to me). in contrast the text file file
> obtained from adobe reader's save as text feature seems fine and copy pasting
> the text from my document viewer(evince) also works.
> Questions:
> # why are the outputs from pdfbox and adobe different?
> # what can i do to extract the text from a multilingual pdf correctly?
> # Is there a way to apply pattern matching to text in pdf file and declare
> matches without extracting the text first? (say if the problem is with fonts
> and glyphs)
> ---
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching to identify
> pii in them. making it an app so users can define their own patterns. I am
> using apache tika for parsing documents. I noticed problem with extracted PDF
> text (other filetypes parse fine). used executable pdfbox jar to conclude
> that the _problem is in pdfbox and not in tika._ tested with adobe reader's
> extract text to confirm the problem is not with the pdf. i want to extract
> these multilingual text to run pattern matching on them alone and do not need
> to display the content but only if the pattern is present or not.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]