[
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874189#comment-17874189
]
Tilman Hausherr commented on PDFBOX-5868:
-----------------------------------------
[~manish003] So you're appealing to our pride and think that such a transparent
manipulation attempt would work ЁЯШВ
I had a look at PDFMarkedContentExtractor and at
https://stackoverflow.com/questions/78705656/ and
https://stackoverflow.com/questions/44029191/ . Using parts of
PDFMarkedContentExtractor in the stripper helps;
1) add
{code}
addOperator(new BeginMarkedContentSequenceWithProperties(this));
addOperator(new BeginMarkedContentSequence(this));
addOperator(new EndMarkedContentSequence(this));
{code}
to the constructor of the stripper
2) add
{code}
boolean inActualText = false;
boolean firstActualText = false;
String actualText = null;
@Override
public void endMarkedContentSequence()
{
inActualText = false;
//TODO add the text
super.endMarkedContentSequence();
}
@Override
public void beginMarkedContentSequence(COSName tag, COSDictionary
properties)
{
PDMarkedContent mc = PDMarkedContent.create(tag, properties);
actualText = mc.getActualText();
if (actualText != null)
{
actualText = actualText.replace("\u00ad", ""); // remove soft
hyphens
inActualText = true;
firstActualText = true;
//System.out.println("actualText: " + actualText);
}
super.beginMarkedContentSequence(tag, properties);
}
{code}
wherever you want
3) add
{code}
if (inActualText)
{
if (firstActualText)
{
text.setUnicode(actualText);
firstActualText = false;
}
else
{
text.setUnicode("");
}
}
{code}
At the beginning of {{processTextPosition(TextPosition text)}}.
4) Add
{code}
void setUnicode(String unicode)
{
this.unicode = unicode;
}
{code}
in the {{Textposition}} class.
There are lots of differences in build texts, most are better, some look weird
(lots of spaces). Your file is extracted differently now(non latin parts):
┬ард╣рд┐рдВрджреА ┬а┬а(hindi):
┬а┬арддреВрдБ┬арддреВрдБ┬ардХрд░рддрд╛ ┬арддреВрдБ┬арднрдпрд╛ ,┬ардореБрдЭ┬ардореИрдВ┬ард░рд╣реА ┬арди┬ард╣реВрдБред
┬а┬ард╡рд╛ рд░реА ┬ардлреЗрд░реА ┬ардмрд▓рд┐ ┬ардЧрдИ,┬ардЬрд┐ рдд┬арджреЗрдЦреМрдВ ┬арддрд┐ рдд┬арддреВрдБ┬арее
┬а
┬ардЬреА рд╡рд╛ рддреНрдорд╛ ┬ардХрд╣┬ард░рд╣реА ┬ард╣реИ┬ардХрд┐ ┬атАШрддреВ┬ард╣реИтАЩ┬атАШрддреВ┬ард╣реИтАЩ┬ардХрд╣рддреЗтИТрдХрд╣рддреЗ┬ардореЗрд░рд╛ ┬ардЕрд╣рдВрдХрд╛ рд░┬ард╕рдорд╛ рдкреНрдд┬ард╣реЛ ┬а
рдЧрдпрд╛ ред┬ардЗрд╕┬арддрд░рд╣┬арднрдЧрд╡рд╛ рди┬ардкрд░┬ардиреНрдпреМ рдЫрд╛ рд╡рд░
┬ард╣реЛ рддреЗтИТрд╣реЛ рддреЗ┬ардореИрдВ┬ардкреВрд░реНрдгрддрдпрд╛ ┬ард╕рдорд░реНрдкрд┐ рдд┬ард╣реЛ ┬ардЧрдИред┬ардЕрдм┬арддреЛ ┬ардЬрд┐ рдзрд░┬арджреЗрдЦрддреА ┬ард╣реВрдБ┬ардЙрдзрд░┬арддреВ┬ард╣реА ┬а
рджрд┐ рдЦрд╛ рдИ┬арджреЗрддрд╛ ┬ард╣реИред
┬а
┬а┬ародрооро┐ро┤рпН┬а(tamil):
┬а
┬а┬ароЖроХрпНроХроорпН┬ароЕродро░рпНро╡ро┐ройро╛ ропрпНроЪрпН┬ароЪрпЖ ро▓рпНро▓рпБроорпН┬ароЕроЪрпИ ро╡ро┐ро▓ро╛
┬ароКроХрпНроХ┬ароорпБроЯрпИ ропро╛ ┬аройрпБро┤рпИ
роиро╛ рооро╛ ро░рпНроХрпНроХрпБроЩрпН┬ароХрпБроЯро┐ропро▓рпНро▓рпЛ роорпН┬ароирооройрпИ ┬аропроЮрпНроЪрпЛ роорпН
роиро░роХродрпНродро┐┬аро▓ро┐роЯро░рпНрокрпНрокроЯрпЛ роорпН┬ароироЯро▓рпИ ┬аропро┐ро▓рпНро▓рпЛ роорпН
роПрооро╛ рокрпНрокрпЛ роорпН┬арокро┐рогро┐ропро▒ро┐ропрпЛ роорпН┬арокрогро┐ро╡рпЛ ┬арооро▓рпНро▓рпЛ роорпН
┬а
роЗройрпНрокроорпЗ┬ароОроирпНроиро╛ ро│рпБроирпН┬ародрпБройрпНрок┬арооро┐ро▓рпНро▓рпИ
родро╛ рооро╛ ро░рпНроХрпНроХрпБроЩрпН┬ароХрпБроЯро┐ропро▓рпНро▓ро╛ родрпН┬ародройрпНроорпИ ┬аропро╛ рой
роЪроЩрпНроХро░ройро▒рпН┬ароЪроЩрпНроХро╡рпЖ рогрпН┬ароХрпБро┤рпИ ропрпЛ ро░рпН┬ароХро╛ родро┐ро▒рпН
роХрпЛ рооро╛ ро▒рпНроХрпЗ ┬ароиро╛ роорпЖ ройрпНро▒рпБроорпН┬ароорпАро│ро╛ ┬ароЖро│ро╛ ропрпНроХрпН
┬ароХрпК ропрпНроорпНрооро▓ро░рпНроЪрпНроЪрпЗ ┬аро╡роЯро┐ропро┐рогрпИ ропрпЗ ┬ароХрпБро▒рпБроХро┐┬аройрпЛ роорпЗ .
┬а
┬аBengali:
ржЖржарж╛ рж░рзЛ ┬аржмржЫрж░┬аржмржпрж╝рж╕┬аржХрзА ┬аржжрзБржГ рж╕рж╣
рж░рзНрж╕рзНржкржзрж╛ ржпрж╝┬аржирзЗ ржпрж╝┬аржорж╛ ржерж╛ ┬арждрзЛ рж▓ржмрж╛ рж░┬аржЭрзБржБ ржХрж┐ ,
ржЖржарж╛ рж░рзЛ ┬аржмржЫрж░┬аржмржпрж╝рж╕рзЗ ржЗ┬аржЕрж╣рж░рж╣
ржмрж┐ рж░рж╛ ржЯ┬аржжрзБржГ рж╕рж╛ рж╣рж╕рзЗ рж░рж╛ ┬аржжрзЗ ржпрж╝┬аржпрзЗ ┬аржЙржБржХрж┐ ред
ржЖржарж╛ рж░рзЛ ┬аржмржЫрж░┬аржмржпрж╝рж╕рзЗ рж░┬аржирзЗ ржЗ┬аржнржпрж╝
ржкржжрж╛ ржШрж╛ рждрзЗ ┬аржЪрж╛ ржпрж╝┬аржнрж╛ ржЩрждрзЗ ┬аржкрж╛ ржерж░┬аржмрж╛ ржзрж╛ ,
ржП┬аржмржпрж╝рж╕рзЗ ┬аржХрзЗ ржЙ┬аржорж╛ ржерж╛ ┬аржирзЛ ржпрж╝рж╛ ржмрж╛ рж░┬аржиржпрж╝-
ржЖржарж╛ рж░рзЛ ┬аржмржЫрж░┬аржмржпрж╝рж╕┬аржЬрж╛ ржирзЗ ┬аржирж╛ ┬аржХрж╛ржБ ржжрж╛ ред
ржП┬аржмржпрж╝рж╕┬аржЬрж╛ ржирзЗ ┬арж░ржХрзНрждржжрж╛ ржирзЗ рж░┬аржкрзБржгрзНржп
┬аржмрж╛ рж╖рзНржкрзЗ рж░┬аржмрзЗ ржЧрзЗ ┬арж╕рзНржЯрж┐ ржорж╛ рж░рзЗ рж░┬аржорждрзЛ ┬аржЪрж▓рзЗ ,
┬а
┬аJapnese:
┬ахПдц▒ауВД┬ашЫЩщгЫуБ│ш╛╝уВА┬ац░┤уБощЯ│
┬а
> PDFBox not extracting text of non-latin languages(tamil, bengali) properly
> but adobe reader's save as text does
> ---------------------------------------------------------------------------------------------------------------
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
> Reporter: Manish S N
> Priority: Major
> Attachments: adobe_out.txt, multilingual_test.pdf, okular_out.txt,
> pdfbox_out.txt, poppler_out.txt, screenshot-1.png
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used
> the export:text command line tool to obtain the results
> * the multilingual_test.pdf is the original pdf i made to test multilingual
> text extraction.
> * the pdfbox_out.txt is the text file produced by pdfbox
> * the adobe_out.txt is the text file created by adobe reader's save as text
> feature
> ┬а
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird
> unicodes for tamil and bengali (for hindi the charecters are extracted but
> not overlapped; japanese seems fine to me). in contrast the text file file
> obtained from adobe reader's save as text feature seems fine and copy pasting
> the text from my document viewer(evince) also works.
> Questions:
> # why are the outputs from pdfbox and adobe different?
> # what can i do to extract the text from a multilingual pdf correctly?
> # Is there a way to apply pattern matching to text in pdf file and declare
> matches without extracting the text first? (say if the problem is with fonts
> and glyphs)
> тАФ
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using
> apache tika for parsing documents. I noticed problem with extracted PDF text
> (other filetypes parse fine). used executable pdfbox jar to conclude that the
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract
> text to confirm the problem is not with the pdf. i┬а want to extract these
> multilingual text to run pattern matching on them alone and do not need to
> display the content but only if the pattern is present or not (say if the
> problem is with fonts and glyphs)
> ┬а
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]