Here's an excerpt the CMAP table of that font, to be found at Root/Pages/Kids/[0]/Resources/Font/F480/ToUnicode :

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo 3 dict dup begin
  /Registry (Adobe) def
  /Ordering (UCS) def
  /Supplement 0 def
end def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
1 beginbfchar
<0000> <ffff>
endbfchar
2 beginbfrange
<0001> <005f> <f020>
<0060> <00d0> <f080>
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end
end



This means that characters in the content stream whole value is between 0001 and 00d0 are converted to unicode starting with f020 (see beginbfrange - search for this word in the PDF 32000 specifiation).
https://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf

But the content stream has also

    [ (\000\000) ] TJ

16 times. This is being rendered as a square by Adobe and PDFBox. In the beginbfchar section, the 0000 is being converted to unicode ffff, this is the unicode non character. This becomes EF BF BF in utf8.

http://www.fileformat.info/info/unicode/char/ffff/index.htm

QED

Tilman





Am 23.06.2016 um 10:33 schrieb OYEBISI, Daniel:
You can get the PDF file through this url

http://www.pdf-archive.com/2016/06/23/modele-tableau-wingdings-3/

-----Message d'origine-----
De : Tilman Hausherr [mailto:[email protected]]
Envoyé : mercredi 22 juin 2016 20:03
À : [email protected]
Objet : Re: Empty glyphs

  From what I see, the "whitespace" are EF BF BF which is not a valid
UTF8 character. Please upload the PDF file somewhere.

Tilman

Am 22.06.2016 um 18:39 schrieb OYEBISI, Daniel:
The problem is with some of the whitespace that appears empty in Notepad but 
are really not.
Please try opening the text file with other text editors.
Thanks

-----Message d'origine-----
De : Tilman Hausherr [mailto:[email protected]] Envoyé : mercredi
22 juin 2016 17:54 À : [email protected] Objet : Re: Empty
glyphs

Your PDF didn't get through (security) but this sounds like a N++ problem.

I could display your txt file with the normal notepad, by changing the font to 
windings.

Tilman

Am 22.06.2016 um 16:58 schrieb OYEBISI, Daniel:
Hello,

I came across an issue while trying to extract the text using
PDFTextStripper from the PDF file attached to this email.

When I open the txt document generated in the Notepad, it appears
normal but when I open it with Notepad++ and it gives an interesting
result.

Please can you have a look at this?

Thanks



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected].
org


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to