Hello Gino,
Please tell whether it happens with every font or only with that one.
And check whether the encoding in the source code is the same passed to
the javac compiler. I suspect your file is UTF8 but the java compiler
expects a single byte font.
It works for me, I just tested it:
public static void main(String[] args) throws IOException
{
try (PDDocument doc = new PDDocument())
{
PDFont font = PDType0Font.load(doc, new
FileInputStream("XXXX/OpenSans-Regular.ttf"), false);
PDPage page = new PDPage();
doc.addPage(page);
try (PDPageContentStream cs = new PDPageContentStream(doc,
page))
{
cs.setFont(font, 20);
cs.beginText();
cs.newLineAtOffset(50, 650);
cs.showText("äöüß");
cs.endText();
}
doc.save("XXXX/gino.pdf");
}
}
And this is the content stream:
/F1 20 Tf
BT
50 650 Td
(\000\246\000\270\000\276\000\241) Tj
ET
Tilman
On 30.01.2024 15:52, Gino G wrote:
Hello there,
I'm encountering an error in how certain characters are encoded using
PDFBox. The issue exists in all versions of PDFBox, but I'm currently
using 3.0.1.
contentStream.showText("äöüß");
The string "äöüß" is used as a test for Unicode characters that PDFBox
needs to render.
var resource =
Processor.class.getResource("/OpenSans-Regular.ttf");var file =
Paths.get(resource.toURI()).toFile(); vartargetStream = new
FileInputStream(file); var out =
PDType0Font.load(PageAssembler.getDocument(), targetStream, false);
contentStream.setFont(out, 20);
To do so, I'm importing a font that I know has the glyphs for all four
special characters (OpenSans downloaded from Google Fonts).
However, this issue can be reproduced using any other
Unicode-supported font.
Executing the code, PDFBox renders the following character
sequence: äöüß.
Clearly an encoding issue.
Using the PDF Debugger, it shows the text rendered as:
/F1 20 Tf
BT
(\000\205\000f\000\205\000x\000\205\000~\000\205\0019) Tj
ET
Now, as far as I understand from what I've learned while debugging
this issue, \205 is the octal value that uses the glyph at position
133 (decimal for \205) of the font with the id F1.
Again, looking at the F1 section in the PDF Debugger, the character
listed under the code / CID / GID 133 is indeed Ã, the first
"incorrect" character of the sequence, which is supposed to be "ä"
"ä", however, would be 166, not 133. How does PDFBox get this wrong?
As an aside, if I use showText and use toUnicode(166), PDFBox
correctly renders "ä" in the desired font!
Looking at the "ToUnicode" part of the F1 font, the following string
is displayed.
Could someone please help me figure out what is going on? And
hopefully even help me fix this issue? For more help, I have attached
the PDF document.
Best,
Gino
ToUnicode:
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
100 beginbfrange
<0001> <0001> <0000>
<0002> <0002> <000D>
<0003> <0061> <0020>
<0062> <00C1> <00A0>
<00C2> <00F2> <0100>
<00F3> <00FF> <0132>
<0100> <0122> <013F>
<0123> <0124> <021A>
<0125> <0140> <0164>
<0141> <0141> <0192>
<0142> <0147> <01FA>
<0148> <0149> <0218>
<014A> <014B> <02C6>
<014C> <014C> <02C9>
<014D> <0152> <02D8>
<0153> <0159> <0384>
<015A> <015A> <038C>
<015B> <016E> <038E>
<016F> <019A> <03A3>
<019B> <01A6> <0401>
<01A7> <01E8> <040E>
<01E9> <01F4> <0451>
<01F5> <01F6> <045E>
<01F7> <01F8> <0490>
<01F9> <01FE> <1E80>
<01FF> <01FF> <1EF2>
<0200> <0200> <1EF3>
<0201> <0203> <2013>
<0204> <020B> <2017>
<020C> <020E> <2020>
<020F> <020F> <2026>
<0210> <0210> <2030>
<0211> <0212> <2032>
<0213> <0214> <2039>
<0215> <0215> <203C>
<0216> <0216> <2044>
<0217> <0217> <207F>
<0218> <0219> <20A3>
<021A> <021A> <20A7>
<021B> <021B> <20AC>
<021C> <021C> <2105>
<021D> <021D> <2113>
<021E> <021E> <2116>
<021F> <021F> <2122>
<0220> <0220> <2126>
<0221> <0221> <212E>
<0222> <0225> <215B>
<0226> <0226> <2202>
<0227> <0227> <2206>
<0228> <0228> <220F>
<0229> <022A> <2211>
<022B> <022B> <221A>
<022C> <022C> <221E>
<022D> <022D> <222B>
<022E> <022E> <2248>
<022F> <022F> <2260>
<0230> <0231> <2264>
<0232> <0232> <25CA>
<0235> <0235> <0326>
<0237> <0238> <2074>
<0239> <023A> <2077>
<023B> <0246> <2000>
<0247> <0247> <FEFF>
<0248> <0249> <FFFC>
<024A> <024A> <01F0>
<024B> <024B> <02BC>
<024C> <024D> <03D1>
<024E> <024E> <03D6>
<024F> <0250> <1E3E>
<0251> <0252> <1E00>
<0253> <0253> <02F3>
<0254> <0255> <01A0>
<0256> <0257> <01AF>
<0259> <0259> <0400>
<025A> <025A> <040D>
<025B> <025B> <0450>
<025C> <025C> <045D>
<025D> <027F> <0460>
<0280> <0287> <0488>
<0288> <02F5> <0492>
<02F6> <02FF> <0500>
<0300> <0309> <050A>
<030A> <035B> <1EA0>
<035C> <0361> <1EF4>
<0362> <0362> <20AB>
<036D> <036E> <0162>
<036F> <0372> <01EA>
<0373> <0373> <0259>
<0374> <0374> <0309>
<0375> <0375> <1F4D>
<0376> <0376> <1FDE>
<0377> <0377> <2070>
<0378> <0378> <2076>
<0379> <0379> <2079>
<038A> <038E> <FB00>
<038F> <038F> <1E9E>
<0390> <0391> <A7B3>
<03AF> <03AF> <0131>
<03B0> <03B0> <0237>
<03B1> <03B1> <A7B5>
endbfrange
35 beginbfrange
<03B2> <03B2> <AB53>
<03C1> <03C8> <2095>
<03C9> <03E3> <05D0>
<03E4> <03F0> <FB2A>
<03F1> <03F5> <FB38>
<03F6> <03F6> <FB3E>
<03F7> <03F8> <FB40>
<03F9> <03FA> <FB43>
<03FB> <03FF> <FB46>
<0400> <0400> <FB4B>
<0401> <0405> <0300>
<0406> <0408> <0306>
<0409> <040B> <030A>
<040C> <040C> <030F>
<040D> <040D> <0312>
<040E> <040E> <0323>
<040F> <0410> <0327>
<0411> <0412> <0485>
<0413> <0414> <0483>
<0415> <0422> <05B0>
<0423> <0424> <05C1>
<0425> <0425> <05C7>
<0459> <0462> <2080>
<0463> <0463> <05BE>
<0464> <0464> <207D>
<0465> <0465> <208D>
<0466> <0466> <207E>
<0467> <0467> <208E>
<0468> <0468> <207A>
<0469> <0469> <207C>
<046A> <046A> <208A>
<046B> <046B> <208C>
<046C> <046C> <2215>
<046D> <046D> <20AA>
<046E> <046E> <2120>
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end
end
--
/*Gino*/
---------------------------------------------------------------------
To unsubscribe, e-mail:users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail:users-h...@pdfbox.apache.org