Hello Gino,

Please tell whether it happens with every font or only with that one. And check whether the encoding in the source code is the same passed to the javac compiler. I suspect your file is UTF8 but the java compiler expects a single byte font.

It works for me, I just tested it:

    public static void main(String[] args) throws IOException
    {
        try (PDDocument doc = new PDDocument())
        {
            PDFont font = PDType0Font.load(doc, new FileInputStream("XXXX/OpenSans-Regular.ttf"), false);
            PDPage page = new PDPage();
            doc.addPage(page);
            try (PDPageContentStream cs = new PDPageContentStream(doc, page))
            {
                cs.setFont(font, 20);
                cs.beginText();
                cs.newLineAtOffset(50, 650);
                cs.showText("äöüß");
                cs.endText();
            }
            doc.save("XXXX/gino.pdf");
        }
    }

And this is the content stream:

/F1 20 Tf
BT
  50 650 Td
  (\000\246\000\270\000\276\000\241) Tj
ET

Tilman

On 30.01.2024 15:52, Gino G wrote:
Hello there,

I'm encountering an error in how certain characters are encoded using PDFBox. The issue exists in all versions of PDFBox, but I'm currently using 3.0.1.

contentStream.showText("äöüß");

The string "äöüß" is used as a test for Unicode characters that PDFBox needs to render.

var resource = Processor.class.getResource("/OpenSans-Regular.ttf");var file = Paths.get(resource.toURI()).toFile(); vartargetStream = new FileInputStream(file); var out = PDType0Font.load(PageAssembler.getDocument(), targetStream, false); contentStream.setFont(out, 20);

To do so, I'm importing a font that I know has the glyphs for all four special characters (OpenSans downloaded from Google Fonts). However, this issue can be reproduced using any other Unicode-supported font.

Executing the code, PDFBox renders the following character sequence: Ã¤Ã¶Ã¼ÃŸ.
Clearly an encoding issue.

Using the PDF Debugger, it shows the text rendered as:

/F1 20 Tf
BT
  (\000\205\000f\000\205\000x\000\205\000~\000\205\0019) Tj
ET

Now, as far as I understand from what I've learned while debugging this issue, \205 is the octal value that uses the glyph at position 133 (decimal for \205) of the font with the id F1. Again, looking at the F1 section in the PDF Debugger, the character listed under the code / CID / GID 133 is indeed Ã, the first "incorrect" character of the sequence, which is supposed to be "ä"
"ä", however, would be 166, not 133. How does PDFBox get this wrong?

As an aside, if I use showText and use toUnicode(166), PDFBox correctly renders "ä" in the desired font!

Looking at the "ToUnicode" part of the F1 font, the following string is displayed.

Could someone please help me figure out what is going on? And hopefully even help me fix this issue? For more help, I have attached the PDF document.

Best,
Gino

ToUnicode:

/CIDInit /ProcSet findresource begin
12 dict begin

begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def

/CMapName /Adobe-Identity-UCS def
/CMapType 2 def

1 begincodespacerange
<0000> <FFFF>
endcodespacerange

100 beginbfrange
<0001> <0001> <0000>
<0002> <0002> <000D>
<0003> <0061> <0020>
<0062> <00C1> <00A0>
<00C2> <00F2> <0100>
<00F3> <00FF> <0132>
<0100> <0122> <013F>
<0123> <0124> <021A>
<0125> <0140> <0164>
<0141> <0141> <0192>
<0142> <0147> <01FA>
<0148> <0149> <0218>
<014A> <014B> <02C6>
<014C> <014C> <02C9>
<014D> <0152> <02D8>
<0153> <0159> <0384>
<015A> <015A> <038C>
<015B> <016E> <038E>
<016F> <019A> <03A3>
<019B> <01A6> <0401>
<01A7> <01E8> <040E>
<01E9> <01F4> <0451>
<01F5> <01F6> <045E>
<01F7> <01F8> <0490>
<01F9> <01FE> <1E80>
<01FF> <01FF> <1EF2>
<0200> <0200> <1EF3>
<0201> <0203> <2013>
<0204> <020B> <2017>
<020C> <020E> <2020>
<020F> <020F> <2026>
<0210> <0210> <2030>
<0211> <0212> <2032>
<0213> <0214> <2039>
<0215> <0215> <203C>
<0216> <0216> <2044>
<0217> <0217> <207F>
<0218> <0219> <20A3>
<021A> <021A> <20A7>
<021B> <021B> <20AC>
<021C> <021C> <2105>
<021D> <021D> <2113>
<021E> <021E> <2116>
<021F> <021F> <2122>
<0220> <0220> <2126>
<0221> <0221> <212E>
<0222> <0225> <215B>
<0226> <0226> <2202>
<0227> <0227> <2206>
<0228> <0228> <220F>
<0229> <022A> <2211>
<022B> <022B> <221A>
<022C> <022C> <221E>
<022D> <022D> <222B>
<022E> <022E> <2248>
<022F> <022F> <2260>
<0230> <0231> <2264>
<0232> <0232> <25CA>
<0235> <0235> <0326>
<0237> <0238> <2074>
<0239> <023A> <2077>
<023B> <0246> <2000>
<0247> <0247> <FEFF>
<0248> <0249> <FFFC>
<024A> <024A> <01F0>
<024B> <024B> <02BC>
<024C> <024D> <03D1>
<024E> <024E> <03D6>
<024F> <0250> <1E3E>
<0251> <0252> <1E00>
<0253> <0253> <02F3>
<0254> <0255> <01A0>
<0256> <0257> <01AF>
<0259> <0259> <0400>
<025A> <025A> <040D>
<025B> <025B> <0450>
<025C> <025C> <045D>
<025D> <027F> <0460>
<0280> <0287> <0488>
<0288> <02F5> <0492>
<02F6> <02FF> <0500>
<0300> <0309> <050A>
<030A> <035B> <1EA0>
<035C> <0361> <1EF4>
<0362> <0362> <20AB>
<036D> <036E> <0162>
<036F> <0372> <01EA>
<0373> <0373> <0259>
<0374> <0374> <0309>
<0375> <0375> <1F4D>
<0376> <0376> <1FDE>
<0377> <0377> <2070>
<0378> <0378> <2076>
<0379> <0379> <2079>
<038A> <038E> <FB00>
<038F> <038F> <1E9E>
<0390> <0391> <A7B3>
<03AF> <03AF> <0131>
<03B0> <03B0> <0237>
<03B1> <03B1> <A7B5>
endbfrange

35 beginbfrange
<03B2> <03B2> <AB53>
<03C1> <03C8> <2095>
<03C9> <03E3> <05D0>
<03E4> <03F0> <FB2A>
<03F1> <03F5> <FB38>
<03F6> <03F6> <FB3E>
<03F7> <03F8> <FB40>
<03F9> <03FA> <FB43>
<03FB> <03FF> <FB46>
<0400> <0400> <FB4B>
<0401> <0405> <0300>
<0406> <0408> <0306>
<0409> <040B> <030A>
<040C> <040C> <030F>
<040D> <040D> <0312>
<040E> <040E> <0323>
<040F> <0410> <0327>
<0411> <0412> <0485>
<0413> <0414> <0483>
<0415> <0422> <05B0>
<0423> <0424> <05C1>
<0425> <0425> <05C7>
<0459> <0462> <2080>
<0463> <0463> <05BE>
<0464> <0464> <207D>
<0465> <0465> <208D>
<0466> <0466> <207E>
<0467> <0467> <208E>
<0468> <0468> <207A>
<0469> <0469> <207C>
<046A> <046A> <208A>
<046B> <046B> <208C>
<046C> <046C> <2215>
<046D> <046D> <20AA>
<046E> <046E> <2120>
endbfrange

endcmap
CMapName currentdict /CMap defineresource pop
end
end

--
/*Gino*/



---------------------------------------------------------------------
To unsubscribe, e-mail:users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail:users-h...@pdfbox.apache.org

Reply via email to