On Sat, Nov 05, 2011 at 02:45:32PM +0000, Jonathan Kew wrote: > On 5 Nov 2011, at 10:24, Akira Kakuto wrote: > > > Dear Heiko, > > > >>>>>> Conclusion: > >>>>>> * The encoding mess with 8-bit characters remain even with XeTeX. > > > > I have disabled to reencode pdf strings to UTF-16 in xdvipdfmx: TL trunk > > r24508. > > Now > > /D<c3a46e6368c3b872> > > and > > /Names[<c3a46e6368c3b872>7 0 R]
Thanks Akira. But caution, it could break bookmark strings that currently works more or less accidently, sometimes with warnings. Perhaps the problem can be solved with a syntax extension, see below. > Unfortunately, I have not had time to follow this thread in detail or > investigate the issue properly, but I'm concerned this may break other > things that currently work, and rely on this conversion between the > encoding form in \specials, and the representation needed in PDF. > > However, by way of background: xetex was never intended to be a tool for > reading and writing arbitrary binary files. The PDF file format is a binary file format. To some degree us-ascii can be used, but at the cost of flexibility and some restrictions. > It is a tool for processing > text, and is specifically based on Unicode as the encoding for text, with > UTF-8 being its default/preferred encoding form for Unicode, and (more > importantly) the ONLY encoding form that it uses to write output files. > It's possible to READ other encoding forms (UTF-16), or even other > codepages, and have them mapped to Unicode internally, but output is > always written as UTF-8. > > Now, this should include not only .log file and \write output, but also > text embedded in the .xdv output using \special. Remember that \special > basically writes a sequence of *characters* to the output, and in xetex > those characters are *Unicode* characters. So my expectation would be that > arbitrary Unicode text can be written using \special, and will be > represented using UTF-8 in the argument of the xxxN operation in .xdv. That means that arbitrary bytes can't be written using \special, a restriction that is not available in vanilla TeX. > If > that \special is destined to be converted to a fragment of PDF data by the > xdv-to-pdf output driver (xdvipdfmx), and needs a different encoding form, > I'd expect the driver to be responsible for that conversion. Suggestions for some of PDF's data structures: * Strings: It seems that both (...) and the hex form <...> can be used. In the hex form spaces are ignored, thus a space right after the opening angle could be used for a syntax extension. In this case the driver unescapes the hex string to get the byte string without reencoding to Unicode. Example: \special{pdf:dest < c3a46e6368c3b872> [...]} The destination name would be "änchør" as byte string in UTF-8. \special{pdf:dest < e46e6368f872> [...]} The destination name would be "änchør" as byte string in latin1. \special{pdf:dest <c3a46e6368c3b872> [...]} The destination name would be the result of the current implementation. * Streams (\special{pdf: object ...<<...>>stream...endstream}): Instead of the keyword "stream" "hexstream" could be introduced. The driver then takes a hex string, unhexes it to get the byte data for the stream, also without reencoding to Unicode. > What I would NOT expect to work is for a TeX macro package to generate > arbitrary binary data (byte streams) and expect these to be passed > unchanged to the output. I suspect that's what Heiko's macros probably do, > and it worked in pdftex where "tex character" == "byte", but it's > problematic when "tex character" == "Unicode character". Yes, that's the problem. PDF is a binary format, not a Unicode text format. Yours sincerely Heiko Oberdiek -------------------------------------------------- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex