On 5 Nov 2011, at 15:24, Heiko Oberdiek wrote: > On Sat, Nov 05, 2011 at 02:45:32PM +0000, Jonathan Kew wrote: > >> On 5 Nov 2011, at 10:24, Akira Kakuto wrote: >> >>> Dear Heiko, >>> >>>>>>>> Conclusion: >>>>>>>> * The encoding mess with 8-bit characters remain even with XeTeX. >>> >>> I have disabled to reencode pdf strings to UTF-16 in xdvipdfmx: TL trunk >>> r24508. >>> Now >>> /D<c3a46e6368c3b872> >>> and >>> /Names[<c3a46e6368c3b872>7 0 R] > > Thanks Akira. But caution, it could break bookmark strings that > currently works more or less accidently, sometimes with warnings.
IIRC (it's a while since I looked at any of this), I believe Unicode bookmark strings work deliberately (not accidentally) - I think this came up early on as an issue, and encoding-form conversion was implemented to ensure that it works. (It's possible there are bugs, of course, but it was _supposed_ to work!) > Perhaps the problem can be solved with a syntax extension, see below. > >> Unfortunately, I have not had time to follow this thread in detail or >> investigate the issue properly, but I'm concerned this may break other >> things that currently work, and rely on this conversion between the >> encoding form in \specials, and the representation needed in PDF. >> >> However, by way of background: xetex was never intended to be a tool for >> reading and writing arbitrary binary files. > > The PDF file format is a binary file format. To some degree us-ascii > can be used, but at the cost of flexibility and some restrictions. Yes, PDF is a binary format; xetex was not designed to write PDF. It writes its output as XDV - also a binary format, of course, but a very specific one designed for this purpose - and XDV provides an extension mechanism that involves writing "special" strings that a driver is expected to understand. The key issue is that the "special" strings xetex writes are Unicode strings, not byte strings. > >> It is a tool for processing >> text, and is specifically based on Unicode as the encoding for text, with >> UTF-8 being its default/preferred encoding form for Unicode, and (more >> importantly) the ONLY encoding form that it uses to write output files. >> It's possible to READ other encoding forms (UTF-16), or even other >> codepages, and have them mapped to Unicode internally, but output is >> always written as UTF-8. >> >> Now, this should include not only .log file and \write output, but also >> text embedded in the .xdv output using \special. Remember that \special >> basically writes a sequence of *characters* to the output, and in xetex >> those characters are *Unicode* characters. So my expectation would be that >> arbitrary Unicode text can be written using \special, and will be >> represented using UTF-8 in the argument of the xxxN operation in .xdv. > > That means that arbitrary bytes can't be written using \special, > a restriction that is not available in vanilla TeX. That's correct. Perhaps regrettable, but that was the design. The argument of \special{....} is ultimately represented, after macro expansion, etc, as (Unicode) text, and Unicode text != arbitrary bytes. > >> If >> that \special is destined to be converted to a fragment of PDF data by the >> xdv-to-pdf output driver (xdvipdfmx), and needs a different encoding form, >> I'd expect the driver to be responsible for that conversion. > > Suggestions for some of PDF's data structures: > > * Strings: It seems that both (...) and the hex form <...> can be > used. In the hex form spaces are ignored, thus a space right > after the opening angle could be used for a syntax extension. > In this case the driver unescapes the hex string to get the > byte string without reencoding to Unicode. > Example: > \special{pdf:dest < c3a46e6368c3b872> [...]} > The destination name would be "änchør" as byte string in UTF-8. > \special{pdf:dest < e46e6368f872> [...]} > The destination name would be "änchør" as byte string in latin1. I don't understand this proposal. How can you (or rather, a driver) tell which encoding is the intended interpretation of an arbitrary sequence of byte values? > \special{pdf:dest <c3a46e6368c3b872> [...]} > The destination name would be the result of the current > implementation. > > * Streams (\special{pdf: object ...<<...>>stream...endstream}): > Instead of the keyword "stream" "hexstream" could be introduced. > The driver then takes a hex string, unhexes it to get the byte > data for the stream, also without reencoding to Unicode. I'm only vaguely aware of the various \special{}s that are supported by xdvipdfmx (this stuff is inherited from DVIPDFMx), but yes, I think that's where this issue should be fixed. But it _also_ needs the cooperation of macro package authors, in that macros designed to directly generate binary PDF streams and send them out via \special cannot be expected to work unchanged - they're assuming that the argument of \special{...} expands to a string of 8-bit bytes, not a string of Unicode characters, and that's not true in xetex. JK -------------------------------------------------- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex