Re: [XeTeX] Bug fixes and new features related to Unicode character codes, surrogates, etc

2015-05-07 Thread David Carlisle
On 7 May 2015 at 02:07, Ross Moore wrote: > Hi David, > > .. > > No disagreement to this. > > OK:-) > > In the current versions d835dc00 is two characters in luatex > and one character in xetex > as the implementation detail that xetex's underlying storage is mostly > UTF-16 is exp

Re: [XeTeX] Bug fixes and new features related to Unicode character codes, surrogates, etc

2015-05-06 Thread Ross Moore
Hi David, On 07/05/2015, at 9:26 AM, David Carlisle wrote: >> The character itself, as bytes that is, is not wrong and users should be >> able to create these. >> But preferably through macros that ensure that they come correctly paired. > > placing two character tokens representing a surrogate

Re: [XeTeX] Bug fixes and new features related to Unicode character codes, surrogates, etc

2015-05-06 Thread David Carlisle
> The character itself, as bytes that is, is not wrong and users should be able > to create these. > But preferably through macros that ensure that they come correctly paired. placing two character tokens representing a surrogate pair should not though magically turn itself into a single characte

Re: [XeTeX] Bug fixes and new features related to Unicode character codes, surrogates, etc

2015-05-06 Thread Ross Moore
Hi Arthur, On 07/05/2015, at 8:04, Arthur Reutenauer wrote: > While working on these bugs, we also discussed how surrogate > characters were handled in XeTeX. Surrogate characters are the 2048 > code points that are used in UTF-16 to encode characters with code > points above 65536: a pair of

Re: [XeTeX] Bug fixes and new features related to Unicode character codes, surrogates, etc

2015-05-06 Thread David Carlisle
On 6 May 2015 at 23:04, Arthur Reutenauer wrote: > While working on these bugs, we also discussed how surrogate > characters were handled in XeTeX. Surrogate characters are the 2048 > code points that are used in UTF-16 to encode characters with code > points above 65536: a pair of them makes u

Re: [XeTeX] Bug fixes and new features related to Unicode character codes, surrogates, etc

2015-05-06 Thread Arthur Reutenauer
While working on these bugs, we also discussed how surrogate characters were handled in XeTeX. Surrogate characters are the 2048 code points that are used in UTF-16 to encode characters with code points above 65536: a pair of them makes up one Unicode character; however they're not meant to be u

Re: [XeTeX] Bug fixes and new features related to Unicode character codes, surrogates, etc

2015-05-05 Thread David Carlisle
On 4 May 2015 at 16:27, Jonathan Kew wrote: > ... > > A fix for this bug, so that \string generates single Unicode characters > even for values above U+, is currently on the utf16-issues branch in > the XeTeX repository on sourceforge.[1] > > A bug with characters above U+ within \scantok

[XeTeX] Bug fixes and new features related to Unicode character codes, surrogates, etc

2015-05-04 Thread Jonathan Kew
On 23/4/15 20:59, David Carlisle wrote: I can confirm that \string does convert character tokens to two tokens giving the UTF-16 representation. With the attached file luatex produces 90,33 34,33 233,33 233,33 65530,33 65537,33 65537,33 which is in each case the unicode value of the character