Hi Yuriy! Am Sun, 24 May 2015 23:08:12 +0300 schrieb "Yuriy M. Kaminskiy" <yum...@gmail.com>:
> utf-16 decoder in id3 parser improperly identifies surrogate pairs, > resulting in improper identification of characters in 0xf800-0xfffe > range as leading surrogate and decoding failure. > > E.g. attempt to decode title "「x」~y~" results in: > [id3.c:1065] error: Invalid UTF16 surrogate pair at 0 (0xff62). > and empty parsed title. Could you please send me (mpg123 upstream maintainer) a little (piece of an) example file to add as regression test for this? As ID3 tag writers also have a history of messing up encoding, I'd like to use the original and not a fake I did myself;-) Regarding the patch: Oh, yes, I see stupid me not getting the proper idea about bit masks back in 2006/2007 in this case. Let's recap to be on the safe side: high surrogate range: 0xD800 to 0xDBFF low suggogate range: 0xDC00 to 0xDFFF Do we agree on that or is my knowledge of UTF-16 outdated? I sense that the mask 0xf800 doesn't cover the first range properly, neither. We need to detect bit sequences between 0b1101100000000000 0b1101101111111111 We don't want to catch 0b110111xxxxxxxxxx in there. So a proper mask should be 0b1111110000000000 which is 0xfc00 in hex, too. Verifying the low surrogate range: 0b1101110000000000 0b1101111111111111 The mask 0b1111110000000000 seems appropriate here, too. How convenient. This smells of intelligent design, doesn't it? ;-) So 0xfc00 should be used both for low and high surrogates to properly tell them apart with the additional bit. I'm attaching a revised patch that should enter mpg123 trunk shortly. Feel free to yell and show the error in my current reasoning … Alrighty then, Thomas
Index: src/libmpg123/id3.c =================================================================== --- src/libmpg123/id3.c (Revision 3642) +++ src/libmpg123/id3.c (Arbeitskopie) @@ -1051,10 +1051,10 @@ for(i=0; i < n; i+=2) { unsigned long point = ((unsigned long) s[i+high]<<8) + s[i+low]; - if((point & 0xd800) == 0xd800) /* lead surrogate */ + if((point & 0xfc00) == 0xd800) /* lead surrogate */ { unsigned short second = (i+3 < l) ? (s[i+2+high]<<8) + s[i+2+low] : 0; - if((second & 0xdc00) == 0xdc00) /* good... */ + if((second & 0xfc00) == 0xdc00) /* good... */ { point = FULLPOINT(point,second); length += UTF8LEN(point); /* possibly 4 bytes */ @@ -1077,7 +1077,7 @@ for(i=0; i < n; i+=2) { unsigned long codepoint = ((unsigned long) s[i+high]<<8) + s[i+low]; - if((codepoint & 0xd800) == 0xd800) /* lead surrogate */ + if((codepoint & 0xfc00) == 0xd800) /* lead surrogate */ { unsigned short second = (s[i+2+high]<<8) + s[i+2+low]; codepoint = FULLPOINT(codepoint,second);
pgpqDpi42mv8d.pgp
Description: Digitale Signatur von OpenPGP
_______________________________________________ pkg-multimedia-maintainers mailing list pkg-multimedia-maintainers@lists.alioth.debian.org http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-multimedia-maintainers