> On 25 Feb 2018, at 04:41, Eric Sunshine <[email protected]> wrote:
>
> On Sat, Feb 24, 2018 at 11:27 AM, <[email protected]> wrote:
>> Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
>> or UTF-32LE a BOM must not be used [1]. The function returns true if
>> this is the case.
>>
>> [1] http://unicode.org/faq/utf_bom.html#bom10
>>
>> Signed-off-by: Lars Schneider <[email protected]>
>> ---
>> diff --git a/utf8.c b/utf8.c
>> @@ -538,6 +538,30 @@ char *reencode_string_len(const char *in, int insz,
>> +int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
>> +{
>> + return (
>> + (!strcmp(enc, "UTF-16BE") || !strcmp(enc, "UTF-16LE")) &&
>> + (has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
>> + has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
>> + ) || (
>> + (!strcmp(enc, "UTF-32BE") || !strcmp(enc, "UTF-32LE")) &&
>> + (has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
>> + has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
>> + );
>> +}
>
> Is this interpretation correct? When I read [1], I interpret it as
> saying that no BOM _of any sort_ should be present when the encoding
> is declared as one of UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE.
Correct!
> This
> code, on the other hand, only checks for BOMs corresponding to the
> declared size (16 or 32 bits).
Hmm. Interesting thought. You are saying that my code won't complain if
a document declared as UTF-16LE has a UTF32-LE BOM, correct? I would say
this is correct behavior in context of this function. This function assumes
that the document is proper UTF-16/UTF-16BE/UTF-16LE but it is wrongly
declared with respect to its BOM in the .gitattributes. Would this
comment make it more clear to you?
/*
* If a data stream is declared as UTF-16BE or UTF-16LE, then a UTF-16
* BOM must not be used [1]. The same applies for the UTF-32
equivalents.
* The function returns true if this rule is violated.
*
* [1] http://unicode.org/faq/utf_bom.html#bom10
*/
I think what you are referring to is a different class of error and
would therefore warrant its own checker function. Would you agree?
> I suppose the intention of [1] is to detect a mismatch between the
> declared encoding and how the stream is actually encoded. The check
> implemented here will fail to detect a mismatch between, say, declared
> encoding UTF-16BE and actual encoding UTF-32BE.
As stated above the intention is to detect wrong BOMs! I think we cannot
detect the "declared as UTF-16BE but actually UTF-32BE" error.
Consider this:
printf "test" | iconv -f UTF-8 -t UTF-32BE | iconv -f UTF-16BE -t UTF-8 | od -c
0000000 \0 t \0 e \0 s \0 t
0000010
In the first step we "encode" the string to UTF-32BE and then we "decode" it as
UTF-16BE. The result is valid although not correct. Does this make sense?
Thanks,
Lars