32 BOM

Junio C Hamano Tue, 30 Jan 2018 11:15:53 -0800

[email protected] writes:

> From: Lars Schneider <[email protected]>
>
> If the endianness is not defined in the encoding name, then let's
> be strict and require a BOM to avoid any encoding confusion. The
> has_missing_utf_bom() function returns true if a required BOM is
> missing.
>
> The Unicode standard instructs to assume big-endian if there in no BOM
> for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used
> in HTML5 recommends to assume little-endian to "deal with deployed
> content" [3]. Strictly requiring a BOM seems to be the safest option
> for content in Git.


I do not have strong opinion on encoding such policy-ish behaviour
as our default, but am I alone to find that "has missing X" is a
confusing name for a helper function?  "is missing X" (or "lacks
X") is a bit more understandable, I guess.

> +int has_missing_utf_bom(const char *enc, const char *data, size_t len)
> +{
> +     return (
> +        !strcmp(enc, "UTF-16") &&
> +        !(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
> +          has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
> +     ) || (
> +        !strcmp(enc, "UTF-32") &&
> +        !(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
> +          has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
> +     );
> +}

Re: [PATCH v5 4/7] utf8: add function to detect a missing UTF-16/32 BOM

Reply via email to