Re: Use cases for invalid-Unicode atoms

Henri Sivonen Wed, 21 Mar 2018 02:28:40 -0700

On Tue, Mar 20, 2018 at 8:22 PM, Steve Fink <sf...@mozilla.com> wrote:
> On 03/20/2018 06:49 AM, Henri Sivonen wrote:
>>
>> On Tue, Mar 20, 2018 at 12:44 PM, Henri Sivonen <hsivo...@hsivonen.fi>
>> wrote:
>>>
>>> On Tue, Mar 20, 2018 at 11:12 AM, Henri Sivonen <hsivo...@hsivonen.fi>
>>> wrote:
>>>>
>>>> OK. I'll leave the UTF-16 case unchanged and will make the minimal
>>>> changes on the UTF-8 side to retain the existing outward behavior
>>>> without burning the tree. Hopefully I can make the UTF-8 case faster
>>>> while at it. It depended on not-so-great code.
>>>
>>> I still have doubts about retaining the exact invalid-UTF-8 behavior.
>>> The current behavior appears to be that if we try to atomicize an
>>> invalid UTF-8 string, the returned atom is new atom representing the
>>> empty string--not the pre-existing atom for the empty string.
>>>
>>> Is there a reason why it's desirable behavior to potentially have
>>> multiple atoms representing the empty string? Is there a reason why we
>>> don't MOZ_CRASH on invalid UTF-8 if we are convinced enough that it's
>>> not supposed to happen to the point that we let go of the atomicity of
>>> atoms if it does happen?
>>
>> Furthermore, we validate UTF-8 strings anyway as a side effect of
>> hashing them as if they were UTF-16, so if we don't want to MOZ_CRASH,
>> we could at that point swap a valid string (invalid byte sequences
>> replaced with U+FFFD) in the input string's place and atomicize that.
>> It could be a MOZ_UNLIKELY branch on the validation result that we
>> compute anyway and would avoid the weirdness of non-atomic atoms that
>> have no resemblance to the input string.
>>
>> Thoughts?
>
> My only thought is that the atomize-to-non-canonical-empty-string behavior
> sounds like a crazy footgun. But changing it also sounds scary -- currently,
> it sounds like valid strings produce unique atoms, and invalid strings
> produce unique placeholder things that will never match anything, even
> themselves, sort of like a string version of NaN. Is that correct?


So it seems from reading the code.

> If so, then your proposed change would not only make invalid strings compare
> equal to themselves, but also to other *different* invalid strings with
> same-length invalid byte sequences. That also seems kind of footgunny.

I'm quite a bit more worried about the case of using the string value
of the atom than about comparing them. These kind of NaN atoms yield
the empty string as the string value, so if you atomicize an invalid
UTF-8 string and then obtain the string value from the atom, the
errors have collapsed into nothingness. While I don't have an idea how
to exploit this given how atoms are used in Gecko, this general
behavior is known-bad on the level that it's called out in the Unicode
Security Considerations:
https://www.unicode.org/reports/tr36/#Substituting_for_Ill_Formed_Subsequences
.

> I don't really know how Gecko atoms work or are used, though, so I may be
> totally off base here.

What I'm proposing is equivalent to converting the UTF-8 string to
UTF-16 "with replacement" (in the WHATWG sense) and then atomicizing
the UTF-16 string. AFAICT, the main user of atomicizing by UTF-8 is
the new style system, which has already performed the equivalent
substitution.

Looking at the other callers:
 * One case uselessly atomicizes by UTF-8, considering that it had a
UTF-16 string available a couple of lines earlier.
 * One case uses a contract ID as input, so the input is internal anyway.
 * Some uses are language atoms, where it would be fine to have U+FFFD
in bogus atoms, since they don't match anything useful anyway.
 * A bunch of things atomicize URL components. I'd hope that the URLs
were converted from UTF-16 to UTF-8 at some prior point ensuring UTF-8
validity, but it's hard to be sure. To the extent these are used for
security checks, having NaN atoms that match nothing could be safer
than having different inputs yield the same U+FFFD sequences to make
them match.

The query string can introduce invalid UTF-8 into a URL, but I believe
we never do security checks based on query part. I believe we're
supposed to be doing security checks on the scheme (always ASCII),
port (number) and the Punycode form of the host (always ASCII). Is
this true?

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Use cases for invalid-Unicode atoms

Reply via email to