[PHP-DEV] [RFC] Re: Decoding HTML and the Ambiguous Ampersand

Dennis Snell Fri, 16 Aug 2024 11:44:42 -0700

>On Fri, Aug 16, 2024, at 02:59, Dennis Snell wrote:
>> 
>>> On Jul 9, 2024, at 4:55 PM, Dennis Snell <dennis.sn...@a8c.com> wrote:
>>> 
>>> Greetings all,
>>> 
>>> The `html_entity_decode( … ENT_HTML5 … )` function has a number of issues 
>>> that I’d like to correct.
>>> 
>>> - It’s missing 720 of HTML5’s specified named character references.
>>> - 106 of these are named character references which do not require a 
>>> trailing semicolon, such as `&acute`
>>> - It’s unaware of the ambiguous ampersand rule, which allows these 106 in 
>>> special circumstances.
>>> 
>>> HTML5 asserts that the list of named character references will not expand 
>>> in the future. It can be found authoritatively at the following URL:
>>> 
>>> https://html.spec.whatwg.org/entities.json
>>> 
>>> The ambiguous ampersand rule smoothes over legacy behavior from before 
>>> HTML5 where ampersands were not properly encoded in attribute values, 
>>> specifically in URL values. For example, in a query string for a search, 
>>> one might find `?q=dog&not=cat`. The `&not` in that value would decode to 
>>> U+AC (¬), but since it’s in an attribute value it will be left as 
>>> plaintext. Inside normal HTML markup it would transform into `?q=dog¬=cat`. 
>>> There are related nuances when numeric character references are found at 
>>> the end of a string or boundary without the semicolon.
>>> 
>>> The function signature of `html_entity_decode()` does not currently allow 
>>> for correcting this behavior. I’d like to propose an RFC or a bug fix which 
>>> either extends the function (perhaps by adding a new flag like 
>>> `ENT_AMBIGUOUS_AMPERSAND`) or preferably creates a new function. For the 
>>> missing character references I wonder if it would be enough to add them to 
>>> the list of default translatable references.
>>> 
>>> One challenge with the existing function is that the concept of the 
>>> translation table stands in contrast with the fixed and static nature of 
>>> HTML5’s replacement tables. A new function or set of functions could open 
>>> up spec-compliant decoding while providing helpful methods that are 
>>> necessary in many common server-side operations:
>>> 
>>> - `html_decode( ‘attribute’ | ‘data’, $raw_text, $input_encoding = ‘utf-8' 
>>> )`
>>> - `html_text_contains( ‘attribute’ | ‘data’, $raw_haystack, $needle, 
>>> $input_encoding = ‘utf-8’ )`
>>> - `html_text_starts_with( ‘attribute’ | ‘data’, $raw_haystack, $needle, 
>>> $input_encoding = ‘utf-8’ )`
>>> 
>>> These methods are handy for inspecting things like encoded attribute values 
>>> in a memory-efficient and processing-efficient way, when it’s not necessary 
>>> to decode the entire value. In common situations, one encounters data-URIs 
>>> with potentially megabytes of image data and processing only the first few 
>>> or tens of bytes can save a lot of overhead.
>>> 
>>> We’re exploring pure-PHP solutions to these problems in WordPress in 
>>> attempts to improve the reliability and safety of handling HTML. I’d love 
>>> to hear your thoughts and know if anyone is willing to work with me to 
>>> create an RFC or directly propose patches. We’ve created a step function 
>>> which allows finding the next character reference and decoding it 
>>> separately, enabling some novel features like highlighting the character 
>>> references in source text.
>>> 
>>> Should I propose an RFC for this?
>>> 
>>> Warmly,
>>> Dennis Snell
>>> Automattic Inc.
>> 
>> All,
>> 
>> I have submitted an RFC draft for including the proposed feature from this 
>> issue. Thanks to everyone who helped me in this process. It’s my first RFC, 
>> so I apologize in advance for any mistakes I’ve made in the process.
>> 
>> https://wiki.php.net/rfc/decode_html
>> 
>> This is proposed for a future PHP version after 8.4.
>> 
>> Warmly,
>> Dennis Snell
>
>Hey Dennis,


Thanks for the question, Rob, I hope this finds you well!

>The RFC mentions that encoding must be utf-8. How are programmers supposed to 
>work with this if the php file itself isn’t utf-8

From my experience it’s the opposite case that is more important to consider. 
That is, what happens when we mix UTF-8 source code with latin1 or UTF-8 source 
HTML with the system-set locale. I tried to hint at this scenario in the 
"Character encodings and UTF-8” section.

Let’s examine the fundamental breakdown case:

```php
“é” === decode_html( “&#xe9;” );
```

If the source is UTF-8 there’s no problem. If the source is ISO-8859-1 this 
will fail because xE9 is on the left while xC3 xA9 is on the right. _Except_ if 
`zend.multibyte=1` and (`zend.script_encoding=iso-8859-1` _or_ if 
`declare(encoding=‘iso-8859-1’)` is set). The source code may or may not be 
converted into a different encoding based on configurations that most 
developers won’t have access to, or won’t examine.

Even with source code in ISO-8859-1, the `zend.script_encoding` and 
`zend.multibyte` set, `html_entity_decode()` _still_ reports UTF-8 unless 
`zend.default_charset` is set _or_ one of the `iconv` or `mbstring` internal 
charsets is set.

My point I’m trying to make is that the current situation today is a minefield 
due to a dizzying array of system-dependent settings. Most modern code will 
either be running UTF-8 source code or will be converting source code _to_ 
UTF-8 or many other things will already be helplessly broken beyond this one 
issue.

UTF-8 is the unifier that lets us escape this by having a defined and explicit 
encoding at the input and output.

> or the input is meaningless in utf-8 or if changing it to utf-8 and back 
> would result in invalid text?

There shouldn't be input that’s meaningless in UTF-8 if it’s valid in any other 
encoding. Indeed, I have placed the burden on the calling code to convert into 
UTF-8 beforehand, but that’s not altogether different than asking someone to 
declare into what encoding the character references ought to be decoded.

```diff
-html_entity_decode( $html, ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML5, 
‘ISO-8859-1’ );
+$html = mb_convert_encoding( $html, ‘UTF-8’, ‘ISO-8859-1’ );
+$html = decode_html( HTML_TEXT, $html );
+$html = mb_convert_encoding( $html, ‘ISO-8859-1’, ‘UTF-8’ );
```

If an encoding can go into UTF-8 (which it should) then it should also be able 
to return for all supported inputs. That is, we cannot convert into UTF-8 and 
produce a character that is unrepresentable in the source encoding, because 
that would imply it was there in the source to begin with. Furthermore, if the 
HTML decodes into a code point unsupported in the destination encoding, it 
would be invalid either directly via decoding, or indirectly via conversion.

```diff
-“\x1A” === html_entity_decode( “&#x1f170;”, ENT_QUOTES | ENT_SUBSTITUTE | 
ENT_HTML5, ‘ISO-8859-1’ );
+”?” === mb_convert_encoding( decode_html( HTML_TEXT, “&#x1f170;” ), 
‘ISO-8859-1’, ‘UTF-8’ );
```

This gets really confusing because neither of these outputs is a proper 
decoding, as character encodings that don’t support the full Unicode code space 
cannot adequately represent all valid HTML inputs. HTML is a Unicode decoding 
by specification, so even in a browser with `<meta 
charset=“ISO-8859-1”>&#x1f170;` the text content will still be `🅰`, not `?` or 
the invisible ASCII control code SUB.

—

I’m sorry for being long-winded but I think it’s necessary to frame these 
questions in the context of the problem today. We have very frequent errors 
that result from having the wrong defaults and a confusion of text encodings. 
I’ve seen far more problems from source code being UTF-8 and assuming the input 
is, rather than being anything else (likely ISO-8859-1 if not UTF-8) assuming 
the the input isn’t.

  * It should be possible to convert any string into UTF-8 regardless of its 
origin character set, and then transitively, if it originated there, it should 
be able to convert back if the HTML represents text that is representable in 
the original character set.

  * Converting at the boundaries of the application is the way to escape the 
confusion of wrestling an arbitrary number of different character sets.

  * Proper HTML decoding requires a character set capable of representing all 
of Unicode, as the code points in numeric character references refer to Unicode 
Code Points and _not_ any particular code units or byte sequences in any 
particular encoding.

  * Almost every other character set is ASCII compatible, including UTF-8, 
making the domain of problems where this arises even smaller than it might 
otherwise seem. For example, `&` is `&` in all of the common character sets.

Have a lovely weekend! And sorry for the potentially mis-threaded reply. I 
couldn’t figure out how to reply to your message directly because the digest 
emails were still stuck in 2020 for my account and I didn’t switch 
subscriptions until after your email went out, meaning I didn’t have a copy of 
your email.

>
>— Rob

[PHP-DEV] [RFC] Re: Decoding HTML and the Ambiguous Ampersand

Reply via email to