[PHP-DEV] Re: [RFC] Decoding HTML and the Ambiguous Ampersand

Dennis Snell Fri, 06 Sep 2024 12:02:03 -0700

All, I have updated the RFC document by adding a section on the proposed 
HtmlContext enum, with some extra contexts than were originally discussed (but 
which were added to the implementation).


As I’ve been a bit distracted this has taken a bit of a backseat but I am still 
interested in keeping it moving forward.

https://wiki.php.net/rfc/decode_html

Warmly,
Dennis Snell

> On Jul 9, 2024, at 4:55 PM, Dennis Snell <dennis.sn...@a8c.com> wrote:
> 
> Greetings all,
> 
> The `html_entity_decode( … ENT_HTML5 … )` function has a number of issues 
> that I’d like to correct.
> 
>  - It’s missing 720 of HTML5’s specified named character references.
>  - 106 of these are named character references which do not require a 
> trailing semicolon, such as `&acute`
>  - It’s unaware of the ambiguous ampersand rule, which allows these 106 in 
> special circumstances.
> 
> HTML5 asserts that the list of named character references will not expand in 
> the future. It can be found authoritatively at the following URL:
> 
> https://html.spec.whatwg.org/entities.json
> 
> The ambiguous ampersand rule smoothes over legacy behavior from before HTML5 
> where ampersands were not properly encoded in attribute values, specifically 
> in URL values. For example, in a query string for a search, one might find 
> `?q=dog&not=cat`. The `&not` in that value would decode to U+AC (¬), but 
> since it’s in an attribute value it will be left as plaintext. Inside normal 
> HTML markup it would transform into `?q=dog¬=cat`. There are related nuances 
> when numeric character references are found at the end of a string or 
> boundary without the semicolon.
> 
> The function signature of `html_entity_decode()` does not currently allow for 
> correcting this behavior. I’d like to propose an RFC or a bug fix which 
> either extends the function (perhaps by adding a new flag like 
> `ENT_AMBIGUOUS_AMPERSAND`) or preferably creates a new function. For the 
> missing character references I wonder if it would be enough to add them to 
> the list of default translatable references.
> 
> One challenge with the existing function is that the concept of the 
> translation table stands in contrast with the fixed and static nature of 
> HTML5’s replacement tables. A new function or set of functions could open up 
> spec-compliant decoding while providing helpful methods that are necessary in 
> many common server-side operations:
> 
>   - `html_decode( ‘attribute’ | ‘data’, $raw_text, $input_encoding = ‘utf-8' 
> )`
>   - `html_text_contains( ‘attribute’ | ‘data’, $raw_haystack, $needle, 
> $input_encoding = ‘utf-8’ )`
>   - `html_text_starts_with( ‘attribute’ | ‘data’, $raw_haystack, $needle, 
> $input_encoding = ‘utf-8’ )`
> 
> These methods are handy for inspecting things like encoded attribute values 
> in a memory-efficient and processing-efficient way, when it’s not necessary 
> to decode the entire value. In common situations, one encounters data-URIs 
> with potentially megabytes of image data and processing only the first few or 
> tens of bytes can save a lot of overhead.
> 
> We’re exploring pure-PHP solutions to these problems in WordPress in attempts 
> to improve the reliability and safety of handling HTML. I’d love to hear your 
> thoughts and know if anyone is willing to work with me to create an RFC or 
> directly propose patches. We’ve created a step function which allows finding 
> the next character reference and decoding it separately, enabling some novel 
> features like highlighting the character references in source text.
> 
> Should I propose an RFC for this?
> 
> Warmly,
> Dennis Snell
> Automattic Inc.

[PHP-DEV] Re: [RFC] Decoding HTML and the Ambiguous Ampersand

Reply via email to