[PHP-DEV] Decoding HTML and the Ambiguous Ampersand

Dennis Snell Tue, 09 Jul 2024 17:01:32 -0700

Greetings all,


The `html_entity_decode( … ENT_HTML5 … )` function has a number of issues that 
I’d like to correct.


 - It’s missing 720 of HTML5’s specified named character references.
 - 106 of these are named character references which do not require a trailing 
semicolon, such as `&acute`
 - It’s unaware of the ambiguous ampersand rule, which allows these 106 in 
special circumstances.


HTML5 asserts that the list of named character references will not expand in 
the future. It can be found authoritatively at the following URL:


https://html.spec.whatwg.org/entities.json


The ambiguous ampersand rule smoothes over legacy behavior from before HTML5 
where ampersands were not properly encoded in attribute values, specifically in 
URL values. For example, in a query string for a search, one might find 
`?q=dog&not=cat`. The `&not` in that value would decode to U+AC (¬), but since 
it’s in an attribute value it will be left as plaintext. Inside normal HTML 
markup it would transform into `?q=dog¬=cat`. There are related nuances when 
numeric character references are found at the end of a string or boundary 
without the semicolon.


The function signature of `html_entity_decode()` does not currently allow for 
correcting this behavior. I’d like to propose an RFC or a bug fix which either 
extends the function (perhaps by adding a new flag like 
`ENT_AMBIGUOUS_AMPERSAND`) or preferably creates a new function. For the 
missing character references I wonder if it would be enough to add them to the 
list of default translatable references.


One challenge with the existing function is that the concept of the translation 
table stands in contrast with the fixed and static nature of HTML5’s 
replacement tables. A new function or set of functions could open up 
spec-compliant decoding while providing helpful methods that are necessary in 
many common server-side operations:


  - `html_decode( ‘attribute’ | ‘data’, $raw_text, $input_encoding = ‘utf-8' )`
  - `html_text_contains( ‘attribute’ | ‘data’, $raw_haystack, $needle, 
$input_encoding = ‘utf-8’ )`
  - `html_text_starts_with( ‘attribute’ | ‘data’, $raw_haystack, $needle, 
$input_encoding = ‘utf-8’ )`


These methods are handy for inspecting things like encoded attribute values in 
a memory-efficient and processing-efficient way, when it’s not necessary to 
decode the entire value. In common situations, one encounters data-URIs with 
potentially megabytes of image data and processing only the first few or tens 
of bytes can save a lot of overhead.


We’re exploring pure-PHP solutions to these problems in WordPress in attempts 
to improve the reliability and safety of handling HTML. I’d love to hear your 
thoughts and know if anyone is willing to work with me to create an RFC or 
directly propose patches. We’ve created a step function which allows finding 
the next character reference and decoding it separately, enabling some novel 
features like highlighting the character references in source text.


Should I propose an RFC for this?


Warmly,
Dennis Snell
Automattic Inc.

[PHP-DEV] Decoding HTML and the Ambiguous Ampersand

Reply via email to