Re: [PHP-DEV] [RFC] Decoding HTML and the Ambiguous Ampersand

Dennis Snell Thu, 22 Aug 2024 16:04:33 -0700

> On Aug 22, 2024, at 5:01 PM, Niels Dossche <dossche.ni...@gmail.com> wrote:
> 
> On 20/08/2024 00:45, Dennis Snell wrote:
>> 
>>> On Jul 9, 2024, at 4:55 PM, Dennis Snell <dennis.sn...@a8c.com> wrote:
>>> 
>>> Greetings all,
>>> 
>>> The `html_entity_decode( … ENT_HTML5 … )` function has a number of issues 
>>> that I’d like to correct.
>>> 
>>>  - It’s missing 720 of HTML5’s specified named character references.
>>>  - 106 of these are named character references which do not require a 
>>> trailing semicolon, such as `&acute`
>>>  - It’s unaware of the ambiguous ampersand rule, which allows these 106 in 
>>> special circumstances.
>>> 
>>> HTML5 asserts that the list of named character references will not expand 
>>> in the future. It can be found authoritatively at the following URL:
>>> 
>>> https://html.spec.whatwg.org/entities.json 
>>> <https://html.spec.whatwg.org/entities.json>
>>> 
>>> The ambiguous ampersand rule smoothes over legacy behavior from before 
>>> HTML5 where ampersands were not properly encoded in attribute values, 
>>> specifically in URL values. For example, in a query string for a search, 
>>> one might find `?q=dog&not=cat`. The `&not` in that value would decode to 
>>> U+AC (¬), but since it’s in an attribute value it will be left as 
>>> plaintext. Inside normal HTML markup it would transform into `?q=dog¬=cat`. 
>>> There are related nuances when numeric character references are found at 
>>> the end of a string or boundary without the semicolon.
>>> 
>>> The function signature of `html_entity_decode()` does not currently allow 
>>> for correcting this behavior. I’d like to propose an RFC or a bug fix which 
>>> either extends the function (perhaps by adding a new flag like 
>>> `ENT_AMBIGUOUS_AMPERSAND`) or preferably creates a new function. For the 
>>> missing character references I wonder if it would be enough to add them to 
>>> the list of default translatable references.
>>> 
>>> One challenge with the existing function is that the concept of the 
>>> translation table stands in contrast with the fixed and static nature of 
>>> HTML5’s replacement tables. A new function or set of functions could open 
>>> up spec-compliant decoding while providing helpful methods that are 
>>> necessary in many common server-side operations:
>>> 
>>>   - `html_decode( ‘attribute’ | ‘data’, $raw_text, $input_encoding = 
>>> ‘utf-8' )`
>>>   - `html_text_contains( ‘attribute’ | ‘data’, $raw_haystack, $needle, 
>>> $input_encoding = ‘utf-8’ )`
>>>   - `html_text_starts_with( ‘attribute’ | ‘data’, $raw_haystack, $needle, 
>>> $input_encoding = ‘utf-8’ )`
>>> 
>>> These methods are handy for inspecting things like encoded attribute values 
>>> in a memory-efficient and processing-efficient way, when it’s not necessary 
>>> to decode the entire value. In common situations, one encounters data-URIs 
>>> with potentially megabytes of image data and processing only the first few 
>>> or tens of bytes can save a lot of overhead.
>>> 
>>> We’re exploring pure-PHP solutions to these problems in WordPress in 
>>> attempts to improve the reliability and safety of handling HTML. I’d love 
>>> to hear your thoughts and know if anyone is willing to work with me to 
>>> create an RFC or directly propose patches. We’ve created a step function 
>>> which allows finding the next character reference and decoding it 
>>> separately, enabling some novel features like highlighting the character 
>>> references in source text.
>>> 
>>> Should I propose an RFC for this?
>>> 
>>> Warmly,
>>> Dennis Snell
>>> Automattic Inc.
>> 
>> Thanks everyone for your feedback so far on the `decode_html()` RFC 
>> [https://wiki.php.net/rfc/decode_html <https://wiki.php.net/rfc/decode_html>]
>> 
>> I’ve updated it replacing the new constants with a new `HtmlContext` enum, 
>> and the interface seems much nicer this way. I particularly like how PHP 
>> enforces passing a valid value, vs. hoping that the right flag is used.
>> 
>> Additionally I added a section that I previously forgot, which highlights 
>> the source of the infamous mojibake/gremlins. HTML has special rules for 
>> remapping the C1 control characters, as if they had been stored or recorded 
>> for Windows-1251.
>> 
>> Warmly,
>> Dennis Snell
>> 
> 
> Hi Dennis
> 
> +1 on the concept.
> I just have two concerns:


Thanks Niels. I appreciate the help you’ve already provided on this process, 
and the work you’ve done with lexbor.

> 
> 1) I'm not so sure that the name "decode_html" is self-descriptive enough, it 
> sounds very generic.

The name is not very important to me. For the sake of history, the reason I 
have chosen “decode HTML” is because, unlike an HTML parser, this is focused on 
taking a snippet of HTML “text” content and decoding it into a “plain PHP 
string.”

The existing `html_entity_decode()` is very close in naming but ties this 
concept into _entities_, and overlooks other basic text decoding concerns 
(newline normalization and NULL byte handling).

Originally I had “utf8” in the name but someone else thought it was too long 
and specific. I want the name to educate developers and also be terse. Naming 
is hard.

> 2) I would strongly suggest to explore an implementation based on Lexbor. I'm 
> pretty confident that it can be done by reusing the internal APIs. The 
> advantage is that it will be less code to maintain. You pull off some fancy 
> tricks in your implementation for performance reasons, but that also adds to 
> complexity and maintenance burden. Also since this is C, we must be extra 
> careful when implementing tricks.

Yeah I agree and I’ll share more below. The tricks I’m using in my PR 
implementing the RFC are partly there to propose adoption into PHP and partly 
there to get a real sense of my algorithm vs. those found in Chrome, Firefox, 
Safari, and lexbor. I’ve attempted to build a search algorithm for named 
character references that optimizes for cache locality in contrast to 
algorithmic complexity where RAM access is assumed to be free.

My code isn’t currently well document and doesn’t meet the PHP-src coding 
standards, but the algorithm is pretty basic and easy to explain. It’s also 
“unoptimized” for C, mostly. I think there are still large gains to be made 
that so far I’ve been unable to visualize incorporating into the lexbor parser. 
For example, `decode_html()` assumes we’re starting already with a span of text 
that is HTML text. We’re not making conditional decisions on whether the next 
byte produces a token that escapes out of the text parsing mode.

> If we could have a single implementation, that would be great. I do 
> understand of course your concern that DOM is not a required extension, and 
> therefore basing the internals on Lexbor makes it tied to the DOM extension 
> which may not be available. I however suspect that a large chunk of people 
> needing a function like this have DOM available (as DOM is required by many 
> HTML-processing-related packages). I can also look into it sometime soon if 
> you want; anyway feel free to ping me.

I’m also very open to lexbor-based approaches but I’ve so-far found it more 
complicated than I expected. In some part this is because it involves setting 
up the parser and state machine for the HTML specification and much of the 
actual decoding can be safely done without this.

The other part is the extension aspect. I hear you, that you would expect 
calling code to have the DOM extensions available, but that’s simply not the 
case when developing a platform like WordPress, which I do. We don’t have 
control over the servers or environments where people are deploying this, and 
the availability of the DOM extensions is low enough that WordPress code simply 
cannot use `DOMDocument` (even though it shouldn’t because of the wild problems 
that has for attempting to parse HTML).

People resort to `html_entity_decode()` because that’s the only option. In 
WordPress we now have a spec-compliant decoder, but as it’s in user-space PHP 
its performance is far below what’s possible.

I’d love your help in setting up lexbor’s state machine to decode text nodes. 
I’d love it even more if this could be part of the PHP language. It constantly 
surprises me that _the language of the web_ (PHP) doesn’t have the tools to 
speak _the language of the web_ (HTML). This RFC is all about taking a step 
towards ensuring that PHP developers can rely on PHP to be a reliable 
middle-man between the HTML domain and the PHP domain.

In other words, requiring the DOM extension or `DOM\HtmlDocument` would be such 
a non-starter for WordPress (accounting for 43% of the web today) that it would 
completely unavailable.

> 
> And I do have the following thoughts:
> 1) We should amend the ENT_HTML5 related docs already that it's not compliant.
> 2) Perhaps ENT_HTML5 should be deprecated. E.g. you could say in your RFC 
> that ENT_HTML5 will be deprecated in the release after the version that will 
> have decode_html(). The reason I suggest the release _after_ and not the 
> _same_ release is because I strongly believe that we should have at least one 
> version where the proper alternative is available without forcing a 
> deprecation to users already.

I love this suggestion. Just for reference, since I’ve looked before and not 
found it. Can someone indicate where to find the PHP function documentation? 
There are a number of updates I would love to propose but I don’t know where to 
find the content that appears in 
https://www.php.net/manual/en/function.html-entity-decode.php, for instance.

> 
> Kind regards
> Niels

Mad respect to the work you’ve brought to lexbor and to PHP. I’m excited to 
start relying on `\DOM\HtmlDocument` and have started using it in my benchmarks 
and HTML analysis as we develop the WordPress HTML API (a streaming, low 
memory-overhead, reentrant HTML parsing and manipulation framework in 
user-space PHP).

Dennis Snell

Re: [PHP-DEV] [RFC] Decoding HTML and the Ambiguous Ampersand

Reply via email to