All, I have updated the RFC document by adding a section on the proposed HtmlContext enum, with some extra contexts than were originally discussed (but which were added to the implementation).
As I’ve been a bit distracted this has taken a bit of a backseat but I am still interested in keeping it moving forward. https://wiki.php.net/rfc/decode_html Warmly, Dennis Snell > On Jul 9, 2024, at 4:55 PM, Dennis Snell <dennis.sn...@a8c.com> wrote: > > Greetings all, > > The `html_entity_decode( … ENT_HTML5 … )` function has a number of issues > that I’d like to correct. > > - It’s missing 720 of HTML5’s specified named character references. > - 106 of these are named character references which do not require a > trailing semicolon, such as `´` > - It’s unaware of the ambiguous ampersand rule, which allows these 106 in > special circumstances. > > HTML5 asserts that the list of named character references will not expand in > the future. It can be found authoritatively at the following URL: > > https://html.spec.whatwg.org/entities.json > > The ambiguous ampersand rule smoothes over legacy behavior from before HTML5 > where ampersands were not properly encoded in attribute values, specifically > in URL values. For example, in a query string for a search, one might find > `?q=dog¬=cat`. The `¬` in that value would decode to U+AC (¬), but > since it’s in an attribute value it will be left as plaintext. Inside normal > HTML markup it would transform into `?q=dog¬=cat`. There are related nuances > when numeric character references are found at the end of a string or > boundary without the semicolon. > > The function signature of `html_entity_decode()` does not currently allow for > correcting this behavior. I’d like to propose an RFC or a bug fix which > either extends the function (perhaps by adding a new flag like > `ENT_AMBIGUOUS_AMPERSAND`) or preferably creates a new function. For the > missing character references I wonder if it would be enough to add them to > the list of default translatable references. > > One challenge with the existing function is that the concept of the > translation table stands in contrast with the fixed and static nature of > HTML5’s replacement tables. A new function or set of functions could open up > spec-compliant decoding while providing helpful methods that are necessary in > many common server-side operations: > > - `html_decode( ‘attribute’ | ‘data’, $raw_text, $input_encoding = ‘utf-8' > )` > - `html_text_contains( ‘attribute’ | ‘data’, $raw_haystack, $needle, > $input_encoding = ‘utf-8’ )` > - `html_text_starts_with( ‘attribute’ | ‘data’, $raw_haystack, $needle, > $input_encoding = ‘utf-8’ )` > > These methods are handy for inspecting things like encoded attribute values > in a memory-efficient and processing-efficient way, when it’s not necessary > to decode the entire value. In common situations, one encounters data-URIs > with potentially megabytes of image data and processing only the first few or > tens of bytes can save a lot of overhead. > > We’re exploring pure-PHP solutions to these problems in WordPress in attempts > to improve the reliability and safety of handling HTML. I’d love to hear your > thoughts and know if anyone is willing to work with me to create an RFC or > directly propose patches. We’ve created a step function which allows finding > the next character reference and decoding it separately, enabling some novel > features like highlighting the character references in source text. > > Should I propose an RFC for this? > > Warmly, > Dennis Snell > Automattic Inc.