On Fri, Aug 16, 2024, at 02:59, Dennis Snell wrote: > >> On Jul 9, 2024, at 4:55 PM, Dennis Snell <dennis.sn...@a8c.com> wrote: >> >> Greetings all, >> >> The `html_entity_decode( … ENT_HTML5 … )` function has a number of issues >> that I’d like to correct. >> >> - It’s missing 720 of HTML5’s specified named character references. >> - 106 of these are named character references which do not require a >> trailing semicolon, such as `´` >> - It’s unaware of the ambiguous ampersand rule, which allows these 106 in >> special circumstances. >> >> HTML5 asserts that the list of named character references will not expand in >> the future. It can be found authoritatively at the following URL: >> >> https://html.spec.whatwg.org/entities.json >> >> The ambiguous ampersand rule smoothes over legacy behavior from before HTML5 >> where ampersands were not properly encoded in attribute values, specifically >> in URL values. For example, in a query string for a search, one might find >> `?q=dog¬=cat`. The `¬` in that value would decode to U+AC (¬), but >> since it’s in an attribute value it will be left as plaintext. Inside normal >> HTML markup it would transform into `?q=dog¬=cat`. There are related nuances >> when numeric character references are found at the end of a string or >> boundary without the semicolon. >> >> The function signature of `html_entity_decode()` does not currently allow >> for correcting this behavior. I’d like to propose an RFC or a bug fix which >> either extends the function (perhaps by adding a new flag like >> `ENT_AMBIGUOUS_AMPERSAND`) or preferably creates a new function. For the >> missing character references I wonder if it would be enough to add them to >> the list of default translatable references. >> >> One challenge with the existing function is that the concept of the >> translation table stands in contrast with the fixed and static nature of >> HTML5’s replacement tables. A new function or set of functions could open up >> spec-compliant decoding while providing helpful methods that are necessary >> in many common server-side operations: >> >> - `html_decode( ‘attribute’ | ‘data’, $raw_text, $input_encoding = ‘utf-8' >> )` >> - `html_text_contains( ‘attribute’ | ‘data’, $raw_haystack, $needle, >> $input_encoding = ‘utf-8’ )` >> - `html_text_starts_with( ‘attribute’ | ‘data’, $raw_haystack, $needle, >> $input_encoding = ‘utf-8’ )` >> >> These methods are handy for inspecting things like encoded attribute values >> in a memory-efficient and processing-efficient way, when it’s not necessary >> to decode the entire value. In common situations, one encounters data-URIs >> with potentially megabytes of image data and processing only the first few >> or tens of bytes can save a lot of overhead. >> >> We’re exploring pure-PHP solutions to these problems in WordPress in >> attempts to improve the reliability and safety of handling HTML. I’d love to >> hear your thoughts and know if anyone is willing to work with me to create >> an RFC or directly propose patches. We’ve created a step function which >> allows finding the next character reference and decoding it separately, >> enabling some novel features like highlighting the character references in >> source text. >> >> Should I propose an RFC for this? >> >> Warmly, >> Dennis Snell >> Automattic Inc. > > All, > > I have submitted an RFC draft for including the proposed feature from this > issue. Thanks to everyone who helped me in this process. It’s my first RFC, > so I apologize in advance for any mistakes I’ve made in the process. > > https://wiki.php.net/rfc/decode_html > > This is proposed for a future PHP version after 8.4. > > Warmly, > Dennis Snell
Hey Dennis, The RFC mentions that encoding must be utf-8. How are programmers supposed to work with this if the php file itself isn’t utf-8 or the input is meaningless in utf-8 or if changing it to utf-8 and back would result in invalid text? — Rob