On Fri, Aug 16, 2024, at 02:59, Dennis Snell wrote:
> 
>> On Jul 9, 2024, at 4:55 PM, Dennis Snell <dennis.sn...@a8c.com> wrote:
>> 
>> Greetings all,
>> 
>> The `html_entity_decode( … ENT_HTML5 … )` function has a number of issues 
>> that I’d like to correct.
>> 
>>  - It’s missing 720 of HTML5’s specified named character references.
>>  - 106 of these are named character references which do not require a 
>> trailing semicolon, such as `&acute`
>>  - It’s unaware of the ambiguous ampersand rule, which allows these 106 in 
>> special circumstances.
>> 
>> HTML5 asserts that the list of named character references will not expand in 
>> the future. It can be found authoritatively at the following URL:
>> 
>> https://html.spec.whatwg.org/entities.json
>> 
>> The ambiguous ampersand rule smoothes over legacy behavior from before HTML5 
>> where ampersands were not properly encoded in attribute values, specifically 
>> in URL values. For example, in a query string for a search, one might find 
>> `?q=dog&not=cat`. The `&not` in that value would decode to U+AC (¬), but 
>> since it’s in an attribute value it will be left as plaintext. Inside normal 
>> HTML markup it would transform into `?q=dog¬=cat`. There are related nuances 
>> when numeric character references are found at the end of a string or 
>> boundary without the semicolon.
>> 
>> The function signature of `html_entity_decode()` does not currently allow 
>> for correcting this behavior. I’d like to propose an RFC or a bug fix which 
>> either extends the function (perhaps by adding a new flag like 
>> `ENT_AMBIGUOUS_AMPERSAND`) or preferably creates a new function. For the 
>> missing character references I wonder if it would be enough to add them to 
>> the list of default translatable references.
>> 
>> One challenge with the existing function is that the concept of the 
>> translation table stands in contrast with the fixed and static nature of 
>> HTML5’s replacement tables. A new function or set of functions could open up 
>> spec-compliant decoding while providing helpful methods that are necessary 
>> in many common server-side operations:
>> 
>>   - `html_decode( ‘attribute’ | ‘data’, $raw_text, $input_encoding = ‘utf-8' 
>> )`
>>   - `html_text_contains( ‘attribute’ | ‘data’, $raw_haystack, $needle, 
>> $input_encoding = ‘utf-8’ )`
>>   - `html_text_starts_with( ‘attribute’ | ‘data’, $raw_haystack, $needle, 
>> $input_encoding = ‘utf-8’ )`
>> 
>> These methods are handy for inspecting things like encoded attribute values 
>> in a memory-efficient and processing-efficient way, when it’s not necessary 
>> to decode the entire value. In common situations, one encounters data-URIs 
>> with potentially megabytes of image data and processing only the first few 
>> or tens of bytes can save a lot of overhead.
>> 
>> We’re exploring pure-PHP solutions to these problems in WordPress in 
>> attempts to improve the reliability and safety of handling HTML. I’d love to 
>> hear your thoughts and know if anyone is willing to work with me to create 
>> an RFC or directly propose patches. We’ve created a step function which 
>> allows finding the next character reference and decoding it separately, 
>> enabling some novel features like highlighting the character references in 
>> source text.
>> 
>> Should I propose an RFC for this?
>> 
>> Warmly,
>> Dennis Snell
>> Automattic Inc.
> 
> All,
> 
> I have submitted an RFC draft for including the proposed feature from this 
> issue. Thanks to everyone who helped me in this process. It’s my first RFC, 
> so I apologize in advance for any mistakes I’ve made in the process.
> 
> https://wiki.php.net/rfc/decode_html
> 
> This is proposed for a future PHP version after 8.4.
> 
> Warmly,
> Dennis Snell

Hey Dennis,

The RFC mentions that encoding must be utf-8. How are programmers supposed to 
work with this if the php file itself isn’t utf-8 or the input is meaningless 
in utf-8 or if changing it to utf-8 and back would result in invalid text?

— Rob

Reply via email to