Re: [PHP-DEV] [RFC] Decoding HTML and the Ambiguous Ampersand

Dennis Snell Sat, 24 Aug 2024 13:32:46 -0700

On Aug 24, 2024, at 2:56 PM, Jakob Givoni <ja...@givoni.dk> wrote:
> 
> Hi Dennis,
> 
> Overall it sounds like a reasonable RFC.
>   
> > Dennis:
> >
> > > Niels:
> > >
> > > I'm not so sure that the name "decode_html" is self-descriptive enough, 
> > > it sounds very generic.
> >
> > The name is not very important to me. For the sake of history, the reason I 
> > have chosen “decode HTML” is because, unlike an HTML parser, this is 
> > focused on taking a snippet of HTML “text” content and decoding it into a 
> > “plain PHP string.”
> 
> Why not make it two methods called "decode_html_text" and 
> "decode_html_attribute"?
> Consider the following reasons:
> 1. The function doesn't actually decode html as such, it decodes either an 
> html text node string or an html attribute string.


Thanks Jakob. In WordPress I did just this.
https://developer.wordpress.org/reference/classes/wp_html_decoder/

Part of the reason for that was the inability to require something like an enum 
(due to PHP version support requirements). The Enum solution feels very nice 
too.

> 2. Saves the $context parameter and the constants/enums, making the call 
> significantly shorter. 

In my PR I’ve actually expanded the Enum to include a few other contexts. I 
feel like there’s a balance we have to do if we want to ride the line between 
fully reliable and fully convenient. On one hand, we could say “don’t send the 
text content of a SCRIPT element to this function!” But on the other hand, that 
kind of forces people to expect that SCRIPT content is different.

With the Enum there is that in-built training material when someone looks and 
finds `Attribute | BodyText | ForeignText | Script | Style` (the contexts I’ve 
explored in my PR). 

We could make the same argument for `decode_html_script()` and 
`decode_foreign_text_node()` and `decode_html_style()`. Somehow the context 
feels cleaner to me, and like a single entry point for learning instead of five.

> 3. It feels like decoding either text or attribute are two significantly 
> different things. I admit I could be wrong, if code like 
> decode_html($e->isAttritbute() ? HtmlContext::Attribute : HtmlContext::Text, 
> $e->getContent()) is likely to be seen.

None of these contexts are significantly different, which is one of the major 
dangers of using `html_entity_decode()`. The results will look just about right 
most of the time. It’s the subtle differences that matter most, I suppose. 
Thankfully, in most places I’ve seen them blurred together, the intent of the 
code someone is writing understands which is which.

    preg_replace_callback(
        ‘~<a[^>]+href=“([^”]+)”[^>]*>([^<]+)</a>~’,
        function ( $m ) {
            $title = str_replace( ‘]’, ‘\]’, html_entity_decode( $m[2] ) );
            $url = str_replace( ‘)’, ‘\)’, html_entity_decode( $m[1] ) );
            return “[{$title}]({$url})”;
        }
        $post_content
    );

The lesson I have drawn is that people frequently have what they understand to 
be a text node or an attribute value, but they aren’t aware that they are 
supposed to decode differently, and they also aren’t reaching to interact with 
a full parser to get these values. If PHP could train people as they use these 
functions, purely through their interfaces, I think that could help elevate the 
level of reliability out there in the wild, as long as they aren’t too 
cumbersome (hence explicitly no default context argument _or_ using 
separately-named functions).

Having the Enum I think enhances the ease with which people can reliably also 
decode things like SCRIPT and STYLE nodes. “I know `html_decode_text()` but I 
don’t know what the rules for SCRIPT are or if they’re different so I’ll just 
stick with that.” vs “My IDE suggests that `Script` is a different context, 
that’s interesting, I’ll try that and see how it’s different."

>  But I somehow don't foresee a lot of situations where text and attribute 
> strings end up in the same code path?

The underlying reason I started this work was in support of building an HTML 
parser. We have a streaming parser which relies on a different parsing model 
than those built purely on the state machine in the specification, taking 
advantage of what we can to eek out performance in PHP code. For this, the 
strings are in the same path, and in this work I’ve come across a number of 
other common use-cases where the flow is the same but the decoder needs to know 
the context.

 - Normalizing HTML from “tag soup” to standard serialized form.
 - Sanitizing code wanting to inspect values from different parts of the markup.
 - Sanitizing rules engines providing configurations or DSLs for sanitization.
 - Live optimizers or analyzers to improve the output HTML leaving a server.

It’s one of those things that when it becomes trivial to start getting reliable 
transforms from the HTML syntax to the decoded text, more opportunities appear 
that never seemed practical before.

> 
> A couple of other options that would silence anyone opposed to implicitly 
> favouring utf-8:
> html_text_to_utf8 and html_attribute_to_utf8

The names started with these 😀. I do agree that it gets a bit excessive though 
to the point where it risks people not adopting them purely because they don’t 
want to type that long of a name every time they use it. Perhaps some of these 🙃

    str_from_html( HtmlContext $context, string $html ): string {}

    utf8_from_html( HtmlContext $context, string $html ): string {}

    html_to_utf8( HtmlContext $context, string $html ): string {}

> 
> Best,
> Jakob
>  

Thanks for your input. I’m grateful for the discussions and that people are 
sharing.

Dennis Snell

Re: [PHP-DEV] [RFC] Decoding HTML and the Ambiguous Ampersand

Reply via email to