Re: [PHP-DEV] Sanitize filters

Thomas Hruska Tue, 11 Oct 2022 08:53:32 -0700

On 10/6/2022 1:19 AM, Rowan Tommins wrote:

On 05/10/2022 22:35, David Gebler wrote:
There are multiple RFC standards for email address format but AFAIKPHP's FILTER_SANITIZE_EMAIL doesn't conform to any of them.
FILTER_SANITIZE_EMAIL is a very short list of characters which claims tobe based on RFC 822 section 6:https://heap.space/xref/php-src/ext/filter/sanitizing_filters.c?r=4df3dd76#295FILTER_VALIDATE_EMAIL doesn't say exactly which standard it's attemptingto adhere to; it's one of many long unreadable regexes I've seen onlineclaiming to cover all possible addresses. (Actually, there are now tworegexes there, because there's a different version to supportFILTER_FLAG_EMAIL_UNICODE).https://heap.space/xref/php-src/ext/filter/logical_filters.c?r=d8fc05c0#651
The idea behind my suggestion for something like is_valid_email(whatever it might be named) is as a step towards deprecating andremoving the entire existing filter API, which I think many of usagree is a mess.
You described FILTER_VALIDATE_EMAIL as "notorious for being next touseless"; that gives us two possibilities:
a) A new function will be just as useless, because it will be based onthe same implementation

b) There is a better implementation out there, which we should startusing in ext/filter right now

For (b), well, there is always the option of handling email addressesthe way the IETF intended instead of using regexes.


For example, SMTP::MakeValidEmailAddress() from:

https://github.com/cubiclesoft/ultimate-email

Does three things quite differently from ext/filter:

1) It uses a custom state engine to implement half of the relevant IETFEBNF grammars and then cheats for the other half. The very complexspecifications that the IETF (and W3C) produces should generally beimplemented as custom state engines (finite state machines or FSMs) insoftware. A custom state engine can correctly identify certain commoninput errors and both transparently and correctly fix those errors invery specific instances as it processes the input (e.g. gmail,com ->gmail.com happens often). State engines can also accurately andcorrectly do things such as remove CFWS (comments and foldingwhitespace) from email addresses, which are not necessary components ofan email address and CFWS causes all kinds of issues. State engines,when done right, can even outperform all other functionalimplementations. State engines can also read partial input and maintaintheir internal state while using few resources to process very largeinputs (not particularly relevant in this case). The currentregex-based approach in ext/filter is obviously causing some problemsthat can probably be fixed by using a custom state engine.

Important caveat: Custom state engines do run the risk of winding up inan infinite loop when forgetting to properly transition between statesor forgetting to move pointers through the input, resulting in DoSissues. Been there, done that - they are both very easy things to do.

2) It parses email addresses in reverse: Domain part first, local partsecond. The EBNF grammars for the domain part are simpler and lesscontentious than the grammars for the local part. Also, IIRC, thedomain portion can't contain '@' while the local portion can - it's beena while since I looked at the specs though.

3) It considers sanitization and validation as being the same function.There is no separate SMTP::IsValidEmailAddress() in the librarybecause there is no need for one. If MakeValidEmailAddress() can't turnan input into a valid email address string, it returns an error. If thereturned email address is not the same as the one that was input, theoriginal address can be viewed as technically "invalid." One sharedinternal function for both FILTER_SANITIZE_EMAIL andFILTER_VALIDATE_EMAIL would produce consistent output/results.

Other thoughts: I'm aware that a regex is effectively defining a stateengine as a compact string. However, as evidenced by the two Perl CPANregexes for email addresses currently in use, regexes are limited inutility/function and are somewhat inflexible, get more difficult to readand comprehend once they get longer than a few dozen bytes, and can'treadily correct errors or other problems in complex input strings. The~250 lines of userland code referenced above is also not perfect (e.g.extracting characters using substr() is rather inefficient) but it workswell enough. The userland code also performs a DNS MX record check bydefault, but that is its own complex can of worms and was probably notthe best idea I've ever had. However, the three main concepts are theimportant takeaways here, not the referenced userland code.

My gut feel is that (a) is true, and there is no point considering whata new function would be called, because we don't know how to implement it.

Perhaps the above will help to at least provide some new ideas to thinkabout/ponder.


--
Thomas Hruska
CubicleSoft President

CubicleSoft has over 80 original open source projects and counting.
Plus a couple of commercial/retail products.

What software are you looking to build?

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Re: [PHP-DEV] Sanitize filters

Reply via email to