On 10/6/2022 1:19 AM, Rowan Tommins wrote:
On 05/10/2022 22:35, David Gebler wrote:
There are multiple RFC standards for email address format but AFAIK PHP's FILTER_SANITIZE_EMAIL doesn't conform to any of them.

FILTER_SANITIZE_EMAIL is a very short list of characters which claims to be based on RFC 822 section 6: https://heap.space/xref/php-src/ext/filter/sanitizing_filters.c?r=4df3dd76#295 FILTER_VALIDATE_EMAIL doesn't say exactly which standard it's attempting to adhere to; it's one of many long unreadable regexes I've seen online claiming to cover all possible addresses. (Actually, there are now two regexes there, because there's a different version to support FILTER_FLAG_EMAIL_UNICODE). https://heap.space/xref/php-src/ext/filter/logical_filters.c?r=d8fc05c0#651

The idea behind my suggestion for something like is_valid_email (whatever it might be named) is as a step towards deprecating and removing the entire existing filter API, which I think many of us agree is a mess.

You described FILTER_VALIDATE_EMAIL as "notorious for being next to useless"; that gives us two possibilities:

a) A new function will be just as useless, because it will be based on the same implementation


b) There is a better implementation out there, which we should start using in ext/filter right now

For (b), well, there is always the option of handling email addresses the way the IETF intended instead of using regexes.

For example, SMTP::MakeValidEmailAddress() from:

https://github.com/cubiclesoft/ultimate-email

Does three things quite differently from ext/filter:

1) It uses a custom state engine to implement half of the relevant IETF EBNF grammars and then cheats for the other half. The very complex specifications that the IETF (and W3C) produces should generally be implemented as custom state engines (finite state machines or FSMs) in software. A custom state engine can correctly identify certain common input errors and both transparently and correctly fix those errors in very specific instances as it processes the input (e.g. gmail,com -> gmail.com happens often). State engines can also accurately and correctly do things such as remove CFWS (comments and folding whitespace) from email addresses, which are not necessary components of an email address and CFWS causes all kinds of issues. State engines, when done right, can even outperform all other functional implementations. State engines can also read partial input and maintain their internal state while using few resources to process very large inputs (not particularly relevant in this case). The current regex-based approach in ext/filter is obviously causing some problems that can probably be fixed by using a custom state engine.

Important caveat: Custom state engines do run the risk of winding up in an infinite loop when forgetting to properly transition between states or forgetting to move pointers through the input, resulting in DoS issues. Been there, done that - they are both very easy things to do.

2) It parses email addresses in reverse: Domain part first, local part second. The EBNF grammars for the domain part are simpler and less contentious than the grammars for the local part. Also, IIRC, the domain portion can't contain '@' while the local portion can - it's been a while since I looked at the specs though.

3) It considers sanitization and validation as being the same function. There is no separate SMTP::IsValidEmailAddress() in the library because there is no need for one. If MakeValidEmailAddress() can't turn an input into a valid email address string, it returns an error. If the returned email address is not the same as the one that was input, the original address can be viewed as technically "invalid." One shared internal function for both FILTER_SANITIZE_EMAIL and FILTER_VALIDATE_EMAIL would produce consistent output/results.


Other thoughts: I'm aware that a regex is effectively defining a state engine as a compact string. However, as evidenced by the two Perl CPAN regexes for email addresses currently in use, regexes are limited in utility/function and are somewhat inflexible, get more difficult to read and comprehend once they get longer than a few dozen bytes, and can't readily correct errors or other problems in complex input strings. The ~250 lines of userland code referenced above is also not perfect (e.g. extracting characters using substr() is rather inefficient) but it works well enough. The userland code also performs a DNS MX record check by default, but that is its own complex can of worms and was probably not the best idea I've ever had. However, the three main concepts are the important takeaways here, not the referenced userland code.


My gut feel is that (a) is true, and there is no point considering what a new function would be called, because we don't know how to implement it.

Perhaps the above will help to at least provide some new ideas to think about/ponder.

--
Thomas Hruska
CubicleSoft President

CubicleSoft has over 80 original open source projects and counting.
Plus a couple of commercial/retail products.

What software are you looking to build?

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Reply via email to