On 10/6/2022 1:19 AM, Rowan Tommins wrote:
On 05/10/2022 22:35, David Gebler wrote:
There are multiple RFC standards for email address format but AFAIK
PHP's FILTER_SANITIZE_EMAIL doesn't conform to any of them.
FILTER_SANITIZE_EMAIL is a very short list of characters which claims to
be based on RFC 822 section 6:
https://heap.space/xref/php-src/ext/filter/sanitizing_filters.c?r=4df3dd76#295
FILTER_VALIDATE_EMAIL doesn't say exactly which standard it's attempting
to adhere to; it's one of many long unreadable regexes I've seen online
claiming to cover all possible addresses. (Actually, there are now two
regexes there, because there's a different version to support
FILTER_FLAG_EMAIL_UNICODE).
https://heap.space/xref/php-src/ext/filter/logical_filters.c?r=d8fc05c0#651
The idea behind my suggestion for something like is_valid_email
(whatever it might be named) is as a step towards deprecating and
removing the entire existing filter API, which I think many of us
agree is a mess.
You described FILTER_VALIDATE_EMAIL as "notorious for being next to
useless"; that gives us two possibilities:
a) A new function will be just as useless, because it will be based on
the same implementation
b) There is a better implementation out there, which we should start
using in ext/filter right now
For (b), well, there is always the option of handling email addresses
the way the IETF intended instead of using regexes.
For example, SMTP::MakeValidEmailAddress() from:
https://github.com/cubiclesoft/ultimate-email
Does three things quite differently from ext/filter:
1) It uses a custom state engine to implement half of the relevant IETF
EBNF grammars and then cheats for the other half. The very complex
specifications that the IETF (and W3C) produces should generally be
implemented as custom state engines (finite state machines or FSMs) in
software. A custom state engine can correctly identify certain common
input errors and both transparently and correctly fix those errors in
very specific instances as it processes the input (e.g. gmail,com ->
gmail.com happens often). State engines can also accurately and
correctly do things such as remove CFWS (comments and folding
whitespace) from email addresses, which are not necessary components of
an email address and CFWS causes all kinds of issues. State engines,
when done right, can even outperform all other functional
implementations. State engines can also read partial input and maintain
their internal state while using few resources to process very large
inputs (not particularly relevant in this case). The current
regex-based approach in ext/filter is obviously causing some problems
that can probably be fixed by using a custom state engine.
Important caveat: Custom state engines do run the risk of winding up in
an infinite loop when forgetting to properly transition between states
or forgetting to move pointers through the input, resulting in DoS
issues. Been there, done that - they are both very easy things to do.
2) It parses email addresses in reverse: Domain part first, local part
second. The EBNF grammars for the domain part are simpler and less
contentious than the grammars for the local part. Also, IIRC, the
domain portion can't contain '@' while the local portion can - it's been
a while since I looked at the specs though.
3) It considers sanitization and validation as being the same function.
There is no separate SMTP::IsValidEmailAddress() in the library
because there is no need for one. If MakeValidEmailAddress() can't turn
an input into a valid email address string, it returns an error. If the
returned email address is not the same as the one that was input, the
original address can be viewed as technically "invalid." One shared
internal function for both FILTER_SANITIZE_EMAIL and
FILTER_VALIDATE_EMAIL would produce consistent output/results.
Other thoughts: I'm aware that a regex is effectively defining a state
engine as a compact string. However, as evidenced by the two Perl CPAN
regexes for email addresses currently in use, regexes are limited in
utility/function and are somewhat inflexible, get more difficult to read
and comprehend once they get longer than a few dozen bytes, and can't
readily correct errors or other problems in complex input strings. The
~250 lines of userland code referenced above is also not perfect (e.g.
extracting characters using substr() is rather inefficient) but it works
well enough. The userland code also performs a DNS MX record check by
default, but that is its own complex can of worms and was probably not
the best idea I've ever had. However, the three main concepts are the
important takeaways here, not the referenced userland code.
My gut feel is that (a) is true, and there is no point considering what
a new function would be called, because we don't know how to implement it.
Perhaps the above will help to at least provide some new ideas to think
about/ponder.
--
Thomas Hruska
CubicleSoft President
CubicleSoft has over 80 original open source projects and counting.
Plus a couple of commercial/retail products.
What software are you looking to build?
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php