On Aug 7, 2009, at 8:13 AM, Carl wrote: > > This is an excellent article on the traps to beware of when regex'ing > email address formats > > http://www.regular-expressions.info/email.html > > This may ignite a debate though :)
A discussion, maybe. In the abstract, I like the idea of verifying the RFC verbatim, but we *should* be clear on what we're trying to do. Guard against typos? Prevent some kind of attack? How much do we care about false positives? The article objects (to RFC-style checking) that j...@aol.com.nospam, for example, will validate. I'm not too concerned about that, in that there are lots of ways that a user can enter a wrong but (syntactically) valid address. We deal with that through active validation, not a syntax check. Might there be a security concern? The quoted variation of the RFC checker is very permissive: "([^"\r\\]|\\["\r\\])*" Could that open the door to some kind of injection attack? Presumably we sanitize it for display; how about when we actually use it to send mail? Any consumer that doesn't understand quoted names could end up very confused. I take false positives as a v. bad thing: if a user enters a real and valid address, I do not want to reject it. So I don't much like the explicit list of TLDs (below), on the grounds that it's bound to expand, and at some point it'll break. From the Wikipedia TLD article: > During the 32nd International Public ICANN Meeting in Paris in 2008, > ICANN started a new process of TLD naming policy to take a > "significant step forward on the introduction of new generic top- > level domains." This program envisions the availability of many new > or already proposed domains, as well a new application and > implementation process. Observers believed that the new rules could > result in hundreds of new gTLDs to be registered. Proposed TLDs > include music, berlin and nyc. I think I'd favor the RFC-style pattern without the quoted-name alternation. One thing we could do is to give the developer an option: IS_EMAIL(something or other) that lets them select one of a small number of regexes. And of course the developer can always use IS_MATCH if they don't like our choice of email filters. If we permitted a choice, I'd suggest: 1. default to the RFC regex, but without quoted names 2. RFC including quoted names 3. something like the pattern below, including the TLD filter (maybe) > > I favour this variation... > [a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a- > z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+(?:[A-Z]{2}|com|org|net|gov|mil|biz| > info|mobi|name|aero|jobs|museum)\b > > C > > > On Aug 7, 8:25 am, Jonathan Lundell <jlund...@pobox.com> wrote: >> On Aug 7, 2009, at 12:22 AM, mdipierro wrote: >> >> >> >>> I will take a patch for this. >> >> If nobody else gets to it first, I'll work up a patch over the >> weekend. >> >> >> >> >> >>> Massimo >> >>> On Aug 7, 1:33 am, Jonathan Lundell <jlund...@pobox.com> wrote: >>>> On Aug 6, 2009, at 9:32 PM, DenesL wrote: >> >>>>> IS_EMAIL does not follow the RFC specs for valid email addresses >>>>> (seehttp://en.wikipedia.org/wiki/E-mail_address) >> >>>>> even a simple a...@b.com fails >> >>>>> it is kinda late to work on the regex now, maybe tomorrow. >> >>>> The RFC is fairly hard to validate. If that's what we really >>>> want, I >>>> found this one on the web that looks about right: >> >>>> ^(?!\.)("([^"\r\\]|\\["\r\\])*"|([-a-z0-9!#$%&'*+/=?^_`{|}~]|(?...@[a- >>>> z0-9][\w\.-]*[a-z0-9]\.[a-z][a-z\.]*[a-z]$ >> >>>> It assumes the case-insensitive flag. >> >>>> http://haacked.com/archive/2007/08/21/i-knew-how-to-validate-an- >>>> email... >> >>>> Overkill? Or, what the heck? --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "web2py-users" group. To post to this group, send email to web2py@googlegroups.com To unsubscribe from this group, send email to web2py+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/web2py?hl=en -~----------~----~----~----~------~----~------~--~---