On Mon, Dec 18, 2017 at 2:29 AM, Peng Yu <pengyu...@gmail.com> wrote:
> Hi,
>
> I would like to extract "a...@efg.hij.xyz". But it only shows ".hij".
> Does anybody see what is wrong with it? Thanks.
>
> $ cat main.py
> #!/usr/bin/env python
> # vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1 
> fileencoding=utf-8:
>
> import re
> email_regex = re.compile('[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+(\.[a-zA-Z0-9-]+)')
> s = 'a...@efg.hij.xyz.'
> for email in re.findall(email_regex, s):
>     print email
>
> $ ./main.py
> .hij

What is the goal of your email address extraction? There are two
goals, one of which cannot be done perfectly but doesn't need to, and
the other cannot be done perfectly and is thus virtually useless. If
you want to detect email addresses in text and turn them into mailto:
links, it's okay to miss out some edge cases, and for that, I would
recommend keeping your regex REALLY simple - something like you have
above, but maybe even simpler. (And I wouldn't have the parentheses in
there, which I think might be what you're getting tripped up on.) But
if you're trying to *validate* an email address - for instance, if you
receive a form submission and want to know if there was an email
address included - then my recommendation is simply DON'T. You can't
get all the edge cases right; it is actually impossible for a regex to
perfectly match every valid email address and no invalid addresses.
And that's only counting *syntactically* valid - it doesn't take into
account the fact that "b...@junk.example.com" is not going to get
anywhere. So if you're trying to do validation, basically just don't.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to