I want to retrieve all urls in a string. When I use re.fiandall, I get a list of tuples. My code is like below:
[code] url=unicode(r"((http|ftp)://)?(((([\d]+\.)+){3}[\d]+(/[\w./]+)?)|([a-z]\w*((\.\w+)+){2,})([/][\w.~]*)*)") m=re.findall(url,html) for i in m: print i [/code] html is a variable of string type which contains many urls in it. the code will print many tuples, and each tuple seems not to represent a url. e.g, one of them is as below: (u'http://', u'http', u'image.zhongsou.com/image/netchina.gif', u'', u'', u'', u'', u'image.zhongsou.com', u'.com', u'.com', u'/netchina.gif') Why is there two "http" in it? and why are there so many ampty strings in the tupe above? It's obviously not a url. How can I get the urls correctly? Thanks in advance. -- 鹦鹉聪明绝顶、搞笑之极,是人类的好朋友。 直到有一天,我才发觉,我是鹦鹉。 我是翻墙的鹦鹉。 -- http://mail.python.org/mailman/listinfo/python-list