[EMAIL PROTECTED] wrote:
On Apr 14, 10:36 am, [EMAIL PROTECTED] wrote:
  
On Apr 14, 12:02 am, Michael Bentley <[EMAIL PROTECTED]>
wrote:



    
On Apr 13, 2007, at 11:49 PM, [EMAIL PROTECTED] wrote:
      
Hi,
        
I have a list of url names like this, and I am trying to strip out the
domain name using the following code:
        
http://www.cnn.com
www.yahoo.com
http://www.ebay.co.uk
        
pattern = re.compile("http:\\\\(.*)\.(.*)", re.S)
match = re.findall(pattern, line)
        
if (match):
        s1, s2 = match[0]
        
        print s2
        
but none of the site matched, can you please tell me what am i
missing?
        
change re.compile("http:\\\\(.*)\.(.*)", re.S) to re.compile("http:\/
\/(.*)\.(.*)", re.S)
      
Thanks. I try this:

but when the 'line' ishttp://www.cnn.com, I get 's2' com,
but i want 'cnn.com' (everything after the first '.'), how can I do
that?

pattern = re.compile("http:\/\/(.*)\.(.*)", re.S)

    match = re.findall(pattern, line)

    if (match):

        s1, s2 = match[0]

        print s2
    

Can anyone please help me with my problem?  I still can't solve it.

Basically, I want to strip out the text after the first '.' in url
address:

http://www.cnn.com -> cnn.com

  
Generalized, you'll want to add some 'try' test to verify the not returning just root domains ('com', 'edu', 'net' etc)

Start with spliting?
from string import split, find
url=''
url.split('//')
['http:', 'www.cnn.com']

site = url.split('//')[1:][0]
site ='www.cnn.com'

site.find('.')
3
site[site.find('.')+1:]
'cnn.com'
domain = site[site.find('.')+1:]

from string import split, find
def getDomain( url=''):
    site = url.split('//')[1:][0]
    domain = site[site.find('.')+1:]
    return domain

-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to