On 11/12/07, Michel Albert ([EMAIL PROTECTED]) wrote: >On Nov 9, 11:45 pm, Bruno Desthuilliers ><[EMAIL PROTECTED]> wrote: >> [EMAIL PROTECTED] a Ècrit : >> >> > Hi, >> >> > I have to get list of URLs one by one and to find the URLs that I have >> > more than one time(can't be more than twice). >> >> > I thought to put them into binary search tree, this way they'll be >> > sorted and I'll be able to check if the URL already exist. >> >> What about a set ? >> >> s = set() >> for url in urls: >> if url in s: >> print "already have ", url >> else: >> set.add(url) > >Interesting. For this case I usually used dicts. As in: > >d = {} >for url in urls: > if url in d: > print "already have ", url > else: > d[url] = 1 > >Now, I can see that this method has some superfluous data (the `1` >that is assigned to the dict). So I suppose this is less memory >efficient. But is this slower then? As both implementations use hashes >of the URL to access the data. Just asking out of curiosity ;)
I'm pretty new at python but found this recipie in a cookbook (thanks O'Reilly for a useful book). I also found it online later, when the cookbook wasn't handy: <http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/66516> <http://mail.python.org/pipermail/tutor/2005-March/036966.html> ... google for more It uses the 'setdefault' function and works _really_ well. <http://docs.python.org/lib/typesmapping.html> Basically looks like this: unique_urls = {} for url in urls: unique_urls.setdefault(url) ['www.somewhere.com', 'www.somewherelse.com', 'www.somewherelse.com'] ... creates a dict of None's* {'www.somewhere.com': None, 'www.somewherelse.com': None} and "harvesting" is trivial: url_keys = unique_urls.keys() If you would find an index and count useful, you could do something like this: for idx in range(len(urls)): unique_urls.setdefault(urls[idx],[]).append(idx) ... {'www.somewhere.com': [0], 'www.somewherelse.com': [1,2]} later, the url's, their index values and counts could be found by: url_keys = unique_urls.keys() for url_key in url_keys: url_idxs = unique_urls[url_key] url_hits = len(url_idxs) Depending upon the data you may have associated with your URLs, or whatever else you're working with, this is a very sweet 'shortcut'. I hope this helps, Scott * To be trully Python-esque, 'habits' would be employed and the value would have been 'Nun', but I guess the namesake can only be carried so far (^8 -- http://mail.python.org/mailman/listinfo/python-list