Re: Generate unique ID for URL

2012-11-14 Thread Johannes Bauer
On 14.11.2012 13:33, Dave Angel wrote: > Te birthday paradox could have been important had the OP stated his goal > differently. What he said was: > > """Ideally I would want to avoid collisions altogether. But if that means > significant extra CPU time then 1 collision in 10 million hashes wou

Re: Generate unique ID for URL

2012-11-14 Thread Dave Angel
On 11/14/2012 06:29 AM, Johannes Bauer wrote: > > > When doing these calculations, it's important to keep the birthday > paradox in mind (this is kind of counter-intuitive): The chance of a > collission raises tremendously when we're looking for *any* arbitrary > two hashes colliding within a cert

Re: Generate unique ID for URL

2012-11-14 Thread Johannes Bauer
On 14.11.2012 02:39, Roy Smith wrote: > The next step is to reduce the number of bits you are encoding. You > said in another post that "1 collision in 10 million hashes would be > tolerable". So you need: > math.log(10*1000*1000, 2) > 23.25349666421154 > > 24 bits worth of key. Nope

Re: Generate unique ID for URL

2012-11-14 Thread Richard
thanks for perspective! -- http://mail.python.org/mailman/listinfo/python-list

Re: Generate unique ID for URL

2012-11-14 Thread Johannes Bauer
On 14.11.2012 01:41, Richard Baron Penman wrote: > I found the MD5 and SHA hashes slow to calculate. Slow? For URLs? Are you kidding? How many URLs per second do you want to calculate? > The builtin hash is fast but I was concerned about collisions. What > rate of collisions could I expect? MD5

Re: Generate unique ID for URL

2012-11-13 Thread Richard
yeah good point - I have gone with md5 for now. On Wednesday, November 14, 2012 3:06:18 PM UTC+11, Chris Angelico wrote: > On Wed, Nov 14, 2012 at 2:25 PM, Richard wrote: > > > So the use case - I'm storing webpages on disk and want a quick retrieval > > system based on URL. > > > I can't sto

Re: Generate unique ID for URL

2012-11-13 Thread Chris Angelico
On Wed, Nov 14, 2012 at 2:25 PM, Richard wrote: > So the use case - I'm storing webpages on disk and want a quick retrieval > system based on URL. > I can't store the files in a single directory because of OS limitations so > have been using a sub folder structure. > For example to store data at

Re: Generate unique ID for URL

2012-11-13 Thread Richard
thanks for pointer to Varnish. I found MongoDB had a lot of size overhead so that it ended up using 4x the data stored. -- http://mail.python.org/mailman/listinfo/python-list

Re: Generate unique ID for URL

2012-11-13 Thread Roy Smith
In article <1ce88f36-bfc7-4a55-89f8-70d1645d2...@googlegroups.com>, Richard wrote: > So the use case - I'm storing webpages on disk and want a quick retrieval > system based on URL. > I can't store the files in a single directory because of OS limitations so > have been using a sub folder str

Re: Generate unique ID for URL

2012-11-13 Thread Richard
> The next step is to reduce the number of bits you are encoding. You > > said in another post that "1 collision in 10 million hashes would be > > tolerable". So you need: > > > > >>> math.log(10*1000*1000, 2) > > 23.25349666421154 I think a difficulty would be finding a hash algorithm

Re: Generate unique ID for URL

2012-11-13 Thread Richard
So the use case - I'm storing webpages on disk and want a quick retrieval system based on URL. I can't store the files in a single directory because of OS limitations so have been using a sub folder structure. For example to store data at URL "abc": a/b/c/index.html This data is also viewed loca

Re: Generate unique ID for URL

2012-11-13 Thread Richard
I am dealing with URL's rather than integers -- http://mail.python.org/mailman/listinfo/python-list

Re: Generate unique ID for URL

2012-11-13 Thread Roy Smith
In article <0692e6a2-343c-4eb0-be57-fe5c815ef...@googlegroups.com>, Richard wrote: > Hello, > > I want to create a URL-safe unique ID for URL's. > Currently I use: > url_id = base64.urlsafe_b64encode(url) > > >>> base64.urlsafe_b64encode('docs.python.org/library/uuid.html') > 'ZG9jcy5weXRob24u

Re: Generate unique ID for URL

2012-11-13 Thread Richard
I found md5 / sha 4-5 times slower than hash. And base64 a lot slower. No database or else I would just use their ID. On Wednesday, November 14, 2012 11:59:55 AM UTC+11, Christian Heimes wrote: > Am 14.11.2012 01:41, schrieb Richard Baron Penman: > > > I found the MD5 and SHA hashes slow to cal

Re: Generate unique ID for URL

2012-11-13 Thread Christian Heimes
Am 14.11.2012 01:50, schrieb Richard: > These URL ID's would just be used internally for quick lookups, not exposed > publicly in a web application. > > Ideally I would want to avoid collisions altogether. But if that means > significant extra CPU time then 1 collision in 10 million hashes would

Re: Generate unique ID for URL

2012-11-13 Thread Christian Heimes
Am 14.11.2012 01:41, schrieb Richard Baron Penman: > I found the MD5 and SHA hashes slow to calculate. > The builtin hash is fast but I was concerned about collisions. What > rate of collisions could I expect? Seriously? It takes about 1-5msec to sha1() one MB of data on a modern CPU, 1.5 on my bo

Re: Generate unique ID for URL

2012-11-13 Thread Richard
These URL ID's would just be used internally for quick lookups, not exposed publicly in a web application. Ideally I would want to avoid collisions altogether. But if that means significant extra CPU time then 1 collision in 10 million hashes would be tolerable. -- http://mail.python.org/mailm

Re: Generate unique ID for URL

2012-11-13 Thread Christian Heimes
Am 14.11.2012 01:26, schrieb Chris Kaynor: > One option would be using a hash. Python's built-in hash, a 32-bit > CRC, 128-bit MD5, 256-bit SHA or one of the many others that exist, > depending on the needs. Higher bit counts will reduce the odds of > accidental collisions; cryptographically secure

Re: Generate unique ID for URL

2012-11-13 Thread Richard Baron Penman
I found the MD5 and SHA hashes slow to calculate. The builtin hash is fast but I was concerned about collisions. What rate of collisions could I expect? Outside attacks not an issue and multiple processes would be used. On Wed, Nov 14, 2012 at 11:26 AM, Chris Kaynor wrote: > One option would be

Re: Generate unique ID for URL

2012-11-13 Thread Chris Kaynor
One option would be using a hash. Python's built-in hash, a 32-bit CRC, 128-bit MD5, 256-bit SHA or one of the many others that exist, depending on the needs. Higher bit counts will reduce the odds of accidental collisions; cryptographically secure ones if outside attacks matter. In such a case, yo

Re: Generate unique ID for URL

2012-11-13 Thread Richard
Good point - one way encoding would be fine. Also this is performed millions of times so ideally efficient. On Wednesday, November 14, 2012 10:34:03 AM UTC+11, John Gordon wrote: > In <0692e6a2-343c-4eb0-be57-fe5c815ef...@googlegroups.com> Richard > writes: > > > > > I want to create a URL-

Re: Generate unique ID for URL

2012-11-13 Thread John Gordon
In <0692e6a2-343c-4eb0-be57-fe5c815ef...@googlegroups.com> Richard writes: > I want to create a URL-safe unique ID for URL's. > Currently I use: > url_id = base64.urlsafe_b64encode(url) > >>> base64.urlsafe_b64encode('docs.python.org/library/uuid.html') > 'ZG9jcy5weXRob24ub3JnL2xpYnJhcnkvdXVpZC