Danny Yoo wrote: > > On Thu, 20 Oct 2005, Tomas Markus wrote: >>what is the most effective way to check a file for not allowed >>characters or how to check it for allowed only characters (which might >>be i.e. ASCII only). > > > If the file is small enough to fit into memory, you might use regular > expressions as a sledgehammer. See: > > http://www.amk.ca/python/howto/regex/ > > for a small tutorial on regular expressions. But unless performance is a > real concern, doing a character-by-character scan shouldn't be too > horrendous.
Hi Danny, I was going to ask why you think regex is a sledgehammer for this one, then I decided to try the two alternatives and found out it is actually faster to scan for individual characters than to use a regex and look for them all at once! Here is a program that scans a string for test chars, either using a single regex search or by individually searching for the test chars. The test data set doesn't include any of the test chars so it is a worst case (neither scan terminates early): # FindAny.py import re, string data = string.letters * 2500 testchars = string.digits + string.whitespace testRe = re.compile('[' + testchars + ']') def findRe(): return testRe.search(data) is not None def findScan(): for c in testchars: if c in data: return True return False and here are the results of timing calls to findRe() and findScan(): F:\Tutor>python -m timeit -s "from FindAny import findRe, findScan" "findRe()" 100 loops, best of 3: 2.29 msec per loop F:\Tutor>python -m timeit -s "from FindAny import findRe, findScan" "findScan()" 100 loops, best of 3: 2.04 msec per loop Surprised the heck out of me! When in doubt, measure! When you think you know, measure anyway, you are probably wrong! Kent _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor