On 4/05/2006 1:36 PM, Edward Elliott wrote: > I'm looking for the "best" way to strip a large set of chars from a filename > string (my definition of best usually means succinct and readable). I > only want to allow alphanumeric chars, dashes, and periods. This is what I > would write in **** (bless me father, for I have sinned...):
[expletives deleted] and it was wrong anyway (according to your requirements); using \w would keep '_' which is *NOT* alphanumeric. > I could just use re.sub like the second example, but that's a bit overkill. > I'm trying to figure out if there's a good way to do the same thing with > string methods. string.translate seems to do what I want, the problem is > specifying the set of chars to remove. Obviously hardcoding them all is a > non-starter. > > Working with chars seems to be a bit of a pain. There's no equivalent of > the range function, one has to do something like this: > >>>> [chr(x) for x in range(ord('a'), ord('z')+1)] > ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', > 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'] >>> alphabet = 'qwertyuiopasdfghjklzxcvbnm' # Look, Ma, no thought required!! Monkey see, monkey type. >>> keepchars = set(alphabet + alphabet.upper() + '1234567890-.') >>> fixer = lambda x: ''.join(c for c in x if c in keepchars) >>> fixer('[EMAIL PROTECTED]') 'qwe456.--Howzat' >>> > > Do that twice for letters, once for numbers, add in a few others, and I get > the chars I want to keep. Then I'd invert the set and call translate. > It's a mess and not worth the trouble. Unless there's some way to expand a > compact representation of a char list and obtain its complement, it looks > like I'll have to use a regex. > > Ideally, there would be a mythical charset module that works like this: > >>>> keep = charset.expand (r'\w.-') # or r'a-zA-Z0-9_.-' Where'd that '_' come from? >>>> toss = charset.invert (keep) > > Sadly I can find no such beast. Anyone have any insight? As of now, > regexes look like the best solution. I'll leave it to somebody else to dredge up the standard riposte to your last sentence :-) One point on your requirements: replacing unwanted characters instead of deleting them may be better -- theoretically possible problems with deleting are: (1) duplicates (foo and foo_ become the same) (2) '_' becomes '' which is not a valid filename. And a legibility problem: if you hate '_' and ' ' so much, why not change them to '-'? Oh and just in case the fix was accidentally applied to a path: keepchars.update(os.sep) if os.altsep: keepchars.update(os.altsep) HTH, John -- http://mail.python.org/mailman/listinfo/python-list