This is very nice :-) Thank you Tony. I think this will be the way to go. My concern ATM is where it will be best to unicode. The data after this will go into dict and a few processes and into database. Because input source if not explicit encoding, I will have to assume ISO-8859-1 I believe but could well be cp1252 for most part ( because it says no ASCII (0-30) but alright ASCII chars 128-254) and because most are Windows users. Am thinking to unicode after stripping these characters and validating text, then unicoding (utf-8) so it is unicode in dict. Then when I perform these other processes it should be uniform and then it will go into database as unicode. I think this should be ok.
Regards, David On Monday, October 17, 2005, at 01:48 PM, Tony Nelson wrote: > In article <[EMAIL PROTECTED]>, > David Pratt <[EMAIL PROTECTED]> wrote: > >> I am working with a text format that advises to strip any ascii >> control >> characters (0 - 30) as part of parsing data and also the ascii pipe >> character (124) from the data. I think many of these characters are >> from a different time. Since I have never seen most of these >> characters >> in text I am not sure how these first 30 control characters are all >> represented (other than say tab (\t), newline(\n), line return(\r) ) >> so >> what should I do to remove these characters if they are ever >> encountered. Many thanks. > > Most of those characters are hard to see. > > Represent arbitrary characters in a string in hex: "\x00\x01\x02" or > with chr(n). > > If you just want to remove some characters, look into "".translate(). > > nullxlate = "".join([chr(n) for n in xrange(256)]) > delchars = nullxlate[:31] + chr(124) > outputstr = inputstr.translate(nullxlate, delchars) > _______________________________________________________________________ > _ > TonyN.:' > [EMAIL PROTECTED] > ' > <http://www.georgeanelson.com/> > -- > http://mail.python.org/mailman/listinfo/python-list > -- http://mail.python.org/mailman/listinfo/python-list