This is very nice :-)  Thank you Tony.  I think this will be the way to  
go.  My concern ATM is where it will be best to unicode. The data after  
this will go into dict and a few processes and into database. Because  
input source if not explicit encoding, I will have to assume ISO-8859-1  
I believe but could well be cp1252 for most part ( because it says no  
ASCII (0-30) but alright ASCII chars 128-254) and because most are  
Windows users.  Am thinking to unicode after stripping these characters  
and validating text, then unicoding (utf-8) so it is unicode in dict.  
Then when I perform these other processes it should be uniform and then  
it will go into database as unicode.  I think this should be ok.


On Monday, October 17, 2005, at 01:48 PM, Tony Nelson wrote:

> In article <[EMAIL PROTECTED]>,
>  David Pratt <[EMAIL PROTECTED]> wrote:
>> I am working with a text format that advises to strip any ascii  
>> control
>> characters (0 - 30) as part of parsing data and also the ascii pipe
>> character (124) from the data. I think many of these characters are
>> from a different time. Since I have never seen most of these  
>> characters
>> in text I am not sure how these first 30 control characters are all
>> represented (other than say tab (\t), newline(\n), line return(\r) )  
>> so
>> what should I do to remove these characters if they are ever
>> encountered. Many thanks.
> Most of those characters are hard to see.
> Represent arbitrary characters in a string in hex: "\x00\x01\x02" or
> with chr(n).
> If you just want to remove some characters, look into "".translate().
> nullxlate = "".join([chr(n) for n in xrange(256)])
> delchars = nullxlate[:31] + chr(124)
> outputstr = inputstr.translate(nullxlate, delchars)
> _______________________________________________________________________ 
> _
> TonyN.:'                         
>       '                                   
> <>
> -- 

Reply via email to