Unicode support in python
Hi, I am using python2.4.1 I need to pass russian text into python and validate the same. Can u plz guide me on how to make my existing code support the russian text. Is there any module that can be used for unicode support in python? Incase of decimal numbers, how to handle "comma as a decimal point" within a number Currently the existing code is woking fine for English text Please help. Thanks in advance. regards sonal -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode support in python
Fredrik Lundh wrote: > >http://www.google.com/search?q=python+unicode > > (and before anyone starts screaming about how they hate RTFM replies, look > at the search result) > > Thanks!! but i have already tried this... and let me tell you what i am trying now... I have added the following line in the script # -*- coding: utf-8 -*- I have also modified the site.py in ./Python24/Lib as def setencoding(): """Set the string encoding used by the Unicode implementation. The default is 'ascii', but if you're willing to experiment, you can change this.""" encoding = "utf-8" # Default value set by _PyUnicode_Init() if 0: # Enable to support locale aware default string encodings. import locale loc = locale.getdefaultlocale() if loc[1]: encoding = loc[1] if 0: # Enable to switch off string to Unicode coercion and implicit # Unicode to string conversion. encoding = "undefined" if encoding != "ascii": # On Non-Unicode builds this will raise an AttributeError... sys.setdefaultencoding(encoding) # Needs Python Unicode build ! Now when I try to validate the data in the text file say abc.txt (saved as with utf-8 encoding) containing either english or russian text, some junk character (box like) is added as the first character what must be the reason for this? and how do I handle it? -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode support in python
Fredrik Lundh wrote: > > what does the word "validate" mean here? > Let me explain our module. We receive text files (with comma separated values, as per some predefined format) from a third party. for example account file comes as "abc.acc" {.acc is the extension for account file as per our code} it must contain account_code, account_description, account_balance in the same order. So, from the text file("abc.acc") we receive for 2 or more records, will look like A001, test account1, 10 A002, test account2, 50 We may have multiple .acc files Our job is to validate the incoming data on the basis of its datatype, field number, etc and copy all the error free records in acc.txt for this, we use a schema as follows -- if account_flg == 1: start = time() # the input fields acct_schema = { 0: Text('AccountCode', 50), 1: Text('AccountDescription', 100), 2: Text('AccountBalance', 50) } validate( schema= acct_schema, primary_keys = [acct_pk], infile= '../data/ACC/*.acc', outfile = '../data/acc.txt', update_freq = 1) -- In a core.py, we have defined a function validate, which checks for the datatypes & other validations. All the erroneous records are copied in a error log file, and the correct records are copied to a clean acc.text file The validate function is as given below... --- def validate(infile, outfile, schema, primary_keys=[], foreign_keys=[], record_checks=[], buffer_size=0, update_freq=0): show("intitalizing ... ") # find matching input files all_files = glob.glob(infile) if not all_files: raise ValueError('No input files were found.') # initialize data structures freq = update_freq or DEFAULT_UPDATE input = fileinput.FileInput(all_files, bufsize = buffer_size or DEFAULT_BUFFER) output = open(outfile, 'wb+') logs = {} for name in all_files: logs[name] = open(name + DEFAULT_SUFFIX, 'wb+') #logs[name] = open(name + DEFAULT_SUFFIX, 'a+') errors = [] num_fields = len(schema) pk_length = range(len(primary_keys)) fk_length = range(len(foreign_keys)) rc_length = range(len(record_checks)) # initialize the PKs and FKs with the given schema for idx in primary_keys: idx.setup(schema) for idx in foreign_keys: idx.setup(schema) # start processing: collect all lines which have errors for line in input: rec_num = input.lineno() if rec_num % freq == 0: show("processed %d records ... " % (rec_num)) for idx in primary_keys: idx.flush() for idx in foreign_keys: idx.flush() if BLANK_LINE.match(line): continue try: data = csv.parse(line) # check number of fields if len(data) != num_fields: errors.append( (rec_num, LINE_ERROR, 'incorrect number of fields') ) continue # check for well-formed fields fields_ok = True for i in range(num_fields): if not schema[i].validate(data[i]): errors.append( (rec_num, FIELD_ERROR, i) ) fields_ok = False break # check the PKs for i in pk_length: if fields_ok and not primary_keys[i].valid(rec_num, data): errors.append( (rec_num, PK_ERROR, i) ) break # check the FKs for i in fk_length: if fields_ok and not foreign_keys[i].valid(rec_num, data): #print 'here ---> %s, rec_num : %d'%(data,rec_num) errors.append( (rec_num, FK_ERROR, i) ) break # perform record-level checks for i in rc_length: if fields_ok and not record_checks[i](schema, data): errors.append( (rec_num, REC_ERROR, i) ) break except fastcsv.Error, err: errors.append( (rec_num, LINE_ERROR, err.__str__()) ) # finalize the indexes to check for any more errors for i in pk_length: error_list = primary_keys[i].finalize() primary_keys[i].save() if error_list: errors.extend( [ (rec_num, PK_ERROR, i) for rec_num in error_list ] ) for i in fk_length: error_list = foreign_keys[i].finalize() if error_list: errors.extend( [ (rec_num,
Re: Unicode support in python
HI Can u please tell me if there is any package or class that I can import for internationalization, or unicode support? This module is just a small part of our application, and we are not really supposed to alter the code. We do not have nobody here to help us with python here. and are supposed to just try and understand the program. Today I am in a position, that I can fix the bugs arising from the code, but cannot really try something like internationalization on my own. Can u help? Do you want me to post the complete code for your reference? plz lemme know asap. John Roth wrote: > sonald wrote: > > Hi, > > I am using python2.4.1 > > > > I need to pass russian text into python and validate the same. > > Can u plz guide me on how to make my existing code support the > > russian text. > > > > Is there any module that can be used for unicode support in python? > > > > Incase of decimal numbers, how to handle "comma as a decimal point" > > within a number > > > > Currently the existing code is woking fine for English text > > Please help. > > > > Thanks in advance. > > > > regards > > sonal > > As both of the other responders have said, the > coding comment at the front only affects source > text; it has absolutely no effect at run time. In > particular, it's not even necessary to use it to > handle non-English languages as long as you > don't want to write literals in those languages. > > What seems to be missing is the notion that > external files are _always_ byte files, and have to > be _explicitly_ decoded into unicode strings, > and then encoded back to whatever the external > encoding needs to be, each and every time you > read or write a file, or copy string data from > byte strings to unicode strings and back. > There is no good way of handling this implicitly: > you can't simply say "utf-8" or "iso-8859-whatever" > in one place and expect it to work. > > You've got to specify the encoding on each and > every open, or else use the encode and decode > string methods. This is a great motivation for > eliminating duplication and centralizing your > code! > > For your other question: the general words > are localization and locale. Look up locale in > the index. It's a strange subject which I don't > know much about, but that should get you > started. > > John Roth -- http://mail.python.org/mailman/listinfo/python-list
how can i change the text delimiter
Hi, Can anybody tell me how to change the text delimiter in FastCSV Parser ? By default the text delimiter is double quotes(") I want to change it to anything else... say a pipe (|).. can anyone please tell me how do i go about it? -- http://mail.python.org/mailman/listinfo/python-list
Re: how can i change the text delimiter
Hi Amit, Thanks for a quick response... E.g record is: "askin"em" This entire text is extracted as one string but since the qualifier is double quotes("), therefore fastcsv parser is unable to parse it. If we can change the text qualifier to pipe(|), then the string will look like this: |askin"em| But for this the default text qualifier in fastcsv parser needs to be changed to pipe(|). how to do this? Also please note that the string cannot be modified at all. Thanks. Amit Khemka wrote: > sonald <[EMAIL PROTECTED]> wrote: > > Hi, > > Can anybody tell me how to change the text delimiter in FastCSV Parser > > ? > > By default the text delimiter is double quotes(") > > I want to change it to anything else... say a pipe (|).. > > can anyone please tell me how do i go about it? > > You can use the parser constructor to specify the field seperator: > Python >>> parser(ms_double_quote = 1, field_sep = ',', auto_clear = 1) > > cheers, > amit. > > -- > > Amit Khemka -- onyomo.com > Home Page: www.cse.iitd.ernet.in/~csd00377 > Endless the world's turn, endless the sun's Spinning, Endless the quest; > I turn again, back to my own beginning, And here, find rest. -- http://mail.python.org/mailman/listinfo/python-list
Re: how can i change the text delimiter
Hi , thanks for the reply... fast csv is the the csv module for Python... and actually the string cannot be modified because it is received from a third party and we are not supposed to modify the data in any way.. for details on the fast CSV module please visit www.object-craft.com.au/projects/csv/ or import fastcsv csv = fastcsv.parser(strict = 1,field_sep = ',') // part of configuration and somewhere in the code... we are using data = csv.parse(line) all i mean to say is, csv.reader is nowhere in the code and somehow we got to modify the existing code. looking forward to ur kind reply ... Fredrik Lundh wrote: > "sonald" wrote: > > > Thanks for a quick response... > > E.g record is: "askin"em" > > that's usually stored as "askin""em" in a CSV file, and the csv module > has no problem handling that: > > >>> import csv, StringIO > >>> source = StringIO.StringIO('"askin""em"\n') > >>> list(csv.reader(source)) > [['askin"em']] > > to use another quote character, use the quotechar option to the reader > function: > > >>> source = StringIO.StringIO('|askin"em|\n') > >>> list(csv.reader(source, quotechar='|')) > [['askin"em']] > > > Also please note that the string cannot be modified at all. > > not even by the Python program that reads the data? sounds scary. > > what's fastcsv, btw? the only thing google finds with that name is a > Ruby library... > > -- http://mail.python.org/mailman/listinfo/python-list
Re: how can i change the text delimiter
Hi, I am using Python version python-2.4.1 and along with this there are other installables like: 1. fastcsv-1.0.1.win32-py2.4.exe 2. psyco-1.4.win32-py2.4.exe 3. scite-1.63-setup.exe We are freshers here, joined new... and are now into handling this module which validates the data files, which are provided in some predefined format from the third party. The data files are provided in the comma separated format. The fastcsv package is imported in the code... import fastcsv and csv = fastcsv.parser(strict = 1,field_sep = ',') can u plz tell me where to find the parser function definition, (used above) so that if possible i can provide a parameter for text qualifier or text separator or text delimiter.. just as {field_sep = ','} (as given above) I want to handle string containing double quotes (") but the problem is that the default text qualifier is double quote Now if I can change the default text qualifier... to say pipe (|) the double quote inside the string may be ignored... plz refer to the example given in my previous query... Thanks.. Fredrik Lundh wrote: > "sonald" wrote: > > > fast csv is the the csv module for Python... > > no, it's not. the csv module for Python is called "csv". > > > and actually the string cannot be modified because > > it is received from a third party and we are not supposed to modify the > > data in any way.. > > that doesn't prevent you from using Python to modify it before you pass it to > the csv parser, though. > > > for details on the fast CSV module please visit > > > > www.object-craft.com.au/projects/csv/ or > > that module is called "csv", not "fastcsv". and as it says on that page, a > much > improved version of that module was added to Python in version 2.3. > > what Python version are you using? > > -- http://mail.python.org/mailman/listinfo/python-list
Re: how can i change the text delimiter
Hi, Thanks a lot for the snips you have included in your post... those were quite helpful... And about the 3rd party data we receive the data in csv format ... but we are not supposed to modify the files provided by the user directly... Instead we make another file with the same name & different extensions... and use the new files created by the python for further processing > quote_char > Defines the character used to quote fields that > contain the field separator or newlines. If set to None > special characters will be escaped using the escape_char. > # That's what you are looking for # Yes you got me right I was indeed looking for the quote_char... > Aha!! Looks like some misguided person has got a copy of the > object-craft code, renamed it fastcsv, and compiled it to run with > Python 2.4 ... so you want some docs. The simplest thing to do is to > ask it, e.g. like this, but with Python 2.4 (not 2.2) and call it > fastcsv (not csv): > I guess... that's true... ;) Thank you very much. Thanks a lot for the reponse John Machin wrote: > sonald wrote: > > Hi, > > I am using > > Python version python-2.4.1 and along with this there are other > > installables > > like: > > 1. fastcsv-1.0.1.win32-py2.4.exe > > Well, you certainly didn't get that from the object-craft website -- > just go and look at their download page > http://www.object-craft.com.au/projects/csv/download.html -- stops dead > in 2002 and the latest windows kit is a .pyd for Python 2.2. As you > have already been told and as the object-craft csv home-page says, > their csv was the precursor of the Python csv module. > > > > 2. psyco-1.4.win32-py2.4.exe > > 3. scite-1.63-setup.exe > > > > We are freshers here, joined new... and are now into handling this > > module which validates the data files, which are provided in some > > predefined format from the third party. > > The data files are provided in the comma separated format. > > > > The fastcsv package is imported in the code... > > import fastcsv > > and > > csv = fastcsv.parser(strict = 1,field_sep = ',') > > Aha!! Looks like some misguided person has got a copy of the > object-craft code, renamed it fastcsv, and compiled it to run with > Python 2.4 ... so you want some docs. The simplest thing to do is to > ask it, e.g. like this, but with Python 2.4 (not 2.2) and call it > fastcsv (not csv): > > ... command-prompt...>\python22\python > Python 2.2.3 (#42, May 30 2003, 18:12:08) [MSC 32 bit (Intel)] on win32 > Type "help", "copyright", "credits" or "license" for more information. > >>> import csv > >>> help(csv.parser) > Help on built-in function parser: > > parser(...) > parser(ms_double_quote = 1, field_sep = ',', >auto_clear = 1, strict = 0, >quote_char = '"', escape_char = None) -> Parser > > Constructs a CSV parser object. > > ms_double_quote > When True, quotes in a fields must be doubled up. > > field_sep > Defines the character that will be used to separate > fields in the CSV record. > > auto_clear > When True, calling parse() will automatically call > the clear() method if the previous call to parse() raised > an > exception during parsing. > > strict > When True, the parser will raise an exception on > malformed fields rather than attempting to guess the right > behavior. > > quote_char > Defines the character used to quote fields that > contain the field separator or newlines. If set to None > special characters will be escaped using the escape_char. > # That's what you are looking for # > escape_char > Defines the character used to escape special > characters. Only used if quote_char is None. > > >>> help(csv) > Help on module csv: > > NAME > csv - This module provides class for performing CSV parsing and > writing. > > FILE > SOMEWHERE\csv.pyd > > DESCRIPTION > The CSV parser object (returned by the parser() function) supports > the > following methods: > clear() > Discards all fields parsed so far. If auto_clear is set to > zero. You should call this after a parser exception. > > parse(string) -> list of strings > Extracts fields from the (partial) CSV re
=?iso-8859-1?q?How_to_allow_special_character's_like_=EF, =F9, acute_e_etc...?=
Dear All, I am working on a module that validates the provided CSV data in a text format, which must be in a predefined format. We check for the : 1. Number of fields provided in the text file, 2. Text checks for max. length of the field & whether the field is mandatory or optional Example: Text('Description', 100, optional=True) Parameters: "Name of the field" => 'Description' "Max length "=> 100 "Optional" => 'True' (the field is not mandaory) 3. valid-text expressions, Example: ValidText('Minor', '[yYnN]') Parameters: name=> field name regex => the regular expression y/Y for Yes & n/N for No Recently we are getting data, where, the name contains non-english characters like: ' ATHUMANIù ', ' LUCIANA S. SENGïONGO '...etc Using the Text function, these names are not validated as they contain special characters or non-english characters (ï,ù). But the data is correct. Is there any function that can allow such special character's but not numbers...? Secondly, If I were to get the data in Russian text, are there any (lingual) packages available so that i can use the the same module for validation. Such that I just have to import the package and the module can be used for validating russian text or japanese text Regards, Sonal. -- http://mail.python.org/mailman/listinfo/python-list