sonald wrote: > Dear All, > I am working on a module that validates the provided CSV data in a text > format, which must be in a predefined format. > We check for the : > [snip] > > 3. valid-text expressions, > Example: > ValidText('Minor', '[yYnN]') > > Parameters: > name => field name > regex => the regular expression y/Y for Yes & n/N for No > > Recently we are getting data, where, the name contains non-english > characters like: ' ATHUMANIù ', ' LUCIANA S. SENGïONGO '...etc
The offending characters are (unusually) lowercase in otherwise uppercase strings; is this actual data or are you typing what you think you see instead of copy/paste? > > Using the Text function, these names are not validated as they contain > special characters or non-english characters (ï,ù). But the data is > correct. It would help a great deal if you were to tell us (1) what is the regex that you are using (2) what encoding you believe/know the data is written in (3) does your app call locale.setlocale() at start-up? If the following guesses are wrong, please say so. Guess (1) (a) you are using the pattern "[A-Za-z]" to check for alphabetic characters (b) you are using the "\w" pattern to check for alphanumeric characters and then using "[\d_]" to reject digits and underscores. Guess (2): "cp1252" or "latin1" or "unknown" -- all pretty much equivalent :-) Guess (3): No. If guess (1b) is correct: the so-called "special" characters are not being interpreted as alphabetic because the re module is locale-dependent. Here is what the re docs have to say: """ \w When the LOCALE and UNICODE flags are not specified, matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus whatever characters are defined as alphanumeric for the current locale. If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database. """ If you are not using (1b) or something like it, you need to move in that direction. Please bear this in mind: the locale is meant to be an attribute/property of the *user* of your application; it is *not* meant to be an attribute of the input data. Read the docs of the locale module -- switching locales on the fly is *not* a good idea. > Is there any function that can allow such special character's but not > numbers...? The righteous way of handling your problem is: (1) decode each field in the incoming 8-bit string data to Unicode, using what you know/guess to be the correct encoding setting. Then string methods like isalpha() and isdigit() will use the Unicode character properties and your "special" characters will be recognised for what they are. (2) use the UNICODE flag in re. > > Secondly, If I were to get the data in Russian text, > are there any > (lingual) packages available so that i can use the the same module for > validation. If you are getting the data as 8-bit strings, then the above approach should still "work" at the basic level ... you decode it using 'cp1251' or whatever, and the Cyrillic letter equivalents of "Ivanov" would pass muster as alphabetic. > Such that I just have to import the package and the module can be used > for validating russian text or japanese text.... Chinese, Japanese and Korean ("CJK") names are written natively in characters that are not alphabetic in the linguistic sense. The number of characters that could possibly be written in a name is rather large. However the CJK characters are classified as Unicode category "Lo" (Letter, other) and do actually match \w in re. So with a minimal amount of work, you can provide a basic level of validation across the board. Anything fancier needs local knowledge [not a c.l.py topic]. Some points for consideration: (1) You may wish not to reject digits irrevocably -- some jurisdictions do permit people to change their legal name to "4567" or whatever. (2) You are of course allowing space, hyphen and apostrophe as valid characters in "English" names e.g. "mac Intyre", "O'Brien-Smith". Bear in mind that other punctuation characters may be valid in other languages -- see 'local knowledge" above. (3) If you are given data encoded as utf16* or utf32, you won't be able to use the csv module (neither the ObjectCraft one nor the Python one (read the docs)) directly. You will need to recode the file as UTF8, read it using the csv module, and *then* decode each text field from utf8. HTH, John -- http://mail.python.org/mailman/listinfo/python-list