On Sat, 23 Dec 2006 19:28:48 -0800, Belize wrote: > Hi. > Essence of problem in the following: > Here is lines in utf8 of this form "BZ???TV%??DVD" > Is it possible to split them into the fragments that contain only latin > printable symbols (aplhabet + "?#" etc)
Of course it is possible, but there probably isn't a built-in function to do it. Write a program to do it. > and fragments with the hieroglyphs, so it could be like this > ['BZ?', '\xe3\x83\x84\xe3\x83\xbc\xe3\x83\xaa', 'TV%', > '\xe3\x83\x84\xe3\x82\xad', 'DVD'] ? def split_fragments(s): """Split a string s into Latin and non-Latin fragments.""" # Warning -- untested. fragments = [] # hold the string fragments latin = [] # temporary accumulator for Latin fragment nonlatin = [] # temporary accumulator for non-Latin fragment for c in s: if islatin(c): if nonlatin: fragments.append(''.join(nonlatin)) nonlatin = [] latin.append(c) else: if latin: fragments.append(''.join(latin)) latin = [] nonlatin.append(c) return fragments I leave it to you to write the function islatin. Hints: There is a Perl module to guess the encoding: http://search.cpan.org/~dankogai/Encode-2.18/lib/Encode/Guess.pm You might like to read this too: http://effbot.org/pyfaq/what-does-unicodeerror-ascii-decoding-encoding-error-ordinal-not-in-range-128-mean.htm I also recommend you read this recipe: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/251871 And look at the module unicodedata. > Then, after translate of hieroglyphs, necessary to join line, so it > could be like this > "BZ? navigation TV% display DVD" def join_fragments(fragments) accumulator = [] for fragment in fragments: if islatin(fragment): accumulator.append(fragment) else: accumulator.append(translate_hieroglyphics(fragment)) return ''.join(accumulator) I leave it to you to write the function translate_hieroglyphics. -- Steven. -- http://mail.python.org/mailman/listinfo/python-list