japanese encoding iso-2022-jp in python vs. perl
Hi, I am rather new to python, and am currently struggling with some encoding issues. I have some utf-8-encoded text which I need to encode as iso-2022-jp before sending it out to the world. I am using python's encode functions: -- var = var.encode("iso-2022-jp", "replace") print var -- I am using the 'replace' argument because there seem to be a couple of utf-8 japanese characters which python can't correctly convert to iso-2022-jp. The output looks like this: ↓東京???日比谷線?北千住行 However if use perl's encode module to re-encode the exact same bit of text: -- $var = encode("iso-2022-jp", decode("utf8", $var)) print $var -- I get proper output (no unsightly question-marks): ↓東京メトロ日比谷線・北千住行 So, what's the deal? Why can't python properly encode some of these characters? I know there are a host of different iso-2022-jp variants, could it be using a different one than I think (the default)? I'm quite liking python at the moment for a variety of different reasons (I suspect perl will forever win when it comes to regular expressions but everything else is pretty darn nice), but this is a bit worrying. -Joe -- http://mail.python.org/mailman/listinfo/python-list
Re: japanese encoding iso-2022-jp in python vs. perl
Thanks Leo, and everyone else, these were very helpful replies. The issue was exactly as Leo described, and I apologize for not being aware of it, and thus not quite reporting it correctly. At the moment I don't care about round-tripping between half-width and full-width kana, rather I need only be able to rely on any particular kana character be translated correctly to its half-width or full-width equivalent, and I need the Japanese I send out to be readable. I appreciate the 'implicit versus explicit' point, and have read about it in a few different python mailing lists. In this instance it seems that perl perhaps ought to flash a warning notification regarding what it is doing, but as this conversion between half-width and full-width characters is by far the most logical one available, it also seems reasonable that python might perhaps include such capabilities by default, just as it currently includes the 'replace' option for mapping missed characters generically to '?'. I still haven't worked out the entire mapping routine, but Leo's hint is probably sufficient to get it working with a bit more effort. Again, thanks for the help. -Joe > Thanks that I have my crystal ball working. I can see clearly that the > forth > character of the input is 'HALFWIDTH KATAKANA LETTER ME' (U+FF92) > which is > not present in ISO-2022-JP as defined by RFC 1468 so python converts > it into > question mark as you requested. Meanwhile perl as usual is trying to > guess what > you want and silently converts that character into 'KATAKANA LETTER > ME' (U+30E1) > which is present in ISO-2022-JP. > > > Why can't python properly encode some of these > > characters? > > Because "Explicit is better than implicit". Do you care about > roundtripping? > Do you care about width of characters? What about full-width " (U > +FF02)? Python > doesn't know answers to these questions so it doesn't do anything with > your > input. You have to do it yourself. Assuming you don't care about > roundtripping > and width here is an example demonstrating how to deal with narrow > characters: > > from unicodedata import normalize > iso2022_squeezing = dict((i, normalize('NFKC',unichr(i))) for i in > range(0xFF61,0xFFE0)) > print repr(u'\uFF92'.translate(iso2022_squeezing)) > > It prints u'\u30e1'. Feel free to ask questions if something is not > clear. > > Note, this is just an example, I *don't* claim it does what you want > for any character > in FF61-FFDF range. You may want to carefully review the whole unicode > block:http://www.unicode.org/charts/PDF/UFF00.pdf > > -- Leo. -- http://mail.python.org/mailman/listinfo/python-list
dictionary of dictionaries
Hi, I'm wondering what the best practice is for creating an extensible dictionary-of-dictionaries in python? In perl I would just do something like: my %hash_of_hashes; for(my $i=0;$i<10;$i++){ for(my $j=0;$j<10;$j++){ ${$hash_of_hashes{$i}}{$j} = int(rand(10)); } } but it seems to be more hassle to replicate this in python. I've found a couple of references around the web but they seem cumbersome. I'd like something compact. -joe -- http://mail.python.org/mailman/listinfo/python-list
Re: dictionary of dictionaries
On Dec 9, 5:49 pm, Marc 'BlackJack' Rintsch <[EMAIL PROTECTED]> wrote: > On Sun, 09 Dec 2007 00:35:18 -0800, kettle wrote: > > Hi, > > I'm wondering what the best practice is for creating an extensible > > dictionary-of-dictionaries in python? > > > In perl I would just do something like: > > > my %hash_of_hashes; > > for(my $i=0;$i<10;$i++){ > > for(my $j=0;$j<10;$j++){ > >${$hash_of_hashes{$i}}{$j} = int(rand(10)); > > } > > } > > > but it seems to be more hassle to replicate this in python. I've > > found a couple of references around the web but they seem cumbersome. > > I'd like something compact. > > Use `collections.defaultdict`: > > from collections import defaultdict > from random import randint > > data = defaultdict(dict) > for i in xrange(11): > for j in xrange(11): > data[i][j] = randint(0, 10) > > If the keys `i` and `j` are not "independent" you might use a "flat" > dictionary with a tuple of both as keys: > > data = dict(((i, j), randint(0, 10)) for i in xrange(11) for j in xrange(11)) > > And just for completeness: The given data in the example can be stored in a > list of lists of course: > > data = [[randint(0, 10) for dummy in xrange(11)] for dummy in xrange(11)] > > Ciao, > Marc 'BlackJack' Rintsch Thanks for the heads up. Indeed it's just as nice as perl. One more question though, this defaultdict seems to only work with python2.5+ in the case of python < 2.5 it seems I have to do something like: #!/usr/bin/python from random import randint dict_dict = {} for x in xrange(10): for y in xrange(10): r = randint(0,10) try: dict_dict[x][y] = r except: if x in dict_dict: dict_dict[x][y] = r else: dict_dict[x] = {} dict_dict[x][y] = r what I really want to / need to be able to do is autoincrement the values when I hit another word. Again in perl I'd just do something like: my %my_hash; while(){ chomp; @_ = split(/\s+/); grep{$my_hash{$_}++} @_; } and this generalizes transparently to a hash of hashes or hash of a hash of hashes etc. In python < 2.5 this seems to require something like: for line in file: words = line.split() for word in words: my_dict[word] = 1 + my_dict.get(word, 0) which I guess I can generalize to a dict of dicts but it seems it will require more if/else statements to check whether or not the higher- level keys exist. I guess the real answer is that I should just migrate to python2.5...! -joe -- http://mail.python.org/mailman/listinfo/python-list
Re: dictionary of dictionaries
On Dec 10, 6:58 pm, Peter Otten <[EMAIL PROTECTED]> wrote: > kettle wrote: > > On Dec 9, 5:49 pm, Marc 'BlackJack' Rintsch <[EMAIL PROTECTED]> wrote: > >> On Sun, 09 Dec 2007 00:35:18 -0800, kettle wrote: > >> > Hi, > >> > I'm wondering what the best practice is for creating an extensible > >> > dictionary-of-dictionaries in python? > > >> > In perl I would just do something like: > > >> > my %hash_of_hashes; > >> > for(my $i=0;$i<10;$i++){ > >> > for(my $j=0;$j<10;$j++){ > >> >${$hash_of_hashes{$i}}{$j} = int(rand(10)); > >> > } > >> > } > > >> > but it seems to be more hassle to replicate this in python. I've > >> > found a couple of references around the web but they seem cumbersome. > >> > I'd like something compact. > > >> Use `collections.defaultdict`: > > >> from collections import defaultdict > >> from random import randint > > >> data = defaultdict(dict) > >> for i in xrange(11): > >> for j in xrange(11): > >> data[i][j] = randint(0, 10) > > >> If the keys `i` and `j` are not "independent" you might use a "flat" > >> dictionary with a tuple of both as keys: > > >> data = dict(((i, j), randint(0, 10)) for i in xrange(11) for j in > >> xrange(11)) > > >> And just for completeness: The given data in the example can be stored in a > >> list of lists of course: > > >> data = [[randint(0, 10) for dummy in xrange(11)] for dummy in xrange(11)] > > >> Ciao, > >> Marc 'BlackJack' Rintsch > > > Thanks for the heads up. Indeed it's just as nice as perl. One more > > question though, this defaultdict seems to only work with python2.5+ > > in the case of python < 2.5 it seems I have to do something like: > > #!/usr/bin/python > > from random import randint > > > dict_dict = {} > > for x in xrange(10): > > for y in xrange(10): > > r = randint(0,10) > > try: > > dict_dict[x][y] = r > > except: > > if x in dict_dict: > > dict_dict[x][y] = r > > else: > > dict_dict[x] = {} > > dict_dict[x][y] = r > > You can clean that up a bit: > > from random import randrange > > dict_dict = {} > for x in xrange(10): > dict_dict[x] = dict((y, randrange(11)) for y in xrange(10)) > > > > > what I really want to / need to be able to do is autoincrement the > > values when I hit another word. Again in perl I'd just do something > > like: > > > my %my_hash; > > while(){ > > chomp; > > @_ = split(/\s+/); > > grep{$my_hash{$_}++} @_; > > } > > > and this generalizes transparently to a hash of hashes or hash of a > > hash of hashes etc. In python < 2.5 this seems to require something > > like: > > > for line in file: > > words = line.split() > > for word in words: > > my_dict[word] = 1 + my_dict.get(word, 0) > > > which I guess I can generalize to a dict of dicts but it seems it will > > require more if/else statements to check whether or not the higher- > > level keys exist. I guess the real answer is that I should just > > migrate to python2.5...! > > Well, there's also dict.setdefault() > > >>> pairs = ["ab", "ab", "ac", "bc"] > >>> outer = {} > >>> for a, b in pairs: > > ... inner = outer.setdefault(a, {}) > ... inner[b] = inner.get(b, 0) + 1 > ...>>> outer > > {'a': {'c': 1, 'b': 2}, 'b': {'c': 1}} > > and it's not hard to write your own defaultdict > > >>> class Dict(dict): > > ... def __getitem__(self, key): > ... return self.get(key, 0) > ...>>> d = Dict() > >>> for c in "abbbcdeafgh": d[c] += 1 > ... > >>> d > > {'a': 2, 'c': 1, 'b': 3, 'e': 1, 'd': 1, 'g': 1, 'f': 1, 'h': 1} > > Peter Nice, thanks for all the tips! I knew there had to be some handier python ways to do these things. My initial attempts were just what occurred to me first given my still limited knowledge of the language and its idioms. Thanks again! -joe -- http://mail.python.org/mailman/listinfo/python-list
Re: dictionary of dictionaries
On Dec 10, 6:58 pm, Peter Otten <[EMAIL PROTECTED]> wrote: > kettle wrote: > > On Dec 9, 5:49 pm, Marc 'BlackJack' Rintsch <[EMAIL PROTECTED]> wrote: > >> On Sun, 09 Dec 2007 00:35:18 -0800, kettle wrote: > >> > Hi, > >> > I'm wondering what the best practice is for creating an extensible > >> > dictionary-of-dictionaries in python? > > >> > In perl I would just do something like: > > >> > my %hash_of_hashes; > >> > for(my $i=0;$i<10;$i++){ > >> > for(my $j=0;$j<10;$j++){ > >> >${$hash_of_hashes{$i}}{$j} = int(rand(10)); > >> > } > >> > } > > >> > but it seems to be more hassle to replicate this in python. I've > >> > found a couple of references around the web but they seem cumbersome. > >> > I'd like something compact. > > >> Use `collections.defaultdict`: > > >> from collections import defaultdict > >> from random import randint > > >> data = defaultdict(dict) > >> for i in xrange(11): > >> for j in xrange(11): > >> data[i][j] = randint(0, 10) > > >> If the keys `i` and `j` are not "independent" you might use a "flat" > >> dictionary with a tuple of both as keys: > > >> data = dict(((i, j), randint(0, 10)) for i in xrange(11) for j in > >> xrange(11)) > > >> And just for completeness: The given data in the example can be stored in a > >> list of lists of course: > > >> data = [[randint(0, 10) for dummy in xrange(11)] for dummy in xrange(11)] > > >> Ciao, > >> Marc 'BlackJack' Rintsch > > > Thanks for the heads up. Indeed it's just as nice as perl. One more > > question though, this defaultdict seems to only work with python2.5+ > > in the case of python < 2.5 it seems I have to do something like: > > #!/usr/bin/python > > from random import randint > > > dict_dict = {} > > for x in xrange(10): > > for y in xrange(10): > > r = randint(0,10) > > try: > > dict_dict[x][y] = r > > except: > > if x in dict_dict: > > dict_dict[x][y] = r > > else: > > dict_dict[x] = {} > > dict_dict[x][y] = r > > You can clean that up a bit: > > from random import randrange > > dict_dict = {} > for x in xrange(10): > dict_dict[x] = dict((y, randrange(11)) for y in xrange(10)) > > > > > what I really want to / need to be able to do is autoincrement the > > values when I hit another word. Again in perl I'd just do something > > like: > > > my %my_hash; > > while(){ > > chomp; > > @_ = split(/\s+/); > > grep{$my_hash{$_}++} @_; > > } > > > and this generalizes transparently to a hash of hashes or hash of a > > hash of hashes etc. In python < 2.5 this seems to require something > > like: > > > for line in file: > > words = line.split() > > for word in words: > > my_dict[word] = 1 + my_dict.get(word, 0) > > > which I guess I can generalize to a dict of dicts but it seems it will > > require more if/else statements to check whether or not the higher- > > level keys exist. I guess the real answer is that I should just > > migrate to python2.5...! > > Well, there's also dict.setdefault() > > >>> pairs = ["ab", "ab", "ac", "bc"] > >>> outer = {} > >>> for a, b in pairs: > > ... inner = outer.setdefault(a, {}) > ... inner[b] = inner.get(b, 0) + 1 > ...>>> outer > > {'a': {'c': 1, 'b': 2}, 'b': {'c': 1}} > > and it's not hard to write your own defaultdict > > >>> class Dict(dict): > > ... def __getitem__(self, key): > ... return self.get(key, 0) > ...>>> d = Dict() > >>> for c in "abbbcdeafgh": d[c] += 1 > ... > >>> d > > {'a': 2, 'c': 1, 'b': 3, 'e': 1, 'd': 1, 'g': 1, 'f': 1, 'h': 1} > > Peter One last question. I've heard the 'Explicit vs. Implicit' argument but this seems to boil down to a question of general usage case scenarios and what most people 'expect' for default behavior. The above defaultdict implementation defining the __getitem__ method seems like it is more generally useful than the real default. What is the reasoning behind NOT using this as the default implementation for a dict in python? -- http://mail.python.org/mailman/listinfo/python-list
socket script from perl -> python
Hi I have a socket script, written in perl, which I use to send audio data from one server to another. I would like to rewrite this in python so as to replicate exactly the functionality of the perl script, so as to incorporate this into a larger python program. Unfortunately I still don't really have the hang of socket programming in python. The relevant parts of the perl script are below: $host = '127.0.0.1'; my $port = 3482; my $proto = getprotobyname('tcp'); my $iaddr = inet_aton($host); my $paddr = sockaddr_in($port, $iaddr); # create the socket, connect to the port socket(SOCKET, PF_INET, SOCK_STREAM, $proto) or die "socket: $!"; connect(SOCKET, $paddr) or die "connect: $!"; my $length = length($converted_audio); # pack $length as a 32-bit network-independent long my $len = pack('N', $length); #print STDERR "LENGTH: $length\n"; SOCKET->autoflush(); print SOCKET "r"; print SOCKET $len; print SOCKET "$converted_audio\n"; while(defined($line = )) { do something here... } I've used python's socket library to connect to the server, and verified that the first piece of data'r' is read correctly, the sticking point seems to be the $len variable. I've tried using socket.htonl() and the other less likely variants, but nothing seem to produce the desired result, which would be to have the server-side message print the same 'length' as the length printed by the client. The python I've tried looked like this: from socket import * host = '127.0.0.1' port = 3482 addr = (host, port) s = socket(AF_INET, SOCK_STREAM) s.connect(addr) f = open('/home/myuname/socket.wav','rb') audio = "" for line in f: audio += line leng = htonl(len(audio)) print leng s.send('r') s.send(leng) s.send(audio) s.send("\n") s.flush() -- of course I'd also like to s.recv() the results from the server, but first I need to properly calculate the length and send it as a network independent long. Any tips on how to do this would be greatly appreciated! -- http://mail.python.org/mailman/listinfo/python-list
Re: socket script from perl -> python
On Feb 8, 12:08 am, Bjoern Schliessmann wrote: > kettle wrote: > > Hi I have a socket script, written in perl, which I use to send > > audio data from one server to another. I would like to rewrite > > this in python so as to replicate exactly the functionality of the > > perl script, so as to incorporate this into a larger python > > program. Unfortunately I still don't really have the hang of > > socket programming in python. > > Socket programming in Python is just like socket programming in C. I > suppose with Perl it's the same. True, but I'm not talking about the concepts, I'm talking about the idioms, which in python I don't know. > > > # pack $length as a 32-bit network-independent long > > my $len = pack('N', $length); > > [...] > > I've used python's socket library to connect to the server, and > > verified that the first piece of data'r' is read correctly, the > > sticking point seems to be the $len variable. I've tried using > > socket.htonl() and the other less likely variants, but nothing > > seem to produce the desired result, which would be to have the > > server-side message print the same 'length' as the length printed > > by the client. > > Try struct.calcsize. Thanks for the suggestion, I hadn't tried that one. -joe > > Regards, > > Björn > > -- > BOFH excuse #88: > > Boss' kid fucked up the machine -- http://mail.python.org/mailman/listinfo/python-list
Re: socket script from perl -> python
On Feb 8, 4:01 am, Hrvoje Niksic <[EMAIL PROTECTED]> wrote: > kettle <[EMAIL PROTECTED]> writes: > > # pack $length as a 32-bit network-independent long > > my $len = pack('N', $length); > [...] > > the sticking point seems to be the $len variable. > > Use len = struct.pack('!L', length) in Python. > Seehttp://docs.python.org/lib/module-struct.htmlfor details. Thanks, that was exactly what I was missing. And thanks for the link as well! -joe -- http://mail.python.org/mailman/listinfo/python-list
python tr equivalent (non-ascii)
Hi, I was wondering how I ought to be handling character range translations in python. What I want to do is translate fullwidth numbers and roman alphabet characters into their halfwidth ascii equivalents. In perl I can do this pretty easily with tr: tr/\x{ff00}-\x{ff5e}/\x{0020}-\x{007e}/; and I think the string.translate method is what I need to use to achieve the equivalent in python. Unfortunately the maktrans method doesn't seem to accept character ranges and I'm also having trouble with it's interpretation of length. What I came up with was to first fudge the ranges: my_test_string = u"ABCDEFG" f_range = "".join([unichr(x) for x in range(ord(u"\uff00"),ord(u"\uff5e"))]) t_range = "".join([unichr(x) for x in range(ord(u"\u0020"),ord(u"\u007e"))]) then use these as input to maketrans: my_trans_string = my_test_string.translate(string.maketrans(f_range,t_range)) Traceback (most recent call last): File "", line 1, in ? UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-93: ordinal not in range(128) but it generates an encoding error... and if I encodethe ranges in utf8 before passing them on I get a length error because maketrans is counting bytes not characters and utf8 is variable width... my_trans_string = my_test_string.translate(string.maketrans(f_range.encode("utf8"),t_range.encode("utf8"))) Traceback (most recent call last): File "", line 1, in ? ValueError: maketrans arguments must have same length -- http://mail.python.org/mailman/listinfo/python-list
Re: python tr equivalent (non-ascii)
On Aug 13, 5:18 pm, kettle <[EMAIL PROTECTED]> wrote: > Hi, > I was wondering how I ought to be handling character range > translations in python. > > What I want to do is translate fullwidth numbers and roman alphabet > characters into their halfwidth ascii equivalents. > In perl I can do this pretty easily with tr: > > tr/\x{ff00}-\x{ff5e}/\x{0020}-\x{007e}/; > > and I think the string.translate method is what I need to use to > achieve the equivalent in python. Unfortunately the maktrans method > doesn't seem to accept character ranges and I'm also having trouble > with it's interpretation of length. What I came up with was to first > fudge the ranges: > > my_test_string = u"ABCDEFG" > f_range = "".join([unichr(x) for x in > range(ord(u"\uff00"),ord(u"\uff5e"))]) > t_range = "".join([unichr(x) for x in > range(ord(u"\u0020"),ord(u"\u007e"))]) > > then use these as input to maketrans: > my_trans_string = > my_test_string.translate(string.maketrans(f_range,t_range)) > Traceback (most recent call last): > File "", line 1, in ? > UnicodeEncodeError: 'ascii' codec can't encode characters in position > 0-93: ordinal not in range(128) > > but it generates an encoding error... and if I encodethe ranges in > utf8 before passing them on I get a length error because maketrans is > counting bytes not characters and utf8 is variable width... > my_trans_string = > my_test_string.translate(string.maketrans(f_range.encode("utf8"),t_range.encode("utf8"))) > Traceback (most recent call last): > File "", line 1, in ? > ValueError: maketrans arguments must have same length Ok so I guess I was barking up the wrong tree. Searching for python 全角 半角 quickly brought up a solution: >>>import unicodedata >>>my_test_string=u"[EMAIL PROTECTED]" >>>print unicodedata.normalize('NFKC', my_test_string.decode("utf8")) [EMAIL PROTECTED]@123 >>> still, it would be nice if there was a more general solution, or if maketrans actually looked at chars instead of bytes methinks. -- http://mail.python.org/mailman/listinfo/python-list
Re: python tr equivalent (non-ascii)
On Aug 13, 5:33 pm, Fredrik Lundh <[EMAIL PROTECTED]> wrote: > kettle wrote: > > I was wondering how I ought to be handling character range > > translations in python. > > > What I want to do is translate fullwidth numbers and roman alphabet > > characters into their halfwidth ascii equivalents. > > In perl I can do this pretty easily with tr: > > > tr/\x{ff00}-\x{ff5e}/\x{0020}-\x{007e}/; > > > and I think the string.translate method is what I need to use to > > achieve the equivalent in python. Unfortunately the maktrans method > > doesn't seem to accept character ranges and I'm also having trouble > > with it's interpretation of length. What I came up with was to first > > fudge the ranges: > > > my_test_string = u"ABCDEFG" > > f_range = "".join([unichr(x) for x in > > range(ord(u"\uff00"),ord(u"\uff5e"))]) > > t_range = "".join([unichr(x) for x in > > range(ord(u"\u0020"),ord(u"\u007e"))]) > > > then use these as input to maketrans: > > my_trans_string = > > my_test_string.translate(string.maketrans(f_range,t_range)) > > Traceback (most recent call last): > > File "", line 1, in ? > > UnicodeEncodeError: 'ascii' codec can't encode characters in position > > 0-93: ordinal not in range(128) > > maketrans only works for byte strings. > > as for translate itself, it has different signatures for byte strings > and unicode strings; in the former case, it takes lookup table > represented as a 256-byte string (e.g. created by maketrans), in the > latter case, it takes a dictionary mapping from ordinals to ordinals or > unicode strings. > > something like > > lut = dict((0xff00 + ch, 0x0020 + ch) for ch in range(0x80)) > > new_string = old_string.translate(lut) > > could work (untested). > > excellent. i didnt realize from the docs that i could do that. thanks -- http://mail.python.org/mailman/listinfo/python-list