Anton, Sean, Anton brings up a pretty interesting problem.
At first, I thought it might be easy to remedy with: import json import functools antonjson = functools.partial(json.dumps, ensure_ascii=False) from riak import RiakClient R = RiakClient() R.set_encoder('application/json', antonjson) …however, upon testing this out, it's seems likely that the underlying transport channels use the default encoding, 'ascii,' and choke on the 8-bit data we now pass it, in socket.py (for the HTTP client) or protobuf.internal.type_checkers (for PBC). Maybe that's a suitable hint for Anton's further investigation, but I'll try to spend some time with it to see what I can find, as well. As to the OP's question: Yes, you've summarized the state of affairs quite nicely. IMHO it was a reasonable default (you can't be sure other Riak clients are as good as Python at 8-bit/Unicode!), but the underlying implementation definitely shows a bug that (again, IMHO) should and can be fixed.-- Adam Lindsay On Friday, 1 February 2013 at 14:27, Sean Cribbs wrote: > Anton, > > I don't see any reason why this can't be fixed. However, since I'm not > familiar with the specifics of the JSON implementation, I'll need > assistance. Please open an issue or pull-request on the Python client: > https://github.com/basho/riak-python-client/issues. We are open to > major, breaking changes for the next release. > > On Fri, Feb 1, 2013 at 8:06 AM, Anton <theati...@gmail.com > (mailto:theati...@gmail.com)> wrote: > > Let's talk python and Unicode (yey!) > > > > The objects that I want to store will have non-ASCII strings in them. > > Potentially a lot. How much is a lot? "Very many millions" should be a > > good estimate. > > > > Now, the default behaviour for storing a python object (ok, a dict of > > stuff), using the PBC transport is to pass them to json and encode > > them. I'm ok with that, I like JSON and the fact that I can read out > > an object in JSON, using a browser, helps a lot. It's really great for > > developing project-specific tools, say debugging tools. > > > > But here is where the fun part starts. The JSON encoder in python is > > not a simple thing, and takes a lot of parameters. And by default it > > works. So well that people rarely look at what's going on. When you > > look at what's going on, however, things get more entertaining. > > > > The JSON encoder works on unicode objects, not strings. When you pass > > it unicode objects, it's happy. When you pass it strings, it decodes > > them, using a specified encoding. By default this is set to 'utf-8' > > which makes everything quite ok. So far so good. However, there's > > another option - 'ensure_ascii'. This is set to True by default and it > > means that the JSON encoder will spew out an ASCII-encoded string. > > That is, in the result, every unicode code-point is encoded as \u0123, > > or a total of 6 bytes. > > > > Now, this is not good. For one, the JSON RFCs expect Unicode, encoded > > using UTF-*. Also, even if much of the data will require 3bytes in > > UTF-8, that's still only half the bytes that the python default would > > take. > > > > Now, consider this elementary example. It already gives a significant > > (in bytes) difference for a short string: > > http://pastie.org/6011147 > > > > > > Please tell me I'm not going crazy and all this is the state of > > affairs and it is, in fact, wrong and can/should be fixed. > > > > _______________________________________________ > > riak-users mailing list > > riak-users@lists.basho.com (mailto:riak-users@lists.basho.com) > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > > > > > > -- > Sean Cribbs <s...@basho.com (mailto:s...@basho.com)> > Software Engineer > Basho Technologies, Inc. > http://basho.com/ > > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com (mailto:riak-users@lists.basho.com) > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > >
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com