Le jeudi 11 juillet 2013 20:42:26 UTC+2, wxjm...@gmail.com a écrit : > Le jeudi 11 juillet 2013 15:32:00 UTC+2, Chris Angelico a écrit : > > > On Thu, Jul 11, 2013 at 11:18 PM, <wxjmfa...@gmail.com> wrote: > > > > > > > Just to stick with this funny character ẞ, a ucs-2 char > > > > > > > in the Flexible String Representation nomenclature. > > > > > > > > > > > > > > It seems to me that, when one needs more than ten bytes > > > > > > > to encode it, > > > > > > > > > > > > > >>>> sys.getsizeof('a') > > > > > > > 26 > > > > > > >>>> sys.getsizeof('ẞ') > > > > > > > 40 > > > > > > > > > > > > > > this is far away from the perfection. > > > > > > > > > > > > Better comparison is to see how much space is used by one copy of it, > > > > > > and how much by two copies: > > > > > > > > > > > > >>> sys.getsizeof('aa')-sys.getsizeof('a') > > > > > > 1 > > > > > > >>> sys.getsizeof('ẞẞ')-sys.getsizeof('ẞ') > > > > > > 2 > > > > > > > > > > > > String objects have overhead. Big deal. > > > > > > > > > > > > > BTW, for a modern language, is not ucs2 considered > > > > > > > as obsolete since many, many years? > > > > > > > > > > > > Clearly. And similarly, the 16-bit integer has been completely > > > > > > obsoleted, as there is no reason anyone should ever bother to use it. > > > > > > Same with the float type - everyone uses double or better these days, > > > > > > right? > > > > > > > > > > > > http://www.postgresql.org/docs/current/static/datatype-numeric.html > > > > > > http://www.cplusplus.com/doc/tutorial/variables/ > > > > > > > > > > > > Nope, nobody uses small integers any more, they're clearly completely > > obsolete. > > > > > > > > > > > > > Sure there is some overhead because a str is a class. > > It still remain that a "ẞ" weights 14 bytes more than > > an "a". > > > > In "aẞ", the ẞ weights 6 bytes. > > > > >>> sys.getsizeof('a') > > 26 > > >>> sys.getsizeof('aẞ') > > 42 > > > > and in "aẞẞ", the ẞ weights 2 bytes > > > > sys.getsizeof('aẞẞ') > > > > And what to say about this "ucs4" char/string '\U0001d11e' which > > is weighting 18 bytes more than an "a". > > > > >>> sys.getsizeof('\U0001d11e') > > 44 > > > > A total absurdity. How does is come? Very simple, once you > > split Unicode in subsets, not only you have to handle these > > subsets, you have to create "markers" to differentiate them. > > Not only, you produce "markers", you have to handle the > > mess generated by these "markers". Hiding this markers > > in the everhead of the class does not mean that they should > > not be counted as part of the coding scheme. BTW, since > > when a serious coding scheme need an extermal marker? > > > > > > > > >>> sys.getsizeof('aa') - sys.getsizeof('a') > > 1 > > > > Shortly, if my algebra is still correct: > > > > (overhead + marker + 2*'a') - (overhead + marker + 'a') > > = (overhead + marker + 2*'a') - overhead - marker - 'a' > > = overhead - overhead + marker - marker + 2*'a' - 'a' > > = 0 + 0 + 'a' > > = 1 > > > > The "marker" has magically disappeared. > > > > jmf
-- http://mail.python.org/mailman/listinfo/python-list