>willie wrote: >> Marc 'BlackJack' Rintsch: >> >> >In <[EMAIL PROTECTED]>, willie wrote: >> >> # What's the correct way to get the >> >> # byte count of a unicode (UTF-8) string? >> >> # I couldn't find a builtin method >> >> # and the following is memory inefficient.
>> >> ustr = "example\xC2\x9D".decode('UTF-8') >> >> num_chars = len(ustr) # 8 >> >> buf = ustr.encode('UTF-8') >> >> num_bytes = len(buf) # 9 >> >That is the correct way. >> # Apologies if I'm being dense, but it seems >> # unusual that I'd have to make a copy of a >> # unicode string, converting it into a byte >> # string, before I can determine the size (in bytes) >> # of the unicode string. Can someone provide the rational >> # for that or correct my misunderstanding? >You initially asked "What's the correct way to get the byte countof a >unicode (UTF-8) string". > >It appears you meant "How can I find how many bytes there are in the >UTF-8 representation of a Unicode string without manifesting the UTF-8 >representation?". > >The answer is, "You can't", and the rationale would have to be that >nobody thought of a use case for counting the length of the UTF-8 form >but not creating the UTF-8 form. What is your use case? # Sorry for the confusion. My use case is a web app that # only deals with UTF-8 strings. I want to prevent silent # truncation of the data, so I want to validate the number # of bytes that make up the unicode string before sending # it to the database to be written. # For instance, say I have a name column that is varchar(50). # The 50 is in bytes not characters. So I can't use the length of # the unicode string to check if it's over the maximum allowed bytes. name = post.input('name') # utf-8 string # preferable if bytes(name) > 50: send_http_headers() display_page_begin() display_error_msg('the name is too long') display_form(name) display_page_end() # If I have a form with many input elements, # I have to convert each to a byte string # before i can see how many bytes make up the # unicode string. That's very memory inefficient # with large text fields - having to duplicate each # one to get its size in bytes: buf = name.encode('UTF-8') num_bytes = len(buf) # That said, I'm not losing any sleep over it, # so feel free to disregard any of this if it's # way off base. -- http://mail.python.org/mailman/listinfo/python-list