willie wrote: > >willie wrote: > >> Marc 'BlackJack' Rintsch: > >> > >> >In <[EMAIL PROTECTED]>, willie > wrote: > >> >> # What's the correct way to get the > >> >> # byte count of a unicode (UTF-8) string? > >> >> # I couldn't find a builtin method > >> >> # and the following is memory inefficient. > > >> >> ustr = "example\xC2\x9D".decode('UTF-8') > > >> >> num_chars = len(ustr) # 8 > > >> >> buf = ustr.encode('UTF-8') > > >> >> num_bytes = len(buf) # 9 > > >> >That is the correct way. > > >> # Apologies if I'm being dense, but it seems > >> # unusual that I'd have to make a copy of a > >> # unicode string, converting it into a byte > >> # string, before I can determine the size (in bytes) > >> # of the unicode string. Can someone provide the rational > >> # for that or correct my misunderstanding? > > >You initially asked "What's the correct way to get the byte countof a > >unicode (UTF-8) string". > > > >It appears you meant "How can I find how many bytes there are in the > >UTF-8 representation of a Unicode string without manifesting the UTF-8 > >representation?". > > > >The answer is, "You can't", and the rationale would have to be that > >nobody thought of a use case for counting the length of the UTF-8 form > >but not creating the UTF-8 form. What is your use case? > > # Sorry for the confusion. My use case is a web app that > # only deals with UTF-8 strings. I want to prevent silent > # truncation of the data, so I want to validate the number > # of bytes that make up the unicode string before sending > # it to the database to be written. > > # For instance, say I have a name column that is varchar(50). > # The 50 is in bytes not characters. So I can't use the length of > # the unicode string to check if it's over the maximum allowed bytes.
What is the database API expecting to get as an arg: a Python unicode object, or a Python str (8-bit, presumably encoded in utf-8) ? > > name = post.input('name') # utf-8 string You are confusing the hell out of yourself. You say that your web app deals only with UTF-8 strings. Where do you get "the unicode string" from??? If name is a utf-8 string, as your comment says, then len(name) is all you need!!! *PLEASE* print type(name), repr(name) so that we can see what type it is!! If it says the type is str, then it's an 8-bit string, (presumably) encoded in utf-8. If it says the type is unicode, then please explain "web app that only deals with UTF-8 strings" ... > > # preferable > if bytes(name) > 50: > send_http_headers() > display_page_begin() > display_error_msg('the name is too long') > display_form(name) > display_page_end() > > # If I have a form with many input elements, > # I have to convert each to a byte string > # before i can see how many bytes make up the > # unicode string. That's very memory inefficient > # with large text fields - having to duplicate each > # one to get its size in bytes: They'd be garbage collected unless you worked very hard to hang on to them. How large is "large"? -- http://mail.python.org/mailman/listinfo/python-list