Breaking with established convention is a dangerous thing to do. Being too 
opinionated (regarding opinions that deviate from the norm) tends to put people 
off the language unless there's a clear benefit to forcing the alternative 
behavior.

In this case, there's no compelling benefit to naming the thing .byte_len() 
over merely documenting that .len() is in code units. Everything else that 
doesn't explicitly say "char" on strings is in code units too, so it's sensible 
that .len() is too. But having strings that don't have an inherent "length" is 
confusing to anyone who hasn't already memorized this difference.

Today we only need to teach the simple concept that strings are utf-8 encoded, 
and the corresponding notion that all of the accessor methods on strings 
(including indexing using []) use code units unless they specify otherwise 
(e.g. unless they contain the word "char").

-Kevin

On May 28, 2014, at 10:54 AM, Benjamin Striegel <ben.strie...@gmail.com> wrote:

> > People expect there to be a .len()
> 
> This is the assumption that I object to. People expect there to be a .len() 
> because strings have been fundamentally broken since time immemorial. Make 
> people type .byte_len() and be explicit about their desire to index via code 
> units.
> 
> 
> On Wed, May 28, 2014 at 1:12 PM, Kevin Ballard <ke...@sb.org> wrote:
> It's .len() because slicing and other related functions work on byte indexes.
> 
> We've had this discussion before in the past. People expect there to be a 
> .len(), and the only sensible .len() is byte length (because char length is 
> not O(1) and not appropriate for use with most string-manipulation functions).
> 
> Since Rust strings are UTF-8 encoded text, it makes sense for .len() to be 
> the number of UTF-8 code units. Which happens to be the number of bytes.
> 
> -Kevin
> 
> On May 28, 2014, at 7:07 AM, Benjamin Striegel <ben.strie...@gmail.com> wrote:
> 
>> I think that the naming of `len` here is dangerously misleading. Naive 
>> ASCII-users will be free to assume that this is counting codepoints rather 
>> than bytes. I'd prefer the name `byte_len` in order to make the behavior 
>> here explicit.
>> 
>> 
>> On Wed, May 28, 2014 at 5:55 AM, Simon Sapin <simon.sa...@exyr.org> wrote:
>> On 28/05/2014 10:46, Aravinda VK wrote:
>> Thanks. I didn't know about char_len.
>> `unicode_str.as_slice().char_len()` is giving number of code points.
>> 
>> Sorry for the confusion, I was referring codepoint as character in my
>> mail. char_len gives the correct output for my requirement. I have
>> written javascript script to convert from string length to grapheme
>> cluster length for Kannada language.
>> 
>> Be careful, JavaScript’s String.length counts UCS-2 code units, not code 
>> points…
>> 
>> 
>> -- 
>> Simon Sapin
>> _______________________________________________
>> Rust-dev mailing list
>> Rust-dev@mozilla.org
>> https://mail.mozilla.org/listinfo/rust-dev
>> 
>> _______________________________________________
>> Rust-dev mailing list
>> Rust-dev@mozilla.org
>> https://mail.mozilla.org/listinfo/rust-dev
> 
> 
> _______________________________________________
> Rust-dev mailing list
> Rust-dev@mozilla.org
> https://mail.mozilla.org/listinfo/rust-dev

_______________________________________________
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev

Reply via email to