thanks Steven, that clears things up.
On Monday, January 12, 2015 at 8:51:43 AM UTC-8, Steven G. Johnson wrote:
>
>
> For sequential string processing, you can use the nextind and prevind
> functions to find the next/prevoius valid indices in the string. e.g.
> nextind(s,1) in your example above yields 3, and s[3] gives '('. In
> practice, virtually all string processing is sequential (starting at the
> beginning of the string or at previously computed indices), so UTF-8 string
> processing is efficient.
>
> Alternatively, you can use the chr2ind function to convert a character
> index into a byte index. e.g. chr2ind(s,2) gives the byte index of the
> start of the second codepoint in s, which in your case gives 3. However,
> in practice this is rarely needed, which is good because it is relatively
> slow (it requires Julia to loop through the string). (The only time I've
> needed it was to convert indices from one encoding to another.)
>
> Regex matching returns the byte index, which is what you want: that lets
> you efficiently jump to that point in the string. That is why
> match(r"a",s).offset
> returns 4: this correct, because the character 'a' indeed starts at the 4th
> byte of s and s[4] == 'a'.
>
> See also
> http://docs.julialang.org/en/latest/manual/strings/#unicode-and-utf-8
>
> There is some discussion of using a special string indexing type to hide
> this complexity, but it raises some subtle tradeoffs and nothing has been
> decided yet: https://github.com/JuliaLang/julia/issues/9297
>