thanks Steven, that clears things up.


On Monday, January 12, 2015 at 8:51:43 AM UTC-8, Steven G. Johnson wrote:
>
>
> For sequential string processing, you can use the nextind and prevind 
> functions to find the next/prevoius valid indices in the string.  e.g. 
>  nextind(s,1) in your example above yields 3, and s[3] gives '('.   In 
> practice, virtually all string processing is sequential (starting at the 
> beginning of the string or at previously computed indices), so UTF-8 string 
> processing is efficient.
>
> Alternatively, you can use the chr2ind function to convert a character 
> index into a byte index.  e.g. chr2ind(s,2) gives the byte index of the 
> start of the second codepoint in s, which in your case gives 3.  However, 
> in practice this is rarely needed, which is good because it is relatively 
> slow (it requires Julia to loop through the string).  (The only time I've 
> needed it was to convert indices from one encoding to another.)
>
> Regex matching returns the byte index, which is what you want: that lets 
> you efficiently jump to that point in the string.  That is why 
> match(r"a",s).offset 
> returns 4: this correct, because the character 'a' indeed starts at the 4th 
> byte of s and s[4] == 'a'.   
>
> See also 
> http://docs.julialang.org/en/latest/manual/strings/#unicode-and-utf-8
>
> There is some discussion of using a special string indexing type to hide 
> this complexity, but it raises some subtle tradeoffs and nothing has been 
> decided yet: https://github.com/JuliaLang/julia/issues/9297
>

Reply via email to