Chris Angelico writes: > On Sun, Jan 22, 2017 at 2:56 AM, Jussi Piitulainen wrote: >> Steve D'Aprano writes: >> >> [snip] >> >>> You could avoid that error by increasing the offset by the right >>> amount: >>> >>> stuff = text[offset + len("ф".encode('utf-8'):] >>> >>> which is awful. I believe that's what Go and Julia expect you to do. >> >> Julia provides a method to get the next index. >> >> let text = "ἐπὶ οἴνοπα πόντον", offset = 1 >> while offset <= endof(text) >> print(text[offset], ".") >> offset = nextind(text, offset) >> end >> println() >> end # prints: ἐ.π.ὶ. .ο.ἴ.ν.ο.π.α. .π.ό.ν.τ.ο.ν. > > This implies that regular iteration isn't good enough, though.
It doesn't. Here's the straightforward iteration over the whole string: let text = "ἐπὶ οἴνοπα πόντον" for c in text print(c, ".") end println() end # prints: ἐ.π.ὶ. .ο.ἴ.ν.ο.π.α. .π.ό.ν.τ.ο.ν. One can also join any iterable whose elements can be converted to strings, and characters can: let text = "ἐπὶ οἴνοπα πόντον" println(join(text, "."), ".") end # prints: ἐ.π.ὶ. .ο.ἴ.ν.ο.π.α. .π.ό.ν.τ.ο.ν. And strings, trivially, can: let text = "ἐπὶ οἴνοπα πόντον" println(join(split(text), "."), ".") end # prints: ἐπὶ.οἴνοπα.πόντον. > Here's a function that creates a numbered list: > > def print_list(items): > width = len(str(len(items))) > for idx, item in enumerate(items, 1): > print("%*d: %s" % (width, idx, item)) > > In Python, this will happily accept anything that is iterable and has > a known length. Could be a list or tuple, obviously, but can also just > as easily be a dict view (keys or items), a range object, or.... a > string. It's perfectly acceptable to enumerate the characters of a > string. And enumerate() itself is implemented entirely generically. I'll skip the formatting - I don't know off-hand how to do it - but keep the width calculation, and I cut the character iterator short at 10 items to save some space. There, it's much the same in Julia: let text = "ἐπὶ οἴνοπα πόντον" function print_list(items) width = endof(string(length(items))) println("width = ", width) for (idx, item) in enumerate(items) println(idx, '\t', item) end end print_list(take(text, 10)) print_list([text, text, text]) print_list(split(text)) end That prints this: width = 2 1 ἐ 2 π 3 ὶ 4 5 ο 6 ἴ 7 ν 8 ο 9 π 10 α width = 1 1 ἐπὶ οἴνοπα πόντον 2 ἐπὶ οἴνοπα πόντον 3 ἐπὶ οἴνοπα πόντον width = 1 1 ἐπὶ 2 οἴνοπα 3 πόντον > If you have to call nextind() to get the next character, you've made > it impossible to do any kind of generic operation on the text. You > can't do a windowed view by slicing while iterating, you can't have a > "lag" or "lead" value, you can't do any of those kinds of simple and > obvious index-based operations. Yet Julia does with ease many things that you seem to think it cannot possibly do at all. The iteration system works on types that have methods for certain generic functions. For strings, the default is to iterate over something like its characters; I think another iterator over valid indexes is available, or wouldn't be hard to write; it could be forward or backward, and in Julia many of these things are often peekable by default (because the iteration protocol itself does not have state - see below at "more magic"). The usual things work fine: let text = "ἐπὶ οἴνοπα πόντον" foreach(print, enumerate(zip(text, split(text)))) end # prints: (1,('ἐ',"ἐπὶ"))(2,('π',"οἴνοπα"))(3,('ὶ',"πόντον")) How is that bad? More magic: let text = "ἐπὶ οἴνοπα πόντον" let ever = cycle(split(text)) println(first(ever)) println(first(ever)) for n in 2:6 println(join(take(ever, n), " ")) end end end This prints the following. The cycle iterator, ever, produces an endless repetition of the three words, but it doesn't have state like Python iterators do, so it's possible to look at the first word twice (and then five more times). ἐπὶ ἐπὶ ἐπὶ οἴνοπα ἐπὶ οἴνοπα πόντον ἐπὶ οἴνοπα πόντον ἐπὶ ἐπὶ οἴνοπα πόντον ἐπὶ οἴνοπα ἐπὶ οἴνοπα πόντον ἐπὶ οἴνοπα πόντον > Oh, and Python 3.3 wasn't the first programming language to use this > flexible string representation. Pike introduced an extremely similar > string representation back in 1998: > > https://github.com/pikelang/Pike/commit/db4a4 Ok. Is GitHub that old? > So yes, UTF-8 has its advantages. But it also has its costs, and for a > text processing language like Pike or Python, they significantly > outweigh the benefits. I process text in my work but I really don't use character indexes much at all. Rather split, join, startswith, endswith, that kind of thing, and whether a string contains some character or substring anywhere. -- https://mail.python.org/mailman/listinfo/python-list