On Jan 5, 2012, at 5:07 PM, Andy Fingerhut wrote: > I realize that with variable-length multi-byte character encodings like > UTF-8, it would be a bad idea to seek to a random byte position and start > trying to decode a UTF-8 character starting at that byte position. I'm > thinking of cases where you have an index of byte positions of interest you > want to jump to in the future that are known to be the first byte of a > character in the appropriate encoding. I also realize that one must be very > cautious in writing to the middle of such a file, since byte lengths of > strings are variable.
I can't help too much, but the comment about UTF-8 rang a bell. It's actually not that hard to find a valid character by jumping to a random position. You just need to be able to back up a few bytes. http://en.wikipedia.org/wiki/UTF-8 > * All continuation bytes (byte nos. 2-6 in the table above) have 10 as > their two most-significant bits (bits 7-6); in contrast, the first byte never > has 10 as its two most-significant bits. As a result, it is immediately > obvious whether any given byte anywhere in a (valid) UTF-8 stream represents > the first byte of a byte sequence corresponding to a single character, or a > continuation byte of such a byte sequence. > * As a consequence of no. 3 above, starting with any arbitrary byte > anywhere in a (valid) UTF-8 stream, it is necessary to back up by only at > most five bytes in order to get to the beginning of the byte sequence > corresponding to a single character (three bytes in actual UTF-8 as explained > in the next section). If it is not possible to back up, or a byte is missing > because of e.g. a communication failure, one single character can be > discarded, and the next character be correctly read. -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en