On Thursday, November 7, 2013 2:57:03 PM UTC-5, Andy Fingerhut wrote: > > Very cool. > > I read through your README (thank you for that), and did not notice an > answer to the question of: does seq on a utf8 string return a sequence of > Unicode code points? Java UTF-16 code points (with pairs of them for > characters outside the BMP)? Something else? It would be great to have > the answer in the README. >
The utf8 strings implement the CharSequence interface, and when seq'ed produce a sequence of characters. In both those cases "characters" refers to Java UTF-16 character values, which means anything outside the BMP would be represented as surrogate pairs. Maybe I was assuming too much using those terms? I can clarify the README. One idea I had for such a thing was: if any operation ever traversed part > of such a variable-bytes-per-char string for any reason (e.g. counting its > length in Unicode code points, or indexing), maintain some kind of data > structure mapping a few selected index values to their byte offset within > the byte array. For example, a string containing 100 Unicode code points > might have byte offsets for the start of the UTF-8 encodings of every 32 > code points, or 64. This limits any sequential scanning to be from the > most recent cached byte offset. Not a trivial amount of implementation > work, I know, but would be cool. > Even if you used full 32-bit integers to represent code points (even though code points are only 21-bit values...*sigh*) you still can't count "characters" in constant time, unless you define characters to mean code points (like you say above), because code points can represent combining marks which together represent a single character. You could create/maintain some kind of index for an encoded sequence of code points, but there's always a space/time tradeoff, and ultimately it will depend on your use case. If you mostly access strings sequentially, then utf8 isn't so bad. You can index into the middle of the stream find your way to the beginning of the next code point, and skip through code points from there. Not to say there's no merit to your idea. I've had the same myself. It would be nice to be able to jump to the nth encoded code point in a utf8 string. I have a book called "Unicode Demystified" and recommend it. It talks about what terms people use (like character vs. code point vs. glyph). It talks about some of the history of Unicode, and some practical tips like data structures that can be used to represent code points in memory for different types of tasks. Anyway, Unicode is much for complex than any of us could hope, but I digress. Paul -- -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.