Re: ANN: byte vector backed, utf8 strings for Clojure

Paul Stadig Thu, 07 Nov 2013 13:24:34 -0800

On Thursday, November 7, 2013 2:57:03 PM UTC-5, Andy Fingerhut wrote:
>
> Very cool.
>
> I read through your README (thank you for that), and did not notice an 
> answer to the question of: does seq on a utf8 string return a sequence of 
> Unicode code points?  Java UTF-16 code points (with pairs of them for 
> characters outside the BMP)?  Something else?  It would be great to have 
> the answer in the README.
>


The utf8 strings implement the CharSequence interface, and when seq'ed 
produce a sequence of characters. In both those cases "characters" refers 
to Java UTF-16 character values, which means anything outside the BMP would 
be represented as surrogate pairs. Maybe I was assuming too much using 
those terms? I can clarify the README.

One idea I had for such a thing was: if any operation ever traversed part 
> of such a variable-bytes-per-char string for any reason (e.g. counting its 
> length in Unicode code points, or indexing), maintain some kind of data 
> structure mapping a few selected index values to their byte offset within 
> the byte array.  For example, a string containing 100 Unicode code points 
> might have byte offsets for the start of the UTF-8 encodings of every 32 
> code points, or 64.  This limits any sequential scanning to be from the 
> most recent cached byte offset.  Not a trivial amount of implementation 
> work, I know, but would be cool.
>

Even if you used full 32-bit integers to represent code points (even though 
code points are only 21-bit values...*sigh*) you still can't count 
"characters" in constant time, unless you define characters to mean code 
points (like you say above), because code points can represent combining 
marks which together represent a single character. You could 
create/maintain some kind of index for an encoded sequence of code points, 
but there's always a space/time tradeoff, and ultimately it will depend on 
your use case. If you mostly access strings sequentially, then utf8 isn't 
so bad. You can index into the middle of the stream find your way to the 
beginning of the next code point, and skip through code points from there. 
Not to say there's no merit to your idea. I've had the same myself. It 
would be nice to be able to jump to the nth encoded code point in a utf8 
string.

I have a book called "Unicode Demystified" and recommend it. It talks about 
what terms people use (like character vs. code point vs. glyph). It talks 
about some of the history of Unicode, and some practical tips like data 
structures that can be used to represent code points in memory for 
different types of tasks.

Anyway, Unicode is much for complex than any of us could hope, but I 
digress.


Paul

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: ANN: byte vector backed, utf8 strings for Clojure

Reply via email to