Re: Using libunistring for string comparisons et al

2011-03-31 Thread Ludovic Courtès
Hi, Peter Brett writes: > l...@gnu.org (Ludovic Courtès) writes: [...] >> My impression is that GLib & co. are well-equipped to deal with UTF-8 >> whereas other C libraries and programs would rather work with locale >> encoding or ‘wchar_t’ using the standard C APIs. >> > > Yep, lots of GLib.

Re: Using libunistring for string comparisons et al

2011-03-31 Thread Peter Brett
l...@gnu.org (Ludovic Courtès) writes: > Hi Peter, > > Peter Brett writes: > >> It would certainly make my life as a downstream application maintainer >> much, much easier if all Guile API functions that accept a C string >> argument expected UTF-8. > > Out of curiosity, what kind of C libraries

Re: Using libunistring for string comparisons et al

2011-03-31 Thread Ludovic Courtès
Hello, Andy Wingo writes: > Any change to Guile's internal character encoding should not start from > the premise that string-ref is obsolete or unimportant, especially > considering that there is no other standard "string pointer" mechanism. +1 There are idioms like: (let ((start (string-i

Re: Using libunistring for string comparisons et al

2011-03-30 Thread Andy Wingo
On Sun 13 Mar 2011 22:30, l...@gnu.org (Ludovic Courtès) writes: > So yes, the current implementation has bugs, but I think most if not all > can be fixed with minimal changes. Would you like to look into it > for 2.0.x? I very much agree with this sentiment for 2.0.x. Let's not let the perfect

Re: Using libunistring for string comparisons et al

2011-03-30 Thread Andy Wingo
On Tue 15 Mar 2011 23:49, Mark H Weaver writes: >> Well, we covered O(1) vs O(n).  To make UTF-8 O(1), you need to store >> additional indexing information of some sort.  There are various schemes, >> but, depending the the scheme, you lose some of memory advantage of UTF-8 >> vs UTF-32.  You can

Re: Using libunistring for string comparisons et al

2011-03-30 Thread Andy Wingo
On Sun 20 Mar 2011 23:12, l...@gnu.org (Ludovic Courtès) writes: > For 2.1.x, things are different. I’m happy to revisit not only the > internal storage approach but also the O(1) ref/set! (the latter should > be discussed in light of the trend in other Schemes, though.) [...] > Again, if you w

Re: Using libunistring for string comparisons et al

2011-03-30 Thread Andy Wingo
Hi Mark! I think UTF-8 could be a good plan for 2.1/2.2, but I wanted to make sure we understand what string-ref is good for... On Sat 12 Mar 2011 00:09, Mark H Weaver writes: > I claim that any reasonable code which currently uses string-ref and > string-set! could be more cleanly written usin

Re: Using libunistring for string comparisons et al

2011-03-30 Thread Andy Wingo
On Sat 19 Mar 2011 15:06, Mark H Weaver writes: > Let me ask you this: why would you oppose changing the scm_c_ functions > to use UTF-8 by default? If you're comfortable with ASCII-only names, > then UTF-8 will work fine for you, since ASCII strings are unchanged in > UTF-8. This is true. Wel

Re: Using libunistring for string comparisons et al

2011-03-29 Thread Ludovic Courtès
Hi Peter, Peter Brett writes: > It would certainly make my life as a downstream application maintainer > much, much easier if all Guile API functions that accept a C string > argument expected UTF-8. Out of curiosity, what kind of C libraries and tools do you use? My impression is that GLib &

Re: Using libunistring for string comparisons et al

2011-03-29 Thread Andy Wingo
Hi Peter, On Tue 29 Mar 2011 14:39, Peter Brett writes: > Andy Wingo writes: > >> Finally, users are moving away from these functions anyway. The thing >> to do now is to write Scheme, not C: and in Scheme we do the Right >> Thing. > > I hope I'm misinterpreting this statement. Using Guile as

Re: Using libunistring for string comparisons et al

2011-03-29 Thread Peter Brett
Andy Wingo writes: > Finally, users are moving away from these functions anyway. The thing > to do now is to write Scheme, not C: and in Scheme we do the Right > Thing. I hope I'm misinterpreting this statement. Using Guile as an extension language for applications depends *heavily* on writing

Re: Using libunistring for string comparisons et al

2011-03-20 Thread Ludovic Courtès
Hi Mark, Mark H Weaver writes: > l...@gnu.org (Ludovic Courtès) writes: >>> We keep wide (UTF-32) stringbufs as-is, but we change narrow stringbufs >>> to UTF-8, along with a flag that indicates whether it is known to be >>> ASCII-only. >> >> The whole point of the narrow/wide distinction was to

Re: Using libunistring for string comparisons et al

2011-03-20 Thread Ludovic Courtès
Hello Unicode fellows! :-) Mark H Weaver writes: > Andy Wingo writes: >>> Ludovic, Andy and I discussed this on IRC, and came to the conclusion >>> that UTF-8 should be the encoding assumed by functions such as >>> scm_c_define, scm_c_define_gsubr, scm_c_define_gsubr_with_generic, >>> scm_c_ex

Re: Using libunistring for string comparisons et al

2011-03-19 Thread Mark H Weaver
Andy Wingo writes: > Have patience :) We will get there in time. The process of > consensus-building is work. This is a very important decision to make, > and its engineering implications are large. We've only been discussing > it for a week :) Words of wisdom, to be sure. For what it's wort

Re: Using libunistring for string comparisons et al

2011-03-19 Thread Mark H Weaver
Andy Wingo writes: > I am quite sensitive to the "justice" argument -- that we not restrict > the names our users give to Scheme identifiers, or the characters they > use in their strings. But these values typically come from literals in > C source code, which has no portable superset of ASCII.

Re: Using libunistring for string comparisons et al

2011-03-19 Thread Mark H Weaver
Noah Lavine writes: > I think there are two questions being conflated here: what Guile's > internal string representation should be, and what convenience > functions should be provided for users to easily make symbols. Yes, you are absolutely right. They are two separate questions. They were cl

Re: Using libunistring for string comparisons et al

2011-03-19 Thread Andy Wingo
Hello! I have been sitting in the sun pondering all this for an hour or so now, and there is a lot to say, I think. But we should get this out of the way first: On Sat 19 Mar 2011 15:06, Mark H Weaver writes: > As a meta-comment: I've grown rather weary from fighting this battle > alone. My h

Re: Using libunistring for string comparisons et al

2011-03-19 Thread Noah Lavine
Hello all, >> Furthermore, such a default would not restrict our users at all -- they >> can always use the non-_c_ variants with a symbol explicitly constructed >> with (e.g.) scm_from_utf8_symbol. > > We have those convenience functions for a reason.  You recently proposed > several more conveni

Re: Using libunistring for string comparisons et al

2011-03-19 Thread Mark H Weaver
Andy Wingo writes: >> Ludovic, Andy and I discussed this on IRC, and came to the conclusion >> that UTF-8 should be the encoding assumed by functions such as >> scm_c_define, scm_c_define_gsubr, scm_c_define_gsubr_with_generic, >> scm_c_export, scm_c_define_module, scm_c_resolve_module, >> scm_c_u

Re: Using libunistring for string comparisons et al

2011-03-19 Thread Andy Wingo
Greetings, On Wed 16 Mar 2011 02:12, Mark H Weaver writes: > Ludovic, Andy and I discussed this on IRC, and came to the conclusion > that UTF-8 should be the encoding assumed by functions such as > scm_c_define, scm_c_define_gsubr, scm_c_define_gsubr_with_generic, > scm_c_export, scm_c_define_mo

Re: Using libunistring for string comparisons et al

2011-03-18 Thread Mark H Weaver
Thien-Thi Nguyen writes: > () Mark H Weaver > () Thu, 17 Mar 2011 21:38:28 -0400 > >If we may assume that the searched string is valid UTF-8, and when only >ASCII characters are excluded (e.g. "."), then three additional states >are required in the generated DFA. Let us call them S1

Re: Using libunistring for string comparisons et al

2011-03-18 Thread Thien-Thi Nguyen
() Mark H Weaver () Thu, 17 Mar 2011 21:38:28 -0400 If we may assume that the searched string is valid UTF-8, and when only ASCII characters are excluded (e.g. "."), then three additional states are required in the generated DFA. Let us call them S1, S2, and S3. [handling these stat

Re: Using libunistring for string comparisons et al

2011-03-17 Thread Mark H Weaver
Thien-Thi Nguyen writes: > In unibyte land, "." matches a byte. OK. > > In multibyte land done "bytewise", "." matches . > (What goes in the blank?) "." (and more generally [^...]) is equivalent to (a|b|c|d|...) where every valid UTF-8 character is present in the disjunction except f

Re: Using libunistring for string comparisons et al

2011-03-17 Thread Thien-Thi Nguyen
() Mark H Weaver () Thu, 17 Mar 2011 13:58:42 -0400 * regexp search: The search itself can be implemented bytewise, exactly as if it was a fixed-width encoding. Compiling the regexp can _almost_ be implemented as if the UTF-8-encoded regexp was in a fixed-width encoding, with j

Re: Using libunistring for string comparisons et al

2011-03-17 Thread Ludovic Courtès
Hi! Mark H Weaver writes: > (string-upcase "Straße") => "STRAßE" (should be "STRASSE") > (string-downcase "ΧΑΟΣΣ")=> "χαοσσ" (should be "χαoσς") > (string-downcase "ΧΑΟΣ Σ") => "χαοσ σ" (should be "χαoς σ") > (string-ci=? "Straße" "Strasse") => #f(should

Re: Using libunistring for string comparisons et al

2011-03-17 Thread Mike Gran
> From:Ludovic Courtès > >> Can we first check what would need to be done to fix this in 2.0.x? > >> > >> At first glance: > >> > >>   - “Straße” is normally stored as a Latin1 string, so it would need to > >>     be converted to UTF-* before it can be passed to one of the > >>     unicase.h fun

Re: Using libunistring for string comparisons et al

2011-03-17 Thread Mark H Weaver
l...@gnu.org (Ludovic Courtès) writes: >> We keep wide (UTF-32) stringbufs as-is, but we change narrow stringbufs >> to UTF-8, along with a flag that indicates whether it is known to be >> ASCII-only. > > The whole point of the narrow/wide distinction was to avoid > variable-width encodings. In ad

Re: Using libunistring for string comparisons et al

2011-03-17 Thread Ludovic Courtès
Hi Mark, Mark H Weaver writes: > I have a compromise proposal, which could be implemented for 2.0.x: > > We keep wide (UTF-32) stringbufs as-is, but we change narrow stringbufs > to UTF-8, along with a flag that indicates whether it is known to be > ASCII-only. The whole point of the narrow/wid

Re: Using libunistring for string comparisons et al

2011-03-17 Thread Mark H Weaver
I have a compromise proposal, which could be implemented for 2.0.x: We keep wide (UTF-32) stringbufs as-is, but we change narrow stringbufs to UTF-8, along with a flag that indicates whether it is known to be ASCII-only. Applying string-ref or string-set! to a narrow stringbuf would upgrade it to

Re: Using libunistring for string comparisons et al

2011-03-16 Thread Ludovic Courtès
Hi, Mike Gran writes: >> From:Ludovic Courtès > >> > I know of two categories of bugs.  One has to do with case conversions >> > and case-insensitive comparisons, which must be done on entire strings >> > but are currently done for each character.  Here are some examples: >> > >> >  (string-up

Re: Using libunistring for string comparisons et al

2011-03-16 Thread Mike Gran
> From:Ludovic Courtès > > I know of two categories of bugs.  One has to do with case conversions > > and case-insensitive comparisons, which must be done on entire strings > > but are currently done for each character.  Here are some examples: > > > >  (string-upcase "Straße")        => "STRAß

Re: Using libunistring for string comparisons et al

2011-03-16 Thread Ludovic Courtès
Hello Mark, Mark H Weaver writes: > Mike Gran writes: >>> The reason I am still arguing this point is because I have looked >>> seriously at what I would need to do to (A) fix our i18n problems and >>> (B) make the code efficient.  I very much want to fix these things, >>> but the pain of tryin

Re: Using libunistring for string comparisons et al

2011-03-15 Thread Mike Gran
> From:Alex Shinn > > Keep in mind that the UTF-8 forward iterator operation has conditional > > branches.  Merely the act of advancing from one character to another > > could take one of four paths, or more if you include the possibility > > of invalid UTF-8 sequences. > > No, technically you d

Re: Using libunistring for string comparisons et al

2011-03-15 Thread Mike Gran
>   (string-upcase "Straße")        => "STRAßE"  (should > be "STRASSE") >   (string-downcase "ΧΑΟΣΣ")        => "χαοσσ"  (should > be "χαoσς") >   (string-downcase "ΧΑΟΣ Σ")      => "χαοσ σ"  (should > be "χαoς σ") Well, yes and no.  R6RS yes.  SRFI-13 no.

Re: Using libunistring for string comparisons et al

2011-03-15 Thread Mark H Weaver
Mike Gran writes: >> The reason I am still arguing this point is because I have looked >> seriously at what I would need to do to (A) fix our i18n problems and >> (B) make the code efficient.  I very much want to fix these things, >> but the pain of trying to do this with our current scheme is too

Re: Using libunistring for string comparisons et al

2011-03-15 Thread Alex Shinn
On Wed, Mar 16, 2011 at 5:39 AM, Mike Gran wrote: >> From:Mark H Weaver >> >> Mike Gran writes: >> > We do, in a matter of speaking, have a single string representation: >> > UTF-32.  The 'narrow' encoding is UTF-32 with the initial 3 bytes >> of >> > zero removed. >> >> Despite the similarity o

Re: Using libunistring for string comparisons et al

2011-03-15 Thread Mike Gran
> The reason I am still arguing this point is because I have looked > seriously at what I would need to do to (A) fix our i18n problems and > (B) make the code efficient.  I very much want to fix these things, > but the pain of trying to do this with our current scheme is too much > for me to bear.

Re: Using libunistring for string comparisons et al

2011-03-15 Thread Mark H Weaver
Mike Gran writes: >> From:Mark H Weaver >> Despite the similarity of these two representations, they are >> sufficiently different that they cannot be handled by the same machine >> code.  That means you must either implement multiple inner loops, one >> for each combination of string parameter r

Re: Using libunistring for string comparisons et al

2011-03-15 Thread Mike Gran
> From:Mark H Weaver > > Mike Gran writes: > > We do, in a matter of speaking, have a single string representation: > > UTF-32.  The 'narrow' encoding is UTF-32 with the initial 3 bytes > of > > zero removed. > > Despite the similarity of these two representations, they are > sufficiently diff

Re: Using libunistring for string comparisons et al

2011-03-15 Thread Mark H Weaver
Mike Gran writes: > We do, in a matter of speaking, have a single string representation: > UTF-32. The 'narrow' encoding is UTF-32 with the initial 3 bytes of > zero removed. Despite the similarity of these two representations, they are sufficiently different that they cannot be handled by the s

Re: Using libunistring for string comparisons et al

2011-03-13 Thread Ludovic Courtès
Hi Mark, Mark H Weaver writes: > Unfortunately, the alternatives are not pleasant. We have a bunch of > bugs in our string handling functions. Currently, our case-insensitive > string comparisons and case conversions are not correct for several > languages including German, according to the R6

Re: Using libunistring for string comparisons et al

2011-03-12 Thread Mike Gran
> From:Mark H Weaver > > l...@gnu.org (Ludovic Courtès) writes: > > I find Cowan’s proposal for string iteration and the R6RS editors > > response interesting: > > > >  http://www.r6rs.org/formal-comments/comment-235.txt > > Cowan was proposing a complex new API.  I am not, nor did Gauche. > A

Re: Using libunistring for string comparisons et al

2011-03-12 Thread Mark H Weaver
l...@gnu.org (Ludovic Courtès) writes: > I find Cowan’s proposal for string iteration and the R6RS editors > response interesting: > > http://www.r6rs.org/formal-comments/comment-235.txt Cowan was proposing a complex new API. I am not, nor did Gauche. An efficient implementation of string ports

Re: Using libunistring for string comparisons et al

2011-03-12 Thread Ludovic Courtès
Hello! Mark H Weaver writes: > I claim that any reasonable code which currently uses string-ref and > string-set! could be more cleanly written using string ports or > string-{fold,unfold}{,-right}. I agree, and we should encourage this. However... I find Cowan’s proposal for string iteration

Re: Using libunistring for string comparisons et al

2011-03-12 Thread Ludovic Courtès
Hello! Mark H Weaver writes: > I'm aware that this proposal will be very controversial, but starting in > Guile 2.2, I think we ought to consider storing strings internally in > UTF-8, as is done in Gauche. I don’t think so. Thanks, Ludo’.

Re: Using libunistring for string comparisons et al

2011-03-11 Thread Mark H Weaver
I wrote: > I'm aware that this proposal will be very controversial, but starting in > Guile 2.2, I think we ought to consider storing strings internally in > UTF-8, as is done in Gauche. This would of course make string-ref and > string-set! into O(n) operations. However, I claim that any code th

Re: Using libunistring for string comparisons et al

2011-03-11 Thread Mark H Weaver
Sorry, I accidentally sent out an only partly-written draft message. Please disregard for now; I will finish writing it later. Mark