Hi,
Peter Brett writes:
> l...@gnu.org (Ludovic Courtès) writes:
[...]
>> My impression is that GLib & co. are well-equipped to deal with UTF-8
>> whereas other C libraries and programs would rather work with locale
>> encoding or ‘wchar_t’ using the standard C APIs.
>>
>
> Yep, lots of GLib.
l...@gnu.org (Ludovic Courtès) writes:
> Hi Peter,
>
> Peter Brett writes:
>
>> It would certainly make my life as a downstream application maintainer
>> much, much easier if all Guile API functions that accept a C string
>> argument expected UTF-8.
>
> Out of curiosity, what kind of C libraries
Hello,
Andy Wingo writes:
> Any change to Guile's internal character encoding should not start from
> the premise that string-ref is obsolete or unimportant, especially
> considering that there is no other standard "string pointer" mechanism.
+1
There are idioms like:
(let ((start (string-i
On Sun 13 Mar 2011 22:30, l...@gnu.org (Ludovic Courtès) writes:
> So yes, the current implementation has bugs, but I think most if not all
> can be fixed with minimal changes. Would you like to look into it
> for 2.0.x?
I very much agree with this sentiment for 2.0.x. Let's not let the
perfect
On Tue 15 Mar 2011 23:49, Mark H Weaver writes:
>> Well, we covered O(1) vs O(n). To make UTF-8 O(1), you need to store
>> additional indexing information of some sort. There are various schemes,
>> but, depending the the scheme, you lose some of memory advantage of UTF-8
>> vs UTF-32. You can
On Sun 20 Mar 2011 23:12, l...@gnu.org (Ludovic Courtès) writes:
> For 2.1.x, things are different. I’m happy to revisit not only the
> internal storage approach but also the O(1) ref/set! (the latter should
> be discussed in light of the trend in other Schemes, though.)
[...]
> Again, if you w
Hi Mark!
I think UTF-8 could be a good plan for 2.1/2.2, but I wanted to make
sure we understand what string-ref is good for...
On Sat 12 Mar 2011 00:09, Mark H Weaver writes:
> I claim that any reasonable code which currently uses string-ref and
> string-set! could be more cleanly written usin
On Sat 19 Mar 2011 15:06, Mark H Weaver writes:
> Let me ask you this: why would you oppose changing the scm_c_ functions
> to use UTF-8 by default? If you're comfortable with ASCII-only names,
> then UTF-8 will work fine for you, since ASCII strings are unchanged in
> UTF-8.
This is true. Wel
Hi Peter,
Peter Brett writes:
> It would certainly make my life as a downstream application maintainer
> much, much easier if all Guile API functions that accept a C string
> argument expected UTF-8.
Out of curiosity, what kind of C libraries and tools do you use?
My impression is that GLib &
Hi Peter,
On Tue 29 Mar 2011 14:39, Peter Brett writes:
> Andy Wingo writes:
>
>> Finally, users are moving away from these functions anyway. The thing
>> to do now is to write Scheme, not C: and in Scheme we do the Right
>> Thing.
>
> I hope I'm misinterpreting this statement. Using Guile as
Andy Wingo writes:
> Finally, users are moving away from these functions anyway. The thing
> to do now is to write Scheme, not C: and in Scheme we do the Right
> Thing.
I hope I'm misinterpreting this statement. Using Guile as an extension
language for applications depends *heavily* on writing
Hi Mark,
Mark H Weaver writes:
> l...@gnu.org (Ludovic Courtès) writes:
>>> We keep wide (UTF-32) stringbufs as-is, but we change narrow stringbufs
>>> to UTF-8, along with a flag that indicates whether it is known to be
>>> ASCII-only.
>>
>> The whole point of the narrow/wide distinction was to
Hello Unicode fellows! :-)
Mark H Weaver writes:
> Andy Wingo writes:
>>> Ludovic, Andy and I discussed this on IRC, and came to the conclusion
>>> that UTF-8 should be the encoding assumed by functions such as
>>> scm_c_define, scm_c_define_gsubr, scm_c_define_gsubr_with_generic,
>>> scm_c_ex
Andy Wingo writes:
> Have patience :) We will get there in time. The process of
> consensus-building is work. This is a very important decision to make,
> and its engineering implications are large. We've only been discussing
> it for a week :)
Words of wisdom, to be sure. For what it's wort
Andy Wingo writes:
> I am quite sensitive to the "justice" argument -- that we not restrict
> the names our users give to Scheme identifiers, or the characters they
> use in their strings. But these values typically come from literals in
> C source code, which has no portable superset of ASCII.
Noah Lavine writes:
> I think there are two questions being conflated here: what Guile's
> internal string representation should be, and what convenience
> functions should be provided for users to easily make symbols.
Yes, you are absolutely right. They are two separate questions. They
were cl
Hello!
I have been sitting in the sun pondering all this for an hour or so now,
and there is a lot to say, I think. But we should get this out of the
way first:
On Sat 19 Mar 2011 15:06, Mark H Weaver writes:
> As a meta-comment: I've grown rather weary from fighting this battle
> alone. My h
Hello all,
>> Furthermore, such a default would not restrict our users at all -- they
>> can always use the non-_c_ variants with a symbol explicitly constructed
>> with (e.g.) scm_from_utf8_symbol.
>
> We have those convenience functions for a reason. You recently proposed
> several more conveni
Andy Wingo writes:
>> Ludovic, Andy and I discussed this on IRC, and came to the conclusion
>> that UTF-8 should be the encoding assumed by functions such as
>> scm_c_define, scm_c_define_gsubr, scm_c_define_gsubr_with_generic,
>> scm_c_export, scm_c_define_module, scm_c_resolve_module,
>> scm_c_u
Greetings,
On Wed 16 Mar 2011 02:12, Mark H Weaver writes:
> Ludovic, Andy and I discussed this on IRC, and came to the conclusion
> that UTF-8 should be the encoding assumed by functions such as
> scm_c_define, scm_c_define_gsubr, scm_c_define_gsubr_with_generic,
> scm_c_export, scm_c_define_mo
Thien-Thi Nguyen writes:
> () Mark H Weaver
> () Thu, 17 Mar 2011 21:38:28 -0400
>
>If we may assume that the searched string is valid UTF-8, and when only
>ASCII characters are excluded (e.g. "."), then three additional states
>are required in the generated DFA. Let us call them S1
() Mark H Weaver
() Thu, 17 Mar 2011 21:38:28 -0400
If we may assume that the searched string is valid UTF-8, and when only
ASCII characters are excluded (e.g. "."), then three additional states
are required in the generated DFA. Let us call them S1, S2, and S3.
[handling these stat
Thien-Thi Nguyen writes:
> In unibyte land, "." matches a byte. OK.
>
> In multibyte land done "bytewise", "." matches .
> (What goes in the blank?)
"." (and more generally [^...]) is equivalent to (a|b|c|d|...) where
every valid UTF-8 character is present in the disjunction except f
() Mark H Weaver
() Thu, 17 Mar 2011 13:58:42 -0400
* regexp search: The search itself can be implemented bytewise, exactly
as if it was a fixed-width encoding. Compiling the regexp can
_almost_ be implemented as if the UTF-8-encoded regexp was in a
fixed-width encoding, with j
Hi!
Mark H Weaver writes:
> (string-upcase "Straße") => "STRAßE" (should be "STRASSE")
> (string-downcase "ΧΑΟΣΣ")=> "χαοσσ" (should be "χαoσς")
> (string-downcase "ΧΑΟΣ Σ") => "χαοσ σ" (should be "χαoς σ")
> (string-ci=? "Straße" "Strasse") => #f(should
> From:Ludovic Courtès
> >> Can we first check what would need to be done to fix this in 2.0.x?
> >>
> >> At first glance:
> >>
> >> - “Straße” is normally stored as a Latin1 string, so it would need to
> >> be converted to UTF-* before it can be passed to one of the
> >> unicase.h fun
l...@gnu.org (Ludovic Courtès) writes:
>> We keep wide (UTF-32) stringbufs as-is, but we change narrow stringbufs
>> to UTF-8, along with a flag that indicates whether it is known to be
>> ASCII-only.
>
> The whole point of the narrow/wide distinction was to avoid
> variable-width encodings. In ad
Hi Mark,
Mark H Weaver writes:
> I have a compromise proposal, which could be implemented for 2.0.x:
>
> We keep wide (UTF-32) stringbufs as-is, but we change narrow stringbufs
> to UTF-8, along with a flag that indicates whether it is known to be
> ASCII-only.
The whole point of the narrow/wid
I have a compromise proposal, which could be implemented for 2.0.x:
We keep wide (UTF-32) stringbufs as-is, but we change narrow stringbufs
to UTF-8, along with a flag that indicates whether it is known to be
ASCII-only.
Applying string-ref or string-set! to a narrow stringbuf would upgrade
it to
Hi,
Mike Gran writes:
>> From:Ludovic Courtès
>
>> > I know of two categories of bugs. One has to do with case conversions
>> > and case-insensitive comparisons, which must be done on entire strings
>> > but are currently done for each character. Here are some examples:
>> >
>> > (string-up
> From:Ludovic Courtès
> > I know of two categories of bugs. One has to do with case conversions
> > and case-insensitive comparisons, which must be done on entire strings
> > but are currently done for each character. Here are some examples:
> >
> > (string-upcase "Straße") => "STRAß
Hello Mark,
Mark H Weaver writes:
> Mike Gran writes:
>>> The reason I am still arguing this point is because I have looked
>>> seriously at what I would need to do to (A) fix our i18n problems and
>>> (B) make the code efficient. I very much want to fix these things,
>>> but the pain of tryin
> From:Alex Shinn
> > Keep in mind that the UTF-8 forward iterator operation has conditional
> > branches. Merely the act of advancing from one character to another
> > could take one of four paths, or more if you include the possibility
> > of invalid UTF-8 sequences.
>
> No, technically you d
> (string-upcase "Straße") => "STRAßE" (should
> be "STRASSE")
> (string-downcase "ΧΑΟΣΣ") => "χαοσσ" (should
> be "χαoσς")
> (string-downcase "ΧΑΟΣ Σ") => "χαοσ σ" (should
> be "χαoς σ")
Well, yes and no. R6RS yes. SRFI-13 no.
Mike Gran writes:
>> The reason I am still arguing this point is because I have looked
>> seriously at what I would need to do to (A) fix our i18n problems and
>> (B) make the code efficient. I very much want to fix these things,
>> but the pain of trying to do this with our current scheme is too
On Wed, Mar 16, 2011 at 5:39 AM, Mike Gran wrote:
>> From:Mark H Weaver
>>
>> Mike Gran writes:
>> > We do, in a matter of speaking, have a single string representation:
>> > UTF-32. The 'narrow' encoding is UTF-32 with the initial 3 bytes
>> of
>> > zero removed.
>>
>> Despite the similarity o
> The reason I am still arguing this point is because I have looked
> seriously at what I would need to do to (A) fix our i18n problems and
> (B) make the code efficient. I very much want to fix these things,
> but the pain of trying to do this with our current scheme is too much
> for me to bear.
Mike Gran writes:
>> From:Mark H Weaver
>> Despite the similarity of these two representations, they are
>> sufficiently different that they cannot be handled by the same machine
>> code. That means you must either implement multiple inner loops, one
>> for each combination of string parameter r
> From:Mark H Weaver
>
> Mike Gran writes:
> > We do, in a matter of speaking, have a single string representation:
> > UTF-32. The 'narrow' encoding is UTF-32 with the initial 3 bytes
> of
> > zero removed.
>
> Despite the similarity of these two representations, they are
> sufficiently diff
Mike Gran writes:
> We do, in a matter of speaking, have a single string representation:
> UTF-32. The 'narrow' encoding is UTF-32 with the initial 3 bytes of
> zero removed.
Despite the similarity of these two representations, they are
sufficiently different that they cannot be handled by the s
Hi Mark,
Mark H Weaver writes:
> Unfortunately, the alternatives are not pleasant. We have a bunch of
> bugs in our string handling functions. Currently, our case-insensitive
> string comparisons and case conversions are not correct for several
> languages including German, according to the R6
> From:Mark H Weaver
>
> l...@gnu.org (Ludovic Courtès) writes:
> > I find Cowan’s proposal for string iteration and the R6RS editors
> > response interesting:
> >
> > http://www.r6rs.org/formal-comments/comment-235.txt
>
> Cowan was proposing a complex new API. I am not, nor did Gauche.
> A
l...@gnu.org (Ludovic Courtès) writes:
> I find Cowan’s proposal for string iteration and the R6RS editors
> response interesting:
>
> http://www.r6rs.org/formal-comments/comment-235.txt
Cowan was proposing a complex new API. I am not, nor did Gauche.
An efficient implementation of string ports
Hello!
Mark H Weaver writes:
> I claim that any reasonable code which currently uses string-ref and
> string-set! could be more cleanly written using string ports or
> string-{fold,unfold}{,-right}.
I agree, and we should encourage this. However...
I find Cowan’s proposal for string iteration
Hello!
Mark H Weaver writes:
> I'm aware that this proposal will be very controversial, but starting in
> Guile 2.2, I think we ought to consider storing strings internally in
> UTF-8, as is done in Gauche.
I don’t think so.
Thanks,
Ludo’.
I wrote:
> I'm aware that this proposal will be very controversial, but starting in
> Guile 2.2, I think we ought to consider storing strings internally in
> UTF-8, as is done in Gauche. This would of course make string-ref and
> string-set! into O(n) operations. However, I claim that any code th
Sorry, I accidentally sent out an only partly-written draft message.
Please disregard for now; I will finish writing it later.
Mark
47 matches
Mail list logo