Re: [Guile-commits] GNU Guile branch, master, updated. release_1-9-2-164-g0d05ae7

2009-09-09 Thread Ludovic Courtès
Hi,

Mike Gran  writes:

> On Wed, 2009-09-09 at 01:00 +0200, Ludovic Courtès wrote:

[...]

>> > -  return scm_getc (input_port);
>> > +  return scm_get_byte_or_eof (input_port);
>> 
>> This is actually an earlier change, but the prototype of scm_getc is now
>> different from that in 1.8.  Presumably, this means that it’s not
>> source-compatible with 1.8, e.g., on platforms where
>> sizeof (int) < sizeof (scm_t_wchar), right?

I was actually referring to the fact that 1.8 has:

  SCM_API int scm_getc (SCM port);

whereas 1.9 has:

  SCM_API scm_t_wchar scm_getc (SCM port);

What do you think?

>> > --- a/libguile/strings.h
>> > +++ b/libguile/strings.h
>> > @@ -111,7 +111,7 @@ SCM_API SCM scm_substring_shared (SCM str, SCM start, 
>> > SCM end);
>> >  SCM_API SCM scm_substring_copy (SCM str, SCM start, SCM end);
>> >  SCM_API SCM scm_string_append (SCM args);
>> >  
>> > -SCM_INTERNAL SCM scm_i_from_stringn (const char *str, size_t len, 
>> > +SCM_API SCM scm_i_from_stringn (const char *str, size_t len, 
>> >   const char *encoding,
>> >   
>> > scm_t_string_failed_conversion_handler 
>> >   handler);
>> > @@ -157,7 +157,7 @@ SCM_INTERNAL const scm_t_wchar 
>> > *scm_i_string_wide_chars (SCM str);
>> >  SCM_INTERNAL SCM scm_i_string_start_writing (SCM str);
>> >  SCM_INTERNAL void scm_i_string_stop_writing (void);
>> >  SCM_INTERNAL int scm_i_is_narrow_string (SCM str);
>> > -SCM_INTERNAL scm_t_wchar scm_i_string_ref (SCM str, size_t x);
>> > +SCM_API scm_t_wchar scm_i_string_ref (SCM str, size_t x);
>> 
>> Were these changes intended?
>
> Well, one of the two of them was intended.  :)

Shouldn’t both of them remain internal given that they have an ‘_i_’ in
their name?

>> > +  (with-locale "en_US.iso88591"
>> > +(pass-if-exception "no args" exception:wrong-num-args
>> > +  (regexp-quote))
>> 
>> Is the locale part of the API?  That is, should programs that use
>> regexps explicitly ask for a locale with 8-bit encoding?
>
> Basically yes. The libc regex is 8-bit, and it uses
> scm_to/from_locale_string to convert regex's input and output.

That’s unfortunate but OTOH it’s the same as in 1.8, so I guess it’s OK.

> Until libunistring comes with Unicode regex, I think this is the best we
> can do.

Yes, that would be neat!

Thanks,
Ludo’.





Re: [BDW-GC] "Inlined" storage; `scm_take_' functions

2009-09-09 Thread Ludovic Courtès
Hi Neil!

Neil Jerram  writes:

> l...@gnu.org (Ludovic Courtès) writes:

>> Stringbufs and bytevectors are now always "inlined" in the BDW-GC
>> branch [0, 1], which means that there's no cell->buffer indirection,
>> which greatly simplifies code (it also takes less room and may slightly
>> improve performance).
>>
>> The `scm_take_' functions for strings/symbols/bytevectors are now
>> essentially aliases to the corresponding `scm_from_' because we cannot
>> advantageously reuse the provided storage.
>
> That seems a bit of a shame.  (i.e. that we can't advantageously keep
> the caller's string or vector data)

It’s not such a shame IMO because:

  * You have to allocate anyway, to store the (double) cell, and
allocating the whole thing may be just as costly as allocating the
cell, at least for small stringbufs/bytevectors.

  * For stringbufs, the user-provided buffer can be reused only if it’s
either Latin-1 or UCS-4, anyway.

  * Removing the indirection and using only GC-managed memory is
beneficial for Scheme code (which doesn’t use ‘scm_take’).

  * Reusing the malloc(3)-allocated buffer means that we have to
register a finalizer to later free(3) that buffer (see, e.g., commit
d7e7a02a6251c8ed4f76933d9d30baeee3f599c0), which is costly (see, e.g.,
http://www.hpl.hp.com/personal/Hans_Boehm/popl03/web/html/slide_7.html).

That said...

> Did you consider the option of
>
> - always having an indirection from the stringbuf/bytevector object to
> the underlying data

... this may be valuable (Andy pointed it out as well), at least for
bytevectors.  The indirection is a requirement for Andy’s
SRFI-4-on-bytevector patch set, so that ‘scm_take_u8vector ()’ can still
be supported; it’s also required if we want to provide mmap(3) bindings,
for instance, that return a bytevector.

For stringbufs, though, I’m happy if we can leave the code as it is.

Thanks,
Ludo’.





Re: make check fails if no en_US.iso88591 locale

2009-09-09 Thread Ludovic Courtès
Hi,

Neil Jerram  writes:

> because I don't have an en_US.iso88591 locale installed, and so
>
>   (with-locale "en_US.iso88591" ...)
>
> throws an 'unresolved exception.

I’d suggest using ‘with-latin1-locale’ as in ‘bytevectors.test’ to
mitigate this problem.

(Something akin to Gnulib’s ‘locale-*.m4’ could be a good starting
point, too.)

Thanks,
Ludo’.





Re: [Guile-commits] GNU Guile branch, master, updated. release_1-9-2-164-g0d05ae7

2009-09-09 Thread Mike Gran
On Wed, 2009-09-09 at 09:42 +0200, Ludovic Courtès wrote:
> Hi,
> >> > -  return scm_getc (input_port);
> >> > +  return scm_get_byte_or_eof (input_port);
> >> 
> >> This is actually an earlier change, but the prototype of scm_getc is now
> >> different from that in 1.8.  Presumably, this means that it’s not
> >> source-compatible with 1.8, e.g., on platforms where
> >> sizeof (int) < sizeof (scm_t_wchar), right?
> 
> I was actually referring to the fact that 1.8 has:
> 
>   SCM_API int scm_getc (SCM port);
> 
> whereas 1.9 has:
> 
>   SCM_API scm_t_wchar scm_getc (SCM port);
> 
> What do you think?

Sorry, I misunderstood.  It is, as you say, incompatible.
scm_t_wchar is scm_t_int32, not int, so 16-bit int platforms
and 64-bit int platforms would notice the change.  I'm fairly
sure Guile doesn't run in 16-bit int platforms, but, 64-bit 
platforms would notice the change.

I'd like to leave it scm_t_wchar == scm_t_int32.  Do you think that's a
problem?

> >> > --- a/libguile/strings.h
> >> > +++ b/libguile/strings.h
> >> > @@ -111,7 +111,7 @@ SCM_API SCM scm_substring_shared (SCM str, SCM 
> >> > start, SCM end);
> >> >  SCM_API SCM scm_substring_copy (SCM str, SCM start, SCM end);
> >> >  SCM_API SCM scm_string_append (SCM args);
> >> >  
> >> > -SCM_INTERNAL SCM scm_i_from_stringn (const char *str, size_t len, 
> >> > +SCM_API SCM scm_i_from_stringn (const char *str, size_t len, 
> >> >   const char *encoding,
> >> >   
> >> > scm_t_string_failed_conversion_handler 
> >> >   handler);
> >> > @@ -157,7 +157,7 @@ SCM_INTERNAL const scm_t_wchar 
> >> > *scm_i_string_wide_chars (SCM str);
> >> >  SCM_INTERNAL SCM scm_i_string_start_writing (SCM str);
> >> >  SCM_INTERNAL void scm_i_string_stop_writing (void);
> >> >  SCM_INTERNAL int scm_i_is_narrow_string (SCM str);
> >> > -SCM_INTERNAL scm_t_wchar scm_i_string_ref (SCM str, size_t x);
> >> > +SCM_API scm_t_wchar scm_i_string_ref (SCM str, size_t x);
> >> 
> >> Were these changes intended?
> >
> > Well, one of the two of them was intended.  :)
> 
> Shouldn’t both of them remain internal given that they have an ‘_i_’ in
> their name?

I seemed to need to make scm_i_from_stringn into SCM_API so that I could
use it in libguilereadline.  Pragmatically, it is now functioning as
'SCM_API scm_from_stringn'.  The gray area is if libguilereadline is
philosophically 'internal' or 'external'.  If libguilereadline is
philosophically 'internal' it could keep the name scm_i_from_stringn,
but, if that is just confusing, it should probably become
scm_from_stringn.


> > Until libunistring comes with Unicode regex, I think this is the best we
> > can do.
> 
> Yes, that would be neat!

It is on their todo.  They have header files preallocated for it.  Its a
big job, though.

Thanks,
Mike





Re: [Guile-commits] GNU Guile branch, master, updated. release_1-9-2-164-g0d05ae7

2009-09-09 Thread Ludovic Courtès
Mike Gran  writes:

> On Wed, 2009-09-09 at 09:42 +0200, Ludovic Courtès wrote:

>> I was actually referring to the fact that 1.8 has:
>> 
>>   SCM_API int scm_getc (SCM port);
>> 
>> whereas 1.9 has:
>> 
>>   SCM_API scm_t_wchar scm_getc (SCM port);
>> 
>> What do you think?
>
> Sorry, I misunderstood.  It is, as you say, incompatible.
> scm_t_wchar is scm_t_int32, not int, so 16-bit int platforms
> and 64-bit int platforms would notice the change.  I'm fairly
> sure Guile doesn't run in 16-bit int platforms, but, 64-bit 
> platforms would notice the change.
>
> I'd like to leave it scm_t_wchar == scm_t_int32.  Do you think that's a
> problem?

I checked on {powerpc64,sparc64,mips64el,ia64}-linux-gnu:

  * sizeof (int) == 4 on all of them;

  * sizeof (long) == 4 on all of them,
except on ia64 where sizeof (long) == 8.

So presumably we shouldn't worry?

>> >> > --- a/libguile/strings.h
>> >> > +++ b/libguile/strings.h
>> >> > @@ -111,7 +111,7 @@ SCM_API SCM scm_substring_shared (SCM str, SCM 
>> >> > start, SCM end);
>> >> >  SCM_API SCM scm_substring_copy (SCM str, SCM start, SCM end);
>> >> >  SCM_API SCM scm_string_append (SCM args);
>> >> >  
>> >> > -SCM_INTERNAL SCM scm_i_from_stringn (const char *str, size_t len, 
>> >> > +SCM_API SCM scm_i_from_stringn (const char *str, size_t len, 
>> >> >   const char *encoding,
>> >> >   
>> >> > scm_t_string_failed_conversion_handler 
>> >> >   handler);
>> >> > @@ -157,7 +157,7 @@ SCM_INTERNAL const scm_t_wchar 
>> >> > *scm_i_string_wide_chars (SCM str);
>> >> >  SCM_INTERNAL SCM scm_i_string_start_writing (SCM str);
>> >> >  SCM_INTERNAL void scm_i_string_stop_writing (void);
>> >> >  SCM_INTERNAL int scm_i_is_narrow_string (SCM str);
>> >> > -SCM_INTERNAL scm_t_wchar scm_i_string_ref (SCM str, size_t x);
>> >> > +SCM_API scm_t_wchar scm_i_string_ref (SCM str, size_t x);
>> >> 
>> >> Were these changes intended?
>> >
>> > Well, one of the two of them was intended.  :)
>> 
>> Shouldn’t both of them remain internal given that they have an ‘_i_’ in
>> their name?
>
> I seemed to need to make scm_i_from_stringn into SCM_API so that I could
> use it in libguilereadline.  Pragmatically, it is now functioning as
> 'SCM_API scm_from_stringn'.

Cool.

> The gray area is if libguilereadline is philosophically 'internal' or
> 'external'.  If libguilereadline is philosophically 'internal' it
> could keep the name scm_i_from_stringn, but, if that is just
> confusing, it should probably become scm_from_stringn.

It's external.  It it needs something like `scm_from_stringn' then
potentially other users will need it as well, so we should have a public
API.

Thanks,
Ludo'.





Re: compiling with -DSCM_DEBUG=1

2009-09-09 Thread Ken Raeburn

On Sep 7, 2009, at 05:22, Ludovic Courtès wrote:
Non-pair accessed with SCM_C[AD]R: `ERROR: In procedure symbol- 
>string:

ERROR: Wrong type argument in position 1 (expecting symbol):
#


Does that mean it’s this whole string that’s accessed with SCM_C[AD]R?


I'm not sure... it should be printing a value after the quote; I guess  
it's encountering an error trying to print, as well.



I use a modified scm_error_pair_access() that prints the function's
name (as seen above)


Hmm, I don’t see the function name, except ‘symbol->string’ above, but
I’d expect it to be part of the string that’s accessed as a pair.


Sorry, I meant scm_error_pair_access() prints out  
"scm_error_pair_access" to let me know it's been called.


I cannot reproduce it here without SCM_DEBUG but with this simple  
patch

instead:



Any hints?


I would think that would do it, but I'm not seeing it either.  Still  
looking...
(BTW, for SCM_DEBUG=1 I also had to comment out a debugging check  
using SCM_GC_MARK_P in gc.c, since the macro doesn't exist any more.)


Ken



Some leftover bugs for this release

2009-09-09 Thread Mike Gran
Hi-

I guess according to the schedule there is another point release tomorow.

Just a couple of notes.

As it stands, we know the netbsd amd64 build will fail for reasons
discussed in 
http://lists.gnu.org/archive/html/guile-devel/2009-08/msg00213.html

Also, the netbsd build will likely fail because there is new
'condition is always true' condition in array-handle.c:103

100 SCM
101 scm_array_handle_element_type (scm_t_array_handle *h)
102 {
103   if (h->element_type < 0 || h->element_type > SCM_ARRAY_ELEMENT_TYPE_LAST)
104 abort (); /* guile programming error */
105   return scm_i_array_element_types[h->element_type];
106 }

I'd fix it myself, but, I'm away from non-work keyboard.

Thanks,

Mike




‘boehm-demers-weiser-gc’ branch merged in ‘master’

2009-09-09 Thread Ludovic Courtès
Hello!

The ‘boehm-demers-weiser-gc’ has now been merged in ‘master’:

  
http://git.savannah.gnu.org/cgit/guile.git/commit/?id=6dc797eee9041498eec7053d32d8721c3660fb51

It means it’s time for testing, and time to cross fingers too!

I’d also appreciate feedback on the documentation of GC-related things,
which I updated in recent commits.

Thanks,
Ludo’.


pgpsWe4xZW0FL.pgp
Description: PGP signature


Re: [BDW-GC] "Inlined" storage; `scm_take_' functions

2009-09-09 Thread Neil Jerram
l...@gnu.org (Ludovic Courtès) writes:

> It’s not such a shame IMO because:
>
>   * You have to allocate anyway, to store the (double) cell, and
> allocating the whole thing may be just as costly as allocating the
> cell, at least for small stringbufs/bytevectors.
>
>   * For stringbufs, the user-provided buffer can be reused only if it’s
> either Latin-1 or UCS-4, anyway.
>
>   * Removing the indirection and using only GC-managed memory is
> beneficial for Scheme code (which doesn’t use ‘scm_take’).
>
>   * Reusing the malloc(3)-allocated buffer means that we have to
> register a finalizer to later free(3) that buffer (see, e.g., commit
> d7e7a02a6251c8ed4f76933d9d30baeee3f599c0), which is costly (see, e.g.,
> http://www.hpl.hp.com/personal/Hans_Boehm/popl03/web/html/slide_7.html).

All good points.

> That said...
>
>> Did you consider the option of
>>
>> - always having an indirection from the stringbuf/bytevector object to
>> the underlying data
>
> ... this may be valuable (Andy pointed it out as well), at least for
> bytevectors.  The indirection is a requirement for Andy’s
> SRFI-4-on-bytevector patch set, so that ‘scm_take_u8vector ()’ can still
> be supported; it’s also required if we want to provide mmap(3) bindings,
> for instance, that return a bytevector.

OK, cool.  It was actually large bytevectors that I was mostly
thinking about, and IIUC it sounds quite likely that we will end up
keeping meaningful scm_take_... functions there.

> For stringbufs, though, I’m happy if we can leave the code as it is.

Yes, fine.  For stringbufs reallocating feels less painful, especially
given the encoding restriction.

Thanks!
Neil




Re: make check fails if no en_US.iso88591 locale

2009-09-09 Thread Neil Jerram
Mike Gran  writes:

> My bad.  Actually, I should have enclosed the 'with-locale' in the
> context of a 'pass-if', which would have caught the exception.

Yes, but at the cost of not running the tests...

>> I can allow make check to complete by changing that line to
>> 
>>   (false-if-exception (with-locale "en_US.iso88591"
>> 
>> but I doubt that's the best fix.  Is the "en_US.iso88591" locale
>> actually important for the enclosed tests?
>
> It is important.  This is one of the problems with the whole Unicode
> effort.  There is no Unicode-capable regex library.  The regexp.test
> tries matching all bytes from 0 to 255, and it uses scm_to_locale_string
> to prep the string for dispatch to the libc regex calls and
> scm_from_locale_string to send them back.  
>
> If the current locale is C or ASCII, bytes above 127 will cause errors.
> If the current locale is UTF-8, bytes above 127 will be converted into
> multibyte sequences that won't be matched by the regular expression
> being tested.  To pass the test in regexp.test, we need to use the 
> encoding that matches all of the codepoints 0 to 255 to single byte
> characters, which is ISO-8859-1.
>
> So until a better regex comes along, wrapping regex in an
> 8-bit-clean-friendly locale like Latin-1 is necessary to avoid encoding
> errors when encoding arbitrary 8-bit data like the test does.
>
> The reason why this problem is cropping up now and didn't occur before
> is because the old scm_to_locale_string was just a stub that passed
> 8-bit data through unmodified.

Thanks for explaining; I think I understand now.  So then Ludovic's
suggestion of with-latin1-locale should work, shouldn't it?

> This regex library actually can be used with arbitrary Unicode data
> but it takes extra care.  UTF-8 can be used as the locale, and, then
> regular expression must be written keeping in mind that each non-ASCII
> character is really a multibyte string.

Can you give an example of what that ("keeping in mind...") means?  Is
it being careful with repetition counts (as in "[a-z]{3}"), for
example?

Thanks,
Neil




Re: make check fails if no en_US.iso88591 locale

2009-09-09 Thread Mike Gran
On Wed, 2009-09-09 at 22:53 +0100, Neil Jerram wrote:
> > It is important.  This is one of the problems with the whole Unicode
> > effort.  There is no Unicode-capable regex library.  The regexp.test
> > tries matching all bytes from 0 to 255, and it uses scm_to_locale_string
> > to prep the string for dispatch to the libc regex calls and
> > scm_from_locale_string to send them back.  

[...]

> Thanks for explaining; I think I understand now.  So then Ludovic's
> suggestion of with-latin1-locale should work, shouldn't it?

Yeah.  I went with that idea.

> 
> > This regex library actually can be used with arbitrary Unicode data
> > but it takes extra care.  UTF-8 can be used as the locale, and, then
> > regular expression must be written keeping in mind that each non-ASCII
> > character is really a multibyte string.
> 
> Can you give an example of what that ("keeping in mind...") means?  Is
> it being careful with repetition counts (as in "[a-z]{3}"), for
> example?

I'm not much of a regex guy, but, here's a couple of examples.  First
one that sort of works as expected.

guile> (string-match "sé" "José") 
==> #("José" (2 . 5))

Regex properly matches the word, but, the match struct (2 . 5) is
referring to the bytes of the string, not the characters of the string.

Here's one that doesn't work as expected.

guile> (string-match "[:lower:]" "Hi, mom")
==> #("Hi, mom" (5 . 6))
guile> (string-match "[:lower:]" "Hí, móm")
==> #f

Once you add accents on the vowels, nothing matches.

Thanks,

Mike