Re: [Guile-commits] GNU Guile branch, master, updated. release_1-9-2-164-g0d05ae7
Hi, Mike Gran writes: > On Wed, 2009-09-09 at 01:00 +0200, Ludovic Courtès wrote: [...] >> > - return scm_getc (input_port); >> > + return scm_get_byte_or_eof (input_port); >> >> This is actually an earlier change, but the prototype of scm_getc is now >> different from that in 1.8. Presumably, this means that it’s not >> source-compatible with 1.8, e.g., on platforms where >> sizeof (int) < sizeof (scm_t_wchar), right? I was actually referring to the fact that 1.8 has: SCM_API int scm_getc (SCM port); whereas 1.9 has: SCM_API scm_t_wchar scm_getc (SCM port); What do you think? >> > --- a/libguile/strings.h >> > +++ b/libguile/strings.h >> > @@ -111,7 +111,7 @@ SCM_API SCM scm_substring_shared (SCM str, SCM start, >> > SCM end); >> > SCM_API SCM scm_substring_copy (SCM str, SCM start, SCM end); >> > SCM_API SCM scm_string_append (SCM args); >> > >> > -SCM_INTERNAL SCM scm_i_from_stringn (const char *str, size_t len, >> > +SCM_API SCM scm_i_from_stringn (const char *str, size_t len, >> > const char *encoding, >> > >> > scm_t_string_failed_conversion_handler >> > handler); >> > @@ -157,7 +157,7 @@ SCM_INTERNAL const scm_t_wchar >> > *scm_i_string_wide_chars (SCM str); >> > SCM_INTERNAL SCM scm_i_string_start_writing (SCM str); >> > SCM_INTERNAL void scm_i_string_stop_writing (void); >> > SCM_INTERNAL int scm_i_is_narrow_string (SCM str); >> > -SCM_INTERNAL scm_t_wchar scm_i_string_ref (SCM str, size_t x); >> > +SCM_API scm_t_wchar scm_i_string_ref (SCM str, size_t x); >> >> Were these changes intended? > > Well, one of the two of them was intended. :) Shouldn’t both of them remain internal given that they have an ‘_i_’ in their name? >> > + (with-locale "en_US.iso88591" >> > +(pass-if-exception "no args" exception:wrong-num-args >> > + (regexp-quote)) >> >> Is the locale part of the API? That is, should programs that use >> regexps explicitly ask for a locale with 8-bit encoding? > > Basically yes. The libc regex is 8-bit, and it uses > scm_to/from_locale_string to convert regex's input and output. That’s unfortunate but OTOH it’s the same as in 1.8, so I guess it’s OK. > Until libunistring comes with Unicode regex, I think this is the best we > can do. Yes, that would be neat! Thanks, Ludo’.
Re: [BDW-GC] "Inlined" storage; `scm_take_' functions
Hi Neil! Neil Jerram writes: > l...@gnu.org (Ludovic Courtès) writes: >> Stringbufs and bytevectors are now always "inlined" in the BDW-GC >> branch [0, 1], which means that there's no cell->buffer indirection, >> which greatly simplifies code (it also takes less room and may slightly >> improve performance). >> >> The `scm_take_' functions for strings/symbols/bytevectors are now >> essentially aliases to the corresponding `scm_from_' because we cannot >> advantageously reuse the provided storage. > > That seems a bit of a shame. (i.e. that we can't advantageously keep > the caller's string or vector data) It’s not such a shame IMO because: * You have to allocate anyway, to store the (double) cell, and allocating the whole thing may be just as costly as allocating the cell, at least for small stringbufs/bytevectors. * For stringbufs, the user-provided buffer can be reused only if it’s either Latin-1 or UCS-4, anyway. * Removing the indirection and using only GC-managed memory is beneficial for Scheme code (which doesn’t use ‘scm_take’). * Reusing the malloc(3)-allocated buffer means that we have to register a finalizer to later free(3) that buffer (see, e.g., commit d7e7a02a6251c8ed4f76933d9d30baeee3f599c0), which is costly (see, e.g., http://www.hpl.hp.com/personal/Hans_Boehm/popl03/web/html/slide_7.html). That said... > Did you consider the option of > > - always having an indirection from the stringbuf/bytevector object to > the underlying data ... this may be valuable (Andy pointed it out as well), at least for bytevectors. The indirection is a requirement for Andy’s SRFI-4-on-bytevector patch set, so that ‘scm_take_u8vector ()’ can still be supported; it’s also required if we want to provide mmap(3) bindings, for instance, that return a bytevector. For stringbufs, though, I’m happy if we can leave the code as it is. Thanks, Ludo’.
Re: make check fails if no en_US.iso88591 locale
Hi, Neil Jerram writes: > because I don't have an en_US.iso88591 locale installed, and so > > (with-locale "en_US.iso88591" ...) > > throws an 'unresolved exception. I’d suggest using ‘with-latin1-locale’ as in ‘bytevectors.test’ to mitigate this problem. (Something akin to Gnulib’s ‘locale-*.m4’ could be a good starting point, too.) Thanks, Ludo’.
Re: [Guile-commits] GNU Guile branch, master, updated. release_1-9-2-164-g0d05ae7
On Wed, 2009-09-09 at 09:42 +0200, Ludovic Courtès wrote: > Hi, > >> > - return scm_getc (input_port); > >> > + return scm_get_byte_or_eof (input_port); > >> > >> This is actually an earlier change, but the prototype of scm_getc is now > >> different from that in 1.8. Presumably, this means that it’s not > >> source-compatible with 1.8, e.g., on platforms where > >> sizeof (int) < sizeof (scm_t_wchar), right? > > I was actually referring to the fact that 1.8 has: > > SCM_API int scm_getc (SCM port); > > whereas 1.9 has: > > SCM_API scm_t_wchar scm_getc (SCM port); > > What do you think? Sorry, I misunderstood. It is, as you say, incompatible. scm_t_wchar is scm_t_int32, not int, so 16-bit int platforms and 64-bit int platforms would notice the change. I'm fairly sure Guile doesn't run in 16-bit int platforms, but, 64-bit platforms would notice the change. I'd like to leave it scm_t_wchar == scm_t_int32. Do you think that's a problem? > >> > --- a/libguile/strings.h > >> > +++ b/libguile/strings.h > >> > @@ -111,7 +111,7 @@ SCM_API SCM scm_substring_shared (SCM str, SCM > >> > start, SCM end); > >> > SCM_API SCM scm_substring_copy (SCM str, SCM start, SCM end); > >> > SCM_API SCM scm_string_append (SCM args); > >> > > >> > -SCM_INTERNAL SCM scm_i_from_stringn (const char *str, size_t len, > >> > +SCM_API SCM scm_i_from_stringn (const char *str, size_t len, > >> > const char *encoding, > >> > > >> > scm_t_string_failed_conversion_handler > >> > handler); > >> > @@ -157,7 +157,7 @@ SCM_INTERNAL const scm_t_wchar > >> > *scm_i_string_wide_chars (SCM str); > >> > SCM_INTERNAL SCM scm_i_string_start_writing (SCM str); > >> > SCM_INTERNAL void scm_i_string_stop_writing (void); > >> > SCM_INTERNAL int scm_i_is_narrow_string (SCM str); > >> > -SCM_INTERNAL scm_t_wchar scm_i_string_ref (SCM str, size_t x); > >> > +SCM_API scm_t_wchar scm_i_string_ref (SCM str, size_t x); > >> > >> Were these changes intended? > > > > Well, one of the two of them was intended. :) > > Shouldn’t both of them remain internal given that they have an ‘_i_’ in > their name? I seemed to need to make scm_i_from_stringn into SCM_API so that I could use it in libguilereadline. Pragmatically, it is now functioning as 'SCM_API scm_from_stringn'. The gray area is if libguilereadline is philosophically 'internal' or 'external'. If libguilereadline is philosophically 'internal' it could keep the name scm_i_from_stringn, but, if that is just confusing, it should probably become scm_from_stringn. > > Until libunistring comes with Unicode regex, I think this is the best we > > can do. > > Yes, that would be neat! It is on their todo. They have header files preallocated for it. Its a big job, though. Thanks, Mike
Re: [Guile-commits] GNU Guile branch, master, updated. release_1-9-2-164-g0d05ae7
Mike Gran writes: > On Wed, 2009-09-09 at 09:42 +0200, Ludovic Courtès wrote: >> I was actually referring to the fact that 1.8 has: >> >> SCM_API int scm_getc (SCM port); >> >> whereas 1.9 has: >> >> SCM_API scm_t_wchar scm_getc (SCM port); >> >> What do you think? > > Sorry, I misunderstood. It is, as you say, incompatible. > scm_t_wchar is scm_t_int32, not int, so 16-bit int platforms > and 64-bit int platforms would notice the change. I'm fairly > sure Guile doesn't run in 16-bit int platforms, but, 64-bit > platforms would notice the change. > > I'd like to leave it scm_t_wchar == scm_t_int32. Do you think that's a > problem? I checked on {powerpc64,sparc64,mips64el,ia64}-linux-gnu: * sizeof (int) == 4 on all of them; * sizeof (long) == 4 on all of them, except on ia64 where sizeof (long) == 8. So presumably we shouldn't worry? >> >> > --- a/libguile/strings.h >> >> > +++ b/libguile/strings.h >> >> > @@ -111,7 +111,7 @@ SCM_API SCM scm_substring_shared (SCM str, SCM >> >> > start, SCM end); >> >> > SCM_API SCM scm_substring_copy (SCM str, SCM start, SCM end); >> >> > SCM_API SCM scm_string_append (SCM args); >> >> > >> >> > -SCM_INTERNAL SCM scm_i_from_stringn (const char *str, size_t len, >> >> > +SCM_API SCM scm_i_from_stringn (const char *str, size_t len, >> >> > const char *encoding, >> >> > >> >> > scm_t_string_failed_conversion_handler >> >> > handler); >> >> > @@ -157,7 +157,7 @@ SCM_INTERNAL const scm_t_wchar >> >> > *scm_i_string_wide_chars (SCM str); >> >> > SCM_INTERNAL SCM scm_i_string_start_writing (SCM str); >> >> > SCM_INTERNAL void scm_i_string_stop_writing (void); >> >> > SCM_INTERNAL int scm_i_is_narrow_string (SCM str); >> >> > -SCM_INTERNAL scm_t_wchar scm_i_string_ref (SCM str, size_t x); >> >> > +SCM_API scm_t_wchar scm_i_string_ref (SCM str, size_t x); >> >> >> >> Were these changes intended? >> > >> > Well, one of the two of them was intended. :) >> >> Shouldn’t both of them remain internal given that they have an ‘_i_’ in >> their name? > > I seemed to need to make scm_i_from_stringn into SCM_API so that I could > use it in libguilereadline. Pragmatically, it is now functioning as > 'SCM_API scm_from_stringn'. Cool. > The gray area is if libguilereadline is philosophically 'internal' or > 'external'. If libguilereadline is philosophically 'internal' it > could keep the name scm_i_from_stringn, but, if that is just > confusing, it should probably become scm_from_stringn. It's external. It it needs something like `scm_from_stringn' then potentially other users will need it as well, so we should have a public API. Thanks, Ludo'.
Re: compiling with -DSCM_DEBUG=1
On Sep 7, 2009, at 05:22, Ludovic Courtès wrote: Non-pair accessed with SCM_C[AD]R: `ERROR: In procedure symbol- >string: ERROR: Wrong type argument in position 1 (expecting symbol): # Does that mean it’s this whole string that’s accessed with SCM_C[AD]R? I'm not sure... it should be printing a value after the quote; I guess it's encountering an error trying to print, as well. I use a modified scm_error_pair_access() that prints the function's name (as seen above) Hmm, I don’t see the function name, except ‘symbol->string’ above, but I’d expect it to be part of the string that’s accessed as a pair. Sorry, I meant scm_error_pair_access() prints out "scm_error_pair_access" to let me know it's been called. I cannot reproduce it here without SCM_DEBUG but with this simple patch instead: Any hints? I would think that would do it, but I'm not seeing it either. Still looking... (BTW, for SCM_DEBUG=1 I also had to comment out a debugging check using SCM_GC_MARK_P in gc.c, since the macro doesn't exist any more.) Ken
Some leftover bugs for this release
Hi- I guess according to the schedule there is another point release tomorow. Just a couple of notes. As it stands, we know the netbsd amd64 build will fail for reasons discussed in http://lists.gnu.org/archive/html/guile-devel/2009-08/msg00213.html Also, the netbsd build will likely fail because there is new 'condition is always true' condition in array-handle.c:103 100 SCM 101 scm_array_handle_element_type (scm_t_array_handle *h) 102 { 103 if (h->element_type < 0 || h->element_type > SCM_ARRAY_ELEMENT_TYPE_LAST) 104 abort (); /* guile programming error */ 105 return scm_i_array_element_types[h->element_type]; 106 } I'd fix it myself, but, I'm away from non-work keyboard. Thanks, Mike
‘boehm-demers-weiser-gc’ branch merged in ‘master’
Hello! The ‘boehm-demers-weiser-gc’ has now been merged in ‘master’: http://git.savannah.gnu.org/cgit/guile.git/commit/?id=6dc797eee9041498eec7053d32d8721c3660fb51 It means it’s time for testing, and time to cross fingers too! I’d also appreciate feedback on the documentation of GC-related things, which I updated in recent commits. Thanks, Ludo’. pgpsWe4xZW0FL.pgp Description: PGP signature
Re: [BDW-GC] "Inlined" storage; `scm_take_' functions
l...@gnu.org (Ludovic Courtès) writes: > It’s not such a shame IMO because: > > * You have to allocate anyway, to store the (double) cell, and > allocating the whole thing may be just as costly as allocating the > cell, at least for small stringbufs/bytevectors. > > * For stringbufs, the user-provided buffer can be reused only if it’s > either Latin-1 or UCS-4, anyway. > > * Removing the indirection and using only GC-managed memory is > beneficial for Scheme code (which doesn’t use ‘scm_take’). > > * Reusing the malloc(3)-allocated buffer means that we have to > register a finalizer to later free(3) that buffer (see, e.g., commit > d7e7a02a6251c8ed4f76933d9d30baeee3f599c0), which is costly (see, e.g., > http://www.hpl.hp.com/personal/Hans_Boehm/popl03/web/html/slide_7.html). All good points. > That said... > >> Did you consider the option of >> >> - always having an indirection from the stringbuf/bytevector object to >> the underlying data > > ... this may be valuable (Andy pointed it out as well), at least for > bytevectors. The indirection is a requirement for Andy’s > SRFI-4-on-bytevector patch set, so that ‘scm_take_u8vector ()’ can still > be supported; it’s also required if we want to provide mmap(3) bindings, > for instance, that return a bytevector. OK, cool. It was actually large bytevectors that I was mostly thinking about, and IIUC it sounds quite likely that we will end up keeping meaningful scm_take_... functions there. > For stringbufs, though, I’m happy if we can leave the code as it is. Yes, fine. For stringbufs reallocating feels less painful, especially given the encoding restriction. Thanks! Neil
Re: make check fails if no en_US.iso88591 locale
Mike Gran writes: > My bad. Actually, I should have enclosed the 'with-locale' in the > context of a 'pass-if', which would have caught the exception. Yes, but at the cost of not running the tests... >> I can allow make check to complete by changing that line to >> >> (false-if-exception (with-locale "en_US.iso88591" >> >> but I doubt that's the best fix. Is the "en_US.iso88591" locale >> actually important for the enclosed tests? > > It is important. This is one of the problems with the whole Unicode > effort. There is no Unicode-capable regex library. The regexp.test > tries matching all bytes from 0 to 255, and it uses scm_to_locale_string > to prep the string for dispatch to the libc regex calls and > scm_from_locale_string to send them back. > > If the current locale is C or ASCII, bytes above 127 will cause errors. > If the current locale is UTF-8, bytes above 127 will be converted into > multibyte sequences that won't be matched by the regular expression > being tested. To pass the test in regexp.test, we need to use the > encoding that matches all of the codepoints 0 to 255 to single byte > characters, which is ISO-8859-1. > > So until a better regex comes along, wrapping regex in an > 8-bit-clean-friendly locale like Latin-1 is necessary to avoid encoding > errors when encoding arbitrary 8-bit data like the test does. > > The reason why this problem is cropping up now and didn't occur before > is because the old scm_to_locale_string was just a stub that passed > 8-bit data through unmodified. Thanks for explaining; I think I understand now. So then Ludovic's suggestion of with-latin1-locale should work, shouldn't it? > This regex library actually can be used with arbitrary Unicode data > but it takes extra care. UTF-8 can be used as the locale, and, then > regular expression must be written keeping in mind that each non-ASCII > character is really a multibyte string. Can you give an example of what that ("keeping in mind...") means? Is it being careful with repetition counts (as in "[a-z]{3}"), for example? Thanks, Neil
Re: make check fails if no en_US.iso88591 locale
On Wed, 2009-09-09 at 22:53 +0100, Neil Jerram wrote: > > It is important. This is one of the problems with the whole Unicode > > effort. There is no Unicode-capable regex library. The regexp.test > > tries matching all bytes from 0 to 255, and it uses scm_to_locale_string > > to prep the string for dispatch to the libc regex calls and > > scm_from_locale_string to send them back. [...] > Thanks for explaining; I think I understand now. So then Ludovic's > suggestion of with-latin1-locale should work, shouldn't it? Yeah. I went with that idea. > > > This regex library actually can be used with arbitrary Unicode data > > but it takes extra care. UTF-8 can be used as the locale, and, then > > regular expression must be written keeping in mind that each non-ASCII > > character is really a multibyte string. > > Can you give an example of what that ("keeping in mind...") means? Is > it being careful with repetition counts (as in "[a-z]{3}"), for > example? I'm not much of a regex guy, but, here's a couple of examples. First one that sort of works as expected. guile> (string-match "sé" "José") ==> #("José" (2 . 5)) Regex properly matches the word, but, the match struct (2 . 5) is referring to the bytes of the string, not the characters of the string. Here's one that doesn't work as expected. guile> (string-match "[:lower:]" "Hi, mom") ==> #("Hi, mom" (5 . 6)) guile> (string-match "[:lower:]" "Hí, móm") ==> #f Once you add accents on the vowels, nothing matches. Thanks, Mike