Hi Taylan, taylanbayi...@gmail.com (Taylan Ulrich "Bayırlı/Kammer") writes:
> Andy Wingo <wi...@pobox.com> writes: > >> Adopting the behavior is more or less fine. If it can be done while >> relying on the existing behavior, that is better than something ad-hoc >> in a module. In general, I agree with Andy's sentiment that it would be better to avoid redundant BOM handling code, and moreover, I appreciate his reluctance to apply a fix without careful consideration of our existing BOM semantics. However, as Taylan discovered, Guile does not provide a mechanism to specify a default endianness of a UTF-16 or UTF-32 port in case a BOM is not found. I see no straightforward way to implement these R6RS interfaces using ports. We could certainly add such a mechanism if needed, but I see another problem with this approach: the expense of creating and later collecting a bytevector port object would be a very heavy burden to place on these otherwise fairly lightweight operations. Therefore, I would prefer to avoid that implementation strategy for these operations. Although BOM handling for ports is quite complex with many subtle points to consider, detecting a BOM at the beginning of a bytevector is so trivial that I personally have no objection to this tiny duplication of logic. Therefore, my preference would be to adopt code similar to that proposed by Taylan, although I believe it can, and should, be further simplified: > diff --git a/module/rnrs/bytevectors.scm b/module/rnrs/bytevectors.scm > index 9744359f0..997a8c9cb 100644 > --- a/module/rnrs/bytevectors.scm > +++ b/module/rnrs/bytevectors.scm > @@ -69,7 +69,9 @@ > bytevector-ieee-double-native-set! > > string->utf8 string->utf16 string->utf32 > - utf8->string utf16->string utf32->string)) > + utf8->string > + (r6rs-utf16->string . utf16->string) > + (r6rs-utf32->string . utf32->string))) > > > (load-extension (string-append "libguile-" (effective-version)) > @@ -80,4 +82,52 @@ > `(quote ,sym) > (error "unsupported endianness" sym))) > > +(define (read-bom16 bv) > + (let ((c0 (bytevector-u8-ref bv 0)) > + (c1 (bytevector-u8-ref bv 1))) > + (cond > + ((and (= c0 #xFE) (= c1 #xFF)) > + 'big) > + ((and (= c0 #xFF) (= c1 #xFE)) > + 'little) > + (else > + #f)))) We should gracefully handle the case of an empty bytevector, returning an empty string without error in that case. Also, we should use a single 'bytevector-u16-ref' operation to check for the BOM. Pick an arbitrary endianness for the operation (big-endian?), and compare the resulting integer with both #xFEFF and #xFFFE. That way, the code will be simpler and more efficient. Note that our VM has dedicated instructions for these multi-byte bytevector accessors, and there will be fewer comparison operations as well. Similarly for the utf32 case. What do you think? > +(define r6rs-utf16->string > + (case-lambda > + ((bv default-endianness) > + (let ((bom-endianness (read-bom16 bv))) > + (if (not bom-endianness) > + (utf16->string bv default-endianness) > + (substring/shared (utf16->string bv bom-endianness) 1)))) Better to use plain 'substring' here, I think. The machinery of shared substrings is more expensive, and unnecessary in this case. Otherwise, it looks good to me. Would you like to propose a revised patch? Andy, what do you think? Mark