Hi Ludovic, Ludovic Courtès <l...@gnu.org> writes:
> Mark H Weaver <m...@netris.org> skribis: > >> Ludovic Courtès <l...@gnu.org> writes: > > [...] > >>> So there are two things. To fix the issue you reported (build output >>> that goes through), I think we must simply turn off UTF-8 decoding from >>> ‘process-stderr’ and leave that entirely to ‘build-event-output-port’. >> >> Can we assume that UTF-8 is the appropriate encoding for >> (current-build-output-port)? My interpretation of the Guix manual entry >> for 'current-build-output-port' suggests that the answer should be "no". > > What goes to ‘current-build-output-port’ comes from builds processes. > It’s usually UTF-8 but it can be anything, including binary garbage, > which should be gracefully handled. > > That’s why ‘process-stderr’ currently uses ‘read-maybe-utf8-string’. I agree that we should (permissively) interpret the build process output as UTF-8, regardless of locale settings. However, the encoding of 'current-build-output-port' is orthogonal, and I see no reason to assume that it's UTF-8. As 'process-stderr' is currently implemented, it makes no assumptions about the encoding of 'current-build-output-port'. That's because it uses only textual I/O on it. The end result is that the UTF-8 build output is effectively converted into the port encoding of 'current-build-output-port', whatever it might be. I think that's how it should be, no? >> Also, in your previous message you wrote: >> >> The problem is the first layer of UTF-8 decoding that happens in >> ‘process-stderr’, in the ‘%stderr-next’ case. We would need to >> disable it, but only if the build output port is >> ‘build-event-output-port’ (i.e., it’s capable of interpreting >> “multiplexed build output” correctly.) >> >> It sounds like you're suggesting that 'process-stderr' should look to >> see if (current-build-output-port) is a 'build-event-output-port', and >> in that case it should use binary I/O primitives to write raw binary >> data to it, otherwise it should use text I/O primitives and write >> characters to it. Do I understand correctly? > > Yes. (Actually, rather than guessing if (current-build-output-port) is > a ‘build-event-output-port’, there could be a fluid to ask for the use > of raw binary primitives.) > >> IMO, it would be cleaner to treat 'build-event-output-port' uniformly, >> and specifically as a textual port of unknown encoding. > > (You mean ‘current-build-output-port’, right?) Yes, indeed. > I think you’re right. I’m not yet entirely sure what the implications > are. There’s a couple of tests in tests/store.scm for UTF-8 > interpretation that describe behavior that I think we should preserve. I certainly agree that we should preserve those tests. I would go further and add two more tests that bind 'current-build-output-port' to a port with a non-UTF-8 encoding (e.g. UTF-16) and verify that the λ gets converted correctly. The test build process would output the λ as UTF-8, but it should be written to 'current-build-output-port' as e.g. UTF-16. What do you think? >> I would suggest changing 'build-event-output-port' to create an R6RS >> custom *textual* output port, so that it wouldn't have to worry about >> encodings at all, and it would only be given whole characters. >> Internally, it would be doing exactly what you suggest above, but those >> details would be encapsulated within the custom textual port. >> >> However, I don't think we can use Guile's current implementation of R6RS >> custom textual output ports, which are currently built on Guile's legacy >> soft ports, which I suspect have a similar bug with multibyte characters >> sometimes being split (see 'soft_port_write' in vports.c). >> >> Having said all of this, my suggestions would ultimately entail having >> two separate places along the stderr pipeline where 'utf8->string!' >> would be used, and maybe that's too much until we have a more optimized >> C implementation of it. > > Yeah it looks like we don’t yet have custom textual output ports that we > could rely on, do we? > > I support your work to add that in Guile proper! For now, I can offer a new implementation of custom textual output ports built upon custom binary ports and the 'utf8->string!' that I previously sent. See attached. Thanks, Mark --8<---------------cut here---------------start------------->8--- GNU Guile 2.2.4 Copyright (C) 1995-2017 Free Software Foundation, Inc. Guile comes with ABSOLUTELY NO WARRANTY; for details type `,show w'. This program is free software, and you are welcome to redistribute it under certain conditions; type `,show c' for details. Enter `,help' for help. scheme@(guile-user)> (load "utf8-decoder.scm") scheme@(guile-user)> (load "guile-new-custom-textual-ports.scm") scheme@(guile-user)> (define (my-write! str start count) (pk 'my-write! (substring str start (+ start count))) count) scheme@(guile-user)> (define port (make-custom-textual-output-port "test1" my-write! #f #f #f)) scheme@(guile-user)> (display "Hello λ world!" port) scheme@(guile-user)> (force-output port) ;;; (my-write! "Hello λ world!") scheme@(guile-user)> (string->utf8 "λ") $2 = #vu8(206 187) scheme@(guile-user)> (string->utf8 "Hello λ world!") $3 = #vu8(72 101 108 108 111 32 206 187 32 119 111 114 108 100 33) scheme@(guile-user)> (put-bytevector port #vu8(72 101 108 108 111 32 206)) scheme@(guile-user)> (force-output port) ;;; (my-write! "Hello ") scheme@(guile-user)> (put-bytevector port #vu8(187 32 119 111 114 108 100 33)) scheme@(guile-user)> (force-output port) ;;; (my-write! "λ world!") scheme@(guile-user)> --8<---------------cut here---------------end--------------->8---
;;; Copyright © 2019 Mark H Weaver <m...@netris.org> ;;; ;;; This program is free software: you can redistribute it and/or modify ;;; it under the terms of the GNU General Public License as published by ;;; the Free Software Foundation, either version 3 of the License, or ;;; (at your option) any later version. ;;; ;;; This program is distributed in the hope that it will be useful, ;;; but WITHOUT ANY WARRANTY; without even the implied warranty of ;;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ;;; GNU General Public License for more details. ;;; ;;; You should have received a copy of the GNU General Public License ;;; along with this program. If not, see <http://www.gnu.org/licenses/>. (use-modules (rnrs io ports)) (define (make-custom-textual-output-port id write! get-position set-position! close) (let (;; Allocate a per-port string buffer which will be used as a ;; temporary buffer for decoding, to avoid heap allocation ;; during normal operation. (buffer (make-string 4096)) ;; 'state' is the UTF-8 decoder state, which represents a ;; proper prefix of a well-formed UTF-8 byte sequence. These ;; are bytes that 'binary-write!' has accepted and reported as ;; having been written, although we are not able to decode ;; them into a character to pass to (textual) 'write!' until ;; more bytes arrive. (state 0)) (define (binary-write! bv start count) (call-with-values (lambda () ;; XXX FIXME: Consider performing this ;; decoding strictly. (utf8->string! state bv start (+ start count) buffer 0 (string-length buffer))) (lambda (new-state bv-pos char-count) (let* (;; Avoid calling write! with (char-count = 0) unless ;; (count = 0) was passed to us, because calling ;; 'write!' with count=0 has a special meaning: it ;; means to pass an EOF object to the byte/character ;; sink. (chars-accepted (if (and (zero? char-count) (not (zero? count))) 0 (write! buffer 0 char-count))) ;; Compute 'bytes-accepted' in such a way that the ;; bytes from STATE are not included, because they ;; were passed to us in previous calls, and are not ;; part of the bytevector range that we are now being ;; asked to write. However, it's important to note ;; that if 'write!' did not accept the bytes from ;; STATE, 'bytes-accepted' will be negative. We must ;; handle that case specially below. (bytes-accepted (- count (string-utf8-length (substring buffer chars-accepted char-count))))) ;; If 'bytes-accepted' is negative, that means the bytes ;; from STATE were not written. This can only happen if ;; 'chars-accepted' is 0, because 'write!' can only accept ;; whole code points, and the bytes from STATE are part of ;; at most a single code point. In this case, we must ;; leave STATE unchanged and return 0. (if (negative? bytes-accepted) 0 (begin (set! state new-state) bytes-accepted)))))) (define (binary-close) (set! buffer #f) (when close (close))) (define port (make-custom-binary-output-port id binary-write! get-position set-position! binary-close)) ;; Always use UTF-8 as the encoding for custom textual ports, as ;; an internal implementation detail, to ensure that all Unicode ;; characters will pass through regardless of the current locale. (set-port-encoding! port "UTF-8") port))