Hi Ludovic, Ludovic Courtès <l...@gnu.org> writes:
> Mark H Weaver <m...@netris.org> skribis: > >> Ludovic Courtès <l...@gnu.org> writes: >> >>> The third read(2) call here ends on a partial UTF-8 sequence for LEFT >>> SINGLE QUOTATION MARK (we get the first two bytes of a three byte >>> sequence.) >>> >>> What happens is that ‘process-stderr’ in (guix store) gets that byte >>> string from the daemon, passes it through ‘read-maybe-utf8-string’, >>> which replaces the last two bytes with REPLACEMENT CHARACTER, which is >>> itself a 3-byte sequence. >> >> It seems to me that what's needed here is to save the UTF-8 decoder >> state between calls to 'process-stderr'. > > So there are two things. To fix the issue you reported (build output > that goes through), I think we must simply turn off UTF-8 decoding from > ‘process-stderr’ and leave that entirely to ‘build-event-output-port’. Can we assume that UTF-8 is the appropriate encoding for (current-build-output-port)? My interpretation of the Guix manual entry for 'current-build-output-port' suggests that the answer should be "no". Also, in your previous message you wrote: The problem is the first layer of UTF-8 decoding that happens in ‘process-stderr’, in the ‘%stderr-next’ case. We would need to disable it, but only if the build output port is ‘build-event-output-port’ (i.e., it’s capable of interpreting “multiplexed build output” correctly.) It sounds like you're suggesting that 'process-stderr' should look to see if (current-build-output-port) is a 'build-event-output-port', and in that case it should use binary I/O primitives to write raw binary data to it, otherwise it should use text I/O primitives and write characters to it. Do I understand correctly? IMO, it would be cleaner to treat 'build-event-output-port' uniformly, and specifically as a textual port of unknown encoding. What do you think? > However, ‘build-event-output-port’ would still fail to properly decode > split UTF-8 sequences, and for that we’d need to preserve decoder > state as you describe. I would suggest changing 'build-event-output-port' to create an R6RS custom *textual* output port, so that it wouldn't have to worry about encodings at all, and it would only be given whole characters. Internally, it would be doing exactly what you suggest above, but those details would be encapsulated within the custom textual port. However, I don't think we can use Guile's current implementation of R6RS custom textual output ports, which are currently built on Guile's legacy soft ports, which I suspect have a similar bug with multibyte characters sometimes being split (see 'soft_port_write' in vports.c). Having said all of this, my suggestions would ultimately entail having two separate places along the stderr pipeline where 'utf8->string!' would be used, and maybe that's too much until we have a more optimized C implementation of it. Thoughts? Mark