On Fri, Jan 21, 2022 at 11:38:56PM +0100, Steinar H. Gunderson wrote: > On Fri, Jan 21, 2022 at 09:48:06PM +0000, Colin Watson wrote: > > So the current behaviour isn't a bug as such, but there's definitely > > room for optimization here: when operating in-process, and in the common > > case where the target encoding is UTF-8, the UTF-8 to UTF-8 trial > > decoding path could be changed to just do a read-only "is this UTF-8" > > test rather than effectively copying everything to a new buffer via > > iconv. I don't know how much faster that would be, though it seems > > likely to be an improvement. > > Technically, UTF-8 validation can be done at a few gigabytes per second > per core: > > > https://lemire.me/blog/2018/05/16/validating-utf-8-strings-using-as-little-as-0-7-cycles-per-byte/ > > but that is probably overkill. :-)
Quite :-) > > I'll see if I can make time for this, though I think a reasonable > > priority for me is to finish working on your existing MR comments first > > and get this ready to land. > > Sure, I agree this is a good prioritization. I saw a new patch set landed, > but I'm not sure if you wanted me to look at it again yet? (Fundamentally, > though, almost everything I have is style nits; if the patch went into man-db > as-is, I would still be happy about it.) Not yet - that was just trivial rebasing after I found and fixed a few unrelated things I'd broken on main and wanted to get them into this tree to simplify my own testing. I have a larger pile of rearrangements in progress, but I'll post replies on the MR when they're ready. -- Colin Watson (he/him) [cjwat...@debian.org]