Control: tag -1 fixed-upstream On Sun, Jan 23, 2022 at 06:11:19PM +0100, Steinar H. Gunderson wrote: > On Sat, Jan 22, 2022 at 12:41:56AM +0000, Colin Watson wrote: > >> Technically, UTF-8 validation can be done at a few gigabytes per second > >> per core: > >> > >> > >> https://lemire.me/blog/2018/05/16/validating-utf-8-strings-using-as-little-as-0-7-cycles-per-byte/ > >> > >> but that is probably overkill. :-) > > Quite :-) > > It struck me that it can probably be folded for free into the lexer. > If you add symbols for all invalid UTF-8 sequences, I believe it should just > go into the state machine. But I'm fine with those 20%; the perfect need not > be the enemy of the good here.
Mm. I'd somewhat prefer not to put it in the lexer though, because in general the next stage after encoding conversion can be something other than the lexer, and I don't want to store up too much confusion for my future self. I grabbed glib's UTF-8 validator on the basis that it was a simple, portable, and compatibly-licensed one that I could verify by eye and dropped it in, replacing the trial conversion pass if source_encoding != UTF-8 and target_encoding == UTF-8. This saves about 8% on my test system on top of the previous optimizations (10.589s → 9.791s, median of nine runs), so it might be possible to do better with a faster validator, but this seems likely to be good enough and we're probably approaching diminishing returns. > In general, I don't think I need to look at it again now, unless there are > any special questions. Thanks for taking care of this! Looking forward to > bookworm being faster (and of course sid before that), and then I'll happily > live with this on bullseye, knowing that it's transient. Thanks a lot for the initial prod and the review comments - they've definitely improved things. I've gone ahead and merged all this. I'll need to do a call for translations before releasing since there are some other changes that will necessitate that, but I expect to produce a new upstream release in a couple of weeks. -- Colin Watson (he/him) [cjwat...@debian.org]