On Tue, Aug 24, 2021 at 1:50 PM Jacob Champion <pchamp...@vmware.com> wrote: > > On Fri, 2021-08-20 at 13:05 -0400, John Naylor wrote: > > On Thu, Aug 19, 2021 at 8:05 PM Jacob Champion <pchamp...@vmware.com> wrote: > > > I guess it just depends on what the end result looks/performs like. > > > We'd save seven hops or so in the worst case? > > > > Something like that. Attached is what I had in mind (using real > > patches to see what the CF bot thinks): > > > > 0001 is a simple renaming > > 0002 puts the char width inside the mbinterval so we can put arbitrary values there > > 0002 introduces a mixed declarations/statements warning for > ucs_wcwidth(). Other than that, LGTM overall.
I didn't see that warning with clang 12, either with or without assertions, but I do see it on gcc 11. Fixed, and pushed 0001 and 0002. I decided against squashing them, since my original instinct was correct -- the header changes too much for git to consider it the same file, which may make archeology harder. > > --- a/src/common/wchar.c > > +++ b/src/common/wchar.c > > @@ -583,9 +583,9 @@ pg_utf_mblen(const unsigned char *s) > > > > struct mbinterval > > { > > - unsigned short first; > > - unsigned short last; > > - signed short width; > > + unsigned int first; > > + unsigned int last:21; > > + signed int width:4; > > }; > > Oh, right -- my patch moved mbinterval from short to int, but should I > have used uint32 instead? It would only matter in theory for the > `first` member now that the bitfields are there. I'm not sure it would matter, but the usual type for codepoints is unsigned. > > I think the adjustments to 0003 result in a cleaner and more > > extensible design, but a case could be made otherwise. The former > > combining table script is a bit more complex than the sum of its > > former self and Jacob's proposed new script, but just slightly. > > The microbenchmark says it's also more performant, so +1 to your > version. > > Does there need to be any sanity check for overlapping ranges between > the combining and fullwidth sets? The Unicode data on a dev's machine > would have to be broken somehow for that to happen, but it could > potentially go undetected for a while if it did. Thanks for testing again! The sanity check sounds like a good idea, so I'll work on that and push soon. -- John Naylor EDB: http://www.enterprisedb.com