On Fri, 2021-08-20 at 13:05 -0400, John Naylor wrote: > On Thu, Aug 19, 2021 at 8:05 PM Jacob Champion <pchamp...@vmware.com> wrote: > > I guess it just depends on what the end result looks/performs like. > > We'd save seven hops or so in the worst case? > > Something like that. Attached is what I had in mind (using real > patches to see what the CF bot thinks): > > 0001 is a simple renaming > 0002 puts the char width inside the mbinterval so we can put arbitrary values > there
0002 introduces a mixed declarations/statements warning for ucs_wcwidth(). Other than that, LGTM overall. > --- a/src/common/wchar.c > +++ b/src/common/wchar.c > @@ -583,9 +583,9 @@ pg_utf_mblen(const unsigned char *s) > > struct mbinterval > { > - unsigned short first; > - unsigned short last; > - signed short width; > + unsigned int first; > + unsigned int last:21; > + signed int width:4; > }; Oh, right -- my patch moved mbinterval from short to int, but should I have used uint32 instead? It would only matter in theory for the `first` member now that the bitfields are there. > I think the adjustments to 0003 result in a cleaner and more > extensible design, but a case could be made otherwise. The former > combining table script is a bit more complex than the sum of its > former self and Jacob's proposed new script, but just slightly. The microbenchmark says it's also more performant, so +1 to your version. Does there need to be any sanity check for overlapping ranges between the combining and fullwidth sets? The Unicode data on a dev's machine would have to be broken somehow for that to happen, but it could potentially go undetected for a while if it did. > Also, I checked the behavior of this comment that I proposed to remove > upthread: > > - * - Other format characters (general category code Cf in the Unicode > - * database) and ZERO WIDTH SPACE (U+200B) have a column width of 0. > > We don't handle the latter in our current setup: > > SELECT U&'foo\200Bbar'; > +----------+ > | ?column? | > +----------+ > | foobar | > +----------+ > (1 row) > > Not sure if we should do anything about this. It was an explicit > exception years ago in our vendored manual table, but is not labeled > as such in the official Unicode files. I'm wary of changing too many things at once, but it does seem like we should be giving that codepoint a width of 0. On Tue, 2021-08-24 at 12:05 -0400, John Naylor wrote: > I plan to commit my proposed v2 this week unless I hear reservations > about my adjustments, or bikeshedding. I'm thinking of squashing 0001 > and 0002. +1 Thanks! --Jacob