On 1/31/17 16:59, Thomas Munro wrote: > I assume (but haven't checked) that ucol_nextSortKeyPart accesses only > the start of the string via the UCharIterator passed in, unless you > have the rare reverse-accent-sort feature enabled (maybe used only in > fr_CA, it looks like it is required to scan the whole string looking > for the last accent). So I assume that uiter_setUTF8 and > ucol_nextSortKeyPart would usually do a small fixed amount of work, > whereas this patch's icu_to_uchar allocates space and converts the > whole variable length string every time.
I have changed it to use ucol_nextSortKeyPart() when appropriate. > That's about abbreviation, but I note that you can also compare > strings using iterators with ucol_strcollIter, avoiding the need to > allocate and transcode up front. I have no idea whether that'd pay > off. These iterators don't actually transcode on the fly. You can only set up iterators for UTF-8 or UTF-16 strings. And for UTF-8, we already have a special code path using ucol_strcollUTF8(), which uses an interator internally. > Clearly when you upgrade your system from (say) Debian 8 to Debian 9 > and the ICU major version changes we expect to have to REINDEX, but > does anyone know if such data changes are ever pulled into the minor > version package upgrades you get from regular apt-get update of (say) > a Debian 8 or CentOS 7 or FreeBSD 11 system? In other words, do we > expect collversion changes to happen basically any time in response to > regular system updates, or only when you're doing a major upgrade of > some kind, as the above-quoted documentation patch implies? Stable operating systems shouldn't do major library upgrades, so I feel pretty confident about this. > > + collversion = ntohl(*((uint32 *) versioninfo)); > > UVersionInfo is an array of four uint8_t. That doesn't sound like > something that needs to be bit-swizzled... is it really? Even if it > is arranged differently on different architectures, I'm not sure why > you care since we only ever use it to compare for equality on the same > system. But aside from that, I don't love this cast to uint32. I > wonder if we should use u_versionToString to respect boundaries a bit > better? Makes sense, changed the column type to text. > I have another motivation for wanting to model collation versions as > strings: I have been contemplating a version check for system-provided > collations too, piggy-backing on your work here. Obviously PostgreSQL > can't directly know anything about system collation versions, but I'm > thinking of a GUC system_collation_version_command which you could > optionally set to a script that knows how to inspect the local > operating system. For example, a package maintainer might be > interested in writing such a script that knows how to ask the package > system for a locale data version. A brute force approach that works > on many systems could be 'tar c /usr/share/local/*/LC_COLLATE | md5'. That sounds like an idea worth pursuing. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers