I've decided I'm not quite comfortable with the additional complexity in the build system introduced by the SIMD portion of the previous patches. It would make more sense if the pure C portion were unchanged, but with the shift-based DFA plus the bitwise ASCII check, we have a portable implementation that's still a substantial improvement over the current validator. In v24, I've included only that much, and the diff is only about 1/3 as many lines. If future improvements to COPY FROM put additional pressure on this path, we can always add SIMD support later.
One thing not in this patch is a possible improvement to pg_utf8_verifychar() that Heikki and I worked on upthread as part of earlier attempts to rewrite pg_utf8_verifystr(). That's worth looking into separately. On Thu, Aug 26, 2021 at 12:09 PM Vladimir Sitnikov < sitnikov.vladi...@gmail.com> wrote: > > >Attached is v23 incorporating the 32-bit transition table, with the necessary comment adjustments > > 32bit table is nice. Thanks for taking a look! > Would you please replace https://github.com/BobSteagall/utf_utils/blob/master/src/utf_utils.cpp URL with > https://github.com/BobSteagall/utf_utils/blob/6b7a465265de2f5fa6133d653df0c9bdd73bbcf8/src/utf_utils.cpp > in the header of src/port/pg_utf8_fallback.c? > > It would make the URL more stable in case the file gets renamed. > > Vladimir > Makes sense, so done that way. -- John Naylor EDB: http://www.enterprisedb.com
v24-0001-Add-fast-path-for-validating-UTF-8-text.patch
Description: Binary data