On Fri, Nov 17, 2023 at 2:26 AM John Naylor <johncnaylo...@gmail.com> wrote: > > On Fri, Nov 17, 2023 at 5:54 AM Nathan Bossart <nathandboss...@gmail.com> > wrote: > > > > It looks like is_valid_ascii() was originally added to pg_wchar.h so that > > it could easily be used elsewhere [0] [1], but that doesn't seem to have > > happened yet. > > > > Would moving this definition to a separate header file be a viable option? > > Seems fine to me. (I believe the original motivation for making it an > inline function was for in pg_mbstrlen_with_len(), but trying that > hasn't been a priority.)
In that case, I took a look across the codebase and saw a utils/ascii.h that doesn't seem to have gotten much love, but I suppose one could argue that it's intended to be a backend-only header file? As the codebase is growing some enhanced UTF-8 support, you'll want somewhere that contains the optimized US-ASCII routines, because, as US-ASCII is a subset of UTF-8, and often faster to handle, it's typical for such codepaths to look like ```c while (i < len && no_multibyte_chars) { i = i + ascii_op_version(i, buffer, &no_multibyte_chars); } while (i < len) { i = i + utf8_op_version(i, buffer); } ``` So it should probably end up living somewhere near the UTF-8 support, and the easiest way to make it not go into something pgrx currently includes would be to make it a new header file, though there's a fair amount of API we don't touch. >From the pgrx / Rust perspective, Postgres function calls are passed via callback to a "guard function" that guarantees that longjmp and setjmp don't cause trouble (and makes sure we participate in that). So we only want to call Postgres functions if we "can't replace" them, as the overhead is quite a lot. That means UTF-8-per-se functions aren't very interesting to us as the Rust language already supports it, but we do benefit from access to transcoding to/from UTF-8. —Jubilee