Re: Optimize UUID parse using SIMD

Haibo Yan Mon, 29 Jun 2026 17:53:39 -0700

On Mon, Jun 29, 2026 at 2:55 PM Masahiko Sawada <[email protected]> wrote:
>
> On Sun, Jun 28, 2026 at 7:20 PM Haibo Yan <[email protected]> wrote:
> >
> > On Thu, Jun 25, 2026 at 3:16 PM Masahiko Sawada <[email protected]> 
> > wrote:
> > >
> > > On Thu, Jun 25, 2026 at 2:31 PM Haibo Yan <[email protected]> wrote:
> > > >
> > > >
> > > >
> > > > On Thu, Jun 25, 2026 at 11:28 AM Masahiko Sawada 
> > > > <[email protected]> wrote:
> > > >>
> > > >> Hi all,
> > > >>
> > > >> I'd like to propose the $subject.
> > > >>
> > > >> Since commit ec8719ccbfcd made hex_decode_safe() SIMD-aware, decoding
> > > >> a run of hex digits is now fast. The attached patch reuses
> > > >> hex_decode_safe() in the UUID input function to speed up parsing.
> > > >>
> > > >> We accept several textual forms of a UUID[1]. The fast path handles
> > > >> the common ones: 32 hex digits, the canonical 8x-4x-4x-4x-12x form
> > > >> (where "nx" means n hex digits), and either of those wrapped in
> > > >> braces. Otherwise, it falls back to the ordinary scalar UUID parse.
> > > >>
> > > >> I've benchmarked the parse speed using the following query:
> > > >>
> > > >> CREATE TEMP TABLE u AS SELECT gen_random_uuid()::text AS t FROM
> > > >> generate_series(1, 1000000);
> > > >> EXPLAIN (ANALYZE, TIMING OFF) SELECT t::uuid FROM u;
> > > >>
> > > >> I compared the execution time of the second query, which measures
> > > >> uuid_in() alone, with/without SIMD optimization. Here are results (the
> > > >> median of 5 runs):
> > > >>
> > > >> HEAD: 208.879 ms
> > > >> Patched: 40.983 ms
> > > >>
> > > >> The improvements look promising to me. But in a realistic pipeline the
> > > >> parse is a small fraction of the work, so end-to-end gains could be
> > > >> much smaller.
> > > >>
> > > >> Feedback is very welcome.
> > > >>
> > > > I may be missing something, but I wonder whether the fast path is 
> > > > relying on
> > > > slightly different input semantics from the existing UUID parser.
> > > >
> > > > In particular, hex_decode_safe() is not a strict “32 hex characters 
> > > > only”
> > > > decoder.  It skips whitespace, which is fine for its existing callers, 
> > > > but I
> > > > don’t think UUID input should treat whitespace inside the UUID body as
> > > > ignorable.
> > >
> > > Good catch! hex_decode_safe() skips whitespaces so the patch accepts
> > > the following UUID value, which is bad:
> > >
> > > select '019f00b5-7f8a-722f-b707-59f0ed25cd  '::uuid;
> > >                  uuid
> > > --------------------------------------
> > >  019f00b5-7f8a-722f-b707-59f0ed25cd00
> > > (1 row)
> > >
> > > > Also, since hex_decode_safe() returns void, the UUID fast path
> > > > cannot verify that exactly UUID_LEN bytes were produced.
> > >
> > > IIUC hex_decode_safe() does return the output length in bytes. So I
> > > think we can fallback to the scalar UUID parser if
> > > esctx.error_occurred is true or if the returned value is not 16.
> > >
> >
> > You’re right, I misread that part.  Checking both esctx.error_occurred and
> > the returned length sounds good to me.
> >
> > > >
> > > > So I think it would be safer either to pre-validate that the 32 source
> > > > characters are all hex digits before calling hex_decode_safe(), or to 
> > > > use a
> > > > UUID-specific strict hex decoder for this path.  After that, a comment
> > > > explaining why hex_decode_safe() is safe here would make the invariant 
> > > > much
> > > > clearer.
> > >
> > > IIUC hex_decode_simd_helper() accepts only hex digits so we could
> > > re-use it for UUID parsing. Let me check if the above idea of using
> > > the return value works for us first.
> > >
> >
> > That sounds reasonable.  My main concern was to keep the fast path’s 
> > accepted
> > input set identical to the scalar UUID parser.  Falling back when the 
> > decoded
> > length is not UUID_LEN, together with regression tests for whitespace cases,
> > should address that.
> >
> > > >
> > > > Could you also add a few regression tests for invalid inputs that 
> > > > contain
> > > > whitespace inside otherwise fast-path-looking UUID strings?  For 
> > > > example:
> > > >
> > > > ---------------------------------------------------------------
> > > >
> > > > SELECT 'a0eebc99 9c0b4ef8bb6d6bb9bd380a11'::uuid;
> > > > SELECT 'a0eebc999c0b4ef8bb6d6bb9bd380a1 '::uuid;
> > > > SELECT '{a0eebc999c0b4ef8bb6d6bb9bd380a1 }'::uuid;
> > > > SELECT 'a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a1 '::uuid;
> > > > ---------------------------------------------------------------
> > > >
> > > > These should continue to be rejected in the same way as the scalar 
> > > > parser.
> > > > Regards,
> > >
> > > Agreed.
> > >
>
> I've attached the updated patch.
>
> Regards,
>
> --
> Masahiko Sawada
> Amazon Web Services: https://aws.amazon.com


I noticed a few typos in the comments:

src/backend/utils/adt/uuid.c
line 56: “scalar implmentation” -> “scalar implementation”
line 109: “swalled” -> “swallowed”
line 110: “kepping” -> “keeping”
line 118: “grammer” -> “grammar”
line 119: “whitespaces” -> “whitespace”

Could you fix them ?
Thank you.

Regards.
Haibo

Re: Optimize UUID parse using SIMD

Reply via email to