https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121908

            Bug ID: 121908
           Summary: Hot loop in xz not vectorized
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rdapp at gcc dot gnu.org
                CC: jeffreyalaw at gmail dot com, rguenth at gcc dot gnu.org,
                    tamar.christina at arm dot com
  Target Milestone: ---

The following is a simplified example of bt_skip_func in 557.xz that can be
vectorized.  It's a search loop that looks for (dis)similarities in an array
and, depending on the input, we're seeing double-digit improvements when
vectorized.

#define uint8_t unsigned char
#define uint32_t unsigned int

int foo (const uint8_t *const cur, uint32_t n)
{
  uint32_t i = 15;

  while (i++ != n)
    if (cur[i] != cur[i - 15])
      break;

  return i;
}

We give up analyzing the DRs because n may be < 15:

Creating dr for *_2
analyze_innermost: bla2.c:15:12: missed:  failed: evolution of base is not
affine.
        base_address: 
        offset from base address: 
        constant offset from base address: 
        step: 
        base alignment: 0
        base misalignment: 0
        offset alignment: 0
        step alignment: 0
        base_object: *_2
Creating dr for *_6
analyze_innermost: bla2.c:15:22: missed:  failed: evolution of base is not
affine.
        base_address: 
        offset from base address: 
        constant offset from base address: 
        step: 
        base alignment: 0
        base misalignment: 0
        offset alignment: 0
        step alignment: 0
        base_object: *_6

A more complex example, closer to the real loop is:

#define uint32_t unsigned int
#define uint8_t unsigned char

int foo (const uint8_t *const cur, uint32_t len, uint32_t len_limit,
         uint32_t pos, uint32_t cur_match)
{
  const uint32_t delta = pos - cur_match;
  const uint8_t *pb = cur - delta;

  while (++len != len_limit)
    if (pb[len] != cur[len])
      break;

  return len;
}

My idea was to "version"/partition the loop or vectorization along len >
len_limit and len <= len_limit.

Reply via email to