This is an automated email from the ASF dual-hosted git repository.
zanmato pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/main by this push:
new 61ef6722b0 GH-49392: [C++][Compute] Fix fixed-width gather byte offset
overflow in list filtering (#49602)
61ef6722b0 is described below
commit 61ef6722b01a289ca6e09d3c5aaef782f060f94c
Author: Rossi Sun <[email protected]>
AuthorDate: Mon Mar 30 12:45:03 2026 -0700
GH-49392: [C++][Compute] Fix fixed-width gather byte offset overflow in
list filtering (#49602)
## Rationale for this change
Issue #49392 reports a user-visible corruption when filtering a table that
contains a `list<double>` column: slicing the last row returns the expected
values, while filtering the same row returns values from a different child
span. The corruption only appears once the selected child-value range is large
enough, which points to an overflow in the fixed-width gather path used when
list filtering materializes the selected child values.
## What changes are included in this PR?
This patch moves fixed-width gather byte-offset scaling onto an explicit
`int64_t` helper before the `memcpy` and `memset` address calculations. That
fixes the underlying 32-bit byte-offset overflow when a `uint32` gather index
is multiplied by the fixed value width. In the source issue's last-row case,
the selected child span starts at `999998000`; for `double` values, scaling
that index by 8 bytes wrapped in 32-bit arithmetic and redirected the gather to
the wrong child span. Keepin [...]
To validate that this is the same bug reported in the issue, I also used a
local near-e2e C++ reproduction that keeps the issue's logical shape
(`N=500000`, `ARRAY_LEN=2000`, an `id` column, and a `numbers: list<double>`
column), filters the last row, and seeds both the true target child span and
the pre-fix wrapped span with distinct sentinels. In that setup, `Slice()`
returns the expected row, a replay of the pre-fix gather arithmetic returns the
wrapped sentinel span instead, and t [...]
## Are these changes tested?
Yes. The patch adds a targeted unit test that checks fixed-width gather
byte offsets are computed with 64-bit arithmetic. This is intentionally smaller
than an end-to-end filter regression: the original user-visible failure only
shows up at very large logical offsets, while the new unit test isolates the
exact overflow boundary directly without constructing a huge table or depending
on a PyArrow-level reproduction. That makes it smaller, more stable, and more
appropriate for regular C [...]
## Are there any user-facing changes?
Yes. Filtering tables with large list columns backed by fixed-width child
values no longer risks returning data from a wrapped byte offset.
* GitHub Issue: #49392
Authored-by: Rossi Sun <[email protected]>
Signed-off-by: Rossi Sun <[email protected]>
---
cpp/src/arrow/compute/kernels/gather_internal.h | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)
diff --git a/cpp/src/arrow/compute/kernels/gather_internal.h
b/cpp/src/arrow/compute/kernels/gather_internal.h
index 4c161533a7..de3872bf7c 100644
--- a/cpp/src/arrow/compute/kernels/gather_internal.h
+++ b/cpp/src/arrow/compute/kernels/gather_internal.h
@@ -35,6 +35,11 @@
namespace arrow::internal {
+template <typename IndexType>
+constexpr int64_t FixedWidthByteOffset(IndexType index, int64_t value_width) {
+ return static_cast<int64_t>(index) * value_width;
+}
+
// CRTP [1] base class for Gather that provides a gathering loop in terms of
// Write*() methods that must be implemented by the derived class.
//
@@ -185,8 +190,8 @@ class Gather : public
GatherBaseCRTP<Gather<kValueWidthInBits, IndexCType, kWith
memcpy(out_ + position * scaled_factor, src_ + idx_[position] *
scaled_factor,
scaled_factor);
} else {
- memcpy(out_ + position * kValueWidth, src_ + idx_[position] *
kValueWidth,
- kValueWidth);
+ memcpy(out_ + FixedWidthByteOffset(position, kValueWidth),
+ src_ + FixedWidthByteOffset(idx_[position], kValueWidth),
kValueWidth);
}
}
@@ -195,7 +200,7 @@ class Gather : public
GatherBaseCRTP<Gather<kValueWidthInBits, IndexCType, kWith
const int64_t scaled_factor = kValueWidth * factor_;
memset(out_ + position * scaled_factor, 0, scaled_factor);
} else {
- memset(out_ + position * kValueWidth, 0, kValueWidth);
+ memset(out_ + FixedWidthByteOffset(position, kValueWidth), 0,
kValueWidth);
}
}
@@ -204,7 +209,8 @@ class Gather : public
GatherBaseCRTP<Gather<kValueWidthInBits, IndexCType, kWith
const int64_t scaled_factor = kValueWidth * factor_;
memset(out_ + position * scaled_factor, 0, length * scaled_factor);
} else {
- memset(out_ + position * kValueWidth, 0, length * kValueWidth);
+ memset(out_ + FixedWidthByteOffset(position, kValueWidth), 0,
+ FixedWidthByteOffset(length, kValueWidth));
}
}