Oh wait, I see now that you're incrementing with a uint8_t*. That could be fine for your own use, but you might want to make sure it aligns with the type of your output (Int64Array vs Int32Array).

Sent from Proton Mail for iOS


On Mon, Jul 17, 2023 at 06:20, Aldrin <octalene....@pm.me.INVALID> wrote:
Hi Wenbo,

An ArraySpan is like an ArrayData but does not own the data, so the ColumnarFormat doc that Jon shared is relevant for both.

In the case of a binary format, the output ArraySpan must have at least 2 buffers: the offsets and the contiguous binary data (values). If the output of your UDF is something like an Int32Array with no nulls, then I think you're writing output correctly.

But, since your pointer is a uint8_t, I think Jin is right and `++` is going to move your pointer 1 byte instead of 32 bytes like you intend.

Sent from Proton Mail for iOS


On Mon, Jul 17, 2023 at 05:06, Wenbo Hu <huwenbo1...@gmail.com> wrote:
Hi Jin,

> but why copy to *out_values++ instead of
> *out_values and add 32 to out_values afterwards?
I'm implementing the sha256 function as a scalar function, but it
always inputs with an array, so on visitor pattern, I'll write a 32
byte hash into the pointer and move to the next for next visit.
Something like:
```

struct BinarySha256Visitor {
BinarySha256Visitor(uint8_t **out) {
this->out = out;
}
arrow::Status VisitNull() {
return arrow::Status::OK();
}

arrow::Status VisitValue(std::string_view v) {

uint8_t hash[32];
sha256(v, hash);

memcpy(*out++, hash, 32);

return arrow::Status::OK();
}

uint8_t ** out;
};

arrow::Status Sha256Func(cp::KernelContext *ctx, const cp::ExecSpan
&batch, cp::ExecResult *out) {
arrow::ArraySpanVisitor<arrow::BinaryType> visitor;

auto *out_values = out->array_span_mutable()->GetValues<uint8_t*>(1);
BinarySha256Visitor visit(out_values);
ARROW_RETURN_NOT_OK(visitor.Visit(batch[0].array, &visit));

return arrow::Status::OK();
}
```
Is it as expected?

Jin Shang <shangjin1...@gmail.com> 于2023年7月17日周一 19:44写道:
>
> Hi Wenbo,
>
> I'd like to known what's the *three* `buffers` are in ArraySpan. What are
> > `1` means when `GetValues` called?
>
> The meaning of buffers in an ArraySpan depends on the layout of its data
> type. FixedSizeBinary is a fixed-size primitive type, so it has two
> buffers, one validity buffer and one data buffer. So GetValues(1) would
> return a pointer to the data buffer.
> Layouts of data types can be found here[1].
>
> what is the actual type should I get from `GetValues`?
> >
> Buffer data is stored as raw bytes (uint8_t) but can be reinterpreted as
> any type to suit your need. The template parameter for GetValue is simply
> forwarded to reinterpret_cast. There are discussions[2] on the soundness of
> using uint8_t to represent bytes but it is what we use now. Since you are
> only doing a memcpy, uint8_t should be good.
>
> Maybe, `auto *out_values = out->array_span_mutable()->GetValues(uint8_t
> > *>(1);` and `memcpy(*out_values++, some_ptr, 32);`?
> >
> I may be missing something, but why copy to *out_values++ instead of
> *out_values and add 32 to out_values afterwards? Otherwise I agree this is
> the way to go.
>
> [1]
> https://arrow.apache.org/docs/format/Columnar.html#buffer-listing-for-each-layout
> [2] https://github.com/apache/arrow/issues/36123
>
>
> On Mon, Jul 17, 2023 at 4:44 PM Wenbo Hu <huwenbo1...@gmail.com> wrote:
>
> > Hi,
> > I'm using Acero as the stream executor to run large scale data
> > transformation. The core data used in UDF is `ArraySpan` in
> > `ExecSpan`, but not much document on ArraySpan. I'd like to known
> > what's the *three* `buffers` are in ArraySpan. What are `1` means when
> > `GetValues` called?
> > For input data, I can use a `ArraySpanVisitor` to iterator over
> > different input types. But for output data, I don't know how to write
> > to the`array_span_mutable()` if it is not a simple c_type.
> > For example, I'm implementing a sha256 udf, which input is
> > `arrow::utf8()` and the output is `arrow::fixed_size_binary(32)`, then
> > how can I directly write to the out buffers and what is the actual
> > type should I get from `GetValues`?
> > Maybe, `auto *out_values =
> > out->array_span_mutable()->GetValues(uint8_t *>(1);` and
> > `memcpy(*out_values++, some_ptr, 32);`?
> >
> > --
> > ---------------------
> > Best Regards,
> > Wenbo Hu,
> >



--
---------------------
Best Regards,
Wenbo Hu,

Attachment: publicKey - octalene.dev@pm.me - 0x21969656.asc
Description: application/pgp-keys

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to