Hi Will,
I'll also note that, while float16 is a first-class datatype, I'm not
sure any Arrow implementation is able to do anything else than just
transport it currently.
You're right that we'd probably want extension number types to be based
on fixed-size-binary. A complication is endianness, though. Currently,
we have logic (for example in Arrow C++) to optionally byte-swap number
data at the edge (when receiving non-native-endian data). How would it
work with extension types based on fixed-size-binary? There is a risk
that implementations recognizing the bfloat16 extension type would
byte-swap, but others would not, leading to corrupt data streams.
The bfloat16 extension type would then have to be parametrized with its
endianness, or mandate a fixed endianness (probably little endian).
For bigints, I think the situation is simpler. Little-endian is, I
think, a much more convenient representation for bigints (at the cost of
some potential runtime byte-shuffling on big-endian systems).
Regards
Antoine.
Le 23/05/2023 à 23:47, Will Jones a écrit :
I'm just starting to look at this, so not yet sure what the pros and cons
are of implementing it as an extension type versus a native Arrow type. My
initial ideas:
Pros of an extension type:
* It can be moved through Arrow-native systems that don't implement it, as
long as they preserve extension type information.
Pros of a native type:
* We have established patterns for writing compute kernels for natively
supported types.
If we were to implement these as extension types, I think bfloat16 and the
number types Ian Joiner mentions would be best implemented as extension
types based on fixed-size binary. We have a native float16 type already,
but I think making bfloat16 an extension type based on that it could get
accidentally manipulated as a float16, which IIUC would be invalid.
If anyone has any advice from our work thus far on extension types, I'd
welcome your input.
Best,
Will Jones
[1]
https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus
[2] https://en.wikipedia.org/wiki/Bfloat16_floating-point_format
On Tue, May 23, 2023 at 10:49 AM Antoine Pitrou <anto...@python.org> wrote:
Your question seems unspecific, but we now have the possibility of
standardizing canonical extension types (which are, of course, optional
to implement and support):
https://arrow.apache.org/docs/format/CanonicalExtensions.html
Le 23/05/2023 à 19:45, Ian Joiner a écrit :
That’s a possibility. Do we consider officially support them?
On Tuesday, May 23, 2023, Antoine Pitrou <anto...@python.org> wrote:
I'm not sure what you're actually proposing here. A new extension type
perhaps?
Le 23/05/2023 à 19:13, Ian Joiner a écrit :
Hi,
We need to have really large integers (with 128, 256 and 512 bits) as
well
as decimals (up to at least decimal1024) because they do actually
exist in
crypto / web3 space.
See https://docs.rs/primitive-types/latest/primitive_types/ for an
example
of what needs to be supported.
If accepted we can implement the types for C++/Python and Rust.
Thanks,
Ian