It might help this discussion and future discussions like it if we
could
define how it is determined whether a type should be part of the Arrow
format, an extension type (and what does it mean to say there is a
"canonical" extension type), or just something that a language
implementation or downstream library builds for itself with metadata. I
feel like this has come up before but I don't recall a resolution.
There seemed to be consensus, but I guess we never formally voted on the
decision points here:
https://lists.apache.org/thread.html/r7ba08aed2809fa64537e6f44bce38b2cf740acbef0e91cfaa7c19767%40%3Cdev.arrow.apache.org%3E
Applying the criteria to complex types:
1. Is the type a new parameterization of an existing type? No
2. Does the type itself have its own specification for processing (e.g.
JSON, BSON, Thrift, Avro, Protobuf)? No
3. Is the underlying encoding of the type already semantically supported
by a type? Yes. Two have been mentioned in this thread and I would also
support adding a new packed struct type, but it appears isn't necessary
for
this. Note that FixedSizeLists have some limitations in regards to parquet
compatibility around nullability, there might be a few other sharp edges.
So if we use this criteria we would lean towards an extension type.
We never converged on a standard for "canonical" extension types. I would
propose it roughly be the same criteria as a first class type:
1. Specification/document update PR that describes the representation
2. Implementation showing working integration tests across two languages
(for canonical types I think this can be any 2 languages instead of C++
and
Java)
3. Formal vote accepting the canonical type.
Thanks,
Micah
On Thu, Jun 10, 2021 at 9:34 PM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:
Isn't an array of complexes represented by what arrow already supports?
In
particular, I see at least two valid in-memory representations to use,
that
depend on what we are going to do with it:
* Struct[re, im]
* FixedList[2]
In the first case, we have two buffers, [x0, x1, ...] and [y0, y1,
...], in
the second case we have 1 buffer, [x0, y0, x1, y1, ...].
The first representation is useful for column-based operations (e.g.
taking
the real part in case 1 is trivial; requires a copy in the second case),
the second representation is useful for row-base operations (e.g. "take"
and "filter" require a single pass over buffer 1). Case 2 does not
support
Re and Im of different physical types (arguably an issue). Both cases
support nullability of individual items or combined.
What I conclude is that this does not seem to be a problem about a base
in-memory representation, but rather on whether we agree on a
representation that justifies adding associated metadata to the spec.
The case for the complex interval type recently proposed [1] is more
compelling to me because a complex ops over intervals usually required
all
parts of the interval (and thus the "FixedList" representation is more
compelling), but each part has a different type. I.e. it is like a
"FixedTypedList[int32, int32, int64]", which we do not natively support.
[1] https://github.com/apache/arrow/pull/10177
Best,
Jorge
On Fri, Jun 11, 2021 at 1:48 AM Neal Richardson <
neal.p.richard...@gmail.com>
wrote:
It might help this discussion and future discussions like it if we
could
define how it is determined whether a type should be part of the Arrow
format, an extension type (and what does it mean to say there is a
"canonical" extension type), or just something that a language
implementation or downstream library builds for itself with metadata.
I
feel like this has come up before but I don't recall a resolution.
Examples might also help: are there examples of "canonical extension
types"?
Neal
On Thu, Jun 10, 2021 at 4:20 PM Micah Kornfield <
emkornfi...@gmail.com>
wrote:
My understanding is that it means having COMPLEX as an entry in
the
arrow/type_fwd.h Type enum. I agree this would make implementation
work in the C++ library much more straightforward.
One idea I proposed would be to do that, and implement the
serialization of the complex metadata using Extension types.
If this is a maintainable strategy for Canonical types it sounds
good
to
me.
On Thu, Jun 10, 2021 at 4:02 PM Wes McKinney <wesmck...@gmail.com>
wrote:
My understanding is that it means having COMPLEX as an entry in
the
arrow/type_fwd.h Type enum. I agree this would make implementation
work in the C++ library much more straightforward.
One idea I proposed would be to do that, and implement the
serialization of the complex metadata using Extension types.
On Thu, Jun 10, 2021 at 5:47 PM Weston Pace <
weston.p...@gmail.com>
wrote:
While dedicated types are not strictly required, compute
functions
would
be much easier to add for a first-class dedicated complex
datatype
rather than for an extension type.
@pitrou
This is perhaps a naive question (and admittedly, I'm not up to
speed
on my compute kernels) but why is this the case? For example,
if
adding a complex addition kernel it seems we would be talking
about...
dest_scalar.real = scalar1.real + scalar2.real;
dest_scalar.im = scalar1.im + scalar2.im;
vs...
dest_scalar[0] = scalar1[0] + scalar2[0];
dest_scalar[1] = scalar1[1] + scalar2[1];
On Thu, Jun 10, 2021 at 11:27 AM Wes McKinney <
wesmck...@gmail.com
wrote:
I'd be supportive of starting with this as a "canonical"
extension
type so that all implementations are not expected to support
complex
types — this would encourage us to build sufficient
integration
e.g.
with NumPy to get things working end-to-end with the on-wire
representation being an extension type. We could certainly
choose
to
treat the type as "first class" in the C++ library without it
being
"top level" in the Type union in Flatbuffers.
I agree that the use cases are more specialized, and the fact
that
we
haven't needed it until now (or at least, its absence suggests
this)
shows that this is the case.
On Thu, Jun 10, 2021 at 4:17 PM Micah Kornfield <
emkornfi...@gmail.com>
wrote:
I'm convinced now that first-class types seem to be the
way
to
go
and I'm
happy to take this approach.
I agree from an implementation effort it is simpler, but I'm
still
not
convinced that we should be adding this as a first class
type.
As
noted in
the survey below it appears Complex numbers are not a core
concept
in many
general purpose coding languages and it doesn't appear to
be a
common type
in SQL systems either.
The reason why I am being nit-picky here is I think that
having a
first
class type indicates that it should eventually be supported
by
all
reference implementations. An "well known" extension type I
think
offers
less guarantees which makes it seem more suitable for niche
types.
I don't immediately see a Packed Struct type. Would this
need
to
be
implemented?
Not necessarily (*). But before thinking about
implementation,
this
proposal must be accepted into the format.
Yes, this is a type that has been proposed in the past and I
think
handles
a lot of types not yet in Arrow but have been requested
(e.g.
IP
Addresses, Geo coordinates), etc.
On Thu, Jun 10, 2021 at 1:06 AM Simon Perkins <
simon.perk...@gmail.com>
wrote:
On Wed, Jun 9, 2021 at 7:56 PM Antoine Pitrou <
anto...@python.org>
wrote:
Le 09/06/2021 à 17:52, Micah Kornfield a écrit :
Adding a new first-class type in Arrow requires
working
integration
tests
between C++ and Java libraries (once the idea is
informally
agreed
upon)
and then a final vote for approval. We haven't
formalized
extension
types
but I imagine a similar cross language requirement
would
be
agreed
upon.
Implementation of computation wouldn't be required for
adding
a new
type.
Different language bindings have taken different
approaches
on
how much
additional computational elements are packaged in
them.
While dedicated types are not strictly required, compute
functions would
be much easier to add for a first-class dedicated
complex
datatype
rather than for an extension type.
Since complex numbers are quite common in some domains,
and
since they
are conceptually simply, IMHO it would make sense to add
them
to
the
native Arrow datatypes (at least COMPLEX64 and
COMPLEX128).
I'm convinced now that first-class types seem to be the
way
to
go
and I'm
happy to take this approach.
Regarding compute functions, it looks like the standard
set
of
scalar
arithmetic and reduction functionality
is desirable for complex numbers:
https://arrow.apache.org/docs/cpp/compute.html#
Perhaps it would be better to split the addition of the
Types
and
addition
Compute functionality into separate PRs?
Regarding the process for managing this PR, it sounds
like a
proposal must
be voted on?
i.e. is this proposal still in this phase
http://arrow.apache.org/docs/developers/contributing.html#before-starting
Regards
Simon