Re: Complex Number support in Arrow

Antoine Pitrou Mon, 21 Jun 2021 07:58:37 -0700

I certainly don't think we should have extension types with a differenttype id. IMHO, it's a recipe for confusion.


Regards

Antoine.


Le 21/06/2021 à 15:54, Simon Perkins a écrit :

To put it another way, an Extension Type technically has Type::EXTENSION,
but now there's Type::COMPLEX_FLOAT and Type::COMPLEX_DOUBLE.

When checking enums, the code see's a Type::COMPLEX_FLOAT and seems to
mismatch on ComplexFloatType::Type::type_id, as the latter is
Type::EXTENSION?

On Mon, Jun 21, 2021 at 3:11 PM Simon Perkins <[email protected]>
wrote:

I did some exploratory coding adding Complex Numbers as ExtensionTypes in
this PR: https://github.com/apache/arrow/pull/10565

My understanding is that it means having COMPLEX as an entry in the

arrow/type_fwd.h Type enum. I agree this would make implementation
work in the C++ library much more straightforward.

I implemented this approach, adding COMPLEX_FLOAT and COMPLEX_DOUBLE
entries to the Type enum.
One thing I noted is that at least some portion of the code base
(visitor_inline.h) expects the ExtensionTypes
to have first-class type-like interfaces (i.e. needs a TypeTraits entry,
Visitor definitions).

My impression at this point, is that the work in implementing a hybrid
approach
(i.e. ExtensionType with a Type enum entry) replicates that of adding a
first-class type.
As I am not extensively familiar with the code base, I thought I'd just
check whether this impression
is correct.

regards
   Simon





On Fri, Jun 11, 2021 at 6:52 AM Micah Kornfield <[email protected]>
wrote:


  It might help this discussion and future discussions like it if we

could

define how it is determined whether a type should be part of the Arrow
format, an extension type (and what does it mean to say there is a
"canonical" extension type), or just something that a language
implementation or downstream library builds for itself with metadata. I
feel like this has come up before but I don't recall a resolution.



There seemed to be  consensus, but I guess we never formally voted on the
decision points here:

https://lists.apache.org/thread.html/r7ba08aed2809fa64537e6f44bce38b2cf740acbef0e91cfaa7c19767%40%3Cdev.arrow.apache.org%3E

Applying the criteria to complex types:
1.  Is the type a new parameterization of an existing type?  No

2.  Does the type itself have its own specification for processing (e.g.
JSON, BSON, Thrift, Avro, Protobuf)? No

3.  Is the underlying encoding of the type already semantically supported
by a type?  Yes.  Two have been mentioned in this thread and I would also
support adding a new packed struct type, but it appears isn't necessary
for
this. Note that FixedSizeLists have some limitations in regards to parquet
compatibility around nullability, there might be a few other sharp edges.

So if we use this criteria we would lean towards an extension type.

We never converged on a standard for "canonical" extension types.  I would
propose it roughly be the same criteria as a first class type:
1.  Specification/document update PR that describes the representation
2.  Implementation showing working integration tests across two languages
(for canonical types I think this can be any 2 languages instead of C++
and
Java)
3.  Formal vote accepting the canonical type.

Thanks,
Micah



On Thu, Jun 10, 2021 at 9:34 PM Jorge Cardoso Leitão <
[email protected]> wrote:

Isn't an array of complexes represented by what arrow already supports?

In

particular, I see at least two valid in-memory representations to use,

that

depend on what we are going to do with it:

* Struct[re, im]
* FixedList[2]

In the first case, we have two buffers, [x0, x1, ...] and [y0, y1,

...], in

the second case we have 1 buffer, [x0, y0, x1, y1, ...].

The first representation is useful for column-based operations (e.g.

taking

the real part in case 1 is trivial; requires a copy in the second case),
the second representation is useful for row-base operations (e.g. "take"
and "filter" require a single pass over buffer 1). Case 2 does not

support

Re and Im of different physical types (arguably an issue). Both cases
support nullability of individual items or combined.

What I conclude is that this does not seem to be a problem about a base
in-memory representation, but rather on whether we agree on a
representation that justifies adding associated metadata to the spec.

The case for the complex interval type recently proposed [1] is more
compelling to me because a complex ops over intervals usually required

all

parts of the interval (and thus the "FixedList" representation is more
compelling), but each part has a different type. I.e. it is like a
"FixedTypedList[int32, int32, int64]", which we do not natively support.

[1] https://github.com/apache/arrow/pull/10177

Best,
Jorge



On Fri, Jun 11, 2021 at 1:48 AM Neal Richardson <
[email protected]>
wrote:

  It might help this discussion and future discussions like it if we

could

define how it is determined whether a type should be part of the Arrow
format, an extension type (and what does it mean to say there is a
"canonical" extension type), or just something that a language
implementation or downstream library builds for itself with metadata.

feel like this has come up before but I don't recall a resolution.

Examples might also help: are there examples of "canonical extension
types"?

Neal

On Thu, Jun 10, 2021 at 4:20 PM Micah Kornfield <

[email protected]>

wrote:


My understanding is that it means having COMPLEX as an entry in

the

arrow/type_fwd.h Type enum. I agree this would make implementation
work in the C++ library much more straightforward.


One idea I proposed would be to do that, and implement the

serialization of the complex metadata using Extension types.



If this is a maintainable strategy for Canonical types it sounds

good

to

me.

On Thu, Jun 10, 2021 at 4:02 PM Wes McKinney <[email protected]>

wrote:

My understanding is that it means having COMPLEX as an entry in

the

arrow/type_fwd.h Type enum. I agree this would make implementation
work in the C++ library much more straightforward.

One idea I proposed would be to do that, and implement the
serialization of the complex metadata using Extension types.

On Thu, Jun 10, 2021 at 5:47 PM Weston Pace <

[email protected]>

wrote:

While dedicated types are not strictly required, compute

functions

would

be much easier to add for a first-class dedicated complex

datatype

rather than for an extension type.

@pitrou

This is perhaps a naive question (and admittedly, I'm not up to

speed

on my compute kernels) but why is this the case?  For example,

if

adding a complex addition kernel it seems we would be talking

about...


dest_scalar.real = scalar1.real + scalar2.real;
dest_scalar.im = scalar1.im + scalar2.im;

vs...

dest_scalar[0] = scalar1[0] + scalar2[0];
dest_scalar[1] = scalar1[1] + scalar2[1];

On Thu, Jun 10, 2021 at 11:27 AM Wes McKinney <

[email protected]

wrote:


I'd be supportive of starting with this as a "canonical"

extension

type so that all implementations are not expected to support

complex

types — this would encourage us to build sufficient

integration

e.g.

with NumPy to get things working end-to-end with the on-wire
representation being an extension type. We could certainly

choose

to

treat the type as "first class" in the C++ library without it

being

"top level" in the Type union in Flatbuffers.

I agree that the use cases are more specialized, and the fact

that

we

haven't needed it until now (or at least, its absence suggests

this)

shows that this is the case.

On Thu, Jun 10, 2021 at 4:17 PM Micah Kornfield <

[email protected]>

wrote:


I'm convinced now that  first-class types seem to be the

way

to

go

and I'm

happy to take this approach.


I agree from an implementation effort it is simpler, but I'm

still

not

convinced that we should be adding this as a first class

type.

As

noted in

the survey below it appears Complex numbers are not a core

concept

in many

general purpose coding languages and it doesn't appear to

be a

common type

in SQL systems either.

The reason why I am being nit-picky here is I think that

having a

first

class type indicates that it should eventually be supported

by

all

reference implementations.  An "well known" extension type I

think

offers

less guarantees which makes it seem more suitable for niche

types.

I don't immediately see a Packed Struct type. Would this

need

to

be

implemented?

Not necessarily (*).  But before thinking about

implementation,

this

proposal must be accepted into the format.



Yes, this is a type that has been proposed in the past and I

think

handles

a lot of  types not yet in Arrow but have been requested

(e.g.

IP

Addresses, Geo coordinates), etc.

On Thu, Jun 10, 2021 at 1:06 AM Simon Perkins <

[email protected]>

wrote:

On Wed, Jun 9, 2021 at 7:56 PM Antoine Pitrou <

[email protected]>

wrote:


Le 09/06/2021 à 17:52, Micah Kornfield a écrit :


Adding a new first-class type in Arrow requires

working

integration

tests

between C++ and Java libraries (once the idea is

informally

agreed

upon)

and then a final vote for approval.  We haven't

formalized

extension

types

but I imagine a similar cross language requirement

would

be

agreed

upon.

Implementation of computation wouldn't be required for

adding

a new

type.

Different language bindings have taken different

approaches

on

how much

additional computational elements are packaged in

them.


While dedicated types are not strictly required, compute

functions would

be much easier to add for a first-class dedicated

complex

datatype

rather than for an extension type.

Since complex numbers are quite common in some domains,

and

since they

are conceptually simply, IMHO it would make sense to add

them

to

the

native Arrow datatypes (at least COMPLEX64 and

COMPLEX128).


I'm convinced now that  first-class types seem to be the

way

to

go

and I'm

happy to take this approach.
Regarding compute functions, it looks like the standard

set

of

scalar

arithmetic and reduction functionality
is desirable for complex numbers:
https://arrow.apache.org/docs/cpp/compute.html#
Perhaps it would be better to split the addition of the

Types

and

addition

Compute functionality into separate PRs?

Regarding the process for managing this PR, it sounds

like a

proposal must

be voted on?
i.e. is this proposal still in this phase

http://arrow.apache.org/docs/developers/contributing.html#before-starting

Regards

Simon

Re: Complex Number support in Arrow

Reply via email to