Hi Simon,

Please see a recent discussion on adding new types [1]

  - Adding first class complex types seems to involve modifying
>    cpp/src/arrow/ipc/feather.fbs which may change the protocol and
> introduce
>    breaking changes. I'm not sure about this and seek advice on how
> invasive
>    this approach is and whether its worth pursuing.


My understanding is that feather.fbs is for V1 feather files and probably
shouldn't be touched.  Only updating schema.fbs should be required and the
type should be doable in a backwards/forwards compatible way (we've added
types without bumping the metadata version and are in the process of adding
more).

   - list(float{32,64}) seems to work fine as an ExtensionType, but I'd
>    imagine a struct([real, imag]) might offer more in terms of affordance
> ot
>    the user. I'd imagine the underlying memory layout would be the same.


What notation is this using (are 32, 64 meant to be substitual
parameters)?  I would think FixedSizeList might be more appropriate then
list.

It seems like what we would want for this is a "Packed Struct" type and
then have an extension type to wrap it. The existing structs in arrow have
a very different memory layout than lists (the real and imaginary
components would not be adjacent in memory with Structs).  All the
representations also have trade-offs on how they would be mapped to parquet
and the relevant feature set there.

   - I don't have a clear understanding of whether adding either a
>    First-Class or ExtensionType involves supporting numeric operations on
> that
>    type (e.g. Complex Exponential, Absolutes, Min or Max operations) or
>    whether Arrow is merely concerned with the underlying data
> representation.


Adding a new first-class type in Arrow requires working integration tests
between C++ and Java libraries (once the idea is informally agreed upon)
and then a final vote for approval.  We haven't formalized extension types
but I imagine a similar cross language requirement would be agreed upon.
Implementation of computation wouldn't be required for adding a new type.
Different language bindings have taken different approaches on how much
additional computational elements are packaged in them.

-Micah

[1]
https://lists.apache.org/thread.html/r7ba08aed2809fa64537e6f44bce38b2cf740acbef0e91cfaa7c19767%40%3Cdev.arrow.apache.org%3E

On Tue, Jun 8, 2021 at 1:27 AM Simon Perkins <simon.perk...@gmail.com>
wrote:

> Greetings Apache Dev Mailing List
>
> I'm interested in adding complex number support to Arrow. The use case is
> Radio Astronomy data, which is represented by complex values.
>
> xref https://issues.apache.org/jira/browse/ARROW-638
> xref https://github.com/apache/arrow/pull/10452
>
> It's fairly easy to support Complex Numbers as a Python Extension -- see
> for e.g. how I've done it here using a list(float{32,64}):
>
>
> https://github.com/ska-sa/dask-ms/blob/a5bd8538ea3de9fabb8fe74e89c3a75c4043f813/daskms/experimental/arrow/extension_types.py#L144-L173
>
> The above seems to work with the standard NumPy complex memory layout
> (consecutive pairs of [real, imag] values) and should work with the C++
> std::complex layout. Note that C complex and C++ std::complex should also
> have the same layout https://stackoverflow.com/a/10540346.
>
> However, this constrains this representation of Complex Numbers to the
> dask-ms only. I think that it would be better to add support for this at a
> base level in Arrow, especially since this will open up the ability for
> other packages to understand the Complex Number Type. For example, it would
> be useful to:
>
>    1. Have a clearly defined Pandas -> Arrow -> Parquet -> Arrow -> Pandas
>    roundtrip. Currently there's no Pandas -> Arrow conversion for
>    np.complex{64, 128}.
>    2. Support complex number types in query engines like DataFusion and
>    BlazingSQL, if only initially via selection on indexing columns.
>
>
> I started up a PR in https://github.com/apache/arrow/pull/10452 adding
> Complex Numbers as a first-class Arrow type, although I note that
>
> https://issues.apache.org/jira/browse/ARROW-638?focusedCommentId=16912456&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16912456
> suggests implementing this as a C++ Extension Type on a first pass. Initial
> experiments suggests this is pretty doable -- I've got some test cases
> running already.
>
> I have some questions going forward:
>
>    - Adding first class complex types seems to involve modifying
>    cpp/src/arrow/ipc/feather.fbs which may change the protocol and
> introduce
>    breaking changes. I'm not sure about this and seek advice on how
> invasive
>    this approach is and whether its worth pursuing.
>    - list(float{32,64}) seems to work fine as an ExtensionType, but I'd
>    imagine a struct([real, imag]) might offer more in terms of affordance
> ot
>    the user. I'd imagine the underlying memory layout would be the same.
>    - I don't have a clear understanding of whether adding either a
>    First-Class or ExtensionType involves supporting numeric operations on
> that
>    type (e.g. Complex Exponential, Absolutes, Min or Max operations) or
>    whether Arrow is merely concerned with the underlying data
> representation.
>
> Thanks for considering this.
>   Simon Perkins
>

Reply via email to