Hi Micah

Please see a recent discussion on adding new types [1]
>

Thanks, this is useful.



> My understanding is that feather.fbs is for V1 feather files and probably
> shouldn't be touched.  Only updating schema.fbs should be required and the
> type should be doable in a backwards/forwards compatible way (we've added
> types without bumping the metadata version and are in the process of adding
> more).
>

This is good to know. I'm still getting to know the code base, but work
form Schema.fbs going forward.



>    - list(float{32,64}) seems to work fine as an ExtensionType, but I'd
> >    imagine a struct([real, imag]) might offer more in terms of affordance
> > ot
> >    the user. I'd imagine the underlying memory layout would be the same.
>
>
> What notation is this using (are 32, 64 meant to be substitual
> parameters)?  I would think FixedSizeList might be more appropriate then
> list.
>

This should read list(float32()) or list(float64()) for a Python/C++
notation.
As you say, fixed_size_list(float32(), 2), fixed_size_list(float64(), 2)
are more appropriate.


> It seems like what we would want for this is a "Packed Struct" type and
> then have an extension type to wrap it. The existing structs in arrow have
> a very different memory layout than lists (the real and imaginary
> components would not be adjacent in memory with Structs).  All the
> representations also have trade-offs on how they would be mapped to parquet
> and the relevant feature set there.
>

Ah so Arrow Structs are represented as a Struct of Arrays (SoA) vs an Array
of Structs (AoS)?
I don't immediately see a Packed Struct type. Would this need to be
implemented?
Alternatively, std::complex<float> and std::complex<double> seem to work and
implicitly provide a Packed Struct.
The base C Types "float complex" and "double complex" don't seem to be
accepted by C++ templating system as template parameters in types.h.


> Adding a new first-class type in Arrow requires working integration tests
> between C++ and Java libraries (once the idea is informally agreed upon)
> and then a final vote for approval.  We haven't formalized extension types
> but I imagine a similar cross language requirement would be agreed upon.
> Implementation of computation wouldn't be required for adding a new type.
> Different language bindings have taken different approaches on how much
> additional computational elements are packaged in them.
>

Agreed, Complex Types should be covered by integration tests.

regards,

Simon




> On Tue, Jun 8, 2021 at 1:27 AM Simon Perkins <simon.perk...@gmail.com>
> wrote:
>
> > Greetings Apache Dev Mailing List
> >
> > I'm interested in adding complex number support to Arrow. The use case is
> > Radio Astronomy data, which is represented by complex values.
> >
> > xref https://issues.apache.org/jira/browse/ARROW-638
> > xref https://github.com/apache/arrow/pull/10452
> >
> > It's fairly easy to support Complex Numbers as a Python Extension -- see
> > for e.g. how I've done it here using a list(float{32,64}):
> >
> >
> >
> https://github.com/ska-sa/dask-ms/blob/a5bd8538ea3de9fabb8fe74e89c3a75c4043f813/daskms/experimental/arrow/extension_types.py#L144-L173
> >
> > The above seems to work with the standard NumPy complex memory layout
> > (consecutive pairs of [real, imag] values) and should work with the C++
> > std::complex layout. Note that C complex and C++ std::complex should also
> > have the same layout https://stackoverflow.com/a/10540346.
> >
> > However, this constrains this representation of Complex Numbers to the
> > dask-ms only. I think that it would be better to add support for this at
> a
> > base level in Arrow, especially since this will open up the ability for
> > other packages to understand the Complex Number Type. For example, it
> would
> > be useful to:
> >
> >    1. Have a clearly defined Pandas -> Arrow -> Parquet -> Arrow ->
> Pandas
> >    roundtrip. Currently there's no Pandas -> Arrow conversion for
> >    np.complex{64, 128}.
> >    2. Support complex number types in query engines like DataFusion and
> >    BlazingSQL, if only initially via selection on indexing columns.
> >
> >
> > I started up a PR in https://github.com/apache/arrow/pull/10452 adding
> > Complex Numbers as a first-class Arrow type, although I note that
> >
> >
> https://issues.apache.org/jira/browse/ARROW-638?focusedCommentId=16912456&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16912456
> > suggests implementing this as a C++ Extension Type on a first pass.
> Initial
> > experiments suggests this is pretty doable -- I've got some test cases
> > running already.
> >
> > I have some questions going forward:
> >
> >    - Adding first class complex types seems to involve modifying
> >    cpp/src/arrow/ipc/feather.fbs which may change the protocol and
> > introduce
> >    breaking changes. I'm not sure about this and seek advice on how
> > invasive
> >    this approach is and whether its worth pursuing.
> >    - list(float{32,64}) seems to work fine as an ExtensionType, but I'd
> >    imagine a struct([real, imag]) might offer more in terms of affordance
> > ot
> >    the user. I'd imagine the underlying memory layout would be the same.
> >    - I don't have a clear understanding of whether adding either a
> >    First-Class or ExtensionType involves supporting numeric operations on
> > that
> >    type (e.g. Complex Exponential, Absolutes, Min or Max operations) or
> >    whether Arrow is merely concerned with the underlying data
> > representation.
> >
> > Thanks for considering this.
> >   Simon Perkins
> >
>

Reply via email to