Hi Micah Please see a recent discussion on adding new types [1] >
Thanks, this is useful. > My understanding is that feather.fbs is for V1 feather files and probably > shouldn't be touched. Only updating schema.fbs should be required and the > type should be doable in a backwards/forwards compatible way (we've added > types without bumping the metadata version and are in the process of adding > more). > This is good to know. I'm still getting to know the code base, but work form Schema.fbs going forward. > - list(float{32,64}) seems to work fine as an ExtensionType, but I'd > > imagine a struct([real, imag]) might offer more in terms of affordance > > ot > > the user. I'd imagine the underlying memory layout would be the same. > > > What notation is this using (are 32, 64 meant to be substitual > parameters)? I would think FixedSizeList might be more appropriate then > list. > This should read list(float32()) or list(float64()) for a Python/C++ notation. As you say, fixed_size_list(float32(), 2), fixed_size_list(float64(), 2) are more appropriate. > It seems like what we would want for this is a "Packed Struct" type and > then have an extension type to wrap it. The existing structs in arrow have > a very different memory layout than lists (the real and imaginary > components would not be adjacent in memory with Structs). All the > representations also have trade-offs on how they would be mapped to parquet > and the relevant feature set there. > Ah so Arrow Structs are represented as a Struct of Arrays (SoA) vs an Array of Structs (AoS)? I don't immediately see a Packed Struct type. Would this need to be implemented? Alternatively, std::complex<float> and std::complex<double> seem to work and implicitly provide a Packed Struct. The base C Types "float complex" and "double complex" don't seem to be accepted by C++ templating system as template parameters in types.h. > Adding a new first-class type in Arrow requires working integration tests > between C++ and Java libraries (once the idea is informally agreed upon) > and then a final vote for approval. We haven't formalized extension types > but I imagine a similar cross language requirement would be agreed upon. > Implementation of computation wouldn't be required for adding a new type. > Different language bindings have taken different approaches on how much > additional computational elements are packaged in them. > Agreed, Complex Types should be covered by integration tests. regards, Simon > On Tue, Jun 8, 2021 at 1:27 AM Simon Perkins <simon.perk...@gmail.com> > wrote: > > > Greetings Apache Dev Mailing List > > > > I'm interested in adding complex number support to Arrow. The use case is > > Radio Astronomy data, which is represented by complex values. > > > > xref https://issues.apache.org/jira/browse/ARROW-638 > > xref https://github.com/apache/arrow/pull/10452 > > > > It's fairly easy to support Complex Numbers as a Python Extension -- see > > for e.g. how I've done it here using a list(float{32,64}): > > > > > > > https://github.com/ska-sa/dask-ms/blob/a5bd8538ea3de9fabb8fe74e89c3a75c4043f813/daskms/experimental/arrow/extension_types.py#L144-L173 > > > > The above seems to work with the standard NumPy complex memory layout > > (consecutive pairs of [real, imag] values) and should work with the C++ > > std::complex layout. Note that C complex and C++ std::complex should also > > have the same layout https://stackoverflow.com/a/10540346. > > > > However, this constrains this representation of Complex Numbers to the > > dask-ms only. I think that it would be better to add support for this at > a > > base level in Arrow, especially since this will open up the ability for > > other packages to understand the Complex Number Type. For example, it > would > > be useful to: > > > > 1. Have a clearly defined Pandas -> Arrow -> Parquet -> Arrow -> > Pandas > > roundtrip. Currently there's no Pandas -> Arrow conversion for > > np.complex{64, 128}. > > 2. Support complex number types in query engines like DataFusion and > > BlazingSQL, if only initially via selection on indexing columns. > > > > > > I started up a PR in https://github.com/apache/arrow/pull/10452 adding > > Complex Numbers as a first-class Arrow type, although I note that > > > > > https://issues.apache.org/jira/browse/ARROW-638?focusedCommentId=16912456&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16912456 > > suggests implementing this as a C++ Extension Type on a first pass. > Initial > > experiments suggests this is pretty doable -- I've got some test cases > > running already. > > > > I have some questions going forward: > > > > - Adding first class complex types seems to involve modifying > > cpp/src/arrow/ipc/feather.fbs which may change the protocol and > > introduce > > breaking changes. I'm not sure about this and seek advice on how > > invasive > > this approach is and whether its worth pursuing. > > - list(float{32,64}) seems to work fine as an ExtensionType, but I'd > > imagine a struct([real, imag]) might offer more in terms of affordance > > ot > > the user. I'd imagine the underlying memory layout would be the same. > > - I don't have a clear understanding of whether adding either a > > First-Class or ExtensionType involves supporting numeric operations on > > that > > type (e.g. Complex Exponential, Absolutes, Min or Max operations) or > > whether Arrow is merely concerned with the underlying data > > representation. > > > > Thanks for considering this. > > Simon Perkins > > >