hi all, Thank you for all the comments on this mailing list thread and in the Google document. There is definitely a lot of work to take some next steps from here, so I think it would make sense to fork off each of the proposed additions into dedicated discussions. The most contentious issue, it seems, is whether to maintain a 1-to-1 relationship between the IPC format and the C ABI, which would make it rather difficult to implement the "string view" data type in a way that is flexible and useful to applications (for example, giving them control over their own memory management as opposed to forcing data to be "pre-serialized" into buffers that are referenced by offsets).
I tend to be of the "practicality beats purity" mindset, where sufficiently beneficial changes to the in-memory format (and C ABI) may be worth breaking the implicit contract where the IPC format and the in-memory data structures have a strict 1-to-1 relationship. I suggest to help reach some consensus around this that I will create a new document focused only on the "string/binary view" type and the different implementation considerations (like what happens when you write it to the IPC format), as well as the different variants of the data structure itself that have been discussed with the associated trade-offs. Does this sound like a good approach? Thanks, Wes On Sat, Jan 8, 2022 at 7:30 AM Jorge Cardoso Leitão <jorgecarlei...@gmail.com> wrote: > > Fair enough (wrt to deprecation). Think that the sequence view is a > replacement for our existing (that allows O(N) selections), but I agree > with the sentiment that preserving compatibility is more important than a > single way of doing it. Thanks for that angle! > > Imo the Arrow format is already composed of 3 specifications: > > * C data interface (intra-process communication) > * IPC format (inter-process communication) > * Flight (RPC protocol) > > E.g. > * IPC requires a `dict_id` in the fields declaration, but the C data > interface has no such requirement (because, pointers) > * IPC accepts endian and compression, the C data interface does not > * DataFusion does not support IPC (yet ^_^), but its Python bindings > leverage the C data interface to pass data to pyarrow > > This to say that imo as long as we document the different specifications > that compose Arrow and their intended purposes, it is ok. Because the c > data interface is the one with the highest constraints (zero-copy, higher > chance of out of bound reads, etc.), it makes sense for proposals (and > implementations) first be written against it. > > > I agree with Neal's point wrt to the IPC. For extra context, many `async` > implementations use cooperative scheduling, which are vulnerable to DOS if > they need to perform heavy CPU-bound tasks (as the p-thread is blocked and > can't switch). QP Hou and I have summarized a broader version of this > statement here [1]. > > In async contexts, If deserializing from IPC requires a significant amount > of compute, that task should (to avoid blocking) be sent to a separate > thread pool to avoid blocking the p-threads assigned to the runtime's > thread pool. If the format is O(1) in CPU-bounded work, its execution can > be done in an async context without a separate thread pool. Arrow's IPC > format is quite unique there in that it requires almost always O(1) CPU > work to be loaded to memory (at the expense of more disk usage). > > I believe that atm we have two O(N) blocking tasks in reading IPC format > (decompression and byte swapping (big <-> little endian)), and three O(N) > blocking tasks in writing (compression, de-offset bitmaps, byte swapping). > The more prevalent O(N) CPU-bound tasks are at the IPC interface, the less > compelling it becomes vs e.g. parquet (file) or avro (stream), which have > an expectation of CPU-bound work. In this context, keeping the IPC format > compatible with the ABI spec is imo an important characteristic of Apache > Arrow that we should strive to preserve. Alternatively, we could also just > abandon this idea and say that the format expects CPU-bound tasks to > deserialize (even if considerably smaller than avro or parquet), so that > people can design the APIs accordingly. > > Best, > Jorge > > [1] > https://jorgecarleitao.medium.com/how-to-efficiently-load-data-to-memory-d65ee359196c > > > On Sun, Dec 26, 2021 at 5:31 PM Antoine Pitrou <anto...@python.org> wrote: > > > > > > > Le 23/12/2021 à 17:59, Neal Richardson a écrit : > > >> I think in this particular case, we should consider the C ABI / > > >> in-memory representation and IPC format as separate beasts. If an > > >> implementation of Arrow does not want to use this string-view array > > >> type at all (for example, if it created memory safety issues in Rust), > > >> then it can choose to convert to the existing string array > > >> representation when receiving a C ABI payload. Whether or not there is > > >> an alternate IPC format for this data type seems like a separate > > >> question -- my preference actually would be to support this for > > >> in-memory / C ABI use but not to alter the IPC format. > > >> > > > > > > I think this idea deserves some clarification or at least more > > exposition. > > > On first reading, it was not clear to me that we might add things to the > > > in-memory Arrow format but not IPC, that that was even an option. I'm > > > guessing I'm not the only one who missed that. > > > > > > If these new types are only part of the Arrow in-memory format, then it's > > > not the case that reading/writing IPC files involves no serialization > > > overhead. I recognize that that's technically already the case since IPC > > > supports compression now, but it's not generally how we talk about the > > > relationship between the IPC and in-memory formats (see our own FAQ [1], > > > for example). If we go forward with these changes, it would be a good > > > opportunity for us to clarify in our docs/website that the "Arrow format" > > > is not a single thing. > > > > I'm worried that making the "Arrow format" polysemic/context-dependent > > would spread a lot of confusion among potential users of Arrow. > > > > Regards > > > > Antoine. > >