I also agree that splitting the StringView proposal into its own thing would be beneficial for discussion clarity
On Wed, Jan 12, 2022 at 5:34 AM Antoine Pitrou <anto...@python.org> wrote: > > Le 12/01/2022 à 01:49, Wes McKinney a écrit : > > hi all, > > > > Thank you for all the comments on this mailing list thread and in the > > Google document. There is definitely a lot of work to take some next > > steps from here, so I think it would make sense to fork off each of > > the proposed additions into dedicated discussions. The most > > contentious issue, it seems, is whether to maintain a 1-to-1 > > relationship between the IPC format and the C ABI, which would make it > > rather difficult to implement the "string view" data type in a way > > that is flexible and useful to applications (for example, giving them > > control over their own memory management as opposed to forcing data to > > be "pre-serialized" into buffers that are referenced by offsets). > > > > I tend to be of the "practicality beats purity" mindset, where > > sufficiently beneficial changes to the in-memory format (and C ABI) > > may be worth breaking the implicit contract where the IPC format and > > the in-memory data structures have a strict 1-to-1 relationship. I > > suggest to help reach some consensus around this that I will create a > > new document focused only on the "string/binary view" type and the > > different implementation considerations (like what happens when you > > write it to the IPC format), as well as the different variants of the > > data structure itself that have been discussed with the associated > > trade-offs. Does this sound like a good approach? > > Indeed, this sounds like it will help making a decision. > > Personally, I am still very concerned by the idea of adding pointers to > the in-memory representation. Besides the loss of equivalence with the > IPC format, a representation using embedded pointers cannot be fully > validated for safety or correctness (how do you decide whether a pointer > is correct and doesn't reveal unrelated data?). > > I think we should discuss this with the DuckDB folks (and possibly the > Velox folks, but I guess that it might socio-politically more difficult) > so as to measure how much of an inconvenience it would be for them to > switch to a purely offsets-based approach. > > Regards > > Antoine. > > > > > > > Thanks, > > Wes > > > > > > On Sat, Jan 8, 2022 at 7:30 AM Jorge Cardoso Leitão > > <jorgecarlei...@gmail.com> wrote: > >> > >> Fair enough (wrt to deprecation). Think that the sequence view is a > >> replacement for our existing (that allows O(N) selections), but I agree > >> with the sentiment that preserving compatibility is more important than > a > >> single way of doing it. Thanks for that angle! > >> > >> Imo the Arrow format is already composed of 3 specifications: > >> > >> * C data interface (intra-process communication) > >> * IPC format (inter-process communication) > >> * Flight (RPC protocol) > >> > >> E.g. > >> * IPC requires a `dict_id` in the fields declaration, but the C data > >> interface has no such requirement (because, pointers) > >> * IPC accepts endian and compression, the C data interface does not > >> * DataFusion does not support IPC (yet ^_^), but its Python bindings > >> leverage the C data interface to pass data to pyarrow > >> > >> This to say that imo as long as we document the different specifications > >> that compose Arrow and their intended purposes, it is ok. Because the c > >> data interface is the one with the highest constraints (zero-copy, > higher > >> chance of out of bound reads, etc.), it makes sense for proposals (and > >> implementations) first be written against it. > >> > >> > >> I agree with Neal's point wrt to the IPC. For extra context, many > `async` > >> implementations use cooperative scheduling, which are vulnerable to DOS > if > >> they need to perform heavy CPU-bound tasks (as the p-thread is blocked > and > >> can't switch). QP Hou and I have summarized a broader version of this > >> statement here [1]. > >> > >> In async contexts, If deserializing from IPC requires a significant > amount > >> of compute, that task should (to avoid blocking) be sent to a separate > >> thread pool to avoid blocking the p-threads assigned to the runtime's > >> thread pool. If the format is O(1) in CPU-bounded work, its execution > can > >> be done in an async context without a separate thread pool. Arrow's IPC > >> format is quite unique there in that it requires almost always O(1) CPU > >> work to be loaded to memory (at the expense of more disk usage). > >> > >> I believe that atm we have two O(N) blocking tasks in reading IPC format > >> (decompression and byte swapping (big <-> little endian)), and three > O(N) > >> blocking tasks in writing (compression, de-offset bitmaps, byte > swapping). > >> The more prevalent O(N) CPU-bound tasks are at the IPC interface, the > less > >> compelling it becomes vs e.g. parquet (file) or avro (stream), which > have > >> an expectation of CPU-bound work. In this context, keeping the IPC > format > >> compatible with the ABI spec is imo an important characteristic of > Apache > >> Arrow that we should strive to preserve. Alternatively, we could also > just > >> abandon this idea and say that the format expects CPU-bound tasks to > >> deserialize (even if considerably smaller than avro or parquet), so that > >> people can design the APIs accordingly. > >> > >> Best, > >> Jorge > >> > >> [1] > >> > https://jorgecarleitao.medium.com/how-to-efficiently-load-data-to-memory-d65ee359196c > >> > >> > >> On Sun, Dec 26, 2021 at 5:31 PM Antoine Pitrou <anto...@python.org> > wrote: > >> > >>> > >>> > >>> Le 23/12/2021 à 17:59, Neal Richardson a écrit : > >>>>> I think in this particular case, we should consider the C ABI / > >>>>> in-memory representation and IPC format as separate beasts. If an > >>>>> implementation of Arrow does not want to use this string-view array > >>>>> type at all (for example, if it created memory safety issues in > Rust), > >>>>> then it can choose to convert to the existing string array > >>>>> representation when receiving a C ABI payload. Whether or not there > is > >>>>> an alternate IPC format for this data type seems like a separate > >>>>> question -- my preference actually would be to support this for > >>>>> in-memory / C ABI use but not to alter the IPC format. > >>>>> > >>>> > >>>> I think this idea deserves some clarification or at least more > >>> exposition. > >>>> On first reading, it was not clear to me that we might add things to > the > >>>> in-memory Arrow format but not IPC, that that was even an option. I'm > >>>> guessing I'm not the only one who missed that. > >>>> > >>>> If these new types are only part of the Arrow in-memory format, then > it's > >>>> not the case that reading/writing IPC files involves no serialization > >>>> overhead. I recognize that that's technically already the case since > IPC > >>>> supports compression now, but it's not generally how we talk about the > >>>> relationship between the IPC and in-memory formats (see our own FAQ > [1], > >>>> for example). If we go forward with these changes, it would be a good > >>>> opportunity for us to clarify in our docs/website that the "Arrow > format" > >>>> is not a single thing. > >>> > >>> I'm worried that making the "Arrow format" polysemic/context-dependent > >>> would spread a lot of confusion among potential users of Arrow. > >>> > >>> Regards > >>> > >>> Antoine. > >>> >