Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

Andrew Lamb Wed, 12 Jan 2022 05:34:13 -0800

I also agree that splitting the StringView proposal into its own thing
would be beneficial for discussion clarity


On Wed, Jan 12, 2022 at 5:34 AM Antoine Pitrou <anto...@python.org> wrote:

>
> Le 12/01/2022 à 01:49, Wes McKinney a écrit :
> > hi all,
> >
> > Thank you for all the comments on this mailing list thread and in the
> > Google document. There is definitely a lot of work to take some next
> > steps from here, so I think it would make sense to fork off each of
> > the proposed additions into dedicated discussions. The most
> > contentious issue, it seems, is whether to maintain a 1-to-1
> > relationship between the IPC format and the C ABI, which would make it
> > rather difficult to implement the "string view" data type in a way
> > that is flexible and useful to applications (for example, giving them
> > control over their own memory management as opposed to forcing data to
> > be "pre-serialized" into buffers that are referenced by offsets).
> >
> > I tend to be of the "practicality beats purity" mindset, where
> > sufficiently beneficial changes to the in-memory format (and C ABI)
> > may be worth breaking the implicit contract where the IPC format and
> > the in-memory data structures have a strict 1-to-1 relationship. I
> > suggest to help reach some consensus around this that I will create a
> > new document focused only on the "string/binary view" type and the
> > different implementation considerations (like what happens when you
> > write it to the IPC format), as well as the different variants of the
> > data structure itself that have been discussed with the associated
> > trade-offs. Does this sound like a good approach?
>
> Indeed, this sounds like it will help making a decision.
>
> Personally, I am still very concerned by the idea of adding pointers to
> the in-memory representation. Besides the loss of equivalence with the
> IPC format, a representation using embedded pointers cannot be fully
> validated for safety or correctness (how do you decide whether a pointer
> is correct and doesn't reveal unrelated data?).
>
> I think we should discuss this with the DuckDB folks (and possibly the
> Velox folks, but I guess that it might socio-politically more difficult)
> so as to measure how much of an inconvenience it would be for them to
> switch to a purely offsets-based approach.
>
> Regards
>
> Antoine.
>
>
>
> >
> > Thanks,
> > Wes
> >
> >
> > On Sat, Jan 8, 2022 at 7:30 AM Jorge Cardoso Leitão
> > <jorgecarlei...@gmail.com> wrote:
> >>
> >> Fair enough (wrt to deprecation). Think that the sequence view is a
> >> replacement for our existing (that allows O(N) selections), but I agree
> >> with the sentiment that preserving compatibility is more important than
> a
> >> single way of doing it. Thanks for that angle!
> >>
> >> Imo the Arrow format is already composed of 3 specifications:
> >>
> >> * C data interface (intra-process communication)
> >> * IPC format (inter-process communication)
> >> * Flight (RPC protocol)
> >>
> >> E.g.
> >> * IPC requires a `dict_id` in the fields declaration, but the C data
> >> interface has no such requirement (because, pointers)
> >> * IPC accepts endian and compression, the C data interface does not
> >> * DataFusion does not support IPC (yet ^_^), but its Python bindings
> >> leverage the C data interface to pass data to pyarrow
> >>
> >> This to say that imo as long as we document the different specifications
> >> that compose Arrow and their intended purposes, it is ok. Because the c
> >> data interface is the one with the highest constraints (zero-copy,
> higher
> >> chance of out of bound reads, etc.), it makes sense for proposals (and
> >> implementations) first be written against it.
> >>
> >>
> >> I agree with Neal's point wrt to the IPC. For extra context, many
> `async`
> >> implementations use cooperative scheduling, which are vulnerable to DOS
> if
> >> they need to perform heavy CPU-bound tasks (as the p-thread is blocked
> and
> >> can't switch). QP Hou and I have summarized a broader version of this
> >> statement here [1].
> >>
> >> In async contexts, If deserializing from IPC requires a significant
> amount
> >> of compute, that task should (to avoid blocking) be sent to a separate
> >> thread pool to avoid blocking the p-threads assigned to the runtime's
> >> thread pool. If the format is O(1) in CPU-bounded work, its execution
> can
> >> be done in an async context without a separate thread pool. Arrow's IPC
> >> format is quite unique there in that it requires almost always O(1) CPU
> >> work to be loaded to memory (at the expense of more disk usage).
> >>
> >> I believe that atm we have two O(N) blocking tasks in reading IPC format
> >> (decompression and byte swapping (big <-> little endian)), and three
> O(N)
> >> blocking tasks in writing (compression, de-offset bitmaps, byte
> swapping).
> >> The more prevalent O(N) CPU-bound tasks are at the IPC interface, the
> less
> >> compelling it becomes vs e.g. parquet (file) or avro (stream), which
> have
> >> an expectation of CPU-bound work. In this context, keeping the IPC
> format
> >> compatible with the ABI spec is imo an important characteristic of
> Apache
> >> Arrow that we should strive to preserve. Alternatively, we could also
> just
> >> abandon this idea and say that the format expects CPU-bound tasks to
> >> deserialize (even if considerably smaller than avro or parquet), so that
> >> people can design the APIs accordingly.
> >>
> >> Best,
> >> Jorge
> >>
> >> [1]
> >>
> https://jorgecarleitao.medium.com/how-to-efficiently-load-data-to-memory-d65ee359196c
> >>
> >>
> >> On Sun, Dec 26, 2021 at 5:31 PM Antoine Pitrou <anto...@python.org>
> wrote:
> >>
> >>>
> >>>
> >>> Le 23/12/2021 à 17:59, Neal Richardson a écrit :
> >>>>> I think in this particular case, we should consider the C ABI /
> >>>>> in-memory representation and IPC format as separate beasts. If an
> >>>>> implementation of Arrow does not want to use this string-view array
> >>>>> type at all (for example, if it created memory safety issues in
> Rust),
> >>>>> then it can choose to convert to the existing string array
> >>>>> representation when receiving a C ABI payload. Whether or not there
> is
> >>>>> an alternate IPC format for this data type seems like a separate
> >>>>> question -- my preference actually would be to support this for
> >>>>> in-memory / C ABI use but not to alter the IPC format.
> >>>>>
> >>>>
> >>>> I think this idea deserves some clarification or at least more
> >>> exposition.
> >>>> On first reading, it was not clear to me that we might add things to
> the
> >>>> in-memory Arrow format but not IPC, that that was even an option. I'm
> >>>> guessing I'm not the only one who missed that.
> >>>>
> >>>> If these new types are only part of the Arrow in-memory format, then
> it's
> >>>> not the case that reading/writing IPC files involves no serialization
> >>>> overhead. I recognize that that's technically already the case since
> IPC
> >>>> supports compression now, but it's not generally how we talk about the
> >>>> relationship between the IPC and in-memory formats (see our own FAQ
> [1],
> >>>> for example). If we go forward with these changes, it would be a good
> >>>> opportunity for us to clarify in our docs/website that the "Arrow
> format"
> >>>> is not a single thing.
> >>>
> >>> I'm worried that making the "Arrow format" polysemic/context-dependent
> >>> would spread a lot of confusion among potential users of Arrow.
> >>>
> >>> Regards
> >>>
> >>> Antoine.
> >>>
>

Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

Reply via email to