Re: [DISCUSS] IPC buffer layout for Null type

Micah Kornfield Fri, 20 Sep 2019 02:38:11 -0700

I think committing as is, is the better of the two options.

On Thu, Sep 19, 2019 at 12:35 PM Wes McKinney <[email protected]> wrote:


> OK, my preference, therefore, would be to rebase and merge my patch
> without bothering with backwards compatibility code. The situations
> where there would be an issue are fairly esoteric.
>
> https://github.com/apache/arrow/pull/5287
>
> On Thu, Sep 19, 2019 at 2:29 PM Antoine Pitrou <[email protected]>
> wrote:
> >
> >
> > Well, this is an incompatible IPC change, so ideally it should be done
> > now, not later.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > On Thu, 19 Sep 2019 14:08:37 -0500
> > Wes McKinney <[email protected]> wrote:
> >
> > > I'm concerned about rushing through any patch for this for 0.15.0, but
> > > each release with the status quo increases the risk of making changes.
> > > Thoughts?
> > >
> > > On Fri, Sep 6, 2019 at 12:59 PM Wes McKinney <[email protected]>
> wrote:
> > > >
> > > > On Fri, Sep 6, 2019 at 12:57 PM Micah Kornfield <
> [email protected]> wrote:
> > > > >
> > > > > >
> > > > > > We can't because the buffer layout is not transmitted --
> implementations
> > > > > > make assumptions about what Buffer values correspond to each
> field. The
> > > > > > only thing we could do to signal the change would be to increase
> the
> > > > > > metadata version from V4 to V5.
> > > > >
> > > > > If we do this within 0.15.0 we could infer from the padding of
> messages.
> > > > >
> > > >
> > > > That's true. I'd be OK adding backward compatibility code (that we
> can
> > > > probably remove later) to my patch...
> > > >
> > > > I'm not sure about the other implementations. I think for non-C++
> > > > implementations because they don't have much application code that
> can
> > > > produce Null arrays that they should simply use the no-buffers layout
> > > >
> > > > > On Fri, Sep 6, 2019 at 10:16 AM Wes McKinney <[email protected]>
> wrote:
> > > > >
> > > > > > On Fri, Sep 6, 2019, 12:08 PM Antoine Pitrou <[email protected]>
> wrote:
> > > > > >
> > > > > > >
> > > > > > > Null can also come up when converting a column with only NA
> values in a
> > > > > > > CSV file.  I don't remember for sure, but I think the same can
> happen
> > > > > > > with JSON files as well.
> > > > > > >
> > > > > > > Can't we accept both forms when reading?  It sounds like it
> should be
> > > > > > > reasonably easy.
> > > > > > >
> > > > > >
> > > > > > We can't because the buffer layout is not transmitted --
> implementations
> > > > > > make assumptions about what Buffer values correspond to each
> field. The
> > > > > > only thing we could do to signal the change would be to increase
> the
> > > > > > metadata version from V4 to V5.
> > > > > >
> > > > > >
> > > > > > > Regards
> > > > > > >
> > > > > > > Antoine.
> > > > > > >
> > > > > > >
> > > > > > > Le 06/09/2019 à 17:36, Wes McKinney a écrit :
> > > > > > > > hi Micah,
> > > > > > > >
> > > > > > > > Null wouldn't come up that often in practice. It could
> happen when
> > > > > > > > converting from pandas, for example
> > > > > > > >
> > > > > > > > In [8]: df = pd.DataFrame({'col1': np.array([np.nan] * 10,
> > > > > > > dtype=object)})
> > > > > > > >
> > > > > > > > In [9]: t = pa.table(df)
> > > > > > > >
> > > > > > > > In [10]: t
> > > > > > > > Out[10]:
> > > > > > > > pyarrow.Table
> > > > > > > > col1: null
> > > > > > > > metadata
> > > > > > > > --------
> > > > > > > > {b'pandas': b'{"index_columns": [{"kind": "range", "name":
> null,
> > > > > > > "start": 0, "'
> > > > > > > >             b'stop": 10, "step": 1}], "column_indexes":
> [{"name": null,
> > > > > > > "field'
> > > > > > > >             b'_name": null, "pandas_type": "unicode",
> "numpy_type":
> > > > > > > "object", '
> > > > > > > >             b'"metadata": {"encoding": "UTF-8"}}],
> "columns": [{"name":
> > > > > > > "col1"'
> > > > > > > >             b', "field_name": "col1", "pandas_type": "empty",
> > > > > > > "numpy_type": "o'
> > > > > > > >             b'bject", "metadata": null}], "creator":
> {"library":
> > > > > > > "pyarrow", "v'
> > > > > > > >             b'ersion": "0.14.1.dev464+g40d08a751"},
> "pandas_version":
> > > > > > > "0.24.2"'
> > > > > > > >             b'}'}
> > > > > > > >
> > > > > > > > I'm inclined to make the change without worrying about
> backwards
> > > > > > > > compatibility. If people have been persisting data against
> the
> > > > > > > > recommendations of the project, the remedy is to use an
> older version
> > > > > > > > of the library to read the files and write them to something
> else
> > > > > > > > (like Parquet format) in the meantime.
> > > > > > > >
> > > > > > > > Obviously come 1.0.0 we'll begin to make compatibility
> guarantees so
> > > > > > > > this will be less of an issue.
> > > > > > > >
> > > > > > > > - Wes
> > > > > > > >
> > > > > > > > On Thu, Sep 5, 2019 at 11:14 PM Micah Kornfield <
> [email protected]
> > > > > > >
> > > > > > > wrote:
> > > > > > > >>
> > > > > > > >> Hi Wes and others,
> > > > > > > >> I don't have a sense of where Null arrays get created in
> the existing
> > > > > > > code
> > > > > > > >> base?
> > > > > > > >>
> > > > > > > >> Also, do you think it is worth the effort make this
> backwards
> > > > > > > compatible.
> > > > > > > >> We could in theory tie the buffer count to having the
> continuation
> > > > > > value
> > > > > > > >> for alignment.
> > > > > > > >>
> > > > > > > >> The one area were I'm slightly concerned is we seem to have
> users in
> > > > > > the
> > > > > > > >> wild who are depending on backwards compatibility, and I'm
> try to
> > > > > > better
> > > > > > > >> understand the odds that we break them.
> > > > > > > >>
> > > > > > > >> Thanks,
> > > > > > > >> Micah
> > > > > > > >>
> > > > > > > >> On Thu, Sep 5, 2019 at 7:25 AM Wes McKinney <
> [email protected]>
> > > > > > > wrote:
> > > > > > > >>
> > > > > > > >>> hi folks,
> > > > > > > >>>
> > > > > > > >>> One of the as-yet-untested (in integration tests) parts of
> the
> > > > > > > >>> columnar specification is the Null layout. In C++ we
> additionally
> > > > > > > >>> implemented this by writing two length-0 "placeholder"
> buffers in the
> > > > > > > >>> RecordBatch data header, but since the Null layout has no
> memory
> > > > > > > >>> allocated nor any buffers in-memory it may be more proper
> to write no
> > > > > > > >>> buffers (since the length of the Null layout is all you
> need to
> > > > > > > >>> reconstruct it). There are 3 implementations of the
> placeholder
> > > > > > > >>> version (C++, Go, JS, maybe also C#) but it never got
> implemented in
> > > > > > > >>> Java. While technically this would break old serialized
> data, I would
> > > > > > > >>> not expect this to be very frequently occurring in many of
> the
> > > > > > > >>> currently-deployed Arrow applications
> > > > > > > >>>
> > > > > > > >>> Here is my C++ patch
> > > > > > > >>>
> > > > > > > >>> https://github.com/apache/arrow/pull/5287
> > > > > > > >>>
> > > > > > > >>> I'm not sure we need to formalize this with a vote but I'm
> interested
> > > > > > > >>> in the community's feedback on how to proceed here.
> > > > > > > >>>
> > > > > > > >>> - Wes
> > > > > > > >>>
> > > > > > >
> > > > > >
> > >
> >
> >
> >
>

Re: [DISCUSS] IPC buffer layout for Null type

Reply via email to