I think committing as is, is the better of the two options. On Thu, Sep 19, 2019 at 12:35 PM Wes McKinney <wesmck...@gmail.com> wrote:
> OK, my preference, therefore, would be to rebase and merge my patch > without bothering with backwards compatibility code. The situations > where there would be an issue are fairly esoteric. > > https://github.com/apache/arrow/pull/5287 > > On Thu, Sep 19, 2019 at 2:29 PM Antoine Pitrou <solip...@pitrou.net> > wrote: > > > > > > Well, this is an incompatible IPC change, so ideally it should be done > > now, not later. > > > > Regards > > > > Antoine. > > > > > > On Thu, 19 Sep 2019 14:08:37 -0500 > > Wes McKinney <wesmck...@gmail.com> wrote: > > > > > I'm concerned about rushing through any patch for this for 0.15.0, but > > > each release with the status quo increases the risk of making changes. > > > Thoughts? > > > > > > On Fri, Sep 6, 2019 at 12:59 PM Wes McKinney <wesmck...@gmail.com> > wrote: > > > > > > > > On Fri, Sep 6, 2019 at 12:57 PM Micah Kornfield < > emkornfi...@gmail.com> wrote: > > > > > > > > > > > > > > > > > We can't because the buffer layout is not transmitted -- > implementations > > > > > > make assumptions about what Buffer values correspond to each > field. The > > > > > > only thing we could do to signal the change would be to increase > the > > > > > > metadata version from V4 to V5. > > > > > > > > > > If we do this within 0.15.0 we could infer from the padding of > messages. > > > > > > > > > > > > > That's true. I'd be OK adding backward compatibility code (that we > can > > > > probably remove later) to my patch... > > > > > > > > I'm not sure about the other implementations. I think for non-C++ > > > > implementations because they don't have much application code that > can > > > > produce Null arrays that they should simply use the no-buffers layout > > > > > > > > > On Fri, Sep 6, 2019 at 10:16 AM Wes McKinney <wesmck...@gmail.com> > wrote: > > > > > > > > > > > On Fri, Sep 6, 2019, 12:08 PM Antoine Pitrou <anto...@python.org> > wrote: > > > > > > > > > > > > > > > > > > > > Null can also come up when converting a column with only NA > values in a > > > > > > > CSV file. I don't remember for sure, but I think the same can > happen > > > > > > > with JSON files as well. > > > > > > > > > > > > > > Can't we accept both forms when reading? It sounds like it > should be > > > > > > > reasonably easy. > > > > > > > > > > > > > > > > > > > We can't because the buffer layout is not transmitted -- > implementations > > > > > > make assumptions about what Buffer values correspond to each > field. The > > > > > > only thing we could do to signal the change would be to increase > the > > > > > > metadata version from V4 to V5. > > > > > > > > > > > > > > > > > > > Regards > > > > > > > > > > > > > > Antoine. > > > > > > > > > > > > > > > > > > > > > Le 06/09/2019 à 17:36, Wes McKinney a écrit : > > > > > > > > hi Micah, > > > > > > > > > > > > > > > > Null wouldn't come up that often in practice. It could > happen when > > > > > > > > converting from pandas, for example > > > > > > > > > > > > > > > > In [8]: df = pd.DataFrame({'col1': np.array([np.nan] * 10, > > > > > > > dtype=object)}) > > > > > > > > > > > > > > > > In [9]: t = pa.table(df) > > > > > > > > > > > > > > > > In [10]: t > > > > > > > > Out[10]: > > > > > > > > pyarrow.Table > > > > > > > > col1: null > > > > > > > > metadata > > > > > > > > -------- > > > > > > > > {b'pandas': b'{"index_columns": [{"kind": "range", "name": > null, > > > > > > > "start": 0, "' > > > > > > > > b'stop": 10, "step": 1}], "column_indexes": > [{"name": null, > > > > > > > "field' > > > > > > > > b'_name": null, "pandas_type": "unicode", > "numpy_type": > > > > > > > "object", ' > > > > > > > > b'"metadata": {"encoding": "UTF-8"}}], > "columns": [{"name": > > > > > > > "col1"' > > > > > > > > b', "field_name": "col1", "pandas_type": "empty", > > > > > > > "numpy_type": "o' > > > > > > > > b'bject", "metadata": null}], "creator": > {"library": > > > > > > > "pyarrow", "v' > > > > > > > > b'ersion": "0.14.1.dev464+g40d08a751"}, > "pandas_version": > > > > > > > "0.24.2"' > > > > > > > > b'}'} > > > > > > > > > > > > > > > > I'm inclined to make the change without worrying about > backwards > > > > > > > > compatibility. If people have been persisting data against > the > > > > > > > > recommendations of the project, the remedy is to use an > older version > > > > > > > > of the library to read the files and write them to something > else > > > > > > > > (like Parquet format) in the meantime. > > > > > > > > > > > > > > > > Obviously come 1.0.0 we'll begin to make compatibility > guarantees so > > > > > > > > this will be less of an issue. > > > > > > > > > > > > > > > > - Wes > > > > > > > > > > > > > > > > On Thu, Sep 5, 2019 at 11:14 PM Micah Kornfield < > emkornfi...@gmail.com > > > > > > > > > > > > > > wrote: > > > > > > > >> > > > > > > > >> Hi Wes and others, > > > > > > > >> I don't have a sense of where Null arrays get created in > the existing > > > > > > > code > > > > > > > >> base? > > > > > > > >> > > > > > > > >> Also, do you think it is worth the effort make this > backwards > > > > > > > compatible. > > > > > > > >> We could in theory tie the buffer count to having the > continuation > > > > > > value > > > > > > > >> for alignment. > > > > > > > >> > > > > > > > >> The one area were I'm slightly concerned is we seem to have > users in > > > > > > the > > > > > > > >> wild who are depending on backwards compatibility, and I'm > try to > > > > > > better > > > > > > > >> understand the odds that we break them. > > > > > > > >> > > > > > > > >> Thanks, > > > > > > > >> Micah > > > > > > > >> > > > > > > > >> On Thu, Sep 5, 2019 at 7:25 AM Wes McKinney < > wesmck...@gmail.com> > > > > > > > wrote: > > > > > > > >> > > > > > > > >>> hi folks, > > > > > > > >>> > > > > > > > >>> One of the as-yet-untested (in integration tests) parts of > the > > > > > > > >>> columnar specification is the Null layout. In C++ we > additionally > > > > > > > >>> implemented this by writing two length-0 "placeholder" > buffers in the > > > > > > > >>> RecordBatch data header, but since the Null layout has no > memory > > > > > > > >>> allocated nor any buffers in-memory it may be more proper > to write no > > > > > > > >>> buffers (since the length of the Null layout is all you > need to > > > > > > > >>> reconstruct it). There are 3 implementations of the > placeholder > > > > > > > >>> version (C++, Go, JS, maybe also C#) but it never got > implemented in > > > > > > > >>> Java. While technically this would break old serialized > data, I would > > > > > > > >>> not expect this to be very frequently occurring in many of > the > > > > > > > >>> currently-deployed Arrow applications > > > > > > > >>> > > > > > > > >>> Here is my C++ patch > > > > > > > >>> > > > > > > > >>> https://github.com/apache/arrow/pull/5287 > > > > > > > >>> > > > > > > > >>> I'm not sure we need to formalize this with a vote but I'm > interested > > > > > > > >>> in the community's feedback on how to proceed here. > > > > > > > >>> > > > > > > > >>> - Wes > > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > > >