On Fri, Sep 6, 2019, 12:08 PM Antoine Pitrou <anto...@python.org> wrote:
> > Null can also come up when converting a column with only NA values in a > CSV file. I don't remember for sure, but I think the same can happen > with JSON files as well. > > Can't we accept both forms when reading? It sounds like it should be > reasonably easy. > We can't because the buffer layout is not transmitted -- implementations make assumptions about what Buffer values correspond to each field. The only thing we could do to signal the change would be to increase the metadata version from V4 to V5. > Regards > > Antoine. > > > Le 06/09/2019 à 17:36, Wes McKinney a écrit : > > hi Micah, > > > > Null wouldn't come up that often in practice. It could happen when > > converting from pandas, for example > > > > In [8]: df = pd.DataFrame({'col1': np.array([np.nan] * 10, > dtype=object)}) > > > > In [9]: t = pa.table(df) > > > > In [10]: t > > Out[10]: > > pyarrow.Table > > col1: null > > metadata > > -------- > > {b'pandas': b'{"index_columns": [{"kind": "range", "name": null, > "start": 0, "' > > b'stop": 10, "step": 1}], "column_indexes": [{"name": null, > "field' > > b'_name": null, "pandas_type": "unicode", "numpy_type": > "object", ' > > b'"metadata": {"encoding": "UTF-8"}}], "columns": [{"name": > "col1"' > > b', "field_name": "col1", "pandas_type": "empty", > "numpy_type": "o' > > b'bject", "metadata": null}], "creator": {"library": > "pyarrow", "v' > > b'ersion": "0.14.1.dev464+g40d08a751"}, "pandas_version": > "0.24.2"' > > b'}'} > > > > I'm inclined to make the change without worrying about backwards > > compatibility. If people have been persisting data against the > > recommendations of the project, the remedy is to use an older version > > of the library to read the files and write them to something else > > (like Parquet format) in the meantime. > > > > Obviously come 1.0.0 we'll begin to make compatibility guarantees so > > this will be less of an issue. > > > > - Wes > > > > On Thu, Sep 5, 2019 at 11:14 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> > >> Hi Wes and others, > >> I don't have a sense of where Null arrays get created in the existing > code > >> base? > >> > >> Also, do you think it is worth the effort make this backwards > compatible. > >> We could in theory tie the buffer count to having the continuation value > >> for alignment. > >> > >> The one area were I'm slightly concerned is we seem to have users in the > >> wild who are depending on backwards compatibility, and I'm try to better > >> understand the odds that we break them. > >> > >> Thanks, > >> Micah > >> > >> On Thu, Sep 5, 2019 at 7:25 AM Wes McKinney <wesmck...@gmail.com> > wrote: > >> > >>> hi folks, > >>> > >>> One of the as-yet-untested (in integration tests) parts of the > >>> columnar specification is the Null layout. In C++ we additionally > >>> implemented this by writing two length-0 "placeholder" buffers in the > >>> RecordBatch data header, but since the Null layout has no memory > >>> allocated nor any buffers in-memory it may be more proper to write no > >>> buffers (since the length of the Null layout is all you need to > >>> reconstruct it). There are 3 implementations of the placeholder > >>> version (C++, Go, JS, maybe also C#) but it never got implemented in > >>> Java. While technically this would break old serialized data, I would > >>> not expect this to be very frequently occurring in many of the > >>> currently-deployed Arrow applications > >>> > >>> Here is my C++ patch > >>> > >>> https://github.com/apache/arrow/pull/5287 > >>> > >>> I'm not sure we need to formalize this with a vote but I'm interested > >>> in the community's feedback on how to proceed here. > >>> > >>> - Wes > >>> >