[ https://issues.apache.org/jira/browse/ARROW-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17662050#comment-17662050 ]
Rok Mihevc commented on ARROW-5028: ----------------------------------- This issue has been migrated to [issue #21524|https://github.com/apache/arrow/issues/21524] on GitHub. Please see the [migration documentation|https://github.com/apache/arrow/issues/14542] for further details. > [Python][C++] Creating list<string> with pyarrow.array can overflow child > builder > --------------------------------------------------------------------------------- > > Key: ARROW-5028 > URL: https://issues.apache.org/jira/browse/ARROW-5028 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.11.1, 0.13.0 > Environment: python 3.6 > Reporter: Marco Neumann > Assignee: Wes McKinney > Priority: Major > Labels: parquet, pull-request-available > Fix For: 0.15.0 > > Attachments: dct.json.gz, dct.pickle.gz > > Time Spent: 50m > Remaining Estimate: 0h > > I am sorry if this bugs feels rather long and the reproduction data is large, > but I was not able to reduce the data even further while still triggering the > problem. I was able to trigger this behavior on master and on {{0.11.1}}. > {code:python} > import io > import os.path > import pickle > import numpy as np > import pyarrow as pa > import pyarrow.parquet as pq > def dct_to_table(index_dct): > labeled_array = pa.array(np.array(list(index_dct.keys()))) > partition_array = pa.array(np.array(list(index_dct.values()))) > return pa.Table.from_arrays( > [labeled_array, partition_array], names=['a', 'b'] > ) > def check_pq_nulls(data): > fp = io.BytesIO(data) > pfile = pq.ParquetFile(fp) > assert pfile.num_row_groups == 1 > md = pfile.metadata.row_group(0) > col = md.column(1) > assert col.path_in_schema == 'b.list.item' > assert col.statistics.null_count == 0 # fails > def roundtrip(table): > buf = pa.BufferOutputStream() > pq.write_table(table, buf) > data = buf.getvalue().to_pybytes() > # this fails: > # check_pq_nulls(data) > reader = pa.BufferReader(data) > return pq.read_table(reader) > with open(os.path.join(os.path.dirname(__file__), 'dct.pickle'), 'rb') as fp: > dct = pickle.load(fp) > # this does NOT help: > # pa.set_cpu_count(1) > # import gc; gc.disable() > table = dct_to_table(dct) > # this fixes the issue: > # table = pa.Table.from_pandas(table.to_pandas()) > table2 = roundtrip(table) > assert table.column('b').null_count == 0 > assert table2.column('b').null_count == 0 # fails > # if table2 is converted to pandas, you can also observe that some values at > the end of column b are `['']` which clearly is not present in the original > data > {code} > I would also be thankful for any pointers on where the bug comes from or on > who to reduce the test case. -- This message was sent by Atlassian Jira (v8.20.10#820010)