Jeremy Heffner created ARROW-3138: ------------------------------------- Summary: 'Couldn't deserialize thrift' error when reading large binary column Key: ARROW-3138 URL: https://issues.apache.org/jira/browse/ARROW-3138 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.10.0 Environment: Ubuntu 16.04; Python 3.6; Pandas 0.23.4; Numpy 1.14.3 Reporter: Jeremy Heffner Attachments: parquet-issue-example.py
We've run into issues reading Parquet files that contain long binary columns (utf8 strings). In particular, we were generating WKT representations of polygons that contained ~34 million characters when we ran into the issue. The attached example generates a dataframe with one record and one column containing a random string with 10^7 characters. Pandas (using the default pyarrow engine) successfully writes the file, but fails upon reading the file: {code:java} --------------------------------------------------------------------------- ArrowIOError Traceback (most recent call last) <ipython-input-25-25d21204cbad> in <module>() ----> 1 df_read_in = pd.read_parquet('test.parquet') ~/anaconda3/envs/uda/lib/python3.6/site-packages/pandas/io/parquet.py in read_parquet(path, engine, columns, **kwargs) 286 287 impl = get_engine(engine) --> 288 return impl.read(path, columns=columns, **kwargs) ~/anaconda3/envs/uda/lib/python3.6/site-packages/pandas/io/parquet.py in read(self, path, columns, **kwargs) 129 kwargs['use_pandas_metadata'] = True 130 result = self.api.parquet.read_table(path, columns=columns, --> 131 **kwargs).to_pandas() 132 if should_close: 133 try: ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in read_table(source, columns, nthreads, metadata, use_pandas_metadata) 1044 fs = _get_fs_from_path(source) 1045 return fs.read_parquet(source, columns=columns, metadata=metadata, -> 1046 use_pandas_metadata=use_pandas_metadata) 1047 1048 pf = ParquetFile(source, metadata=metadata) ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/filesystem.py in read_parquet(self, path, columns, metadata, schema, nthreads, use_pandas_metadata) 175 filesystem=self) 176 return dataset.read(columns=columns, nthreads=nthreads, --> 177 use_pandas_metadata=use_pandas_metadata) 178 179 def open(self, path, mode='rb'): ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in read(self, columns, nthreads, use_pandas_metadata) 896 partitions=self.partitions, 897 open_file_func=open_file, --> 898 use_pandas_metadata=use_pandas_metadata) 899 tables.append(table) 900 ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in read(self, columns, nthreads, partitions, open_file_func, file, use_pandas_metadata) 459 table = reader.read_row_group(self.row_group, **options) 460 else: --> 461 table = reader.read(**options) 462 463 if len(self.partition_keys) > 0: ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in read(self, columns, nthreads, use_pandas_metadata) 150 columns, use_pandas_metadata=use_pandas_metadata) 151 return self.reader.read_all(column_indices=column_indices, --> 152 nthreads=nthreads) 153 154 def scan_contents(self, columns=None, batch_size=65536): ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/_parquet.pyx in pyarrow._parquet.ParquetReader.read_all() ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status() ArrowIOError: Couldn't deserialize thrift: No more data to read. Deserializing page header failed. {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)