[ https://issues.apache.org/jira/browse/ARROW-11638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche updated ARROW-11638: ------------------------------------------ Component/s: Python > [Python][Parquet] Can't read directory of Parquet data saved by PySpark via > ---------------------------------------------------------------------------- > > Key: ARROW-11638 > URL: https://issues.apache.org/jira/browse/ARROW-11638 > Project: Apache Arrow > Issue Type: Improvement > Components: Python > Environment: Ubuntu 18.04 > Reporter: Russell Jurney > Priority: Major > > I saved a large Parquet dataset with 200 partitions in PySpark. I am trying > to load it to use in Dask. Dask's own utilities for this don't work so I am > trying pyarrow, which has worked in the past when converted into a Table then > a dataframe. I have verified that one of the file partitions loads just fine. > This code: > {code:java} > import pyarrow.parquet as pq > fs_gcs = gcsfs.GCSFileSystem(project="my-space-1234") > table = > pq.ParquetDataset("gs://open-corporates/parquet/2020_10/officers.parquet", > filesystem=fs_gcs) > table > {code} > I get this exception: > {code:java} > --------------------------------------------------------------------------- > ArrowInvalid Traceback (most recent call last) > <ipython-input-69-234baf611ba5> in <module> 5 6 ----> 7 table = > pq.ParquetDataset("gs://open-corporates/parquet/2020_10/officers.parquet", > filesystem=fs_gcs) 8 table > /opt/conda/envs/deep_discovery/lib/python3.8/site-packages/pyarrow/parquet.py > in __init__(self, path_or_paths, filesystem, schema, metadata, > split_row_groups, validate_schema, filters, metadata_nthreads, > read_dictionary, memory_map, buffer_size, partitioning, use_legacy_dataset) > 1274 1275 if validate_schema: -> 1276 self.validate_schemas() 1277 1278 def > equals(self, other): > /opt/conda/envs/deep_discovery/lib/python3.8/site-packages/pyarrow/parquet.py > in validate_schemas(self) 1302 self.schema = self.common_metadata.schema 1303 > else: -> 1304 self.schema = self.pieces[0].get_metadata().schema 1305 elif > self.schema is None: 1306 self.schema = self.metadata.schema > /opt/conda/envs/deep_discovery/lib/python3.8/site-packages/pyarrow/parquet.py > in get_metadata(self) 734 metadata : FileMetaData 735 """ --> 736 f = > self.open() 737 return f.metadata 738 > /opt/conda/envs/deep_discovery/lib/python3.8/site-packages/pyarrow/parquet.py > in open(self) 741 Return instance of ParquetFile. 742 """ --> 743 reader = > self.open_file_func(self.path) 744 if not isinstance(reader, ParquetFile): > 745 reader = ParquetFile(reader, **self.file_options) > /opt/conda/envs/deep_discovery/lib/python3.8/site-packages/pyarrow/parquet.py > in _open_dataset_file(dataset, path, meta) 1111 not isinstance(dataset.fs, > legacyfs.LocalFileSystem)): 1112 path = dataset.fs.open(path, mode='rb') -> > 1113 return ParquetFile( 1114 path, 1115 metadata=meta, > /opt/conda/envs/deep_discovery/lib/python3.8/site-packages/pyarrow/parquet.py > in __init__(self, source, metadata, common_metadata, read_dictionary, > memory_map, buffer_size) 215 read_dictionary=None, memory_map=False, > buffer_size=0): 216 self.reader = ParquetReader() --> 217 > self.reader.open(source, use_memory_map=memory_map, 218 > buffer_size=buffer_size, 219 read_dictionary=read_dictionary, > metadata=metadata) > /opt/conda/envs/deep_discovery/lib/python3.8/site-packages/pyarrow/_parquet.pyx > in pyarrow._parquet.ParquetReader.open() > /opt/conda/envs/deep_discovery/lib/python3.8/site-packages/pyarrow/error.pxi > in pyarrow.lib.check_status() ArrowInvalid: Parquet file size is 0 bytes > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)