[ https://issues.apache.org/jira/browse/ARROW-4516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rok Mihevc updated ARROW-4516: ------------------------------ External issue URL: https://github.com/apache/arrow/issues/21067 > [Python] Error while creating a ParquetDataset on a path without > `_common_dataset` but with an empty `_tempfile` > ---------------------------------------------------------------------------------------------------------------- > > Key: ARROW-4516 > URL: https://issues.apache.org/jira/browse/ARROW-4516 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.12.0 > Reporter: yogesh garg > Priority: Major > Labels: parquet > Fix For: 0.14.0 > > > I suspect that there's an error in this line of code: > https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L926 > While validating schema in the initialisation of a {{ParquetDataset}}, we > assume that if {{_common_metadata}} file does not exist, the schema should be > inferred from the first piece of that dataset. The first piece, in my > experience, could refer to a file named with an underscore, that does not > necessarily have to contain the schema, and could be an empty file, e.g. > {{_tempfile}}. > {code:bash} > /tmp/pq/ > ├── part1.parquet > └── _tempfile > {code} > This behavior is allowed by the parquet specification, and we should probably > ignore such pieces. > On a cursory look, we could do either of the following. > 1. Choose the first piece with path that does not start with "_" > 2. Sort pieces by name, but put all the "_" pieces later while making the > manifest. > https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L729 > 3. Silently exclude all the files starting with "_" here, but this will need > to be tested: > https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L770 -- This message was sent by Atlassian Jira (v8.20.10#820010)