Vladimir created ARROW-7867:
-------------------------------

             Summary: ArrowIOError: Invalid Parquet file size is 0 bytes on 
reading from S3
                 Key: ARROW-7867
                 URL: https://issues.apache.org/jira/browse/ARROW-7867
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.15.1, 0.16.0
            Reporter: Vladimir


I'm not sure if this issue belongs here or to S3FS library.

The error occurs when reading from partitioned parquet from S3, in case when 
the "root folder" of the parquet was created manually before writing the 
parquet there. 

I.e. the steps to reproduce:

 
{code:java}
# 1. Create "folder" s3://bucket.name/data.parquet in e.g. cyberduck app

# 2. Write
table = pa.Table.from_pandas(df)
pq.write_table(table, 's3://bucket.name/data.parquet', partition_cols=[], 
filesystem=s3fs.S3FileSystem())

# 3. Read
pq.read_table('s3://bucket.name/data.parquet', filesystem=s3fs.S3FileSystem())
# ArrowIOError: Invalid Parquet file size is 0 bytes{code}
In case when the table was partitioned by a non-empty set of columns, an error 
reads: "ValueError: Found files in an intermediate directory".

This is likely due to the fact that S3 does not have "folders" per-se, and 
various software "mimic" creation of empty folder by writing an empty 
(zero-size) object to S3. So the parquet confuses this object with the actual 
contents of the parquet file.

At the same time s3fs library correctly identifies the key as a folder: 
{code:java}
s3fs.S3FileSystem().isdir('s3://bucket.name/data.parquet')  # Returns True
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to