Young-Jun Ko created ARROW-1842: ----------------------------------- Summary: ParquetDataset.read(): selectively reading array column Key: ARROW-1842 URL: https://issues.apache.org/jira/browse/ARROW-1842 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.7.1 Reporter: Young-Jun Ko
Scenario: - created a dataframe in spark and saved it as parquet - columns include simple types, e.g. String, but also an array of doubles Issue: I can read the whole data using ParquetDataset in pyarrow. I tried reading selectively a simple type => works I tried reading selectively the array column => key error in the following place: KeyError: 'c' /home/hadoop/Python/lib/python2.7/site-packages/pyarrow/_parquet.pyx in pyarrow._parquet.ParquetReader.column_name_idx (/arrow/python/build/temp.linux-x86_64-2.7/_parquet.cxx:9777)() 513 self.column_idx_map[col_bytes] = i 514 --> 515 return self.column_idx_map[tobytes(column_name)] When I just read the whole dataset, I get the correct metadata pyarrow.Table a: string b: string c: list<element: double not null> child 0, element: double d: int64 metadata -------- {'org.apache.spark.sql.parquet.row.metadata': '{"type":"struct","fields":[{"name":"a","type":"string","nullable":true,"metadata":{}},{"name":"b","type":"string","nullable":true,"metadata":{}},{"name":"c","type":{"type":"array","elementType":"double","containsNull":false},"nullable":true,"metadata":{}},{"name":"d","type":"long","nullable":false,"metadata":{}}]}'} I might just be missing the correct naming convention of the array column. But then this name should be reflected in the metadata. Thanks! -- This message was sent by Atlassian JIRA (v6.4.14#64029)