Hi, I've been doing some work trying to get the parquet read path going for the python iceberg <https://github.com/apache/incubator-iceberg> library. I have two questions that I couldn't get figured out, and was hoping I could get some guidance from the list here.
First, I'd like to create a ParquetSchema->IcebergSchema converter, but it appears that only limited information is available in the ColumnSchema passed back to the python client[2]: <ParquetColumnSchema> name: key path: m.map.key max_definition_level: 2 max_repetition_level: 1 physical_type: BYTE_ARRAY logical_type: UTF8 <ParquetColumnSchema> name: key path: m.map.value.map.key max_definition_level: 4 max_repetition_level: 2 physical_type: BYTE_ARRAY logical_type: UTF8 <ParquetColumnSchema> name: value path: m.map.value.map.value max_definition_level: 5 max_repetition_level: 2 physical_type: BYTE_ARRAY logical_type: UTF8 where physical_type and logical_type are both strings[1]. The arrow schema I can get from *to_arrow_schema *looks to be more expressive(although may be I just don't understand the parquet format well enough): m: struct<map: list<map: struct<key: string, value: struct<map: list<map: struct<key: string, value: string> not null>>> not null>> child 0, map: list<map: struct<key: string, value: struct<map: list<map: struct<key: string, value: string> not null>>> not null> child 0, map: struct<key: string, value: struct<map: list<map: struct<key: string, value: string> not null>>> child 0, key: string child 1, value: struct<map: list<map: struct<key: string, value: string> not null>> child 0, map: list<map: struct<key: string, value: string> not null> child 0, map: struct<key: string, value: string> child 0, key: string child 1, value: string It seems like I can infer the info from the name/path, but is there a more direct way of getting the detailed parquet schema information? Second question, is there a way to push record level filtering into the parquet reader, so that the parquet reader only reads in values that match a given predicate expression? Predicate expressions would be simple field-to-literal comparisons(>,>=,==,<=,<, !=, is null, is not null) connected with logical operators(AND, OR, NOT). I've seen that after reading-in I can use the filtering language in gandiva[3] to get filtered record-batches, but was looking for somewhere lower in the stack if possible. [1] https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx#L661-L667 [2] Spark/Hive Table DDL for this parquet file looks like: CREATE TABLE `iceberg`.`nested_map` ( m map<string,map<string,string>>) [3] https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_gandiva.py#L86-L100