A couple of questions about pyarrow.parquet

Ted Gooch Fri, 17 May 2019 11:49:20 -0700

Hi,

I've been doing some work trying to get the parquet read path going for the
python iceberg <https://github.com/apache/incubator-iceberg> library.  I
have two questions that I couldn't get figured out, and was hoping I could
get some guidance from the list here.


First, I'd like to create a ParquetSchema->IcebergSchema converter, but it
appears that only limited information is available in the ColumnSchema
passed back to the python client[2]:

<ParquetColumnSchema>
  name: key
  path: m.map.key
  max_definition_level: 2
  max_repetition_level: 1
  physical_type: BYTE_ARRAY
  logical_type: UTF8
<ParquetColumnSchema>
  name: key
  path: m.map.value.map.key
  max_definition_level: 4
  max_repetition_level: 2
  physical_type: BYTE_ARRAY
  logical_type: UTF8
<ParquetColumnSchema>
  name: value
  path: m.map.value.map.value
  max_definition_level: 5
  max_repetition_level: 2
  physical_type: BYTE_ARRAY
  logical_type: UTF8


where physical_type and logical_type are both strings[1].  The arrow schema
I can get from *to_arrow_schema *looks to be more expressive(although may
be I just don't understand the parquet format well enough):

m: struct<map: list<map: struct<key: string, value: struct<map: list<map:
struct<key: string, value: string> not null>>> not null>>
  child 0, map: list<map: struct<key: string, value: struct<map: list<map:
struct<key: string, value: string> not null>>> not null>
      child 0, map: struct<key: string, value: struct<map: list<map:
struct<key: string, value: string> not null>>>
          child 0, key: string
          child 1, value: struct<map: list<map: struct<key: string, value:
string> not null>>
              child 0, map: list<map: struct<key: string, value: string>
not null>
                  child 0, map: struct<key: string, value: string>
                      child 0, key: string
                      child 1, value: string


It seems like I can infer the info from the name/path, but is there a more
direct way of getting the detailed parquet schema information?

Second question, is there a way to push record level filtering into the
parquet reader, so that the parquet reader only reads in values that match
a given predicate expression? Predicate expressions would be simple
field-to-literal comparisons(>,>=,==,<=,<, !=, is null, is not null)
connected with logical operators(AND, OR, NOT).

I've seen that after reading-in I can use the filtering language in
gandiva[3] to get filtered record-batches, but was looking for somewhere
lower in the stack if possible.



[1]
https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx#L661-L667
[2] Spark/Hive Table DDL for this parquet file looks like:
CREATE TABLE `iceberg`.`nested_map` (
m map<string,map<string,string>>)
[3]
https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_gandiva.py#L86-L100

A couple of questions about pyarrow.parquet

Reply via email to