Thanks Micah and Wes. Definitely interested in the *Predicate Pushdown* and *Schema inference, schema-on-read, and schema normalization *sections.
On Fri, May 17, 2019 at 12:47 PM Wes McKinney <wesmck...@gmail.com> wrote: > Please see also > > > https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit?usp=drivesdk > > And prior mailing list discussion. I will comment in more detail on the > other items later > > On Fri, May 17, 2019, 2:44 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > I can't help on the first question. > > > > Regarding push-down predicates, there is an open JIRA [1] to do just that > > > > [1] https://issues.apache.org/jira/browse/PARQUET-473 > > < > > > https://issues.apache.org/jira/browse/PARQUET-473?jql=project%20in%20(PARQUET%2C%20ARROW)%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20text%20~%20%22pushdown%22 > > > > > > > On Fri, May 17, 2019 at 11:48 AM Ted Gooch <tedgo...@gmail.com> wrote: > > > > > Hi, > > > > > > I've been doing some work trying to get the parquet read path going for > > the > > > python iceberg <https://github.com/apache/incubator-iceberg> > library. I > > > have two questions that I couldn't get figured out, and was hoping I > > could > > > get some guidance from the list here. > > > > > > First, I'd like to create a ParquetSchema->IcebergSchema converter, but > > it > > > appears that only limited information is available in the ColumnSchema > > > passed back to the python client[2]: > > > > > > <ParquetColumnSchema> > > > name: key > > > path: m.map.key > > > max_definition_level: 2 > > > max_repetition_level: 1 > > > physical_type: BYTE_ARRAY > > > logical_type: UTF8 > > > <ParquetColumnSchema> > > > name: key > > > path: m.map.value.map.key > > > max_definition_level: 4 > > > max_repetition_level: 2 > > > physical_type: BYTE_ARRAY > > > logical_type: UTF8 > > > <ParquetColumnSchema> > > > name: value > > > path: m.map.value.map.value > > > max_definition_level: 5 > > > max_repetition_level: 2 > > > physical_type: BYTE_ARRAY > > > logical_type: UTF8 > > > > > > > > > where physical_type and logical_type are both strings[1]. The arrow > > schema > > > I can get from *to_arrow_schema *looks to be more expressive(although > may > > > be I just don't understand the parquet format well enough): > > > > > > m: struct<map: list<map: struct<key: string, value: struct<map: > list<map: > > > struct<key: string, value: string> not null>>> not null>> > > > child 0, map: list<map: struct<key: string, value: struct<map: > > list<map: > > > struct<key: string, value: string> not null>>> not null> > > > child 0, map: struct<key: string, value: struct<map: list<map: > > > struct<key: string, value: string> not null>>> > > > child 0, key: string > > > child 1, value: struct<map: list<map: struct<key: string, > > value: > > > string> not null>> > > > child 0, map: list<map: struct<key: string, value: > string> > > > not null> > > > child 0, map: struct<key: string, value: string> > > > child 0, key: string > > > child 1, value: string > > > > > > > > > It seems like I can infer the info from the name/path, but is there a > > more > > > direct way of getting the detailed parquet schema information? > > > > > > Second question, is there a way to push record level filtering into the > > > parquet reader, so that the parquet reader only reads in values that > > match > > > a given predicate expression? Predicate expressions would be simple > > > field-to-literal comparisons(>,>=,==,<=,<, !=, is null, is not null) > > > connected with logical operators(AND, OR, NOT). > > > > > > I've seen that after reading-in I can use the filtering language in > > > gandiva[3] to get filtered record-batches, but was looking for > somewhere > > > lower in the stack if possible. > > > > > > > > > > > > [1] > > > > > > > > > https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx#L661-L667 > > > [2] Spark/Hive Table DDL for this parquet file looks like: > > > CREATE TABLE `iceberg`.`nested_map` ( > > > m map<string,map<string,string>>) > > > [3] > > > > > > > > > https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_gandiva.py#L86-L100 > > > > > >