Re: A couple of questions about pyarrow.parquet

Ted Gooch Fri, 17 May 2019 18:54:12 -0700

Thanks Micah and Wes.

Definitely interested in the *Predicate Pushdown* and *Schema inference,
schema-on-read, and schema normalization *sections.


On Fri, May 17, 2019 at 12:47 PM Wes McKinney <wesmck...@gmail.com> wrote:

> Please see also
>
>
> https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit?usp=drivesdk
>
> And prior mailing list discussion. I will comment in more detail on the
> other items later
>
> On Fri, May 17, 2019, 2:44 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
> > I can't help on the first question.
> >
> > Regarding push-down predicates, there is an open JIRA [1] to do just that
> >
> > [1] https://issues.apache.org/jira/browse/PARQUET-473
> > <
> >
> https://issues.apache.org/jira/browse/PARQUET-473?jql=project%20in%20(PARQUET%2C%20ARROW)%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20text%20~%20%22pushdown%22
> > >
> >
> > On Fri, May 17, 2019 at 11:48 AM Ted Gooch <tedgo...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > I've been doing some work trying to get the parquet read path going for
> > the
> > > python iceberg <https://github.com/apache/incubator-iceberg>
> library.  I
> > > have two questions that I couldn't get figured out, and was hoping I
> > could
> > > get some guidance from the list here.
> > >
> > > First, I'd like to create a ParquetSchema->IcebergSchema converter, but
> > it
> > > appears that only limited information is available in the ColumnSchema
> > > passed back to the python client[2]:
> > >
> > > <ParquetColumnSchema>
> > >   name: key
> > >   path: m.map.key
> > >   max_definition_level: 2
> > >   max_repetition_level: 1
> > >   physical_type: BYTE_ARRAY
> > >   logical_type: UTF8
> > > <ParquetColumnSchema>
> > >   name: key
> > >   path: m.map.value.map.key
> > >   max_definition_level: 4
> > >   max_repetition_level: 2
> > >   physical_type: BYTE_ARRAY
> > >   logical_type: UTF8
> > > <ParquetColumnSchema>
> > >   name: value
> > >   path: m.map.value.map.value
> > >   max_definition_level: 5
> > >   max_repetition_level: 2
> > >   physical_type: BYTE_ARRAY
> > >   logical_type: UTF8
> > >
> > >
> > > where physical_type and logical_type are both strings[1].  The arrow
> > schema
> > > I can get from *to_arrow_schema *looks to be more expressive(although
> may
> > > be I just don't understand the parquet format well enough):
> > >
> > > m: struct<map: list<map: struct<key: string, value: struct<map:
> list<map:
> > > struct<key: string, value: string> not null>>> not null>>
> > >   child 0, map: list<map: struct<key: string, value: struct<map:
> > list<map:
> > > struct<key: string, value: string> not null>>> not null>
> > >       child 0, map: struct<key: string, value: struct<map: list<map:
> > > struct<key: string, value: string> not null>>>
> > >           child 0, key: string
> > >           child 1, value: struct<map: list<map: struct<key: string,
> > value:
> > > string> not null>>
> > >               child 0, map: list<map: struct<key: string, value:
> string>
> > > not null>
> > >                   child 0, map: struct<key: string, value: string>
> > >                       child 0, key: string
> > >                       child 1, value: string
> > >
> > >
> > > It seems like I can infer the info from the name/path, but is there a
> > more
> > > direct way of getting the detailed parquet schema information?
> > >
> > > Second question, is there a way to push record level filtering into the
> > > parquet reader, so that the parquet reader only reads in values that
> > match
> > > a given predicate expression? Predicate expressions would be simple
> > > field-to-literal comparisons(>,>=,==,<=,<, !=, is null, is not null)
> > > connected with logical operators(AND, OR, NOT).
> > >
> > > I've seen that after reading-in I can use the filtering language in
> > > gandiva[3] to get filtered record-batches, but was looking for
> somewhere
> > > lower in the stack if possible.
> > >
> > >
> > >
> > > [1]
> > >
> > >
> >
> https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx#L661-L667
> > > [2] Spark/Hive Table DDL for this parquet file looks like:
> > > CREATE TABLE `iceberg`.`nested_map` (
> > > m map<string,map<string,string>>)
> > > [3]
> > >
> > >
> >
> https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_gandiva.py#L86-L100
> > >
> >
>

Re: A couple of questions about pyarrow.parquet

Reply via email to