Re: A couple of questions about pyarrow.parquet

Uwe L. Korn Thu, 23 May 2019 04:56:49 -0700

Hello Ted,

regarding predicate pushdown in Python, have a look at my unfinished PR at 
https://github.com/apache/arrow/pull/2623. This was stopped since we were 
missing native filter in Arrow. The requirements for that have now been 
implemented and we could probably reactivate the PR.


Uwe

On Sat, May 18, 2019, at 3:53 AM, Ted Gooch wrote:
> Thanks Micah and Wes.
> 
> Definitely interested in the *Predicate Pushdown* and *Schema inference,
> schema-on-read, and schema normalization *sections.
> 
> On Fri, May 17, 2019 at 12:47 PM Wes McKinney <wesmck...@gmail.com> wrote:
> 
> > Please see also
> >
> >
> > https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit?usp=drivesdk
> >
> > And prior mailing list discussion. I will comment in more detail on the
> > other items later
> >
> > On Fri, May 17, 2019, 2:44 PM Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> >
> > > I can't help on the first question.
> > >
> > > Regarding push-down predicates, there is an open JIRA [1] to do just that
> > >
> > > [1] https://issues.apache.org/jira/browse/PARQUET-473
> > > <
> > >
> > https://issues.apache.org/jira/browse/PARQUET-473?jql=project%20in%20(PARQUET%2C%20ARROW)%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20text%20~%20%22pushdown%22
> > > >
> > >
> > > On Fri, May 17, 2019 at 11:48 AM Ted Gooch <tedgo...@gmail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > I've been doing some work trying to get the parquet read path going for
> > > the
> > > > python iceberg <https://github.com/apache/incubator-iceberg>
> > library.  I
> > > > have two questions that I couldn't get figured out, and was hoping I
> > > could
> > > > get some guidance from the list here.
> > > >
> > > > First, I'd like to create a ParquetSchema->IcebergSchema converter, but
> > > it
> > > > appears that only limited information is available in the ColumnSchema
> > > > passed back to the python client[2]:
> > > >
> > > > <ParquetColumnSchema>
> > > >   name: key
> > > >   path: m.map.key
> > > >   max_definition_level: 2
> > > >   max_repetition_level: 1
> > > >   physical_type: BYTE_ARRAY
> > > >   logical_type: UTF8
> > > > <ParquetColumnSchema>
> > > >   name: key
> > > >   path: m.map.value.map.key
> > > >   max_definition_level: 4
> > > >   max_repetition_level: 2
> > > >   physical_type: BYTE_ARRAY
> > > >   logical_type: UTF8
> > > > <ParquetColumnSchema>
> > > >   name: value
> > > >   path: m.map.value.map.value
> > > >   max_definition_level: 5
> > > >   max_repetition_level: 2
> > > >   physical_type: BYTE_ARRAY
> > > >   logical_type: UTF8
> > > >
> > > >
> > > > where physical_type and logical_type are both strings[1].  The arrow
> > > schema
> > > > I can get from *to_arrow_schema *looks to be more expressive(although
> > may
> > > > be I just don't understand the parquet format well enough):
> > > >
> > > > m: struct<map: list<map: struct<key: string, value: struct<map:
> > list<map:
> > > > struct<key: string, value: string> not null>>> not null>>
> > > >   child 0, map: list<map: struct<key: string, value: struct<map:
> > > list<map:
> > > > struct<key: string, value: string> not null>>> not null>
> > > >       child 0, map: struct<key: string, value: struct<map: list<map:
> > > > struct<key: string, value: string> not null>>>
> > > >           child 0, key: string
> > > >           child 1, value: struct<map: list<map: struct<key: string,
> > > value:
> > > > string> not null>>
> > > >               child 0, map: list<map: struct<key: string, value:
> > string>
> > > > not null>
> > > >                   child 0, map: struct<key: string, value: string>
> > > >                       child 0, key: string
> > > >                       child 1, value: string
> > > >
> > > >
> > > > It seems like I can infer the info from the name/path, but is there a
> > > more
> > > > direct way of getting the detailed parquet schema information?
> > > >
> > > > Second question, is there a way to push record level filtering into the
> > > > parquet reader, so that the parquet reader only reads in values that
> > > match
> > > > a given predicate expression? Predicate expressions would be simple
> > > > field-to-literal comparisons(>,>=,==,<=,<, !=, is null, is not null)
> > > > connected with logical operators(AND, OR, NOT).
> > > >
> > > > I've seen that after reading-in I can use the filtering language in
> > > > gandiva[3] to get filtered record-batches, but was looking for
> > somewhere
> > > > lower in the stack if possible.
> > > >
> > > >
> > > >
> > > > [1]
> > > >
> > > >
> > >
> > https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx#L661-L667
> > > > [2] Spark/Hive Table DDL for this parquet file looks like:
> > > > CREATE TABLE `iceberg`.`nested_map` (
> > > > m map<string,map<string,string>>)
> > > > [3]
> > > >
> > > >
> > >
> > https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_gandiva.py#L86-L100
> > > >
> > >
> >
>

Re: A couple of questions about pyarrow.parquet

Reply via email to