Columns/Field index semantic for parquet FileReader

Radu Teodorescu Mon, 16 Nov 2020 16:18:51 -0800

Hi,
(my apologies if this has already been discussed) 
I just took a stab at the struct support in parquet FileReader and I am a bit 
confused by the column index semantic when trying to read a subset of columns 
from a subset of row groups:


Say I have a single column arrow table 

top: struct {
        a: string
        b: int
}

parquet::arrow::FileReader::ReadRowGroup(someRG,{0},out) returns an arrow table 
that has one column

top: struct {
        a: string
}

If I call it with parquet::arrow::FileReader::ReadRowGroup(someRG,{0,1},out) 
returns an arrow table that has one column of the original type.

Initially I thought that was a bug but seems like that was the intended 
behavior (especially based on the existence of ReadSchemaField)
 
This feels a bit confusing:
1. I do understand the subtleties of parquet columns vs arrow columns but I 
would have expected the reader to hide all that an effectively talk both ways 
in terms of Arrow Columns as per the associated Arrow schema
2. The number of columns returned by ReadRowGroup don’t (always) match the 
number of indices. It would make more sense if the indexing reflects the 
parquet leafs nodes, to receive the denormalized columns (one for each indexed 
leaf)
3. The ReadSchemaField method seems to do what I was hopping ReadRowGroup in 
terms of using top level schema indexing but if doesn’t allow for chunked 
rowgroup based access 

My perspective is certainly a first impression and a single data point, so I am 
happy to come around the existing design philosophy. 

For my immediate selfish needs though, what is the cleanest way to go from top 
level indices to leaf indices expected by ReadRowGroup? I can build a utility 
if none is available, and it seems like that should allow one to select a 
subset of top level columns while ensuring they are getting all the leafs
 
Thank you
Radu

Columns/Field index semantic for parquet FileReader

Reply via email to