Re: Columns/Field index semantic for parquet FileReader

Radu Teodorescu Mon, 16 Nov 2020 18:58:22 -0800

ok,
I retract my last question on mapping arrow fields to parquet leaf nodes, all 
the pieces are there and it’s a 5 line function.
I still feel a bit thrown off by the column index semantics, but I see how it 
can open up for more interesting requests where once would want a subset of a 
struct.


> On Nov 16, 2020, at 7:18 PM, Radu Teodorescu <radukay...@yahoo.com.INVALID> 
> wrote:
> 
> Hi,
> (my apologies if this has already been discussed) 
> I just took a stab at the struct support in parquet FileReader and I am a bit 
> confused by the column index semantic when trying to read a subset of columns 
> from a subset of row groups:
> 
> Say I have a single column arrow table 
> 
> top: struct {
>       a: string
>       b: int
> }
> 
> parquet::arrow::FileReader::ReadRowGroup(someRG,{0},out) returns an arrow 
> table that has one column
> 
> top: struct {
>       a: string
> }
> 
> If I call it with parquet::arrow::FileReader::ReadRowGroup(someRG,{0,1},out) 
> returns an arrow table that has one column of the original type.
> 
> Initially I thought that was a bug but seems like that was the intended 
> behavior (especially based on the existence of ReadSchemaField)
> 
> This feels a bit confusing:
> 1. I do understand the subtleties of parquet columns vs arrow columns but I 
> would have expected the reader to hide all that an effectively talk both ways 
> in terms of Arrow Columns as per the associated Arrow schema
> 2. The number of columns returned by ReadRowGroup don’t (always) match the 
> number of indices. It would make more sense if the indexing reflects the 
> parquet leafs nodes, to receive the denormalized columns (one for each 
> indexed leaf)
> 3. The ReadSchemaField method seems to do what I was hopping ReadRowGroup in 
> terms of using top level schema indexing but if doesn’t allow for chunked 
> rowgroup based access 
> 
> My perspective is certainly a first impression and a single data point, so I 
> am happy to come around the existing design philosophy. 
> 
> For my immediate selfish needs though, what is the cleanest way to go from 
> top level indices to leaf indices expected by ReadRowGroup? I can build a 
> utility if none is available, and it seems like that should allow one to 
> select a subset of top level columns while ensuring they are getting all the 
> leafs
> 
> Thank you
> Radu

Re: Columns/Field index semantic for parquet FileReader

Reply via email to