Hi, (my apologies if this has already been discussed) I just took a stab at the struct support in parquet FileReader and I am a bit confused by the column index semantic when trying to read a subset of columns from a subset of row groups:
Say I have a single column arrow table top: struct { a: string b: int } parquet::arrow::FileReader::ReadRowGroup(someRG,{0},out) returns an arrow table that has one column top: struct { a: string } If I call it with parquet::arrow::FileReader::ReadRowGroup(someRG,{0,1},out) returns an arrow table that has one column of the original type. Initially I thought that was a bug but seems like that was the intended behavior (especially based on the existence of ReadSchemaField) This feels a bit confusing: 1. I do understand the subtleties of parquet columns vs arrow columns but I would have expected the reader to hide all that an effectively talk both ways in terms of Arrow Columns as per the associated Arrow schema 2. The number of columns returned by ReadRowGroup don’t (always) match the number of indices. It would make more sense if the indexing reflects the parquet leafs nodes, to receive the denormalized columns (one for each indexed leaf) 3. The ReadSchemaField method seems to do what I was hopping ReadRowGroup in terms of using top level schema indexing but if doesn’t allow for chunked rowgroup based access My perspective is certainly a first impression and a single data point, so I am happy to come around the existing design philosophy. For my immediate selfish needs though, what is the cleanest way to go from top level indices to leaf indices expected by ReadRowGroup? I can build a utility if none is available, and it seems like that should allow one to select a subset of top level columns while ensuring they are getting all the leafs Thank you Radu