ok, I retract my last question on mapping arrow fields to parquet leaf nodes, all the pieces are there and it’s a 5 line function. I still feel a bit thrown off by the column index semantics, but I see how it can open up for more interesting requests where once would want a subset of a struct.
> On Nov 16, 2020, at 7:18 PM, Radu Teodorescu <radukay...@yahoo.com.INVALID> > wrote: > > Hi, > (my apologies if this has already been discussed) > I just took a stab at the struct support in parquet FileReader and I am a bit > confused by the column index semantic when trying to read a subset of columns > from a subset of row groups: > > Say I have a single column arrow table > > top: struct { > a: string > b: int > } > > parquet::arrow::FileReader::ReadRowGroup(someRG,{0},out) returns an arrow > table that has one column > > top: struct { > a: string > } > > If I call it with parquet::arrow::FileReader::ReadRowGroup(someRG,{0,1},out) > returns an arrow table that has one column of the original type. > > Initially I thought that was a bug but seems like that was the intended > behavior (especially based on the existence of ReadSchemaField) > > This feels a bit confusing: > 1. I do understand the subtleties of parquet columns vs arrow columns but I > would have expected the reader to hide all that an effectively talk both ways > in terms of Arrow Columns as per the associated Arrow schema > 2. The number of columns returned by ReadRowGroup don’t (always) match the > number of indices. It would make more sense if the indexing reflects the > parquet leafs nodes, to receive the denormalized columns (one for each > indexed leaf) > 3. The ReadSchemaField method seems to do what I was hopping ReadRowGroup in > terms of using top level schema indexing but if doesn’t allow for chunked > rowgroup based access > > My perspective is certainly a first impression and a single data point, so I > am happy to come around the existing design philosophy. > > For my immediate selfish needs though, what is the cleanest way to go from > top level indices to leaf indices expected by ReadRowGroup? I can build a > utility if none is available, and it seems like that should allow one to > select a subset of top level columns while ensuring they are getting all the > leafs > > Thank you > Radu