Hello everybody, I have been trying to use complex types (stored in parquet) with spark SQL and ended up having an issue that I can't seem to be able to solve cleanly. I was hoping, through this mail, to get some insights from the community, maybe I'm just missing something obvious in the way I'm using spark :)
It seems that spark only push down projections for columns at the root level of the records. This is a big I/O issue depending on how much you use complex types, in my samples I ended up reading 100GB of data when using only a single field out of a struct (It should most likely have read only 1GB). I already saw this PR which sounds promising: https://github.com/apache/spark/pull/16578 However it seems that it won't be applicable if you have multiple array nesting level, the main reason is that I can't seem to find how to reference to fields deeply nested in arrays in a Column expression. I can do everything within lambdas but I think the optimizer won't drill into it to understand that I'm only accessing a few fields. If I take the following (simplified) example: { trip_legs:[{ from: "LHR", to: "NYC", taxes: [{ type: "gov", amount: 12 currency: "USD" }] }] } col(trip_legs.from) will return an Array of all the from fields for each trip_leg object. col(trip_legs.taxes.type) will throw an exception. So my questions are: * Is there a way to reference to these deeply nested fields without having to specify an array index with a Column expression? * If not, is there an API to force the projection of a given set of fields so that parquet only read this specific set of columns? In addition, regarding the handling of arrays of struct within spark sql: * Has it been discussed to have a way to "reshape" an array of structs without using lambdas? (Like the $map/$filter/etc.. operators of mongodb for example) * If not and I'm willing to dedicate time to code for these features, does someone familiar with the code base could tell me how disruptive this would be? And if this would be a welcome change or not? (most likely more appropriate for the dev mailing list though) Regards, Antoine --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org