Hello everybody,

I have been trying to use complex types (stored in parquet) with spark
SQL and ended up having an issue that I can't seem to be able to solve
cleanly.
I was hoping, through this mail, to get some insights from the
community, maybe I'm just missing something obvious in the way I'm
using spark :)

It seems that spark only push down projections for columns at the root
level of the records.
This is a big I/O issue depending on how much you use complex types,
in my samples I ended up reading 100GB of data when using only a
single field out of a struct (It should most likely have read only
1GB).

I already saw this PR which sounds promising:
https://github.com/apache/spark/pull/16578

However it seems that it won't be applicable if you have multiple
array nesting level, the main reason is that I can't seem to find how
to reference to fields deeply nested in arrays in a Column expression.
I can do everything within lambdas but I think the optimizer won't
drill into it to understand that I'm only accessing a few fields.

If I take the following (simplified) example:

{
    trip_legs:[{
        from: "LHR",
        to: "NYC",
        taxes: [{
            type: "gov",
            amount: 12
            currency: "USD"
        }]
    }]
}

col(trip_legs.from) will return an Array of all the from fields for
each trip_leg object.
col(trip_legs.taxes.type) will throw an exception.

So my questions are:
  * Is there a way to reference to these deeply nested fields without
having to specify an array index with a Column expression?
  * If not, is there an API to force the projection of a given set of
fields so that parquet only read this specific set of columns?

In addition, regarding the handling of arrays of struct within spark sql:
  * Has it been discussed to have a way to "reshape" an array of
structs without using lambdas? (Like the $map/$filter/etc.. operators
of mongodb for example)
  * If not and I'm willing to dedicate time to code for these
features, does someone familiar with the code base could tell me how
disruptive this would be? And if this would be a welcome change or
not? (most likely more appropriate for the dev mailing list though)

Regards,
Antoine

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to