[PR] [WIP] Support deep schema pruning and projection [datafusion]

via GitHub Wed, 31 Jul 2024 05:47:27 -0700


adragomir opened a new pull request, #11747:
URL: https://github.com/apache/datafusion/pull/11747


   ## Which issue does this PR close?
   
   Closes #11745
   
   ## What changes are included in this PR?
   
   This PR is currently opened as a discussion support for this feature. 
   
   1. We need a way to add the deep projection. This is currently represented 
(for ease of merging etc) as an extra parameter, `projection_deep` added after 
a `projection` parameter before. This parameter needs to be propagated all the 
way down to the physical data source, in our case parquet. 
   
   The representation of the deep projection is a `HashMap<usize, 
Vec<String>>`. The key represents the index of the top level column, while the 
`Vec<String>` is a list of "paths". Each path represents a way to navigate 
inside a deep field separated by dots, like 
`top_level_struct.subfield1.*.subfield2`. 
   
   ** Current problems / questions **
   
   * Duplicated information - for ease of merging, at the moment, we have both 
the `projection` and `projection_deep` parameters, even though we could only 
pass the second one, that has in the keys the information currently in 
projection
   * Untyped representation for path - maybe we need a better / safer way to 
represent the deep projections ? due to unfamiliarity and ease of development, 
currently it's just a String, but maybe we need to have it represented as a 
typed thing ? 
   
   3. We have a set of functions that can: rewrite schemas and record batches 
from an input to an output schema, rewrite an input schema to a (potentially) 
much smaller output schema that contains ONLY the necessary columns. This is 
the first patch in the PR (). These functions can be tested in isolation we 
need to add more unit tests, etc. 
   
   ** Current problems / questions **
   
   * We have a lot of seemingly duplicated code that recurses through schemas. 
We tried making it more generic, but it was pretty complicated, so for the 
moment there is some duplication in the code, and not really sure how to make 
it better
   * Deep Schema test cases - creating deep schemas in code is very verbose, so 
we don't have enough test cases. One way would be to use pre-generated parquet 
files created with other tools and use those for the schema. 
   
   4. We need to detect the deep projections from the input logical plan, that 
means navigating the expressions, down to the columns and transforming index 
and field access expressions to a map 
`field[0]["some_sub_field"]["other_sub_field"] -> 0: 
Vec<String>{"*.some_sub_field.other_sub_field"}`
   
   ** Current problems / questions **
   
   * Uniform field accesses need 2 accesses through the fields: 
       * At the moment, we recurse through the input expressions like 
`a["b"]["c"]` to compute a path like `b.c`, or `*.c`, or `b.*`.  The `["b"]` 
could mean either a map access or a substruct access, same for the c
       * So, we go through the paths twice - once we create them and after that 
we iterate them again, WITH the input schems so we can detect map accesses and 
replace actual field names that are actual map or list accesses with `*` - so 
we also introduce a magic string here :(
   * Implementing a new `scan_deep` function in the TableProvider
       * We can get rid of this. It's done this way for now, so that we don't 
break existing TableProvider implementations 
   
   6. We need to push these through the physical layer and finally implement 
the reading in the parquet layer. 
   
   ** Current problems / questions **
   
   * Named field in map key value or list element
       * In the parquet schema, a list has a field inside it with a specific 
name. However, we don't have this name anywhere, except at the parquet layer. 
The arrow reader as far as we can tell loads these and replaces the names with 
standard names  ("element"  in the list case, "key_value.key", 
"key_value".value). BUT, if the arrow loader behavior changes, this 
functionality will break. At the moment we detect these hardcoded names and we 
replace them with a wildcard "*" that signifies that we don't care about this 
name. We also do this in the INPUT paths, because we don't have these actual 
names at that level. 
   
   ## Are these changes tested?
   
   A version of this is tested on top of Datafusion-40, the rebase on main has 
not been tested. The PR is opened as a discussion support, but the patch 
applied mostly cleanly apart from some refactorings in data fucion
   
   ## Are there any user-facing changes?
   
   Possible API change for implementers of `TableProvider`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [WIP] Support deep schema pruning and projection [datafusion]

Reply via email to