adragomir opened a new pull request, #11747:
URL: https://github.com/apache/datafusion/pull/11747
## Which issue does this PR close?
Closes #11745
## What changes are included in this PR?
This PR is currently opened as a discussion support for this feature.
1. We need a way to add the deep projection. This is currently represented
(for ease of merging etc) as an extra parameter, `projection_deep` added after
a `projection` parameter before. This parameter needs to be propagated all the
way down to the physical data source, in our case parquet.
The representation of the deep projection is a `HashMap<usize,
Vec<String>>`. The key represents the index of the top level column, while the
`Vec<String>` is a list of "paths". Each path represents a way to navigate
inside a deep field separated by dots, like
`top_level_struct.subfield1.*.subfield2`.
** Current problems / questions **
* Duplicated information - for ease of merging, at the moment, we have both
the `projection` and `projection_deep` parameters, even though we could only
pass the second one, that has in the keys the information currently in
projection
* Untyped representation for path - maybe we need a better / safer way to
represent the deep projections ? due to unfamiliarity and ease of development,
currently it's just a String, but maybe we need to have it represented as a
typed thing ?
3. We have a set of functions that can: rewrite schemas and record batches
from an input to an output schema, rewrite an input schema to a (potentially)
much smaller output schema that contains ONLY the necessary columns. This is
the first patch in the PR (). These functions can be tested in isolation we
need to add more unit tests, etc.
** Current problems / questions **
* We have a lot of seemingly duplicated code that recurses through schemas.
We tried making it more generic, but it was pretty complicated, so for the
moment there is some duplication in the code, and not really sure how to make
it better
* Deep Schema test cases - creating deep schemas in code is very verbose, so
we don't have enough test cases. One way would be to use pre-generated parquet
files created with other tools and use those for the schema.
4. We need to detect the deep projections from the input logical plan, that
means navigating the expressions, down to the columns and transforming index
and field access expressions to a map
`field[0]["some_sub_field"]["other_sub_field"] -> 0:
Vec<String>{"*.some_sub_field.other_sub_field"}`
** Current problems / questions **
* Uniform field accesses need 2 accesses through the fields:
* At the moment, we recurse through the input expressions like
`a["b"]["c"]` to compute a path like `b.c`, or `*.c`, or `b.*`. The `["b"]`
could mean either a map access or a substruct access, same for the c
* So, we go through the paths twice - once we create them and after that
we iterate them again, WITH the input schems so we can detect map accesses and
replace actual field names that are actual map or list accesses with `*` - so
we also introduce a magic string here :(
* Implementing a new `scan_deep` function in the TableProvider
* We can get rid of this. It's done this way for now, so that we don't
break existing TableProvider implementations
6. We need to push these through the physical layer and finally implement
the reading in the parquet layer.
** Current problems / questions **
* Named field in map key value or list element
* In the parquet schema, a list has a field inside it with a specific
name. However, we don't have this name anywhere, except at the parquet layer.
The arrow reader as far as we can tell loads these and replaces the names with
standard names ("element" in the list case, "key_value.key",
"key_value".value). BUT, if the arrow loader behavior changes, this
functionality will break. At the moment we detect these hardcoded names and we
replace them with a wildcard "*" that signifies that we don't care about this
name. We also do this in the INPUT paths, because we don't have these actual
names at that level.
## Are these changes tested?
A version of this is tested on top of Datafusion-40, the rebase on main has
not been tested. The PR is opened as a discussion support, but the patch
applied mostly cleanly apart from some refactorings in data fucion
## Are there any user-facing changes?
Possible API change for implementers of `TableProvider`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]