GitHub user marmbrus opened a pull request:
https://github.com/apache/spark/pull/11646
[SPARK-XXXX][SQL] Add a strategy for planning partitioned and bucketed
scans of files
**I'll add the correct JIRA when JIRA comes back up...**
This PR adds a new strategy, `FileSourceStrategy`, that can be used for
planning scans of collections of files that might be partitioned or bucketed.
Compared with the existing planning logic in `DataSourceStrategy` this
version has the following desirable properties:
- It removes the need to have `RDD`, `broadcastedHadoopConf` and other
distributed concerns in the public API of
`org.apache.spark.sql.sources.FileFormat`
- Partition column appending is delegated to the format to avoid an extra
copy / devectorization when appending partition columns
- It minimizes the amount of data that is shipped to each executor (i.e.
it does not send the whole list of files to every worker in the form of a
hadoop conf)
- it natively supports bucketing files into partitions, and thus does not
require coalescing / creating a `UnionRDD` with the correct partitioning.
- Small files are automatically coalesced into fewer tasks using an
approximate bin-packing algorithm.
Currently only a testing source is planned / tested using this strategy.
In follow-up PRs we will port the existing formats to this API.
A stub for `FileScanRDD` is also added, but most methods remain
unimplemented.
Other minor cleanups:
- partition pruning is pushed into `FileCatalog` so both the new and old
code paths can use this logic. This will also allow future implementations to
use indexes or other tricks (i.e. a MySQL metastore)
- The partitions from the `FileCatalog` now propagate information about
file sizes all the way up to the planner so we can intelligently spread files
out.
- `Array` -> `Seq` in some internal APIs to avoid unnecessary `toArray`
calls
- Rename `Partition` to `PartitionDirectory` to differentiate partitions
used earlier in pruning from those where we have already enumerated the files
and their sizes.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/marmbrus/spark fileStrategy
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/11646.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #11646
----
commit 94cede673339bddc083b77a1d8e95e31015dcae3
Author: Michael Armbrust <[email protected]>
Date: 2016-03-09T18:28:42Z
WIP
commit 09df4c8f4ad9f9e83fbd9524afd8d49007ba12da
Author: Michael Armbrust <[email protected]>
Date: 2016-03-11T03:29:51Z
WIP
commit ef6ddaf1a8e0d44e10f3a04d8357f945639075cd
Author: Michael Armbrust <[email protected]>
Date: 2016-03-11T03:51:16Z
Merge remote-tracking branch 'apache/master' into fileStrategy
Conflicts:
sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
commit 9990d41a59cbe23b93a592b6248aed259513aac5
Author: Michael Armbrust <[email protected]>
Date: 2016-03-11T03:52:27Z
cleanup
commit 4f298458fee9e0ebe59b4b569921bc32d3e1d4b1
Author: Michael Armbrust <[email protected]>
Date: 2016-03-11T04:28:07Z
more cleanup
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]