GitHub user marmbrus opened a pull request:

    https://github.com/apache/spark/pull/11646

    [SPARK-XXXX][SQL] Add a strategy for planning partitioned and bucketed 
scans of files

    **I'll add the correct JIRA when JIRA comes back up...**
    
    This PR adds a new strategy, `FileSourceStrategy`, that can be used for 
planning scans of collections of files that might be partitioned or bucketed.
    
    Compared with the existing planning logic in `DataSourceStrategy` this 
version has the following desirable properties:
     - It removes the need to have `RDD`, `broadcastedHadoopConf` and other 
distributed concerns  in the public API of 
`org.apache.spark.sql.sources.FileFormat`
     - Partition column appending is delegated to the format to avoid an extra 
copy / devectorization when appending partition columns
     - It minimizes the amount of data that is shipped to each executor (i.e. 
it does not send the whole list of files to every worker in the form of a 
hadoop conf)
     - it natively supports bucketing files into partitions, and thus does not 
require coalescing / creating a `UnionRDD` with the correct partitioning.
     - Small files are automatically coalesced into fewer tasks using an 
approximate bin-packing algorithm.
    
    Currently only a testing source is planned / tested using this strategy.  
In follow-up PRs we will port the existing formats to this API.
    
    A stub for `FileScanRDD` is also added, but most methods remain 
unimplemented.
    
    Other minor cleanups:
     - partition pruning is pushed into `FileCatalog` so both the new and old 
code paths can use this logic.  This will also allow future implementations to 
use indexes or other tricks (i.e. a MySQL metastore)
     - The partitions from the `FileCatalog` now propagate information about 
file sizes all the way up to the planner so we can intelligently spread files 
out.
     - `Array` -> `Seq` in some internal APIs to avoid unnecessary `toArray` 
calls
     - Rename `Partition` to `PartitionDirectory` to differentiate partitions 
used earlier in pruning from those where we have already enumerated the files 
and their sizes.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/marmbrus/spark fileStrategy

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11646.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11646
    
----
commit 94cede673339bddc083b77a1d8e95e31015dcae3
Author: Michael Armbrust <[email protected]>
Date:   2016-03-09T18:28:42Z

    WIP

commit 09df4c8f4ad9f9e83fbd9524afd8d49007ba12da
Author: Michael Armbrust <[email protected]>
Date:   2016-03-11T03:29:51Z

    WIP

commit ef6ddaf1a8e0d44e10f3a04d8357f945639075cd
Author: Michael Armbrust <[email protected]>
Date:   2016-03-11T03:51:16Z

    Merge remote-tracking branch 'apache/master' into fileStrategy
    
    Conflicts:
        sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

commit 9990d41a59cbe23b93a592b6248aed259513aac5
Author: Michael Armbrust <[email protected]>
Date:   2016-03-11T03:52:27Z

    cleanup

commit 4f298458fee9e0ebe59b4b569921bc32d3e1d4b1
Author: Michael Armbrust <[email protected]>
Date:   2016-03-11T04:28:07Z

    more cleanup

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to