Github user yhuai commented on the pull request:
https://github.com/apache/spark/pull/8346#issuecomment-133547106
Tested it on a cluster using
```
val count =
sqlContext.table("""store_sales""").groupBy().count().queryExecution.executedPlan(3).execute().count
```
Basically, it reads 0 column of table `store_sales`. My table has 1824
parquet files with size from 80MB to 280MB (1 to 3 row group sizes). Without
this patch, in a 16 worker cluster, the job had 5023 tasks and spent 102s. With
this patch, the job had 2893 tasks and spent 64s. It is still not as good as
using one mapper per file (1824 tasks and 42s), but it is much better than our
master.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]