[GitHub] spark pull request: [SPARK-10143] [SQL] Use parquet's block size (...

yhuai Fri, 21 Aug 2015 13:08:43 -0700

Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/8346#issuecomment-133547106
  
    Tested it on a cluster using 
    ```
    val count = 
sqlContext.table("""store_sales""").groupBy().count().queryExecution.executedPlan(3).execute().count
    ```
    Basically, it reads 0 column of table `store_sales`. My table has 1824 
parquet files with size from 80MB to 280MB (1 to 3 row group sizes). Without 
this patch, in a 16 worker cluster, the job had 5023 tasks and spent 102s. With 
this patch, the job had 2893 tasks and spent 64s. It is still not as good as 
using one mapper per file (1824 tasks and 42s), but it is much better than our 
master.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-10143] [SQL] Use parquet's block size (...

Reply via email to