@Cheng, Hao : Physical plans show that it got stuck on scanning S3! (table is partitioned by date_prefix and hour) explain select count(*) from test_table where date_prefix='20150819' and hour='00';
TungstenAggregate(key=[], value=[(count(1),mode=Final,isDistinct=false)] TungstenExchange SinglePartition TungstenAggregate(key=[], value=[(count(1),mode=Partial,isDistinct=false)] Scan ParquetRelation[ .. <about 1000 partition paths go here> ] Why does spark have to scan all partitions when the query only concerns with 1 partitions? Doesn't it defeat the purpose of partitioning? Thanks! On Thu, Aug 20, 2015 at 4:12 PM, Philip Weaver <philip.wea...@gmail.com> wrote: > I hadn't heard of spark.sql.sources.partitionDiscovery.enabled before, and > I couldn't find much information about it online. What does it mean exactly > to disable it? Are there any negative consequences to disabling it? > > On Wed, Aug 19, 2015 at 10:53 PM, Cheng, Hao <hao.ch...@intel.com> wrote: > >> Can you make some more profiling? I am wondering if the driver is busy >> with scanning the HDFS / S3. >> >> Like jstack <pid of driver process> >> >> >> >> And also, it’s will be great if you can paste the physical plan for the >> simple query. >> >> >> >> *From:* Jerrick Hoang [mailto:jerrickho...@gmail.com] >> *Sent:* Thursday, August 20, 2015 1:46 PM >> *To:* Cheng, Hao >> *Cc:* Philip Weaver; user >> *Subject:* Re: Spark Sql behaves strangely with tables with a lot of >> partitions >> >> >> >> I cloned from TOT after 1.5.0 cut off. I noticed there were a couple of >> CLs trying to speed up spark sql with tables with a huge number of >> partitions, I've made sure that those CLs are included but it's still very >> slow >> >> >> >> On Wed, Aug 19, 2015 at 10:43 PM, Cheng, Hao <hao.ch...@intel.com> wrote: >> >> Yes, you can try set the spark.sql.sources.partitionDiscovery.enabled to >> false. >> >> >> >> BTW, which version are you using? >> >> >> >> Hao >> >> >> >> *From:* Jerrick Hoang [mailto:jerrickho...@gmail.com] >> *Sent:* Thursday, August 20, 2015 12:16 PM >> *To:* Philip Weaver >> *Cc:* user >> *Subject:* Re: Spark Sql behaves strangely with tables with a lot of >> partitions >> >> >> >> I guess the question is why does spark have to do partition discovery >> with all partitions when the query only needs to look at one partition? Is >> there a conf flag to turn this off? >> >> >> >> On Wed, Aug 19, 2015 at 9:02 PM, Philip Weaver <philip.wea...@gmail.com> >> wrote: >> >> I've had the same problem. It turns out that Spark (specifically parquet) >> is very slow at partition discovery. It got better in 1.5 (not yet >> released), but was still unacceptably slow. Sadly, we ended up reading >> parquet files manually in Python (via C++) and had to abandon Spark SQL >> because of this problem. >> >> >> >> On Wed, Aug 19, 2015 at 7:51 PM, Jerrick Hoang <jerrickho...@gmail.com> >> wrote: >> >> Hi all, >> >> >> >> I did a simple experiment with Spark SQL. I created a partitioned parquet >> table with only one partition (date=20140701). A simple `select count(*) >> from table where date=20140701` would run very fast (0.1 seconds). However, >> as I added more partitions the query takes longer and longer. When I added >> about 10,000 partitions, the query took way too long. I feel like querying >> for a single partition should not be affected by having more partitions. Is >> this a known behaviour? What does spark try to do here? >> >> >> >> Thanks, >> >> Jerrick >> >> >> >> >> >> >> > >