Yeah, sorry. I think you are seeing some weirdness with partitioned tables that I have also seen elsewhere. I've created a JIRA and assigned someone at databricks to investigate.
https://issues.apache.org/jira/browse/SPARK-2443 On Thu, Jul 10, 2014 at 5:33 PM, Jerry Lam <chiling...@gmail.com> wrote: > Hi Michael, > > Yes the table is partitioned on 1 column. There are 11 columns in the > table and they are all String type. > > I understand that SerDes contributes to some overheads but using pure > Hive, we could run the query about 5 times faster than SparkSQL. Given that > Hive also has the same SerDes overhead, then there must be something > additional that SparkSQL adds to the overall overheads that Hive doesn't > have. > > Best Regards, > > Jerry > > > > On Thu, Jul 10, 2014 at 7:11 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >> On Thu, Jul 10, 2014 at 2:08 PM, Jerry Lam <chiling...@gmail.com> wrote: >> >>> For the curious mind, the dataset is about 200-300GB and we are using 10 >>> machines for this benchmark. Given the env is equal between the two >>> experiments, why pure spark is faster than SparkSQL? >>> >> >> There is going to be some overhead to parsing data using the Hive SerDes >> instead of the native Spark code, however, the slow down you are seeing >> here is much larger than I would expect. Can you tell me more about the >> table? What does the schema look like? Is it partitioned? >> >> By the way, I also try hql("select * from m").count. It is terribly slow >>> too. >> >> >> FYI, this query is actually identical to the one where you write out >> COUNT(*). >> > >