On Thu, Jul 10, 2014 at 2:08 PM, Jerry Lam <chiling...@gmail.com> wrote: > > For the curious mind, the dataset is about 200-300GB and we are using 10 > machines for this benchmark. Given the env is equal between the two > experiments, why pure spark is faster than SparkSQL? >
There is going to be some overhead to parsing data using the Hive SerDes instead of the native Spark code, however, the slow down you are seeing here is much larger than I would expect. Can you tell me more about the table? What does the schema look like? Is it partitioned? By the way, I also try hql("select * from m").count. It is terribly slow > too. FYI, this query is actually identical to the one where you write out COUNT(*).