Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-11 Thread Zongheng Yang
Hey Jerry, When you ran these queries using different methods, did you see any discrepancy in the returned results (i.e. the counts)? On Thu, Jul 10, 2014 at 5:55 PM, Michael Armbrust wrote: > Yeah, sorry. I think you are seeing some weirdness with partitioned tables > that I have also seen els

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Michael Armbrust
Yeah, sorry. I think you are seeing some weirdness with partitioned tables that I have also seen elsewhere. I've created a JIRA and assigned someone at databricks to investigate. https://issues.apache.org/jira/browse/SPARK-2443 On Thu, Jul 10, 2014 at 5:33 PM, Jerry Lam wrote: > Hi Michael, >

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Jerry Lam
Hi Michael, Yes the table is partitioned on 1 column. There are 11 columns in the table and they are all String type. I understand that SerDes contributes to some overheads but using pure Hive, we could run the query about 5 times faster than SparkSQL. Given that Hive also has the same SerDes ove

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Michael Armbrust
On Thu, Jul 10, 2014 at 2:08 PM, Jerry Lam wrote: > > For the curious mind, the dataset is about 200-300GB and we are using 10 > machines for this benchmark. Given the env is equal between the two > experiments, why pure spark is faster than SparkSQL? > There is going to be some overhead to parsi

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Jerry Lam
Hi Spark users, Also, to put the performance issue into perspective, we also ran the query on Hive. It took about 5 minutes to run. Best Regards, Jerry On Thu, Jul 10, 2014 at 5:10 PM, Jerry Lam wrote: > By the way, I also try hql("select * from m").count. It is terribly slow > too. > > >

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Jerry Lam
By the way, I also try hql("select * from m").count. It is terribly slow too. On Thu, Jul 10, 2014 at 5:08 PM, Jerry Lam wrote: > Hi Spark users and developers, > > I'm doing some simple benchmarks with my team and we found out a potential > performance issue using Hive via SparkSQL. It is very