Hey Jerry,
When you ran these queries using different methods, did you see any
discrepancy in the returned results (i.e. the counts)?
On Thu, Jul 10, 2014 at 5:55 PM, Michael Armbrust
wrote:
> Yeah, sorry. I think you are seeing some weirdness with partitioned tables
> that I have also seen els
Yeah, sorry. I think you are seeing some weirdness with partitioned tables
that I have also seen elsewhere. I've created a JIRA and assigned someone
at databricks to investigate.
https://issues.apache.org/jira/browse/SPARK-2443
On Thu, Jul 10, 2014 at 5:33 PM, Jerry Lam wrote:
> Hi Michael,
>
Hi Michael,
Yes the table is partitioned on 1 column. There are 11 columns in the table
and they are all String type.
I understand that SerDes contributes to some overheads but using pure Hive,
we could run the query about 5 times faster than SparkSQL. Given that Hive
also has the same SerDes ove
On Thu, Jul 10, 2014 at 2:08 PM, Jerry Lam wrote:
>
> For the curious mind, the dataset is about 200-300GB and we are using 10
> machines for this benchmark. Given the env is equal between the two
> experiments, why pure spark is faster than SparkSQL?
>
There is going to be some overhead to parsi
Hi Spark users,
Also, to put the performance issue into perspective, we also ran the query
on Hive. It took about 5 minutes to run.
Best Regards,
Jerry
On Thu, Jul 10, 2014 at 5:10 PM, Jerry Lam wrote:
> By the way, I also try hql("select * from m").count. It is terribly slow
> too.
>
>
>
By the way, I also try hql("select * from m").count. It is terribly slow
too.
On Thu, Jul 10, 2014 at 5:08 PM, Jerry Lam wrote:
> Hi Spark users and developers,
>
> I'm doing some simple benchmarks with my team and we found out a potential
> performance issue using Hive via SparkSQL. It is very