Yeah, sorry.  I think you are seeing some weirdness with partitioned tables
that I have also seen elsewhere. I've created a JIRA and assigned someone
at databricks to investigate.

https://issues.apache.org/jira/browse/SPARK-2443


On Thu, Jul 10, 2014 at 5:33 PM, Jerry Lam <chiling...@gmail.com> wrote:

> Hi Michael,
>
> Yes the table is partitioned on 1 column. There are 11 columns in the
> table and they are all String type.
>
> I understand that SerDes contributes to some overheads but using pure
> Hive, we could run the query about 5 times faster than SparkSQL. Given that
> Hive also has the same SerDes overhead, then there must be something
> additional that SparkSQL adds to the overall overheads that Hive doesn't
> have.
>
> Best Regards,
>
> Jerry
>
>
>
> On Thu, Jul 10, 2014 at 7:11 PM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> On Thu, Jul 10, 2014 at 2:08 PM, Jerry Lam <chiling...@gmail.com> wrote:
>>
>>> For the curious mind, the dataset is about 200-300GB and we are using 10
>>> machines for this benchmark. Given the env is equal between the two
>>> experiments, why pure spark is faster than SparkSQL?
>>>
>>
>> There is going to be some overhead to parsing data using the Hive SerDes
>> instead of the native Spark code, however, the slow down you are seeing
>> here is much larger than I would expect. Can you tell me more about the
>> table?  What does the schema look like?  Is it partitioned?
>>
>> By the way, I also try hql("select * from m").count. It is terribly slow
>>> too.
>>
>>
>> FYI, this query is actually identical to the one where you write out
>> COUNT(*).
>>
>
>

Reply via email to