Re: Spark performance for small queries

Saumitra Shahapure (Vizury) Fri, 23 Jan 2015 02:30:07 -0800

Hi Gopal,

Thanks for the informative answer, but my question was around difference in
the processing of Spark SQL and Hive. Right now I am not trying to
optimizing either. I totally agree that Hive can perform much better than
the number I got.

I was just wondering, even though both systems would generate quite similar
execution plans for this query, what exactly is making difference.  My
question is from the point of understanding both the systems,

Answering your questions inline,

--
Regards,
Saumitra Shahapure

On Fri, Jan 23, 2015 at 5:01 AM, Gopal V <gop...@apache.org> wrote:

> On 1/22/15, 3:03 AM, Saumitra Shahapure (Vizury) wrote:
>
>> We were comparing performance of some of our production hive queries
>> between Hive and Spark. We compared Hive(0.13)+hadoop (1.2.1) against both
>> Spark 0.9 and 1.1. We could see that the performance gains have been good
>> in Spark.
>>
>
> Is there any particular reason you are using an ancient & slow Hadoop-1.x
> version instead of a modern YARN 2.0 cluster?

The cluster I was experimenting on, is a legacy cluster in our system. We
are already in process of migrating everything from here to Hadoop 2.

>
>
>  We tried a very simple query,
>> select count(*) from T where col3=123
>> in both sparkSQL and Hive (with hive.map.aggr=true) and found that Spark
>> performance had been 2x better than Hive (120sec vs 60sec). Table T is
>> stored in S3 and contains 600MB single GZIP file.
>>
>
> Not sure if you understand that what you're doing is one of the worst
> cases for both the platforms.
>
> Using a big single gzip file is like a massive anti-pattern.
>
> I'm assuming what you want is fast SQL in Hive (since this is the hive
> list) along with all the other lead/lag functions there.
>
> You need a SQL oriented columnar format like ORC, mix with YARN and add
> Tez, that is going to be somewhere near 10-12 seconds.
>
> Oh, and that's a ball-park figure for a single node.
>

Agree on that end as well. Smaller gzipped files or uncompressed files will
give better performance. This specific query is just a kind-of-rare test
case that one of our job encounters sometimes.

>
> Cheers,
> Gopal
>

Re: Spark performance for small queries

Reply via email to