Re: Low Performance of Shark over Spark.

Xiangrui Meng Thu, 07 Aug 2014 10:37:19 -0700

Did you cache the table? There are couple ways of caching a table in
Shark: https://github.com/amplab/shark/wiki/Shark-User-Guide


On Thu, Aug 7, 2014 at 6:51 AM,  <vinay.kash...@socialinfra.net> wrote:
> Dear all,
>
> I am using Spark 0.9.2 in Standalone mode. Hive and HDFS in CDH 5.1.0.
>
> 6 worker nodes each with memory 96GB and 32 cores.
>
> I am using Shark Shell to execute queries on Spark.
>
> I have a raw_table ( of size 3TB with replication 3 ) which is partitioned
> by year, month and day. I am running an adhoc query on one month data with
> some condition.
>
> For eg:
>
> CREATE TABLE temp_table AS SELECT field1,field2,field3 FROM raw_table WHERE
> year=2000 AND month=01 AND field10 > <some_value>;
>
> It is claimed that the same Hive queries can run 100x faster with shark, but
> I don't see such a significant improvement when running the above query,
>
> I am getting almost same performance as when run in Hive which is around 45
> seconds.
>
> The same query with Impala, takes very  less time, almost 7 times less time
> than shark which is around 6 seconds. I have tried altering the below
> parameters for the spark jobs but did not see any difference.
>
> spark.local.dir
> spark.serializer
> spark.kryoserializer.buffer.mb
> spark.storage.memoryFraction
> spark.io.compression.codec
> spark.default.parallelism
>
> Any suggestions so that I can improve the performance of the query with
> Shark over Spark and make it comparable to Impala..??
>
>
>
> Thanks and regards
>
> Vinay Kashyap

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Low Performance of Shark over Spark.

Reply via email to