Did you cache the table? There are couple ways of caching a table in Shark: https://github.com/amplab/shark/wiki/Shark-User-Guide
On Thu, Aug 7, 2014 at 6:51 AM, <vinay.kash...@socialinfra.net> wrote: > Dear all, > > I am using Spark 0.9.2 in Standalone mode. Hive and HDFS in CDH 5.1.0. > > 6 worker nodes each with memory 96GB and 32 cores. > > I am using Shark Shell to execute queries on Spark. > > I have a raw_table ( of size 3TB with replication 3 ) which is partitioned > by year, month and day. I am running an adhoc query on one month data with > some condition. > > For eg: > > CREATE TABLE temp_table AS SELECT field1,field2,field3 FROM raw_table WHERE > year=2000 AND month=01 AND field10 > <some_value>; > > It is claimed that the same Hive queries can run 100x faster with shark, but > I don't see such a significant improvement when running the above query, > > I am getting almost same performance as when run in Hive which is around 45 > seconds. > > The same query with Impala, takes very less time, almost 7 times less time > than shark which is around 6 seconds. I have tried altering the below > parameters for the spark jobs but did not see any difference. > > spark.local.dir > spark.serializer > spark.kryoserializer.buffer.mb > spark.storage.memoryFraction > spark.io.compression.codec > spark.default.parallelism > > Any suggestions so that I can improve the performance of the query with > Shark over Spark and make it comparable to Impala..?? > > > > Thanks and regards > > Vinay Kashyap --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org