Hi Vinay, First of all you should probably migrate to sparksql as shark is not actively supported anymore. The 100x benefit entails in-memory caching & DAG, since you are not able to cache the performance can be quite low.. Alternatives you can explore 1. Use parquet as storage which will push down predicates smartly hence get better performance (similar to impala) 2. cache data at a partition level from Hive & operate on those instead.
Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi <https://twitter.com/mayur_rustagi> On Fri, Aug 8, 2014 at 10:44 AM, <vinay.kash...@socialinfra.net> wrote: > Hi Meng, > > I cannot use cached table in this case as the data size is quite huge. > > Also, as I am trying to run adhoc queries, I cannot keep the table cached. > I can cache the table only when my requirement is such that, type of > queries are fixed and for specific set of data. > > > > Thanks and regards > > Vinay Kashyap > > ________________________________________________ > From:"Xiangrui Meng" <men...@gmail.com> > Sent:vinay.kash...@socialinfra.net > Cc:"user@spark.apache.org" > Date:Thu, August 7, 2014 11:06 pm > Subject:Re: Low Performance of Shark over Spark. > > > > > Did you cache the table? There are couple ways of caching a table in > > Shark: https://github.com/amplab/shark/wiki/Shark-User-Guide > > > > On Thu, Aug 7, 2014 at 6:51 AM, <vinay.kash...@socialinfra.net> wrote: > >> Dear all, > >> > >> I am using Spark 0.9.2 in Standalone mode. Hive and HDFS in CDH 5.1.0. > >> > >> 6 worker nodes each with memory 96GB and 32 cores. > >> > >> I am using Shark Shell to execute queries on Spark. > >> > >> I have a raw_table ( of size 3TB with replication 3 ) which is > >> partitioned > >> by year, month and day. I am running an adhoc query on one month data > >> with > >> some condition. > >> > >> For eg: > >> > >> CREATE TABLE temp_table AS SELECT field1,field2,field3 FROM raw_table > >> WHERE > >> year=2000 AND month=01 AND field10 > <some_value>; > >> > >> It is claimed that the same Hive queries can run 100x faster with shark, > >> but > >> I don't see such a significant improvement when running the above query, > >> > >> I am getting almost same performance as when run in Hive which is around > >> 45 > >> seconds. > >> > >> The same query with Impala, takes very less time, almost 7 times less > >> time > >> than shark which is around 6 seconds. I have tried altering the below > >> parameters for the spark jobs but did not see any difference. > >> > >> spark.local.dir > >> spark.serializer > >> spark.kryoserializer.buffer.mb > >> spark.storage.memoryFraction > >> spark.io.compression.codec > >> spark.default.parallelism > >> > >> Any suggestions so that I can improve the performance of the query with > >> Shark over Spark and make it comparable to Impala..?? > >> > >> > >> > >> Thanks and regards > >> > >> Vinay Kashyap > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org > > > > > > >