Hi Yana,
I notice there is GC happening in every executor which
is around 400ms on an average. Do you think it is a major impact on the
overall query time..??
And regarding the memory for a single worker,
I have tried distributing the memory by increasing the number of workers
per node and divid
Hi Mayur,
I cannot use spark sql in this case because many of the aggregations are not
supported yet. Hence I migrated back to use Shark as all those aggregation
functions are supported.
apache-spark-user-list.1001560.n3.nabble.com/Support-for-Percentile-and-Variance-Aggregation-functions-in-Spar
Hi Vinay,
First of all you should probably migrate to sparksql as shark is not
actively supported anymore.
The 100x benefit entails in-memory caching & DAG, since you are not able to
cache the performance can be quite low..
Alternatives you can explore
1. Use parquet as storage which will push down
Hi Meng,
I cannot use cached table in this case as the data size
is quite huge.
Also, as I am trying to run adhoc queries, I cannot
keep the table cached. I can cache the table only when my requirement is
such that, type of queries are fixed and for specific set of
data.
Thanks and regards
Vin
Did you cache the table? There are couple ways of caching a table in
Shark: https://github.com/amplab/shark/wiki/Shark-User-Guide
On Thu, Aug 7, 2014 at 6:51 AM, wrote:
> Dear all,
>
> I am using Spark 0.9.2 in Standalone mode. Hive and HDFS in CDH 5.1.0.
>
> 6 worker nodes each with memory 96GB