I think this is a rather simplistic view. All the tools to computation in-memory in the end. For certain type of computation and usage patterns it makes sense to keep them in memory. For example, most of the machine learning approaches require to include the same data in several iterative calculations. This is what Spark has been designed for. Most aggregations/precalculations are just done by using the data in-memory once. Here is where Hive+Tez and to a limited extend Spark can help. The third pattern, where users interactively query the data i.e. Many concurrent users query the same or similar data very frequently, is addressed by Hive on Tez + Llap, Hive Tez+ Ignite or Spark + ignite ( and there are other tools).
So it is important to understand what your users want to do. Then, you have a lot of benchmark data on the web to compare. However I always recommend to generate or use data yourself that fits to the data the company is using. Keep also in mind that time is needed to convert this data in a efficient format. > On 10 Feb 2017, at 20:36, Saikat Kanjilal <sxk1...@hotmail.com> wrote: > > Folks, > > I'm embarking on a project to build a POC around spark sql, I was wondering > if anyone has experience in comparing spark sql with hive or interactive hive > and data points around the types of queries suited for both, I am naively > assuming that spark sql will beat hive in all queries given that computations > are mostly done in memory but want to hear some more data points around > queries that maybe problematic in spark-sql, also are there debugging tools > people ordinarily use with spark-sql to troubleshoot perf related issues. > > > I look forward to hearing from the community. > > Regards