Re: Spark performance for small queries

2015-01-23 Thread Saumitra Shahapure (Vizury)
Hi Gopal, Thanks for the informative answer, but my question was around difference in the processing of Spark SQL and Hive. Right now I am not trying to optimizing either. I totally agree that Hive can perform much better than the number I got. I was just wondering, even though both systems would

Re: Spark performance for small queries

2015-01-22 Thread Gopal V
On 1/22/15, 4:36 PM, chandra Reddy Bogala wrote: My question is related to GZIP files. I am sure single GZIP file is a anti pattern. Is small zip files (20 to 50 mb) also anti pattern. The reason I am asking this question is, my application collectors generate gzip files of that size. So

Re: Spark performance for small queries

2015-01-22 Thread chandra Reddy Bogala
Hi Gopal, My question is related to GZIP files. I am sure single GZIP file is a anti pattern. Is small zip files (20 to 50 mb) also anti pattern. The reason I am asking this question is, my application collectors generate gzip files of that size. So I copy those to HDFS and add as a partition

Re: Spark performance for small queries

2015-01-22 Thread Gopal V
On 1/22/15, 3:03 AM, Saumitra Shahapure (Vizury) wrote: We were comparing performance of some of our production hive queries between Hive and Spark. We compared Hive(0.13)+hadoop (1.2.1) against both Spark 0.9 and 1.1. We could see that the performance gains have been good in Spark. Is there an

Re: Spark performance for small queries

2015-01-22 Thread sjayatheertha
I'm not answering your question but, could you give me more insight on where and how do you use spark? I know that spark has in memory capabilities. Also, I have a similar question on ways to optimize hive queries and file storage. Which is better Orc vs parquet along with when to use compressi

Re: Spark performance for small queries

2015-01-22 Thread Saumitra Shahapure (Vizury)
Hello, We were comparing performance of some of our production hive queries between Hive and Spark. We compared Hive(0.13)+hadoop (1.2.1) against both Spark 0.9 and 1.1. We could see that the performance gains have been good in Spark. We tried a very simple query, select count(*) from T where col