Hi Gopal, My question is related to GZIP files. I am sure single GZIP file is a anti pattern. Is small zip files (20 to 50 mb) also anti pattern. The reason I am asking this question is, my application collectors generate gzip files of that size. So I copy those to HDFS and add as a partition to hive tables and run queries every 15 min. In hour jobs, I convert to ORC with aggregations. Two reasons I continue to use gzip files 1) I don't know or there is no way to convert my csv file to ORC at client(collector) side. Only need to use MR/Hive to convert. 2) Because these are small gzips each file is allocated to one mapper so the data to mapper/map is almost split size.
Thanks, Chandra On Fri, Jan 23, 2015 at 5:01 AM, Gopal V <gop...@apache.org> wrote: > On 1/22/15, 3:03 AM, Saumitra Shahapure (Vizury) wrote: > >> We were comparing performance of some of our production hive queries >> between Hive and Spark. We compared Hive(0.13)+hadoop (1.2.1) against both >> Spark 0.9 and 1.1. We could see that the performance gains have been good >> in Spark. >> > > Is there any particular reason you are using an ancient & slow Hadoop-1.x > version instead of a modern YARN 2.0 cluster? > > We tried a very simple query, >> select count(*) from T where col3=123 >> in both sparkSQL and Hive (with hive.map.aggr=true) and found that Spark >> performance had been 2x better than Hive (120sec vs 60sec). Table T is >> stored in S3 and contains 600MB single GZIP file. >> > > Not sure if you understand that what you're doing is one of the worst > cases for both the platforms. > > Using a big single gzip file is like a massive anti-pattern. > > I'm assuming what you want is fast SQL in Hive (since this is the hive > list) along with all the other lead/lag functions there. > > You need a SQL oriented columnar format like ORC, mix with YARN and add > Tez, that is going to be somewhere near 10-12 seconds. > > Oh, and that's a ball-park figure for a single node. > > Cheers, > Gopal >