Hi Gopal,
      My question is related to GZIP files. I am sure single GZIP file is a
anti pattern. Is small zip files (20 to 50 mb) also anti pattern. The
reason I am asking this question is, my application collectors generate
gzip files of that size. So I copy those to HDFS and add as a partition to
hive tables and run queries every 15 min. In hour jobs, I convert to ORC
with aggregations.
Two reasons I continue to use gzip files 1) I don't know or there is no way
to convert my csv file to ORC at client(collector) side. Only need to use
MR/Hive to convert. 2) Because these are small gzips each file is allocated
to one mapper so the data to mapper/map is almost split size.

Thanks,
Chandra

On Fri, Jan 23, 2015 at 5:01 AM, Gopal V <gop...@apache.org> wrote:

> On 1/22/15, 3:03 AM, Saumitra Shahapure (Vizury) wrote:
>
>> We were comparing performance of some of our production hive queries
>> between Hive and Spark. We compared Hive(0.13)+hadoop (1.2.1) against both
>> Spark 0.9 and 1.1. We could see that the performance gains have been good
>> in Spark.
>>
>
> Is there any particular reason you are using an ancient & slow Hadoop-1.x
> version instead of a modern YARN 2.0 cluster?
>
>  We tried a very simple query,
>> select count(*) from T where col3=123
>> in both sparkSQL and Hive (with hive.map.aggr=true) and found that Spark
>> performance had been 2x better than Hive (120sec vs 60sec). Table T is
>> stored in S3 and contains 600MB single GZIP file.
>>
>
> Not sure if you understand that what you're doing is one of the worst
> cases for both the platforms.
>
> Using a big single gzip file is like a massive anti-pattern.
>
> I'm assuming what you want is fast SQL in Hive (since this is the hive
> list) along with all the other lead/lag functions there.
>
> You need a SQL oriented columnar format like ORC, mix with YARN and add
> Tez, that is going to be somewhere near 10-12 seconds.
>
> Oh, and that's a ball-park figure for a single node.
>
> Cheers,
> Gopal
>

Reply via email to