On 1/22/15, 4:36 PM, chandra Reddy Bogala wrote:

       My question is related to GZIP files. I am sure single GZIP file is a
anti pattern. Is small zip files (20 to 50 mb) also anti pattern. The
reason I am asking this question is, my application collectors generate
gzip files of that size. So I copy those to HDFS and add as a partition to
hive tables and run queries every 15 min. In hour jobs, I convert to ORC
with aggregations.

That is exactly the best practice for hive-13 and earlier. Small files, compressed and converted to columnar storage as part of a periodic compaction.

Your approach works very well and the reasons below are valid.

And for 2015, 15 minutes is a lot of time - assume you want something like 15 seconds. Plus, it has moving external parts (1 hour and 15 min crons etc).

There's a more native implementation of that stage-insert-compact idea in Hive-14.

Hive-14 has a different "streaming ingest" which allows you to do inserts into ORC at sub-minute intervals.

https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest#StreamingDataIngest-StreamingRequirements

You can connect a stream ingestion like Flume into that directly, to get the sub-minute data availability in ORC.

After those bits are in place, then you get to literally pick up the best of the whole Hadoop/YARN ecosystem and see how all of them work with Hive.

Once you go down that path, you can just move the raw data over Kafka, pump it through making a Storm topology, which accesses HBase via Trident, which persists data into a Hive Streaming sink.

That is roughly the state of the art for Hive - 1-2 seconds from raw data to query.

You should be able to find the "hive hbase storm bolt" example in the hortonworks trucking demo.

Cheers,
Gopal

Two reasons I continue to use gzip files 1) I don't know or there is no way
to convert my csv file to ORC at client(collector) side. Only need to use
MR/Hive to convert. 2) Because these are small gzips each file is allocated
to one mapper so the data to mapper/map is almost split size.

Thanks,
Chandra

On Fri, Jan 23, 2015 at 5:01 AM, Gopal V <gop...@apache.org> wrote:

On 1/22/15, 3:03 AM, Saumitra Shahapure (Vizury) wrote:

We were comparing performance of some of our production hive queries
between Hive and Spark. We compared Hive(0.13)+hadoop (1.2.1) against both
Spark 0.9 and 1.1. We could see that the performance gains have been good
in Spark.


Is there any particular reason you are using an ancient & slow Hadoop-1.x
version instead of a modern YARN 2.0 cluster?

 We tried a very simple query,
select count(*) from T where col3=123
in both sparkSQL and Hive (with hive.map.aggr=true) and found that Spark
performance had been 2x better than Hive (120sec vs 60sec). Table T is
stored in S3 and contains 600MB single GZIP file.


Not sure if you understand that what you're doing is one of the worst
cases for both the platforms.

Using a big single gzip file is like a massive anti-pattern.

I'm assuming what you want is fast SQL in Hive (since this is the hive
list) along with all the other lead/lag functions there.

You need a SQL oriented columnar format like ORC, mix with YARN and add
Tez, that is going to be somewhere near 10-12 seconds.

Oh, and that's a ball-park figure for a single node.

Cheers,
Gopal




Reply via email to