Performance Issues in Hive with S3 and Partitions

richin.jain Mon, 23 Jul 2012 20:34:18 -0700

Hi,

Sorry this is an AWS Hive Specific question.  I have two External Hive tables 
for my custom logs.


1. flat directory structure on AWS S3, no partition and files in bz2 compressed 
format (few big files)

2. With 3 level of partitions on AWS S3 (lot of small uncompressed files)

I noticed that my queries on the table with Partition is taking forever to run. 
The same queries run fine and finish up quickly on table with no partition.
Am I missing something, I suspect this has something to do with the way S3 
behaves.

A query example is :

select id, (max(unix_timestamp(ts, "MM/dd/yyyy HH:mm")) - 
min(unix_timestamp(ts, "MM/dd/yyyy HH:mm")))/(60*60)
from logs
group by id;

Thanks,
Richin

Performance Issues in Hive with S3 and Partitions

Reply via email to