RE: Performance Issues in Hive with S3 and Partitions

richin.jain Fri, 27 Jul 2012 12:10:02 -0700

Thanks Guys, I am changing my partition to hold a day worth of data and should 
be good enough for Hive to operate on.


Thanks,
Richin

From: ext Bejoy Ks [mailto:bejoy...@yahoo.com]
Sent: Friday, July 27, 2012 3:06 PM
To: user@hive.apache.org
Subject: Re: Performance Issues in Hive with S3 and Partitions

Hi Richin

I agree with Edward on this. You have to design your partition in such a way 
that each partition holds data that is atleast an hdfs block size.

Regards,
Bejoy KS

________________________________
From: Edward Capriolo <edlinuxg...@gmail.com<mailto:edlinuxg...@gmail.com>>
To: user@hive.apache.org<mailto:user@hive.apache.org>
Sent: Saturday, July 28, 2012 12:32 AM
Subject: Re: Performance Issues in Hive with S3 and Partitions

Use a different partitioning scheme or consider using clustered /
bucketed tables.

On 7/27/12, richin.j...@nokia.com<mailto:richin.j...@nokia.com> 
<richin.j...@nokia.com<mailto:richin.j...@nokia.com>> wrote:
> Igor,
>
> I did not see any major improvement in the performance even after setting
> "Hive.optimize.s3.query=true", although the same was suggested by AWS Team.
>
> My problem is I have too many small files - 3 level of partition, 6500+
> files and a single file is < 1 MB.
> Now I know Hadoop and HDFS are not meant to deal with lot of small files,
> but if that is the way to go is there any work around?
>
> Thanks,
> Richin
>
> From: Jain Richin (Nokia-LC/Boston)
> Sent: Tuesday, July 24, 2012 11:49 AM
> To: user@hive.apache.org<mailto:user@hive.apache.org>
> Subject: RE: Performance Issues in Hive with S3 and Partitions
>
> Hi Igor,
>
> Thanks for the response. Yes I am using EMR.
> I will make changes and let you know if that helps.
>
> Richin
>
> From: ext Igor Tatarinov
> [mailto:i...@decide.com<mailto:i...@decide.com>]<mailto:[mailto:i...@decide.com<mailto:i...@decide.com>]>
> Sent: Tuesday, July 24, 2012 12:38 AM
> To: 
> user@hive.apache.org<mailto:user@hive.apache.org><mailto:user@hive.apache.org<mailto:user@hive.apache.org>>
> Subject: Re: Performance Issues in Hive with S3 and Partitions
>
> Are you using EMR?
> Have you tried  setting
> Hive.optimize.s3.query=true
>
> as mentioned in
> http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/emr-hive-version-details.html
>
> I haven't tried using that option myself. I am curious if it helps in your
> scenario. The above page also mentions another fix that's supposed to help
> with partitioned tables. Optimizing queries with thousands of input files
> used to take a lot of time. But it looks like that fix is enabled by default
> now.
>
> Just in case, also check your jvm reuse option. If it's too low, performance
> will suffer. I had it set to 3 to avoid running out of memory. Using the
> default value of 20 really helps when reading lots of small files.
>
> igor
> decide.com<http://decide.com<http://decide.com/>>
> On Mon, Jul 23, 2012 at 8:33 PM,
> <richin.j...@nokia.com<mailto:richin.j...@nokia.com><mailto:richin.j...@nokia.com<mailto:richin.j...@nokia.com>>>
>  wrote:
> Hi,
>
> Sorry this is an AWS Hive Specific question.  I have two External Hive
> tables for my custom logs.
>
> 1. flat directory structure on AWS S3, no partition and files in bz2
> compressed format (few big files)
>
> 2. With 3 level of partitions on AWS S3 (lot of small uncompressed files)
>
> I noticed that my queries on the table with Partition is taking forever to
> run. The same queries run fine and finish up quickly on table with no
> partition.
> Am I missing something, I suspect this has something to do with the way S3
> behaves.
>
> A query example is :
>
> select id, (max(unix_timestamp(ts, "MM/dd/yyyy HH:mm")) -
> min(unix_timestamp(ts, "MM/dd/yyyy HH:mm")))/(60*60)
> from logs
> group by id;
>
> Thanks,
> Richin
>
>

RE: Performance Issues in Hive with S3 and Partitions

Reply via email to