Thanks Guys, I am changing my partition to hold a day worth of data and should be good enough for Hive to operate on.
Thanks, Richin From: ext Bejoy Ks [mailto:bejoy...@yahoo.com] Sent: Friday, July 27, 2012 3:06 PM To: user@hive.apache.org Subject: Re: Performance Issues in Hive with S3 and Partitions Hi Richin I agree with Edward on this. You have to design your partition in such a way that each partition holds data that is atleast an hdfs block size. Regards, Bejoy KS ________________________________ From: Edward Capriolo <edlinuxg...@gmail.com<mailto:edlinuxg...@gmail.com>> To: user@hive.apache.org<mailto:user@hive.apache.org> Sent: Saturday, July 28, 2012 12:32 AM Subject: Re: Performance Issues in Hive with S3 and Partitions Use a different partitioning scheme or consider using clustered / bucketed tables. On 7/27/12, richin.j...@nokia.com<mailto:richin.j...@nokia.com> <richin.j...@nokia.com<mailto:richin.j...@nokia.com>> wrote: > Igor, > > I did not see any major improvement in the performance even after setting > "Hive.optimize.s3.query=true", although the same was suggested by AWS Team. > > My problem is I have too many small files - 3 level of partition, 6500+ > files and a single file is < 1 MB. > Now I know Hadoop and HDFS are not meant to deal with lot of small files, > but if that is the way to go is there any work around? > > Thanks, > Richin > > From: Jain Richin (Nokia-LC/Boston) > Sent: Tuesday, July 24, 2012 11:49 AM > To: user@hive.apache.org<mailto:user@hive.apache.org> > Subject: RE: Performance Issues in Hive with S3 and Partitions > > Hi Igor, > > Thanks for the response. Yes I am using EMR. > I will make changes and let you know if that helps. > > Richin > > From: ext Igor Tatarinov > [mailto:i...@decide.com<mailto:i...@decide.com>]<mailto:[mailto:i...@decide.com<mailto:i...@decide.com>]> > Sent: Tuesday, July 24, 2012 12:38 AM > To: > user@hive.apache.org<mailto:user@hive.apache.org><mailto:user@hive.apache.org<mailto:user@hive.apache.org>> > Subject: Re: Performance Issues in Hive with S3 and Partitions > > Are you using EMR? > Have you tried setting > Hive.optimize.s3.query=true > > as mentioned in > http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/emr-hive-version-details.html > > I haven't tried using that option myself. I am curious if it helps in your > scenario. The above page also mentions another fix that's supposed to help > with partitioned tables. Optimizing queries with thousands of input files > used to take a lot of time. But it looks like that fix is enabled by default > now. > > Just in case, also check your jvm reuse option. If it's too low, performance > will suffer. I had it set to 3 to avoid running out of memory. Using the > default value of 20 really helps when reading lots of small files. > > igor > decide.com<http://decide.com<http://decide.com/>> > On Mon, Jul 23, 2012 at 8:33 PM, > <richin.j...@nokia.com<mailto:richin.j...@nokia.com><mailto:richin.j...@nokia.com<mailto:richin.j...@nokia.com>>> > wrote: > Hi, > > Sorry this is an AWS Hive Specific question. I have two External Hive > tables for my custom logs. > > 1. flat directory structure on AWS S3, no partition and files in bz2 > compressed format (few big files) > > 2. With 3 level of partitions on AWS S3 (lot of small uncompressed files) > > I noticed that my queries on the table with Partition is taking forever to > run. The same queries run fine and finish up quickly on table with no > partition. > Am I missing something, I suspect this has something to do with the way S3 > behaves. > > A query example is : > > select id, (max(unix_timestamp(ts, "MM/dd/yyyy HH:mm")) - > min(unix_timestamp(ts, "MM/dd/yyyy HH:mm")))/(60*60) > from logs > group by id; > > Thanks, > Richin > >