Use a different partitioning scheme or consider using clustered / bucketed tables.
On 7/27/12, richin.j...@nokia.com <richin.j...@nokia.com> wrote: > Igor, > > I did not see any major improvement in the performance even after setting > "Hive.optimize.s3.query=true", although the same was suggested by AWS Team. > > My problem is I have too many small files - 3 level of partition, 6500+ > files and a single file is < 1 MB. > Now I know Hadoop and HDFS are not meant to deal with lot of small files, > but if that is the way to go is there any work around? > > Thanks, > Richin > > From: Jain Richin (Nokia-LC/Boston) > Sent: Tuesday, July 24, 2012 11:49 AM > To: user@hive.apache.org > Subject: RE: Performance Issues in Hive with S3 and Partitions > > Hi Igor, > > Thanks for the response. Yes I am using EMR. > I will make changes and let you know if that helps. > > Richin > > From: ext Igor Tatarinov > [mailto:i...@decide.com]<mailto:[mailto:i...@decide.com]> > Sent: Tuesday, July 24, 2012 12:38 AM > To: user@hive.apache.org<mailto:user@hive.apache.org> > Subject: Re: Performance Issues in Hive with S3 and Partitions > > Are you using EMR? > Have you tried setting > Hive.optimize.s3.query=true > > as mentioned in > http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/emr-hive-version-details.html > > I haven't tried using that option myself. I am curious if it helps in your > scenario. The above page also mentions another fix that's supposed to help > with partitioned tables. Optimizing queries with thousands of input files > used to take a lot of time. But it looks like that fix is enabled by default > now. > > Just in case, also check your jvm reuse option. If it's too low, performance > will suffer. I had it set to 3 to avoid running out of memory. Using the > default value of 20 really helps when reading lots of small files. > > igor > decide.com<http://decide.com> > On Mon, Jul 23, 2012 at 8:33 PM, > <richin.j...@nokia.com<mailto:richin.j...@nokia.com>> wrote: > Hi, > > Sorry this is an AWS Hive Specific question. I have two External Hive > tables for my custom logs. > > 1. flat directory structure on AWS S3, no partition and files in bz2 > compressed format (few big files) > > 2. With 3 level of partitions on AWS S3 (lot of small uncompressed files) > > I noticed that my queries on the table with Partition is taking forever to > run. The same queries run fine and finish up quickly on table with no > partition. > Am I missing something, I suspect this has something to do with the way S3 > behaves. > > A query example is : > > select id, (max(unix_timestamp(ts, "MM/dd/yyyy HH:mm")) - > min(unix_timestamp(ts, "MM/dd/yyyy HH:mm")))/(60*60) > from logs > group by id; > > Thanks, > Richin > >