Thanks, I'm going to try explain extended and see what I get. I'm using HF.hr > '07' and HF.hr <= '07', because I actually want P*S*T (in order to avoid the spring forward / fallback problem you mentioned).
Marc On Wed, Oct 6, 2010 at 5:05 PM, Steven Wong <sw...@netflix.com> wrote: > What Hive version are you running? Try an “explain extended” on your > insert query and see if unneeded partitions are included. > > > > Pacific Standard Time (PST) is UTC-08:00, while Pacific Daylight Time (PDT) > is UTC-07:00. To convert UTC to PDT, the condition should be: > > (HF.dt = '2010-09-29' AND HF.hr >= '07' ) OR (HF.dt = '2010-09-30' AND > HF.hr < '07' ) > > instead of: > > (HF.dt = '2010-09-29' AND HF.hr > '07' ) OR (HF.dt = '2010-09-30' AND HF.hr > <= '07' ) > > > > Good luck on the days we spring forward or fall back. J/L > > > > > > *From:* Marc Limotte [mailto:mslimo...@gmail.com] > *Sent:* Wednesday, October 06, 2010 11:12 AM > > *To:* hive-u...@hadoop.apache.org > *Subject:* Re: RE: hive query doesn't seem to limit itself to partitions > based on the WHERE clause > > > > Thanks for the response, Edward. > > The source table (hourly_fact) is partitioned on dt (date) and hr (hour), > and I've confirmed that they are both String fields (CREATE stmt is below). > > The hourly_fact table contains 'number of requests' for each hour by a few > dimensions. The query is just trying to get a daily aggregation across > those same dimensions. The only trick is that the hourly_fact table has dt > and hour in UTC time. And the daily aggregation is being done for a PST > (pacific std) day, hence the 7 hour offset. > > CREATE TABLE IF NOT EXISTS hourly_fact ( > tagtype STRING, > country STRING, > company INT, > request_keyword STRING, > receiver_code STRING, > referrer_domain STRING, > num_requests INT, > num_new_user_requests INT > ) > PARTITIONED BY (dt STRING, hr STRING) > ROW FORMAT DELIMITED > STORED AS SEQUENCEFILE > LOCATION "..."; > > > Marc > > On Tue, Oct 5, 2010 at 4:30 PM, Edward Capriolo <edlinuxg...@gmail.com> > wrote: > > On Tue, Oct 5, 2010 at 3:36 PM, Marc Limotte <mslimo...@gmail.com> wrote: > > Hi Namit, > > > > Hourly_fact is partitioned on dt and hr. > > > > Marc > > > > On Oct 3, 2010 10:00 PM, "Namit Jain" <nj...@facebook.com> wrote: > >> What is your table hourly_fact partitioned on ? > >> > >> ________________________________________ > >> From: Marc Limotte [mslimo...@gmail.com] > >> Sent: Friday, October 01, 2010 2:10 PM > >> To: hive-u...@hadoop.apache.org > >> Subject: hive query doesn't seem to limit itself to partitions based on > >> the WHERE clause > >> > >> Hi, > >> > >> From looking at the hive log output, it seems that my job is accessing > >> many more partitions than it needs to? For example, my query is > something > >> like: > >> > >> INSERT OVERWRITE TABLE daily_fact > >> PARTITION (dt='2010-09-29') > >> SELECT > >> 20100929 as stamp, > >> tagtype, > >> country, > >> sum(num_requests) AS num_requests > >> FROM > >> hourly_fact HF > >> WHERE > >> (HF.dt = '2010-09-29' AND HF.hr > '07' ) > >> OR (HF.dt = '2010-09-30' AND HF.hr <= '07' ) > >> GROUP BY > >> 20100929, tagtype, country > >> > >> Based on the WHERE clause, I would expect it to look only at partitions > in > >> the date range 2010-09-29 08:00:00 through 2010-09-30 07:00:00. But, the > log > >> contains entries like: > >> > >> 10/10/01 19:13:09 INFO exec.ExecDriver: Adding input file > >> hdfs://ny-prod-hc01:9000/home/hadoop/ala/out/hourly/dt=2010-08-15/hr=10 > >> > >> And many other hours outside my WHERE constraint. I assume this means > that > >> it's processing those directories. The answer still comes out right, but > I'm > >> concerned about the performance. > >> > >> Would appreciate some help understanding what this means and how to fix > >> it. > >> > >> Thanks, > >> Marc > >> > >> > > > > Possibly you defined HF.hr <= '07' as an int column and comparing it > as a string is resulting in a full table scan. Can you explain the > query? > > >