@kulkarni,
When I did explain on my query, I got these things, I am not sure how to understand these thing. Any help will be appreciated whether my approach is right or not?- hive> EXPLAIN SELECT * FROM PDS_ATTRIBUTE_DATA_REALTIME where dt=yesterdaydate('yyyyMMdd', 2) LIMIT 5; OK ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_TABREF PDS_ATTRIBUTE_DATA_REALTIME)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF)) (TOK_WHERE (= (TOK_TABLE_OR_COL dt) (TOK_FUNCTION yesterdaydate 'yyyyMMdd' 2))) (TOK_LIMIT 5))) STAGE DEPENDENCIES: Stage-0 is a root stage STAGE PLANS: Stage: Stage-0 Fetch Operator limit: 5 Time taken: 12.126 seconds *Raihan Jamal* On Tue, Aug 7, 2012 at 10:56 AM, Jan Dolinár <dolik....@gmail.com> wrote: > Oops, sorry I made a copy&paste mistake :) The annotation should read > @*UDFType(deterministic=true*) > > Jan > > On Tue, Aug 7, 2012 at 7:37 PM, Jan Dolinár <dolik....@gmail.com> wrote: > >> I'm afraid that he query >> >> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10; >> >> will scan entire table, because the functions is evaluated at runtime, so >> Hive doesn't know what the value is when it decides which files to scan. I >> am not 100% sure though, you should try it. >> >> Also, you might want to try to add annotation to your UDF saying that the >> function is deterministic: >> @*UDFType(deterministic=false*) >> >> I think Hive might be clever enough to evaluate it early enough to use >> the partition pruning correctly, since it operates on constant expression. >> But again, I'm not really sure, maybe someone with deeper knowledge of Hive >> optimizations will tell us more. It is actually quite interesting question. >> >> Another way to help Hive with the optimizations might be to skip passing >> the format string argument, if you have all dates in same format, you can >> call the function just like 'yesterdaydate()' and hardcode the format in >> the function. >> >> Jan >> >> >> On Tue, Aug 7, 2012 at 7:20 PM, Raihan Jamal <jamalrai...@gmail.com>wrote: >> >>> Hi Jan, >>> >>> >>> I figured that out, it is working fine for me now. The only question I >>> have is, if I am doing like this- >>> >>> >>> >>> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10; >>> >>> >>> >>> Then the above query will be evaluated as below right? >>> >>> >>> >>> SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10; >>> >>> >>> >>> So that means it will look for data in the corresponding dt partition >>> *(20120806) >>> *only right as above table is partitioned on dt column ? And it will >>> not scan the whole table right?** >>> >>> >>> >>> *Raihan Jamal* >>> >>> >>> >>> On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <dolik....@gmail.com>wrote: >>> >>>> Hi Jamal, >>>> >>>> Check if the function really returns what it should and that your data >>>> are really in yyyyMMdd format. You can do this by simple query like this: >>>> >>>> SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1; >>>> >>>> I don't see anything wrong with the function itself, it works well for >>>> me (although I tested it in hive 0.7.1). The only thing I would change >>>> about it would be to optimize it by calling 'new' only at the time of >>>> construction and reusing the object when the function is called, but that >>>> should not affect the functionality at all. >>>> >>>> Best regards, >>>> Jan >>>> >>>> >>>> >>>> >>>> On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <jamalrai...@gmail.com>wrote: >>>> >>>>> *Problem* >>>>> >>>>> I created the below UserDefinedFunction to get the yesterday's day in >>>>> the format I wanted as I will be passing the format into this below method >>>>> from the query. >>>>> >>>>> >>>>> >>>>> *public final class YesterdayDate extends UDF {* >>>>> >>>>> * * >>>>> >>>>> * public String evaluate(final String format) { * >>>>> >>>>> * DateFormat dateFormat = new >>>>> SimpleDateFormat(format); * >>>>> >>>>> * Calendar cal = >>>>> Calendar.getInstance();* >>>>> >>>>> * cal.add(Calendar.DATE, -1); * >>>>> >>>>> * return >>>>> dateFormat.format(cal.getTime()).toString(); * >>>>> >>>>> * } * >>>>> >>>>> *}* >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> So whenever I try to run the query like below by adding the jar to >>>>> classpath and creating the temporary function yesterdaydate, I always get >>>>> zero result back- >>>>> >>>>> >>>>> >>>>> hive> create temporary function *yesterdaydate* as >>>>> 'com.example.hive.udf.YesterdayDate'; >>>>> >>>>> OK >>>>> >>>>> Time taken: 0.512 seconds >>>>> >>>>> >>>>> >>>>> Below is the query I am running- >>>>> >>>>> >>>>> >>>>> *hive> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') >>>>> LIMIT 10;* >>>>> >>>>> *OK* >>>>> >>>>> * * >>>>> >>>>> And I always get zero result back but the data is there in that table >>>>> for Aug 5th.** >>>>> >>>>> >>>>> >>>>> What wrong I am doing? Any suggestions will be appreciated. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> NOTE:- As I am working with Hive 0.6 so it doesn’t support variable >>>>> substitution thing, so I cannot use hiveconf here and the above table has >>>>> been partitioned on dt(date) column.** >>>>> >>>> >>>> >>> >> >