Oops, sorry I made a copy&paste mistake :) The annotation should read @*UDFType(deterministic=true*)
Jan On Tue, Aug 7, 2012 at 7:37 PM, Jan Dolinár <dolik....@gmail.com> wrote: > I'm afraid that he query > > SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10; > > will scan entire table, because the functions is evaluated at runtime, so > Hive doesn't know what the value is when it decides which files to scan. I > am not 100% sure though, you should try it. > > Also, you might want to try to add annotation to your UDF saying that the > function is deterministic: > @*UDFType(deterministic=false*) > > I think Hive might be clever enough to evaluate it early enough to use the > partition pruning correctly, since it operates on constant expression. But > again, I'm not really sure, maybe someone with deeper knowledge of Hive > optimizations will tell us more. It is actually quite interesting question. > > Another way to help Hive with the optimizations might be to skip passing > the format string argument, if you have all dates in same format, you can > call the function just like 'yesterdaydate()' and hardcode the format in > the function. > > Jan > > > On Tue, Aug 7, 2012 at 7:20 PM, Raihan Jamal <jamalrai...@gmail.com>wrote: > >> Hi Jan, >> >> >> I figured that out, it is working fine for me now. The only question I >> have is, if I am doing like this- >> >> >> >> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10; >> >> >> >> Then the above query will be evaluated as below right? >> >> >> >> SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10; >> >> >> >> So that means it will look for data in the corresponding dt partition >> *(20120806) >> *only right as above table is partitioned on dt column ? And it will not >> scan the whole table right?** >> >> >> >> *Raihan Jamal* >> >> >> >> On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <dolik....@gmail.com> wrote: >> >>> Hi Jamal, >>> >>> Check if the function really returns what it should and that your data >>> are really in yyyyMMdd format. You can do this by simple query like this: >>> >>> SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1; >>> >>> I don't see anything wrong with the function itself, it works well for >>> me (although I tested it in hive 0.7.1). The only thing I would change >>> about it would be to optimize it by calling 'new' only at the time of >>> construction and reusing the object when the function is called, but that >>> should not affect the functionality at all. >>> >>> Best regards, >>> Jan >>> >>> >>> >>> >>> On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <jamalrai...@gmail.com>wrote: >>> >>>> *Problem* >>>> >>>> I created the below UserDefinedFunction to get the yesterday's day in >>>> the format I wanted as I will be passing the format into this below method >>>> from the query. >>>> >>>> >>>> >>>> *public final class YesterdayDate extends UDF {* >>>> >>>> * * >>>> >>>> * public String evaluate(final String format) { * >>>> >>>> * DateFormat dateFormat = new >>>> SimpleDateFormat(format); * >>>> >>>> * Calendar cal = Calendar.getInstance(); >>>> * >>>> >>>> * cal.add(Calendar.DATE, -1); * >>>> >>>> * return >>>> dateFormat.format(cal.getTime()).toString(); * >>>> >>>> * } * >>>> >>>> *}* >>>> >>>> >>>> >>>> >>>> >>>> So whenever I try to run the query like below by adding the jar to >>>> classpath and creating the temporary function yesterdaydate, I always get >>>> zero result back- >>>> >>>> >>>> >>>> hive> create temporary function *yesterdaydate* as >>>> 'com.example.hive.udf.YesterdayDate'; >>>> >>>> OK >>>> >>>> Time taken: 0.512 seconds >>>> >>>> >>>> >>>> Below is the query I am running- >>>> >>>> >>>> >>>> *hive> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') >>>> LIMIT 10;* >>>> >>>> *OK* >>>> >>>> * * >>>> >>>> And I always get zero result back but the data is there in that table >>>> for Aug 5th.** >>>> >>>> >>>> >>>> What wrong I am doing? Any suggestions will be appreciated. >>>> >>>> >>>> >>>> >>>> >>>> NOTE:- As I am working with Hive 0.6 so it doesn’t support variable >>>> substitution thing, so I cannot use hiveconf here and the above table has >>>> been partitioned on dt(date) column.** >>>> >>> >>> >> >