Hi Vijay, Thanks for the suggestion, If upgrading to Hive was under my control then I would have done for sure, but I am working in a company and they are running Hive 0.6 on all the cluster, And I told them to upgrade the Hive version but they said it will take few months for them to do this. And I don't know why they are saying like this, so that is the reason I was doing like this.
Any suggestions will be appreciated to make this thing work *Raihan Jamal* On Tue, Aug 7, 2012 at 11:11 AM, Vijay <tec...@gmail.com> wrote: > Given the implementation of the UDF, I don't think hive would be able > to use partition pruning. Especially the version you're using. I'd > really recommend upgrading to a later version that has the hiveconf > support. That can save a lot of trouble rather than trying to get > things working on 0.6 > > On Tue, Aug 7, 2012 at 11:07 AM, Raihan Jamal <jamalrai...@gmail.com> > wrote: > > Hi Jan, > > > > I have date in different format also, so that is the reason I was > thinking > > to do by this approach. How can I make sure this will work on the > selected > > partition only and it will not scan the entire table. I will add your > > suggestion in my UDF as deterministic thing. > > > > My simple question here is- How to get the Yesterdays date which I can > use > > on the Date Partition I cannot use hiveconf here as I am working with > Hive > > 0.6 > > > > > > > > > > Raihan Jamal > > > > > > > > On Tue, Aug 7, 2012 at 10:37 AM, Jan Dolinár <dolik....@gmail.com> > wrote: > >> > >> I'm afraid that he query > >> > >> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10; > >> > >> will scan entire table, because the functions is evaluated at runtime, > so > >> Hive doesn't know what the value is when it decides which files to > scan. I > >> am not 100% sure though, you should try it. > >> > >> Also, you might want to try to add annotation to your UDF saying that > the > >> function is deterministic: > >> @UDFType(deterministic=false) > >> > >> I think Hive might be clever enough to evaluate it early enough to use > the > >> partition pruning correctly, since it operates on constant expression. > But > >> again, I'm not really sure, maybe someone with deeper knowledge of Hive > >> optimizations will tell us more. It is actually quite interesting > question. > >> > >> Another way to help Hive with the optimizations might be to skip passing > >> the format string argument, if you have all dates in same format, you > can > >> call the function just like 'yesterdaydate()' and hardcode the format > in the > >> function. > >> > >> Jan > >> > >> > >> On Tue, Aug 7, 2012 at 7:20 PM, Raihan Jamal <jamalrai...@gmail.com> > >> wrote: > >>> > >>> Hi Jan, > >>> > >>> > >>> > >>> I figured that out, it is working fine for me now. The only question I > >>> have is, if I am doing like this- > >>> > >>> > >>> > >>> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10; > >>> > >>> > >>> > >>> Then the above query will be evaluated as below right? > >>> > >>> > >>> > >>> SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10; > >>> > >>> > >>> > >>> So that means it will look for data in the corresponding dt partition > >>> (20120806) only right as above table is partitioned on dt column ? And > it > >>> will not scan the whole table right? > >>> > >>> > >>> > >>> > >>> Raihan Jamal > >>> > >>> > >>> > >>> On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <dolik....@gmail.com> > wrote: > >>>> > >>>> Hi Jamal, > >>>> > >>>> Check if the function really returns what it should and that your data > >>>> are really in yyyyMMdd format. You can do this by simple query like > this: > >>>> > >>>> SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1; > >>>> > >>>> I don't see anything wrong with the function itself, it works well for > >>>> me (although I tested it in hive 0.7.1). The only thing I would > change about > >>>> it would be to optimize it by calling 'new' only at the time of > construction > >>>> and reusing the object when the function is called, but that should > not > >>>> affect the functionality at all. > >>>> > >>>> Best regards, > >>>> Jan > >>>> > >>>> > >>>> > >>>> > >>>> On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <jamalrai...@gmail.com> > >>>> wrote: > >>>>> > >>>>> Problem > >>>>> > >>>>> I created the below UserDefinedFunction to get the yesterday's day in > >>>>> the format I wanted as I will be passing the format into this below > method > >>>>> from the query. > >>>>> > >>>>> > >>>>> > >>>>> public final class YesterdayDate extends UDF { > >>>>> > >>>>> > >>>>> > >>>>> public String evaluate(final String format) { > >>>>> > >>>>> DateFormat dateFormat = new > >>>>> SimpleDateFormat(format); > >>>>> > >>>>> Calendar cal = > Calendar.getInstance(); > >>>>> > >>>>> cal.add(Calendar.DATE, -1); > >>>>> > >>>>> return > >>>>> dateFormat.format(cal.getTime()).toString(); > >>>>> > >>>>> } > >>>>> > >>>>> } > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> So whenever I try to run the query like below by adding the jar to > >>>>> classpath and creating the temporary function yesterdaydate, I > always get > >>>>> zero result back- > >>>>> > >>>>> > >>>>> > >>>>> hive> create temporary function yesterdaydate as > >>>>> 'com.example.hive.udf.YesterdayDate'; > >>>>> > >>>>> OK > >>>>> > >>>>> Time taken: 0.512 seconds > >>>>> > >>>>> > >>>>> > >>>>> Below is the query I am running- > >>>>> > >>>>> > >>>>> > >>>>> hive> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') > LIMIT > >>>>> 10; > >>>>> > >>>>> OK > >>>>> > >>>>> > >>>>> > >>>>> And I always get zero result back but the data is there in that table > >>>>> for Aug 5th. > >>>>> > >>>>> > >>>>> > >>>>> What wrong I am doing? Any suggestions will be appreciated. > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> NOTE:- As I am working with Hive 0.6 so it doesn’t support variable > >>>>> substitution thing, so I cannot use hiveconf here and the above > table has > >>>>> been partitioned on dt(date) column. > >>>> > >>>> > >>> > >> > > >