@kulkarni,


When I did explain on my query, I got these things, I am not sure how to
understand these thing. Any help will be appreciated whether my approach is
right or not?-



hive> EXPLAIN SELECT * FROM PDS_ATTRIBUTE_DATA_REALTIME where
dt=yesterdaydate('yyyyMMdd', 2) LIMIT 5;

OK

ABSTRACT SYNTAX TREE:

  (TOK_QUERY (TOK_FROM (TOK_TABREF PDS_ATTRIBUTE_DATA_REALTIME))
(TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT
(TOK_SELEXPR TOK_ALLCOLREF)) (TOK_WHERE (= (TOK_TABLE_OR_COL dt)
(TOK_FUNCTION yesterdaydate 'yyyyMMdd' 2))) (TOK_LIMIT 5)))



STAGE DEPENDENCIES:

  Stage-0 is a root stage



STAGE PLANS:

  Stage: Stage-0

    Fetch Operator

      limit: 5





Time taken: 12.126 seconds




*Raihan Jamal*



On Tue, Aug 7, 2012 at 10:56 AM, Jan Dolinár <dolik....@gmail.com> wrote:

> Oops, sorry I made a copy&paste mistake :) The annotation should read
> @*UDFType(deterministic=true*)
>
> Jan
>
> On Tue, Aug 7, 2012 at 7:37 PM, Jan Dolinár <dolik....@gmail.com> wrote:
>
>> I'm afraid that  he query
>>
>> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
>>
>> will scan entire table, because the functions is evaluated at runtime, so
>> Hive doesn't know what the value is when it decides which files to scan. I
>> am not 100% sure though, you should try it.
>>
>> Also, you might want to try to add annotation to your UDF saying that the
>> function is deterministic:
>> @*UDFType(deterministic=false*)
>>
>> I think Hive might be clever enough to evaluate it early enough to use
>> the partition pruning correctly, since it operates on constant expression.
>> But again, I'm not really sure, maybe someone with deeper knowledge of Hive
>> optimizations will tell us more. It is actually quite interesting question.
>>
>> Another way to help Hive with the optimizations might be to skip passing
>> the format string argument, if you have all dates in same format, you can
>> call the function just like 'yesterdaydate()' and hardcode the format in
>> the function.
>>
>> Jan
>>
>>
>> On Tue, Aug 7, 2012 at 7:20 PM, Raihan Jamal <jamalrai...@gmail.com>wrote:
>>
>>> Hi Jan,
>>>
>>>
>>> I figured that out, it is working fine for me now. The only question I
>>> have is, if I am doing like this-
>>>
>>>
>>>
>>> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd') LIMIT 10;
>>>
>>>
>>>
>>> Then the above query will be evaluated as below right?
>>>
>>>
>>>
>>> SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10;
>>>
>>>
>>>
>>> So that means it will look for data in the corresponding dt partition 
>>> *(20120806)
>>> *only right as above table is partitioned on dt column ? And it will
>>> not scan the whole table right?**
>>>
>>>
>>>
>>> *Raihan Jamal*
>>>
>>>
>>>
>>> On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár <dolik....@gmail.com>wrote:
>>>
>>>> Hi Jamal,
>>>>
>>>> Check if the function really returns what it should and that your data
>>>> are really in yyyyMMdd format. You can do this by simple query like this:
>>>>
>>>> SELECT dt, yesterdaydate('yyyyMMdd') FROM REALTIME LIMIT 1;
>>>>
>>>> I don't see anything wrong with the function itself, it works well for
>>>> me (although I tested it in hive 0.7.1). The only thing I would change
>>>> about it would be to optimize it by calling 'new' only at the time of
>>>> construction and reusing the object when the function is called, but that
>>>> should not affect the functionality at all.
>>>>
>>>> Best regards,
>>>> Jan
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal <jamalrai...@gmail.com>wrote:
>>>>
>>>>> *Problem*
>>>>>
>>>>> I created the below UserDefinedFunction to get the yesterday's day in
>>>>> the format I wanted as I will be passing the format into this below method
>>>>> from the query.
>>>>>
>>>>>
>>>>>
>>>>> *public final class YesterdayDate extends UDF {*
>>>>>
>>>>> * *
>>>>>
>>>>> *                public String evaluate(final String format) { *
>>>>>
>>>>> *                                DateFormat dateFormat = new
>>>>> SimpleDateFormat(format); *
>>>>>
>>>>> *                                Calendar cal =
>>>>> Calendar.getInstance();*
>>>>>
>>>>> *                                cal.add(Calendar.DATE, -1);     *
>>>>>
>>>>> *                                return
>>>>> dateFormat.format(cal.getTime()).toString(); *
>>>>>
>>>>> *                } *
>>>>>
>>>>> *}*
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> So whenever I try to run the query like below by adding the jar to
>>>>> classpath and creating the temporary function yesterdaydate, I always get
>>>>> zero result back-
>>>>>
>>>>>
>>>>>
>>>>> hive> create temporary function *yesterdaydate* as
>>>>> 'com.example.hive.udf.YesterdayDate';
>>>>>
>>>>> OK
>>>>>
>>>>> Time taken: 0.512 seconds
>>>>>
>>>>>
>>>>>
>>>>> Below is the query I am running-
>>>>>
>>>>>
>>>>>
>>>>> *hive> SELECT * FROM REALTIME where dt= yesterdaydate('yyyyMMdd')
>>>>> LIMIT 10;*
>>>>>
>>>>> *OK*
>>>>>
>>>>> * *
>>>>>
>>>>> And I always get zero result back but the data is there in that table
>>>>> for Aug 5th.**
>>>>>
>>>>>
>>>>>
>>>>> What wrong I am doing? Any suggestions will be appreciated.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> NOTE:- As I am working with Hive 0.6 so it doesn’t support variable
>>>>> substitution thing, so I cannot use hiveconf here and the above table has
>>>>> been partitioned on dt(date) column.**
>>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to