RE: Effecient partitions usage in join

Bennie Schut Fri, 23 Nov 2012 02:51:07 -0800

Well this is the udf:

package com.ebuddy.dwhhive.udf;


import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.udf.UDFType;
import org.apache.hadoop.io.Text;

import java.text.SimpleDateFormat;
import java.util.Calendar;

@Description(
        name = "currentisodate",
        value = "currentisodate() - Get the current date. Incorrectly made"
                + " deterministic to get partition pruning to work."
)

@UDFType(deterministic = true)
public class CurrentIsoDate extends UDF {

    public static Text evaluate() {
        String pattern = "yyyy-MM-dd";
        SimpleDateFormat timeFormat = new SimpleDateFormat(pattern);
        return new Text(timeFormat.format(Calendar.getInstance().getTime()));
    }
}


And this is how we use it to query the last 30days:
ADD jar /opt/hive/udf/udf-2.1.2-jar-with-dependencies.jar;
CREATE TEMPORARY FUNCTION currectisodate AS 
'com.ebuddy.dwhhive.udf.CurrentIsoDate';
select count(*) from test where record_date_iso >= DATE_SUB(currentisodate(), 
30);

I’ve always had a preference for iso dates since they sort nicely: 2012-11-23 
but you can obviously pick your own pattern.


From: Dima Datsenko [mailto:di...@microsoft.com]
Sent: Thursday, November 22, 2012 4:07 PM
To: Bennie Schut; user@hive.apache.org
Subject: RE: Effecient partitions usage in join

Hi Benny,

The udf solution sounds like a plan. Much better than generating hive query 
with hardcoded partition out of table B. Can you please provide a sample of 
what you’re doing there?

Thanks,
Dima

From: Bennie Schut [mailto:bsc...@ebuddy.com]
Sent: יום ה 22 נובמבר 2012 16:28
To: user@hive.apache.org<mailto:user@hive.apache.org>
Cc: Dima Datsenko
Subject: RE: Effecient partitions usage in join

Unfortunately at the moment partition pruning is a bit limited in hive. When 
hive creates the query plan it decides what partitions to use. So if you put 
hardcoded list of partition_id items in the where clause it will know what to 
do. In the case of a join (or a subquery) it would have to run the query before 
it can know what it can prune.  There are obvious solutions to this but they 
are simply not implemented at the moment.
Generally speaking people try to work around this by not normalizing the data. 
So if you plan on doing a clean star schema with a calendar table then do 
yourself a favor and but the actual date in the fact table and not a 
meaningless key.
It’s also good to realize you can (in some special cases) work around it by 
using udf’s. I’ve used it once by creating a udf which produced the current 
date which I flagged as deterministic (ugly I know). This causes the planner to 
run the udf during planning and use the result as if it’s a constant and thus 
partition pruning works again. It’s currently the only way I know to select x 
days of data with partition pruning working.


From: Dima Datsenko [mailto:di...@microsoft.com]
Sent: Thursday, November 22, 2012 2:56 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Effecient partitions usage in join

Hi Guys,

I wonder if you could help me.

I have a huge Hive table partitioned by some field. It has thousands of 
partitions.
Now I have another small table containing tens of partitions id. I’d like to 
get the data only from those partitions.

However when I run
Select * from A join B on (A.partition_id = B.partition_id),
It reads all data from A, then from B and on reduce stage performs join.

I tried /*+ MAPJOIN*/ it ran faster sparing reduce operation, but still read 
the whole A table.

Is there a more efficient way to perform the query w/o reading the whole A 
content?


Thanks
Dima

RE: Effecient partitions usage in join

Reply via email to