Hi,
I know this has been asked before. I did google around this topic and tried to
understand as much as possible, but I kind of got difference answers based on
different places. So I like to ask what I have faced and if someone can help me
again on this topic.
I created one table with one column family with 20+ columns in the hive. It is
populated around 150M records from a 20G csv file. What I want to check if how
fast I can get for a full scan in MR job from the Hbase table.
It is running in a 10 nodes hadoop cluster (With Hadoop 1.1.1 + Hbase 0.94.3 +
Hive 0.9) , 8 of them as Data + Task nodes, and one is NN and Hbase master, and
another one is running 2nd NN.
4 nodes of 8 data nodes also run Hbase region servers.
I use the following code example to get row count from a MR job,
http://hbase.apache.org/book/mapreduce.example.htmlAt first, the mapper tasks
run very slow, as I commented out the following 2 lines on purpose:
scan.setCaching(1000); // 1 is the default in Scan, which will be bad
for MapReduce jobs
scan.setCacheBlocks(false); // don't set to true for MR jobs
Then I added the above 2 lines, I almost get 10X faster compared to the first
run. That's good, it proved to me that above 2 lines are important for Hbase
full scan.
Now the question comes to in Hive.
I already created the table in the Hive linking to the Hbase table, then I
started my hive session like this:
hive --auxpath
$HIVE_HOME/lib/hive-hbase-handler-0.9.0.jar,$HIVE_HOME/lib/hbase-0.94.3.jar,$HIVE_HOME/lib/zookeeper-3.4.5.jar,$HIVE_HOME/lib/guava-r09.jar
-hiveconf hbase.master=Hbase_master:port
If I run this query "select count(*) from table", I can see the mappers
performance is very bad, almost as bad as my 1st run above.
I searched this mailing list, it looks like there is a setting in Hive session
to change the scan caching size, same as 1st line of above code base, from here:
http://mail-archives.apache.org/mod_mbox/hbase-user/201110.mbox/%3CCAGpTDNfn11jZAJ2mfboEqkfudXaU9HGsY4b=2x1spwf4qmu...@mail.gmail.com%3E
So I add the following settings in my hive session:
set hbase.client.scanner.caching=1000;
To my surprise, after this setting in hive session, the new MR job generated
from the Hive query still very slow, same as before this settings.
Here is what I found so far:
1) In my owner MR code, before I add the 2 lines of code change or after, in
the job.xml of MR job, I both saw this setting in the job.xml:
hbase.client.scanner.caching=1 So this setting is the same in both run, but
the performance improved great after the code change.
2) In hive run, I saw the setting "hbase.client.scanner.caching" changed from 1
to 1000 in job.xml, which is what I set in the hive session, but performance
has not too much change. So the setting was changed, but it didn't help the
performance as I expected.
My questions are following:
1) Is there any change in the hive (0.9) do the same as the 1st line of code
change? From google and hbase document, it looks like the above configuration
is the one, but it didn't help me.2) Even assume the above setting is correct,
why we have this Hive Jira to fix the Hbase scan cache and marked ONLY fixed in
Hive 0.12? The Jira ticket is here:
https://issues.apache.org/jira/browse/HIVE-36033) Is there any hive setting can
do the same as 2nd line code change above? If so, what is it? I google around
and cannot find one.
Thanks
Yong