Hi,
I recently tried Hive-645 feature and save query results directly to Mysql
table. The feature can be found here:
https://issues.apache.org/jira/browse/HIVE-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel.
The query I tried looks like this:
hive>CREATE TEMPORARY FUNCTION dboutput AS
'org.apache.hadoop.hive.contrib.genericudf.example.GenericUDFDBOutput';
hive>SELECT dboutput('jdbc:mysql://localhost/hdfs_db','uid','pwd','INSERT INTO
dc(t,c) VALUES (?,?)',requestbegintime,count(1)) FROM impressions2 GROUP BY
requestbegintime;
It works, but the reduce tasks are very slow:
Task
Complete
Status
Start Time
Finish Time
Errors
Counters
task_201203081246_0001_r_000000<http://wlu-hadoop01:50030/taskdetails.jsp?jobid=job_201203081246_0001&tipid=task_201203081246_0001_r_000000>
100.00%
reduce > reduce
8-Mar-2012 12:47:06
8-Mar-2012 13:33:54 (46mins, 47sec)
11<http://wlu-hadoop01:50030/taskstats.jsp?jobid=job_201203081246_0001&tipid=task_201203081246_0001_r_000000>
I set #reduce to be 4 but is still very slow (finally 171, 667 rows are
inserted to Mysql).
I guess the reduce process didn't insert data to MySql in batch mode, can
anyone give me some suggestions to improve the performance??
PS: I think it might be better to first save results to HDFS and then use Sqoop
to load data to Mysql, right??
Regards,
Wei