Hi Wei
     +1 to Viral's comments.
We have implemented quite a few production systems that uses SQOOP to do the 
data transfer between hdfs and jdbc compliant databases. Go for it as it is an 
awesome tool. You can find the usage instructions on SQOOP developer guide at
http://archive.cloudera.com/cdh/3/sqoop/SqoopDevGuide.html 


Regards
Bejoy.K.S


________________________________
 From: Viral Bajaria <viral.baja...@gmail.com>
To: "Lu, Wei" <w...@microstrategy.com>; "user@hive.apache.org" 
<user@hive.apache.org> 
Sent: Friday, March 9, 2012 11:19 PM
Subject: RE: Hive-645 is slow to insert query results to mysql
 

Hey Wei,

I have used the udf before and figured it is only useful for summary results 
and not for big datasets due to the fault tolerant nature of map/reduce. If you 
don't have a well defined primary key you will end up with more rows than your 
query results. And you are correct in saying that this is not a bulk insert 
since the udf executes at the select statement and hence it processes each 
returned row.

You can try your solution using swoop, it seems to be the most common way of 
getting data out into DBs though I have not used it personally. 

Viral

________________________________
From: Lu, Wei
Sent: 3/8/2012 6:58 PM
To: user@hive.apache.org
Subject: Hive-645 is slow to insert query results to mysql

 
Hi,
 
I recently tried Hive-645 feature and save query results directly to Mysql 
table. The feature can be found here: 
https://issues.apache.org/jira/browse/HIVE-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel.
The query I tried looks like this:
 
hive>CREATE TEMPORARY FUNCTION dboutput AS 
'org.apache.hadoop.hive.contrib.genericudf.example.GenericUDFDBOutput';
hive>SELECT dboutput('jdbc:mysql://localhost/hdfs_db','uid','pwd','INSERT INTO 
dc(t,c) VALUES (?,?)',requestbegintime,count(1)) FROM impressions2 GROUP BY 
requestbegintime;
 
It works, but the reduce tasks are very slow:
 
Task Complete Status Start Time Finish Time Errors Counters 
task_201203081246_0001_r_000000 100.00%  reduce > reduce 8-Mar-2012 12:47:06 
8-Mar-2012 13:33:54 (46mins, 47sec)  11 
 
I set #reduce to be 4 but is still very slow (finally 171, 667 rows are 
inserted to Mysql).
 
I guess the reduce process didn’t insert data to MySql in batch mode, can 
anyone give me some suggestions to improve the performance?? 
PS: I think it might be better to first save results to HDFS and then use Sqoop 
to load data to Mysql, right??
 
Regards,
Wei

Reply via email to