Hi Viral and Bejoy.K.S

I can see that both of you are suggesting Sqoop, and I will have a try ☺.

Thanks,
Wei

From: Bejoy Ks [mailto:bejoy...@yahoo.com]
Sent: Saturday, March 10, 2012 2:11 AM
To: user@hive.apache.org; Lu, Wei
Subject: Re: Hive-645 is slow to insert query results to mysql

Hi Wei
     +1 to Viral's comments.
We have implemented quite a few production systems that uses SQOOP to do the 
data transfer between hdfs and jdbc compliant databases. Go for it as it is an 
awesome tool. You can find the usage instructions on SQOOP developer guide at
http://archive.cloudera.com/cdh/3/sqoop/SqoopDevGuide.html

Regards
Bejoy.K.S

________________________________
From: Viral Bajaria <viral.baja...@gmail.com>
To: "Lu, Wei" <w...@microstrategy.com>; "user@hive.apache.org" 
<user@hive.apache.org>
Sent: Friday, March 9, 2012 11:19 PM
Subject: RE: Hive-645 is slow to insert query results to mysql


Hey Wei,

I have used the udf before and figured it is only useful for summary results 
and not for big datasets due to the fault tolerant nature of map/reduce. If you 
don't have a well defined primary key you will end up with more rows than your 
query results. And you are correct in saying that this is not a bulk insert 
since the udf executes at the select statement and hence it processes each 
returned row.

You can try your solution using swoop, it seems to be the most common way of 
getting data out into DBs though I have not used it personally.

Viral
________________________________
From: Lu, Wei
Sent: 3/8/2012 6:58 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Hive-645 is slow to insert query results to mysql
Hi,

I recently tried Hive-645 feature and save query results directly to Mysql 
table. The feature can be found here: 
https://issues.apache.org/jira/browse/HIVE-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel.
The query I tried looks like this:

hive>CREATE TEMPORARY FUNCTION dboutput AS 
'org.apache.hadoop.hive.contrib.genericudf.example.GenericUDFDBOutput';
hive>SELECT dboutput('jdbc:mysql://localhost/hdfs_db','uid','pwd','INSERT INTO 
dc(t,c) VALUES (?,?)',requestbegintime,count(1)) FROM impressions2 GROUP BY 
requestbegintime;

It works, but the reduce tasks are very slow:

Task

Complete

Status

Start Time

Finish Time

Errors

Counters

task_201203081246_0001_r_000000<http://wlu-hadoop01:50030/taskdetails.jsp?jobid=job_201203081246_0001&tipid=task_201203081246_0001_r_000000>

100.00%


reduce > reduce

8-Mar-2012 12:47:06

8-Mar-2012 13:33:54 (46mins, 47sec)

11<http://wlu-hadoop01:50030/taskstats.jsp?jobid=job_201203081246_0001&tipid=task_201203081246_0001_r_000000>


I set #reduce to be 4 but is still very slow (finally 171, 667 rows are 
inserted to Mysql).

I guess the reduce process didn’t insert data to MySql in batch mode, can 
anyone give me some suggestions to improve the performance??
PS: I think it might be better to first save results to HDFS and then use Sqoop 
to load data to Mysql, right??

Regards,
Wei

Reply via email to