Re: CDH3 U1 Hive Job-commit very slow

2011-08-10 Thread air
hi Aggarwal, I am using the newest version (CDH3 Update1 Hive 0.7), after submitting several jobs using hive, the submit becomes very slow (about 2-5 minutes), following is some error information from hive.log (seems the metastore has some problem, I upgrade the metastore from 0.5 to 0.6 and then f

Re: why need to copy when run a sql with a single map

2011-08-10 Thread Kai Ju Liu
Hi Daniel. The Hive query uses a reduce step to group by retailer_key and calculate count(*). The "copy" step is a copy of data from the mapper to the reducer. Kai Ju 2011/8/10 Daniel,Wu > I run a single query like > > select retailer_key,count(*) from records group by retailer_key; > > it uses

Re: why need to copy when run a sql with a single map

2011-08-10 Thread bejoy_ks
Hi Hive queries are parsed into hadoop map reduce jobs. In map reduce jobs, between map and reduce tasks there are two phases, copy-phase and sort-phase together known as sort and shuffle phase. So the copy task indicated in hive job here should be the copy phase of map reduce. It does the co

RE: CDH3 U1 Hive Job-commit very slow

2011-08-10 Thread Aggarwal, Vaibhav
How much time is the query startup taking? In earlier versions of Hive (before HIVE 2299) the query startup process had an algorithm which took O(n^2) operations in number of partitions. This means 100M operations before it would submit the map reduce job. From: air [mailto:cnwe...@gmail.com] Se

Re: Fw:why hive has such a high latency?

2011-08-10 Thread Guy Bayes
There is a lot of overhead in the hadoop distributed computing infrastructure If you are not designing for data that is much bigger then this example, you might consider other alternatives Guy 2011/8/10 Daniel,Wu > > Anyone know why hive has such a high latency? scan a table with > 16,522,43

why need to copy when run a sql with a single map

2011-08-10 Thread Daniel,Wu
I run a single query like select retailer_key,count(*) from records group by retailer_key; it uses a single map as shown below, since the file is already on HDFS, so I think hadoop/hive doesn't need to copy anything. Kind% CompleteNum TasksPendingRunningCompleteKilledFailed/Killed Task Attempt

Re: CDH3 U1 Hive Job-commit very slow

2011-08-10 Thread air
there is only 10186 partitions in the metadata store (select count(1) from PARTITIONS; in mysql), I think it is not the problem. 2011/8/10 Aggarwal, Vaibhav > Do you have a lot of partitions in your table? > > Time taken to process the partitions before submitting the job is > proportional t

Fw:why hive has such a high latency?

2011-08-10 Thread Daniel,Wu
Anyone know why hive has such a high latency? scan a table with 16,522,439 rows take more than 85 seconds. To read these data off disk, we only need about 10 seconds (even not consider the caching which read data from memory). So where does 75 seconds go to? will Deserialize & Serialize t

hadoop conf "fs.default.name" can't be setted ip:port format directly?

2011-08-10 Thread Jander g
Hi, all In order to use hive, Hadoop conf "fs.default.name" must be setted hostname:port otherwise Hive will throw Wrong FS exception. In my opinion, ip and hostname is equivalent. Is there something wrong in my Hive conf? *Any Help Will* *be greatly appreciated.*-- Thanks, Jander

Adding UDF permanently to my Hive install

2011-08-10 Thread Ayon Sinha
Hi, I know I have asked this question before but the answers didn't quite satisfy my requirements. Our problem stems from the fact that there is not built-in UDF equivalent to yearweek of Mysql. This gives the year + week concatenated, with the week starting on Sunday. So we wrote our UDF but no