Hi Daniel. The Hive query uses a reduce step to group by retailer_key and calculate count(*). The "copy" step is a copy of data from the mapper to the reducer.
Kai Ju 2011/8/10 Daniel,Wu <hadoop...@163.com> > I run a single query like > > select retailer_key,count(*) from records group by retailer_key; > > it uses a single map as shown below, since the file is already on HDFS, so > I think hadoop/hive doesn't need to copy anything. > > Kind% CompleteNum TasksPendingRunningCompleteKilledFailed/Killed > Task > Attempts<http://localhost:50030/jobfailures.jsp?jobid=job_201108101943_0001> > map<http://localhost:50030/jobtasks.jsp?jobid=job_201108101943_0001&type=map&pagenum=1> > 100.00% > 1001<http://localhost:50030/jobtasks.jsp?jobid=job_201108101943_0001&type=map&pagenum=1&state=completed> > 00 / 0 > reduce<http://localhost:50030/jobtasks.jsp?jobid=job_201108101943_0001&type=reduce&pagenum=1> > 100.00% > 1001<http://localhost:50030/jobtasks.jsp?jobid=job_201108101943_0001&type=reduce&pagenum=1&state=completed> > 00 / 0 > but the final chart in the job report shows "copy" takes about 33% of the > total time, and the rest are "sort", and "reduce". So why it should copy > here, or copy means something elso? > oracle@oracle-MS-7623:~/test$ hadoop fs -lsr / > > drwxr-xr-x - oracle supergroup 0 2011-08-10 19:46 /user > drwxr-xr-x - oracle supergroup 0 2011-08-10 19:46 /user/hive > drwxr-xr-x - oracle supergroup 0 2011-08-10 19:59 > /user/hive/warehouse > drwxr-xr-x - oracle supergroup 0 2011-08-10 19:59 > /user/hive/warehouse/records > -rw-r--r-- 1 oracle supergroup 41600256 2011-08-10 19:59 > /user/hive/warehouse/records/test.txt > > > > >