Re: why need to copy when run a sql with a single map

Kai Ju Liu Wed, 10 Aug 2011 12:02:38 -0700

Hi Daniel. The Hive query uses a reduce step to group by retailer_key and
calculate count(*). The "copy" step is a copy of data from the mapper to the
reducer.


Kai Ju

2011/8/10 Daniel,Wu <hadoop...@163.com>

> I run a single query like
>
> select retailer_key,count(*) from records group by retailer_key;
>
> it uses a single map as shown below, since the file is already on HDFS, so
> I think hadoop/hive doesn't need to copy anything.
>
>  Kind% CompleteNum TasksPendingRunningCompleteKilledFailed/Killed
> Task 
> Attempts<http://localhost:50030/jobfailures.jsp?jobid=job_201108101943_0001>
> map<http://localhost:50030/jobtasks.jsp?jobid=job_201108101943_0001&type=map&pagenum=1>
> 100.00%
> 1001<http://localhost:50030/jobtasks.jsp?jobid=job_201108101943_0001&type=map&pagenum=1&state=completed>
> 00 / 0 
> reduce<http://localhost:50030/jobtasks.jsp?jobid=job_201108101943_0001&type=reduce&pagenum=1>
> 100.00%
> 1001<http://localhost:50030/jobtasks.jsp?jobid=job_201108101943_0001&type=reduce&pagenum=1&state=completed>
> 00 / 0
> but the final chart in the job  report shows "copy" takes about 33% of the
> total time, and the rest are "sort", and "reduce".  So why it should copy
> here, or copy means something elso?
>  oracle@oracle-MS-7623:~/test$ hadoop fs -lsr /
>
> drwxr-xr-x   - oracle supergroup          0 2011-08-10 19:46 /user
> drwxr-xr-x   - oracle supergroup          0 2011-08-10 19:46 /user/hive
> drwxr-xr-x   - oracle supergroup          0 2011-08-10 19:59
> /user/hive/warehouse
> drwxr-xr-x   - oracle supergroup          0 2011-08-10 19:59
> /user/hive/warehouse/records
> -rw-r--r--   1 oracle supergroup   41600256 2011-08-10 19:59
> /user/hive/warehouse/records/test.txt
>
>
>
>
>

Re: why need to copy when run a sql with a single map

Reply via email to