No, I didn't use zip, it's just simple csv file, and then use the command
       load data  local inpath '/home/oracle/sales.csv' into table test; 
to load into hive.  I am not sure whether this command alone can distribute the 
file evenly into the cluster (on 3 nodes). So I used the following command in 
the hope to split the file into cluster.
     create table sales as select * from test;

But when I check the map tasks, it shows I have 8 splits, but all are on node 
test1.  If I run the sql
   select period_key,count(*) from sales group by period_key,  then it will 
kick of ONE map task, and 3 reduce tasks. So looks like it always uses one map 
tasks.  I have 2 questions:
1: why hadoop doesn't distribute the input split evenly on to each node, 
shouldn't we put 3 split on 2 nodes, and then 2 splits on one node  ( 3*2  +2=8 
splits)?
2: how to create multiple map tasks?



Input Split Locations
/default-rack/test1
/default-rack/test1
/default-rack/test1
/default-rack/test1
/default-rack/test1
/default-rack/test1
/default-rack/test1
/default-rack/test1

At 2011-08-23 21:58:04,"Vikas Srivastava" <vikas.srivast...@one97.net> wrote:
hey did u storing data in zipped format

if yes becoz of that its only split in single map.


2011/8/23 Daniel,Wu<hadoop...@163.com>

  I run the following simple sql
select count(*) from sales;
And the job information shows it only uses one map task.

The underlying hadoop has 3 data/data nodes. So I expect hive should kick off 3 
map tasks, one on each task nodes. What can make hive only run one map task? Do 
I need to set something to kick off multiple map task?  in my config, I didn't 
change hive config.







--
With Regards
Vikas Srivastava

DWH & Analytics Team
Mob:+91 9560885900
One97 | Let's get talking !

Reply via email to