No, I didn't use zip, it's just simple csv file, and then use the command load data local inpath '/home/oracle/sales.csv' into table test; to load into hive. I am not sure whether this command alone can distribute the file evenly into the cluster (on 3 nodes). So I used the following command in the hope to split the file into cluster. create table sales as select * from test;
But when I check the map tasks, it shows I have 8 splits, but all are on node test1. If I run the sql select period_key,count(*) from sales group by period_key, then it will kick of ONE map task, and 3 reduce tasks. So looks like it always uses one map tasks. I have 2 questions: 1: why hadoop doesn't distribute the input split evenly on to each node, shouldn't we put 3 split on 2 nodes, and then 2 splits on one node ( 3*2 +2=8 splits)? 2: how to create multiple map tasks? Input Split Locations /default-rack/test1 /default-rack/test1 /default-rack/test1 /default-rack/test1 /default-rack/test1 /default-rack/test1 /default-rack/test1 /default-rack/test1 At 2011-08-23 21:58:04,"Vikas Srivastava" <vikas.srivast...@one97.net> wrote: hey did u storing data in zipped format if yes becoz of that its only split in single map. 2011/8/23 Daniel,Wu<hadoop...@163.com> I run the following simple sql select count(*) from sales; And the job information shows it only uses one map task. The underlying hadoop has 3 data/data nodes. So I expect hive should kick off 3 map tasks, one on each task nodes. What can make hive only run one map task? Do I need to set something to kick off multiple map task? in my config, I didn't change hive config. -- With Regards Vikas Srivastava DWH & Analytics Team Mob:+91 9560885900 One97 | Let's get talking !