Hi, all I am trying to load pig output from Hive as an external table, and currently stuck with that Hive always set the number of mappers to 1, though it has more than 10 million records and is composed of multiple files. Could any of guys have any idea?
To be more specific, the output is in Parquet format generated by Pig Script without any compression. STORE rows INTO '/table-data/test' USING parquet.pig.ParquetStorer; The directory does contain 16 part-m-00xx.parquet files and _metadata. And the external table is pointed to the directory. Here are the create table statement I've used. CREATE EXTERNAL TABLE `t_main_wop`( `id` string, `f1` string, ... ) STORED AS PARQUET LOCATION '/table-data/test'; It seem to properly read the parquet file itself since SELECT * FROM test; returns the proper result. However, everytime I give it queries that requires mapreduce jobs, It only uses single mapper, and takes like forever. hive> select count(*) from t_main_wop; Query ID = xxx Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_yyy, Tracking URL = zzz Kill Command = hadoop_job -kill job_yyy Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2014-12-24 02:49:46,912 Stage-1 map = 0%, reduce = 0% 2014-12-24 02:50:45,847 Stage-1 map = 0%, reduce = 0% Why is it? I've set mapred.map.tasks=100, but to no avail. Again the directory contans 16 part files, so I think it sould be able to use at least 16 mappers. I would really appreciate if you could give me any suggestions Thanks, Akira