Hadoop streaming with insert dynamic partition generate many small files

Chen Wang Sun, 02 Feb 2014 22:15:00 -0800

Hi,
I am using java reducer reading from a table, and then write to another one:


  FROM (

                FROM (

                    SELECT column1,...

                    FROM table1

                    WHERE  ( partition > 6 and partition < 12 )

                ) A

                MAP A.column1,A....

                USING 'java -cp .my.jar  mymapper.mymapper'

                AS key, value

                CLUSTER BY key

            ) map_output

            INSERT OVERWRITE TABLE target_table PARTITION(partition)

            REDUCE

                map_output.key,

                map_output.value

            USING 'java -cp .:myjar.jar  myreducer.myreducer'

            AS column1,column2;"

Its all working fine, except that there are many (20-30) small files
generated under each partition. i am setting  SET
hive.exec.reducers.bytes.per.reducer=1280,000,000; hoping to get one big
enough file under for each partition.But it does not seem to have any
effect. I still get 20-30 small files under each folder, and each file size
is around 7kb.

How can I force to generate only 1 big file for one partition? Does this
have anything to do with the streaming? I recall in the past i was directly
reading from a table with UDF, and write to another table, it only
generates one big file for the target partition. Not sure why is that.


Any help appreciated.

Thanks,

Chen

Hadoop streaming with insert dynamic partition generate many small files

Reply via email to