Re: Hadoop streaming with insert dynamic partition generate many small files

Chen Wang Mon, 03 Feb 2014 13:30:53 -0800

Chandra,
You don't necessary need java to implement the mapper/reducer. Checkout the
answer in this post:
http://stackoverflow.com/questions/6178614/custom-map-reduce-program-on-hive-whats-the-rulehow-about-input-and-output


also in my sample,
A.column1, A.column2 ==> mymapper ==> key, value, and myapper simply read
from std.in, and convert to key,value.
Chen


On Mon, Feb 3, 2014 at 5:51 AM, Bogala, Chandra Reddy <chandra.bog...@gs.com
> wrote:

> Hi Wang,
>
>     I am first time trying MAP & Reduce inside hive query. Is it possible
> to share mymapper and myreducer code? So that I can understand how the
> columns (A.column1,A.... to key, value) converted? Also can you point me
> to some documents to read more about it.
>
> Thanks,
>
> Chandra
>
>
>
>
>
> *From:* Chen Wang [mailto:chen.apache.s...@gmail.com]
> *Sent:* Monday, February 03, 2014 12:26 PM
> *To:* user@hive.apache.org
> *Subject:* Re: Hadoop streaming with insert dynamic partition generate
> many small files
>
>
>
>  it seems that hive.exec.reducers.bytes.per.reducer is still not big
> enough: I added another 0, and now i only gets one file under each
> partition.
>
>
>
> On Sun, Feb 2, 2014 at 10:14 PM, Chen Wang <chen.apache.s...@gmail.com>
> wrote:
>
> Hi,
>
> I am using java reducer reading from a table, and then write to another
> one:
>
>   FROM (
>
>                 FROM (
>
>                     SELECT column1,...
>
>                     FROM table1
>
>                     WHERE  ( partition > 6 and partition < 12 )
>
>                 ) A
>
>                 MAP A.column1,A....
>
>                 USING 'java -cp .my.jar  mymapper.mymapper'
>
>                 AS key, value
>
>                 CLUSTER BY key
>
>             ) map_output
>
>             INSERT OVERWRITE TABLE target_table PARTITION(partition)
>
>             REDUCE
>
>                 map_output.key,
>
>                 map_output.value
>
>             USING 'java -cp .:myjar.jar  myreducer.myreducer'
>
>             AS column1,column2;"
>
> Its all working fine, except that there are many (20-30) small files
> generated under each partition. i am setting  SET
> hive.exec.reducers.bytes.per.reducer=1280,000,000; hoping to get one big
> enough file under for each partition.But it does not seem to have any
> effect. I still get 20-30 small files under each folder, and each file size
> is around 7kb.
>
> How can I force to generate only 1 big file for one partition? Does this
> have anything to do with the streaming? I recall in the past i was directly
> reading from a table with UDF, and write to another table, it only
> generates one big file for the target partition. Not sure why is that.
>
>
>
> Any help appreciated.
>
> Thanks,
>
> Chen
>
>
>
>
>
>
>

Re: Hadoop streaming with insert dynamic partition generate many small files

Reply via email to