it has to do with the data block size,
I had many small files and the performance because much better when i
merged them,
the default block size is 64Mb so redo your files to <= 64MB (what i did
and recommend)
or reconfigure your hadoop.
<property>
<name>dfs.block.size</name>
<value>67108864</value>
<description>The default block size for new files.</description>
</property>
do something like
cat * | rotatelogs ./merged/m 64M
it will merge and chop the data for you.
yoav.morag wrote:
hi all -
can anyone comment on the performance cost of merging many small files into
an increasingly large MapFile ? will that cost be dependent on the size of
the larger MapFile (since I have to rewrite it) or is there a built-in
strategy to split it into smaller parts, affecting only those which were
touched ?
thanks -
Yoav.