>Is there anyway to avoid creating sub-directories while running in tez?
>Or this is by design and can not be changed?

Yes, this is by design. The Tez execution of UNION is entirely parallel &
the task-ids overlaps - so the files created have to have unique names.

But the total counts for "Map 1" and "Map 2" are only available as the job
runs, so they write to different dirs.

Here's a comparison of MapReduce vs Tez (from 2014, some slides are out of
date now).

http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/15


This UNION method is faster because of fewer intermediate HDFS writes &
mapreduce.input.fileinputformat.input.dir.recursive=true kicks in as long
as your cluster runs YARN (which it does, because otherwise Tez wouldn't
work).

Cheers,
Gopal


Reply via email to