It should be easy to control the number of map tasks.
http://wiki.apache.org/hadoop/HowManyMapsAndReduces. It standard HDFS you
might run into a directory with 10,000 small files and you do not want
10,000 map tasks. This is what the CombinedInputFormat's do, they help you
control the number of map tasks a job will generate. For example, imagine i
have a multi-tenant cluster. If a job kicks up 10,000 map tasks, all those
tasks can starve out other jobs. Being able to say "I only want 4 map tasks
per c* node regardless of the number of vnodes" would be a meaningful and
useful feature.


On Fri, Mar 29, 2013 at 2:17 PM, Edward Capriolo <edlinuxg...@gmail.com>wrote:

> Yes but my point, is with 50 map slots you can only be processing 50 at
> once. So it will take 1000/50 "waves" of mappers to complete the job.
>
>
> On Fri, Mar 29, 2013 at 11:46 AM, Jonathan Ellis <jbel...@gmail.com>wrote:
>
>> My point is that if you have over 16MB of data per node, you're going
>> to get thousands of map tasks (that is: hundreds per node) with or
>> without vnodes.
>>
>> On Fri, Mar 29, 2013 at 9:42 AM, Edward Capriolo <edlinuxg...@gmail.com>
>> wrote:
>> > Every map reduce task typically has a minimum Xmx of 256MB memory. See
>> > mapred.child.java.opts...
>> > So if you have a 10 node cluster with 256 vnodes... You will need to
>> spawn
>> > 2,560 map tasks to complete a job.
>> > And a 10 node hadoop cluster with 5 map slotes a node... You have 50 map
>> > slots.
>> >
>> > Wouldnt it be better if the input format spawned 10 map tasks instead of
>> > 2,560?
>> >
>> >
>> > On Fri, Mar 29, 2013 at 10:28 AM, Jonathan Ellis <jbel...@gmail.com>
>> wrote:
>> >>
>> >> I still don't see the hole in the following reasoning:
>> >>
>> >> - Input splits are 64k by default.  At this size, map processing time
>> >> dominates job creation.
>> >> - Therefore, if job creation time dominates, you have a toy data set
>> >> (< 64K * 256 vnodes = 16 MB)
>> >>
>> >> Adding complexity to our inputformat to improve performance for this
>> >> niche does not sound like a good idea to me.
>> >>
>> >> On Thu, Mar 28, 2013 at 8:40 AM, cem <cayiro...@gmail.com> wrote:
>> >> > Hi Alicia ,
>> >> >
>> >> > Cassandra input format creates mappers as many as vnodes. It is a
>> known
>> >> > issue. You need to lower the number of vnodes :(
>> >> >
>> >> > I have a simple solution for that and ready to write a patch. Should
>> I
>> >> > create a ticket about that? I don't know the procedure about that.
>> >> >
>> >> >  Regards,
>> >> > Cem
>> >> >
>> >> >
>> >> > On Thu, Mar 28, 2013 at 2:30 PM, Alicia Leong <lccali...@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> Hi All,
>> >> >>
>> >> >> I have 3 nodes of Cassandra 1.2.3 & edited the cassandra.yaml for
>> >> >> vnodes.
>> >> >>
>> >> >> When I execute a M/R job .. the console showed HUNDRED of Map tasks.
>> >> >>
>> >> >> May I know, is the normal since is vnodes?  If yes, this have slow
>> the
>> >> >> M/R
>> >> >> job to finish/complete.
>> >> >>
>> >> >>
>> >> >> Thanks
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Jonathan Ellis
>> >> Project Chair, Apache Cassandra
>> >> co-founder, http://www.datastax.com
>> >> @spyced
>> >
>> >
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder, http://www.datastax.com
>> @spyced
>>
>
>

Reply via email to