Hi cem <cayiro...@gmail.com>, In your previous reply, you mentioned that you have a simple solution. Can you share with us :)
Thanks in advance. On Sat, Mar 30, 2013 at 2:33 AM, Edward Capriolo <edlinuxg...@gmail.com>wrote: > It should be easy to control the number of map tasks. > http://wiki.apache.org/hadoop/HowManyMapsAndReduces. It standard HDFS you > might run into a directory with 10,000 small files and you do not want > 10,000 map tasks. This is what the CombinedInputFormat's do, they help you > control the number of map tasks a job will generate. For example, imagine i > have a multi-tenant cluster. If a job kicks up 10,000 map tasks, all those > tasks can starve out other jobs. Being able to say "I only want 4 map tasks > per c* node regardless of the number of vnodes" would be a meaningful and > useful feature. > > > On Fri, Mar 29, 2013 at 2:17 PM, Edward Capriolo <edlinuxg...@gmail.com>wrote: > >> Yes but my point, is with 50 map slots you can only be processing 50 at >> once. So it will take 1000/50 "waves" of mappers to complete the job. >> >> >> On Fri, Mar 29, 2013 at 11:46 AM, Jonathan Ellis <jbel...@gmail.com>wrote: >> >>> My point is that if you have over 16MB of data per node, you're going >>> to get thousands of map tasks (that is: hundreds per node) with or >>> without vnodes. >>> >>> On Fri, Mar 29, 2013 at 9:42 AM, Edward Capriolo <edlinuxg...@gmail.com> >>> wrote: >>> > Every map reduce task typically has a minimum Xmx of 256MB memory. See >>> > mapred.child.java.opts... >>> > So if you have a 10 node cluster with 256 vnodes... You will need to >>> spawn >>> > 2,560 map tasks to complete a job. >>> > And a 10 node hadoop cluster with 5 map slotes a node... You have 50 >>> map >>> > slots. >>> > >>> > Wouldnt it be better if the input format spawned 10 map tasks instead >>> of >>> > 2,560? >>> > >>> > >>> > On Fri, Mar 29, 2013 at 10:28 AM, Jonathan Ellis <jbel...@gmail.com> >>> wrote: >>> >> >>> >> I still don't see the hole in the following reasoning: >>> >> >>> >> - Input splits are 64k by default. At this size, map processing time >>> >> dominates job creation. >>> >> - Therefore, if job creation time dominates, you have a toy data set >>> >> (< 64K * 256 vnodes = 16 MB) >>> >> >>> >> Adding complexity to our inputformat to improve performance for this >>> >> niche does not sound like a good idea to me. >>> >> >>> >> On Thu, Mar 28, 2013 at 8:40 AM, cem <cayiro...@gmail.com> wrote: >>> >> > Hi Alicia , >>> >> > >>> >> > Cassandra input format creates mappers as many as vnodes. It is a >>> known >>> >> > issue. You need to lower the number of vnodes :( >>> >> > >>> >> > I have a simple solution for that and ready to write a patch. >>> Should I >>> >> > create a ticket about that? I don't know the procedure about that. >>> >> > >>> >> > Regards, >>> >> > Cem >>> >> > >>> >> > >>> >> > On Thu, Mar 28, 2013 at 2:30 PM, Alicia Leong <lccali...@gmail.com> >>> >> > wrote: >>> >> >> >>> >> >> Hi All, >>> >> >> >>> >> >> I have 3 nodes of Cassandra 1.2.3 & edited the cassandra.yaml for >>> >> >> vnodes. >>> >> >> >>> >> >> When I execute a M/R job .. the console showed HUNDRED of Map >>> tasks. >>> >> >> >>> >> >> May I know, is the normal since is vnodes? If yes, this have slow >>> the >>> >> >> M/R >>> >> >> job to finish/complete. >>> >> >> >>> >> >> >>> >> >> Thanks >>> >> > >>> >> > >>> >> >>> >> >>> >> >>> >> -- >>> >> Jonathan Ellis >>> >> Project Chair, Apache Cassandra >>> >> co-founder, http://www.datastax.com >>> >> @spyced >>> > >>> > >>> >>> >>> >>> -- >>> Jonathan Ellis >>> Project Chair, Apache Cassandra >>> co-founder, http://www.datastax.com >>> @spyced >>> >> >> >