Re: Tasks are not starting

Ritesh Sinha Thu, 03 Sep 2015 06:09:01 -0700

If your workers are not starting try executing the command which supervisor
executes to start the workers.You can get those commands from the
supervisor logs. Execute the command to manually start the workers and see
if you are getting any error.


On Thu, Sep 3, 2015 at 6:30 PM, Abhishek Agarwal <abhishc...@gmail.com>
wrote:

> When you say that tasks do not start, do you mean that worker process
> itself is not starting?
>
> On Thu, Sep 3, 2015 at 5:20 PM, John Yost <soozandjohny...@gmail.com>
> wrote:
>
>> Hi Nick,
>>
>> What do the nimbus and supervisor logs say? One or both may contain clues
>> as to why your workers are not starting up.
>>
>> --John
>>
>> On Thu, Sep 3, 2015 at 4:44 AM, Matthias J. Sax <mj...@apache.org> wrote:
>>
>>> I am currently working with version 0.11.0-SNAPSHOT and cannot observe
>>> the behavior you describe. If I submit a sample topology with 1 spout
>>> (dop=1) and 1 bolt (dop=10) connected via shuffle grouping and have 12
>>> supervisor available (each with 12 worker slots), each of the 11
>>> executors is running on a single worker of a single supervisor (host).
>>>
>>> I am not idea why you observe a different behavior...
>>>
>>> -Matthias
>>>
>>> On 09/03/2015 12:20 AM, Nick R. Katsipoulakis wrote:
>>> > When I say co-locate, what I have seen in my experiments is the
>>> following:
>>> >
>>> > If the executor's number can be served by workers on one node, the
>>> > scheduler spawns all the executors in the workers of one node. I have
>>> > also seen that behavior in that the default scheduler tries to fill up
>>> > one node before provisioning an additional one for the topology.
>>> >
>>> > Going back to your following sentence "and the executors should be
>>> > evenly distributed over all available workers." I have to say that I do
>>> > not see that often in my experiments. Actually, I often come across
>>> with
>>> > workers handling 2 - 3 executors/tasks, and other doing nothing. Am I
>>> > missing something? Is it just a coincidence that happened in my
>>> experiments?
>>> >
>>> > Thank you,
>>> > Nick
>>> >
>>> >
>>> >
>>> > 2015-09-02 17:38 GMT-04:00 Matthias J. Sax <mj...@apache.org
>>> > <mailto:mj...@apache.org>>:
>>> >
>>> >     I agree. The load is not high.
>>> >
>>> >     About higher latencies. How many ackers did you configure? As a
>>> rule of
>>> >     thumb there should be one acker per executor. If you have less
>>> ackers,
>>> >     and an increasing number of executors, this might cause the
>>> increased
>>> >     latency as the ackers could become a bottleneck.
>>> >
>>> >     What do you mean by "trying to co-locate tasks and executors as
>>> much as
>>> >     possible"? Tasks a logical units of works that are processed by
>>> >     executors (which are threads). Furthermore (as far as I know), the
>>> >     default scheduler does a evenly distributed assignment for tasks
>>> and
>>> >     executor to the available workers. In you case, as you set the
>>> number of
>>> >     task equal to the number of executors, each executors processes a
>>> single
>>> >     task, and the executors should be evenly distributed over all
>>> available
>>> >     workers.
>>> >
>>> >     However, you are right: intra-worker channels are "cheaper" than
>>> >     inter-worker channels. In order to exploit this, you should use
>>> >     shuffle-or-local grouping instead of shuffle. The disadvantage of
>>> >     shuffle-or-local might be missing load-balancing. Shuffle always
>>> ensures
>>> >     good load balancing.
>>> >
>>> >
>>> >     -Matthias
>>> >
>>> >
>>> >
>>> >     On 09/02/2015 10:31 PM, Nick R. Katsipoulakis wrote:
>>> >     > Well, my input load is 4 streams at 4000 tuples per second, and
>>> each
>>> >     > tuple is about 128 bytes long. Therefore, I do not think my load
>>> is too
>>> >     > much for my hardware.
>>> >     >
>>> >     > No, I am running only this topology in my cluster.
>>> >     >
>>> >     > For some reason, when I set the task to executor ratio to 1, my
>>> topology
>>> >     > does not hang at all. The strange thing now is that I see higher
>>> latency
>>> >     > with more executors and I am trying to figure this out. Also, I
>>> see that
>>> >     > the default scheduler is trying to co-locate tasks and executors
>>> as much
>>> >     > as possible. Is this true? If yes, is it because the intra-worker
>>> >     > latencies are much lower than the inter-worker latencies?
>>> >     >
>>> >     > Thanks,
>>> >     > Nick
>>> >     >
>>> >     > 2015-09-02 16:27 GMT-04:00 Matthias J. Sax <mj...@apache.org
>>> <mailto:mj...@apache.org>
>>> >     > <mailto:mj...@apache.org <mailto:mj...@apache.org>>>:
>>> >     >
>>> >     >     So (for each node) you have 4 cores available for 1
>>> supervisor JVM, 2
>>> >     >     worker JVMs that execute up to 5 thread each (if 40
>>> executors are
>>> >     >     distributed evenly over all workers. Thus, about 12 threads
>>> for 4 cores.
>>> >     >     Or course, Storm starts a few more threads within each
>>> >     >     worker/supervisor.
>>> >     >
>>> >     >     If your load is not huge, this might be sufficient. However,
>>> having high
>>> >     >     data rate, it might be problematic.
>>> >     >
>>> >     >     One more question: do you run a single topology in your
>>> cluster or
>>> >     >     multiple? Storm isolates topologies for fault-tolerance
>>> reasons. Thus, a
>>> >     >     single worker cannot process executors from different
>>> topologies. If you
>>> >     >     run out of workers, a topology might not start up completely.
>>> >     >
>>> >     >     -Matthias
>>> >     >
>>> >     >
>>> >     >
>>> >     >     On 09/02/2015 09:54 PM, Nick R. Katsipoulakis wrote:
>>> >     >     > Hello Matthias and thank you for your reply. See my
>>> answers below:
>>> >     >     >
>>> >     >     > - I have a 4 supervisor nodes in my AWS cluster of
>>> m4.xlarge instances
>>> >     >     > (4 cores per node). On top of that I have 3 more nodes for
>>> zookeeper and
>>> >     >     > nimbus.
>>> >     >     > - 2 worker nodes per supervisor node
>>> >     >     > - The task number for each bolt ranges from 1 to 4 and I
>>> use 1:1 task to
>>> >     >     > executor assignment.
>>> >     >     > - The number of executors in total for the topology ranges
>>> from 14 to 41
>>> >     >     >
>>> >     >     > Thanks,
>>> >     >     > Nick
>>> >     >     >
>>> >     >     > 2015-09-02 15:42 GMT-04:00 Matthias J. Sax <
>>> mj...@apache.org <mailto:mj...@apache.org> <mailto:mj...@apache.org
>>> >     <mailto:mj...@apache.org>>
>>> >     >     > <mailto:mj...@apache.org <mailto:mj...@apache.org>
>>> >     <mailto:mj...@apache.org <mailto:mj...@apache.org>>>>:
>>> >     >     >
>>> >     >     >     Without any exception/error message it is hard to tell.
>>> >     >     >
>>> >     >     >     What is your cluster setup
>>> >     >     >       - Hardware, ie, number of cores per node?
>>> >     >     >       - How many node/supervisor are available?
>>> >     >     >       - Configured number of workers for the topology?
>>> >     >     >       - What is the number of task for each spout/bolt?
>>> >     >     >       - What is the number of executors for each
>>> spout/bolt?
>>> >     >     >
>>> >     >     >     -Matthias
>>> >     >     >
>>> >     >     >     On 09/02/2015 08:02 PM, Nick R. Katsipoulakis wrote:
>>> >     >     >     > Hello all,
>>> >     >     >     >
>>> >     >     >     > I am working on a project in which I submit a
>>> topology
>>> >     to my
>>> >     >     Storm
>>> >     >     >     > cluster, but for some reason, some of my tasks do not
>>> >     start
>>> >     >     executing.
>>> >     >     >     >
>>> >     >     >     > I can see that the above is happening because every
>>> >     bolt I have
>>> >     >     >     needs to
>>> >     >     >     > connect to an external server and do a registration
>>> to a
>>> >     >     service.
>>> >     >     >     > However, some of the bolts do not seem to connect.
>>> >     >     >     >
>>> >     >     >     > I have to say that the number of tasks I have is
>>> >     larger than the
>>> >     >     >     number
>>> >     >     >     > of workers of my cluster. Also, I check my worker log
>>> >     files,
>>> >     >     and I see
>>> >     >     >     > that the workers that do not register, are also not
>>> >     writing some
>>> >     >     >     > initialization messages I have them print in the
>>> >     beginning.
>>> >     >     >     >
>>> >     >     >     > Any idea why this is happening? Can it be because my
>>> >     >     resources are not
>>> >     >     >     > enough to start off all of the tasks?
>>> >     >     >     >
>>> >     >     >     > Thank you,
>>> >     >     >     > Nick
>>> >     >     >
>>> >     >     >
>>> >     >     >
>>> >     >     >
>>> >     >     > --
>>> >     >     > Nikolaos Romanos Katsipoulakis,
>>> >     >     > University of Pittsburgh, PhD candidate
>>> >     >
>>> >     >
>>> >     >
>>> >     >
>>> >     > --
>>> >     > Nikolaos Romanos Katsipoulakis,
>>> >     > University of Pittsburgh, PhD candidate
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > Nikolaos Romanos Katsipoulakis,
>>> > University of Pittsburgh, PhD candidate
>>>
>>>
>>
>
>
> --
> Regards,
> Abhishek Agarwal
>
>

Re: Tasks are not starting

Reply via email to