Brandon DeVries created NIFI-3564:
-------------------------------------
Summary: Deadlock on startup
Key: NIFI-3564
URL: https://issues.apache.org/jira/browse/NIFI-3564
Project: Apache NiFi
Issue Type: Bug
Affects Versions: 1.1.1, 0.7.1
Reporter: Brandon DeVries
We have uncovered an issue in the way that ControllerServices and Processors
are started that can result in a deadlock. Basically, a ControllerService that
is reported by the framework as ENABLING might not actually be. This is
because of how they are scheduled to be started in
StandardControllerServiceNode.enable()\[1]. This changes the state from
DISABLED to ENABLING, and *then* actually schedules the OnEnabled method to be
called. However, it is scheduled with a ScheduledExecutorService that is
limited to 8 threads\[2], and is *also used to start Processors*\[3].
The situation that exposed the bug was a Processor that attempted to wait for a
ControllerService to become ENABLED in its customValidate() method. The
ControllerService must be at least in the ENABLING state to pass framework
validation, and since the ControllerService was neccessary to do the custom
validation, waiting for it to become ENABLED seems reasonable. However, there
were several (more than 8) instances of this custom Processor on the graph, and
the ControllerService being waited on was one of dozens. This led to the
situation where all 8 of the executor threads were held by our Processor's
customValidate() method waiting for a service that will never transition from
ENABLING to ENABLED because to do so it needs one of those same 8 threads.
This deadlocks the instance, preventing startup.
My first thought as to a fix was to not set the ENABLING state until the
OnEnabled method was actually being called (as opposed to scheduled to be
called). However, this could result in a Processor attempting to start with a
dependent ControllerService in a DISABLED state (even though the
ControllerService will eventually be ENABLED), which would cause the processor
to not start\[4](as opposed to being retried as is the case when OnScheduled
throws an Exception). My feeling is that ultimately we're going to need to
wait for all ControllerServices to be ENABLED before moving on to Processors,
possibly using schedule(Callable) instead of execute(Runnable).
\[1]
https://github.com/apache/nifi/blob/rel/nifi-0.7.1/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-framework-core/src/main/java/org/apache/nifi/controller/service/StandardControllerServiceNode.java#L299-L304
\[2]
https://github.com/apache/nifi/blob/rel/nifi-0.7.1/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-framework-core/src/main/java/org/apache/nifi/controller/scheduling/StandardProcessScheduler.java#L83
\[3]
https://github.com/apache/nifi/blob/rel/nifi-0.7.1/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-framework-core/src/main/java/org/apache/nifi/controller/StandardProcessorNode.java#L1219-L1228
\[4]
https://github.com/apache/nifi/blob/rel/nifi-0.7.1/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-framework-core/src/main/java/org/apache/nifi/controller/StandardProcessorNode.java#L1221-L1223
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)