Brandon DeVries created NIFI-3564:
-------------------------------------

             Summary: Deadlock on startup
                 Key: NIFI-3564
                 URL: https://issues.apache.org/jira/browse/NIFI-3564
             Project: Apache NiFi
          Issue Type: Bug
    Affects Versions: 1.1.1, 0.7.1
            Reporter: Brandon DeVries


We have uncovered an issue in the way that ControllerServices and Processors 
are started that can result in a deadlock.  Basically, a ControllerService that 
is reported by the framework as ENABLING might not actually be.  This is 
because of how they are scheduled to be started in 
StandardControllerServiceNode.enable()\[1].  This changes the state from 
DISABLED to ENABLING, and *then* actually schedules the OnEnabled method to be 
called.  However, it is scheduled with a ScheduledExecutorService that is 
limited to 8 threads\[2], and is *also used to start Processors*\[3].  

The situation that exposed the bug was a Processor that attempted to wait for a 
ControllerService to become ENABLED in its customValidate() method.  The 
ControllerService must be at least in the ENABLING state to pass framework 
validation, and since the ControllerService was neccessary to do the custom 
validation, waiting for it to become ENABLED seems reasonable.  However, there 
were several (more than 8) instances of this custom Processor on the graph, and 
the ControllerService being waited on was one of dozens.  This led to the 
situation where all 8 of the executor threads were held by our Processor's 
customValidate() method waiting for a service that will never transition from 
ENABLING to ENABLED because to do so it needs one of those same 8 threads.  
This deadlocks the instance, preventing startup.

My first thought as to a fix was to not set the ENABLING state until the 
OnEnabled method was actually being called (as opposed to scheduled to be 
called).  However, this could result in a Processor attempting to start with a 
dependent ControllerService in a DISABLED state (even though the 
ControllerService will eventually be ENABLED), which would cause the processor 
to not start\[4](as opposed to being retried as is the case when OnScheduled 
throws an Exception).  My feeling is that ultimately we're going to need to 
wait for all ControllerServices to be ENABLED before moving on to Processors, 
possibly using schedule(Callable) instead of execute(Runnable).  


\[1] 
https://github.com/apache/nifi/blob/rel/nifi-0.7.1/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-framework-core/src/main/java/org/apache/nifi/controller/service/StandardControllerServiceNode.java#L299-L304
\[2] 
https://github.com/apache/nifi/blob/rel/nifi-0.7.1/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-framework-core/src/main/java/org/apache/nifi/controller/scheduling/StandardProcessScheduler.java#L83
\[3] 
https://github.com/apache/nifi/blob/rel/nifi-0.7.1/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-framework-core/src/main/java/org/apache/nifi/controller/StandardProcessorNode.java#L1219-L1228
\[4] 
https://github.com/apache/nifi/blob/rel/nifi-0.7.1/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-framework-core/src/main/java/org/apache/nifi/controller/StandardProcessorNode.java#L1221-L1223



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to