I want to start a discussion about a new prioritization mechanism that 
addresses some of the issues that I believe exist in the current solution. 
These issues are:

 - Scheduling: No consideration is given to data priority when determining 
which component is given the next available thread with which to work
 - Constant sorting: Because all flowfiles in a given connection share the same 
PriorityQueue they must be sorted every time they move. While this sort is 
efficient it can add up as queues grow deep.
 - Administration: There is a costly human element to managing the value used 
as a priority ranking as priorities change. You must also ensure every 
connection in the appropriate flow has the proper prioritizer assigned to it to 
make use of the property.

We have developed a prototype of a new FlowFileQueue implementation that 
addresses these issues. Use of this implementation is controlled via 
nifi.properties so you can opt-in or out system-wide without doing a lot of 
configuration. Its design goals are:

  - Instead of using the value of a FlowFile attribute as a ranking, maintain a 
set of expression language rules to define your priorities. The highest ranked 
rule that a given FlowFile satisfies will be that FlowFile's priority
  - Because we have a finite set of priority rules we can utilize a bucket sort 
in our connections. One bucket per priority rule. The bucket/rule with which a 
FlowFile is associated with will be maintained so that as it moves through the 
system we do not have to re-evaluate that Flowfile against our ruleset unless 
we have reason to do so.
  - Control where in your flow FlowFiles are evaluated against the ruleset with 
a new Prioritizer implementation: BucketPrioritizer.
  - When this queue implementation is polled it will be able to check state to 
see if any data of a higher priority than what it currently contains recently 
(within 5s) moved elsewhere in the system. If higher priority data has recently 
moved elsewhere, the connection will only provide a FlowFile X% of the time 
where X is defined along with the rule. This allows higher priority data to 
have more frequent access to threads without thread-starving lower priority 
data.
  - Rules will be managed via a menu option for the flow and changes to them 
take effect instantly. This allows you to change your priorities without 
stopping/editing/restarting various components on the graph.

I intend to contribute this solution but first want to solicit input and 
opinions.

  - Jon Kessler

Reply via email to