Eric Secules created NIFI-8107:
----------------------------------

             Summary: ExtractText Should Search Entire FlowFile Using Streaming
                 Key: NIFI-8107
                 URL: https://issues.apache.org/jira/browse/NIFI-8107
             Project: Apache NiFi
          Issue Type: New Feature
            Reporter: Eric Secules
            Assignee: Eric Secules


There should be an improvement to ExtractText so that the entire content of the 
flowfile is scanned for matches in chunks of MAX_BUFFER_SIZE which overlap by 
MAX_CAPTURE_GROUP_LENGTH. That way we can do pattern extraction over arbitrary 
size files while keeping memory consumption limited.

Consider the use case where I am looking to extract a small pattern of maybe 
100 bytes from files that could be 1MB or 500MB. Looking at the ExtractText 
source code, it always allocates a byte array of the maximum size, so it 
probably wouldn't be appropriate to set that parameter too high. It's essential 
to have the chunks overlap by the maximum length of the capture group because 
the match may straddle two chunks. For the same reason it's not advisable to 
split the flowfile into chunks of MAX_BUFFER_SIZE using existing processors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to