Sergei Zhirikov created NIFI-5689:
-------------------------------------

             Summary: ReplaceText does not handle end of line correctly on 
buffer boundary
                 Key: NIFI-5689
                 URL: https://issues.apache.org/jira/browse/NIFI-5689
             Project: Apache NiFi
          Issue Type: Bug
          Components: Extensions
    Affects Versions: 1.7.1
            Reporter: Sergei Zhirikov
         Attachments: Text_Parsing_Bug.xml

ReplaceText appears to misbehave under the following conditions:
 * The input flow file contains text with Windows-style line endings (CR-LF).
 * ReplaceText is configured to perform "Regex Replace" in "Line-by-Line" mode.
 * The "Maximum Buffer Size" is set to a value smaller than the whole file 
content,
but large enough to fit any of the text lines in the file.
 * A CR-LF pair of characters in one of the lines happens to be split across 
two buffers,
that is CR is the last character in one buffer and LF is the first one in the 
following one.

An example flow template is attached to illustrate the problem.

In the example, the regular expression is intended to remove white space at the 
end of each line. It operates as expected in all lines except the third one 
(containing "GHI"). That line satisfies the conditions described above. As a 
result the CR character in the end of the line is removed, which does not 
happen in other lines.
In some more complicated cases both CR and LF end up being removed, effectively 
resulting in two lines being joined into one. Although, I haven't managed to 
create a simple test case to reproduce that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to