DO NOT REPLY [Bug 19187] New: - ReplaceRegExp cannot handle multi-byte encodings

bugzilla 21 Apr 2003 07:45:43 -0000

DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=19187>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.


http://nagoya.apache.org/bugzilla/show_bug.cgi?id=19187

ReplaceRegExp cannot handle multi-byte encodings

           Summary: ReplaceRegExp cannot handle multi-byte encodings
           Product: Ant
           Version: 1.5.1
          Platform: Other
        OS/Version: Other
            Status: NEW
          Severity: Normal
          Priority: Other
         Component: Optional Tasks
        AssignedTo: [EMAIL PROTECTED]
        ReportedBy: [EMAIL PROTECTED]


The ReplaceRegExp task throws IndexOutOfBoundsException for files containing
multi-byte encodings.

java.lang.IndexOutOfBoundsException
        at java.io.BufferedReader.read(BufferedReader.java:256)
        at
org.apache.tools.ant.taskdefs.optional.ReplaceRegExp.doReplace(ReplaceRegExp.java:404)
        at
org.apache.tools.ant.taskdefs.optional.ReplaceRegExp.execute(ReplaceRegExp.java:491)
        at org.apache.tools.ant.Task.perform(Task.java:319)
        at org.apache.tools.ant.Target.execute(Target.java:309)
        at org.apache.tools.ant.Target.performTasks(Target.java:336)
        at org.apache.tools.ant.Project.executeTarget(Project.java:1306)
        at org.apache.tools.ant.Project.executeTargets(Project.java:1250)
        at org.apache.tools.ant.Main.runBuild(Main.java:610)
        at org.apache.tools.ant.Main.start(Main.java:196)
        at org.apache.tools.ant.Main.main(Main.java:235)

The task was:
    <replaceregexp flags="g" file="regtst">
        <regexp pattern="((Header:\s+\S+|Revision)\s+\S+\s+\S+\s+\S+)\s+(\w+)"/>
        <substitution expression="\1"/>
      </replaceregexp>

The root cause seems to be the assumption that the length of the file is the
same as the number of characters in the file.  This assumption fails for
multi-byte encodings.  ReplaceRegExp.java lines 398 to 406 are: 
                int flen = (int) f.length();
                char tmpBuf[] = new char[flen]; 
                int numread = 0;
                int totread = 0; 

                while (numread != -1 && totread < flen) {
                    numread = br.read(tmpBuf, totread, flen);
                    totread += numread; 
                }
The flen is the number of bytes in the file, but it's being misused as the
number of characters.

Related symptom: if you use a fileset, you don't get the full stacktrace, only a
summary:
[replaceregexp] An error occurred processing file:
'/home/jdb/projects/foo/regtst': java.lang.IndexOutOfBoundsException

Work around:
byline="true" uses a different block of code.  (But it's still apt to munge your
encoding.)

Suggested enhancement:
add a file encoding parameter to the task.

Sorry I don't have time to fix this right now.

11011011

DO NOT REPLY [Bug 19187] New: - ReplaceRegExp cannot handle multi-byte encodings

Reply via email to