[lang] Fix for Apache2,Apache3 StringEscapeUtils/UnicodeEscaper (3 objects created per non-ascii char) & simpler unicode hex logic

Lawrence Angrave Wed, 06 Mar 2013 06:09:26 -0800

Hi,

Some comments that are relevant to Apache3 UnicodeEscaper and Apache2'sStringEscapeUtils.java

Summary-


 * I noticed the current Apache code creates three String objects each
   time it writes a unicode hexadecimal  value.
 * Apache3 can also create a char[] array per character translation
   (but I do include a fix for that)
 * This is a easy-to-fix performance bottleneck when writing many
   non-ascii characters.
 * The logic to test for unicode values of different magnitudes can
   also be simplified.
 * Benchmark and code fixes for Apache2 and Apache 3 are included. I do
   not have time to become an Apache maintainer. use or ignore at your
   choice.
 * I'm not interested in being a developer for Commons Lang  Use it or
   not  - that's a choice for Commons Lang developers.

A simple fix more than doubles the string escape speed (40 ms v 100ms totranslate all unicode characters) for Apache3.The older Apache2-style implementation can now translate all unicodecharacters in 8ms.


The existing Apache3/Apache2 write unicode hex values like this-
        if (codepoint > 0xfff) {
                out.write("\\u" + hex(codepoint));
            } else if (codepoint > 0xff) {
                out.write("\\u0" + hex(codepoint));
            } else if (codepoint > 0xf) {
                out.write("\\u00" + hex(codepoint));
            } else {
                out.write("\\u000" + hex(codepoint));
            }
The hex() function,
//hex(): return Integer.toHexString(codepoint).toUpperCase(Locale.ENGLISH);
also creates two string objects, so we have 3 objects per unicode hex value.

FIX:

The padding logic can be simplified and per-character object creationcan be eliminated by writing hex digits directly

            out.write("\\u");
            out.write(HEX_DIGIT[(codepoint >> 12) & 15]);
            out.write(HEX_DIGIT[(codepoint >> 8) & 15]);
            out.write(HEX_DIGIT[(codepoint >> 4) & 15]);
            out.write(HEX_DIGIT[(codepoint) & 15]);


where  HEX_DIGIT  is
public static final char[] HEX_DIGIT = "0123456789ABCDEF".toCharArray();
I believe this is safe for all Locales.

When benchmarked it was disconcerting that Apache3 is still five timesslower (40ms instead of 8ms) than my rewritten Apache2 version (includedbelow).My guess is that there are other unnecessary per-character objectcreation issues still lurking Here's one example -

CharSequenceTranslator.translate(CharSequence input, Writer out) :
       char[] c = *Character.toChars*(Character.codePointAt(input, pos))

For better performance this should use toChars(int codePoint, char[]dst, int dstIndex) , which can re-use the dst char array

The benchmark, my version of a Apache2-style escapeJavaStyleStringimplementation and the code fix for UnicodeEscaper.java are included below.If you do find this useful, some source-code attribution would beappreciated.Legal: I, Lawrence Angrave (email address, [email protected]<mailto:[email protected]>), am the author and copyright holder ofthe new code provided in this email, and share and re-license this codeunder the current Apache2 license.

I hope this email does not go into a blackhole... Feel free to forwardit to the relevant maintainers.


Regards,
Lawrence.

public static final char[] HEX_DIGIT ="0123456789ABCDEF".toCharArray();public static final char[] CONTROL_CHARS; // non-zero entries forspecial case control characters

    static {
        CONTROL_CHARS = new char[32];
        CONTROL_CHARS['\b'] = 'b';
        CONTROL_CHARS['\n'] = 'n';
        CONTROL_CHARS['\t'] = 't';
        CONTROL_CHARS['\f'] = 'f';
        CONTROL_CHARS['\r'] = 'r';
    }

public static void escapeJavaStyleString(Writer out, String s,boolean escapeSingleQuote) throws IOException {

// Apache2 makes the following checks, so we will too-

if(out==null) throw new IllegalArgumentException("The Writer mustnot be null");

        if(s == null) return;

        final int len = s.length();
        for(int i =0; i < len;i++)
            escapeChar(out,s.charAt(i), escapeSingleQuote);
    }

public static void escapeChar(Writer out, char c, booleanescapeSingleQuote)

            throws IOException {
        // Most common case
        if (c >= 32 && c < 127) {
            if (c == '\\' || c == '"' || (c == '\'' && escapeSingleQuote))
                out.write('\\');
            out.write(c);
            return;
        }
        out.write('\\');
        if (c < 32 && CONTROL_CHARS[c] != 0) {
            out.write(CONTROL_CHARS[c]);
            return;
        }
        // Fast 4 digit uppercase hexadecimal without object creation
        out.write('u');
        out.write(HEX_DIGIT[(c >> 12) & 15]);
        out.write(HEX_DIGIT[(c >> 8) & 15]);
        out.write(HEX_DIGIT[(c >> 4) & 15]);
        out.write(HEX_DIGIT[(c) & 15]);
    }

FYI The benchmark test just writes all possible unicode characters intoa null writer:

            Writer nullWriter = new Writer() {
            public void write(String s) {
            };

            public void write(int c) {
            }

            public void close() throws IOException {
            }

            public void flush() throws IOException {
            }

public void write(char[] cbuf, int off, int len) throwsIOException {

            }
        };
        StringBuilder sb = new StringBuilder(0x10000);
        for (int i = 0; i <= 0xffff; i++)
            sb.append((char) i);
        String allChars = sb.toString();

        long t1 = System.currentTimeMillis();
        StringEscaper.escapeJavaStyleString(nullWriter, allChars, true);
        long t2 = System.currentTimeMillis();
        System.out.println(t2 - t1);

        long t3 = System.currentTimeMillis();
        CharSequenceTranslator translator = StringEscapeUtils.ESCAPE_JAVA;
        translator.translate(allChars, nullWriter);
        long t4 = System.currentTimeMillis();
        System.out.println(t4 - t3);



The modification to Apache3 UnicodeEscaper :

        if (codepoint > 0xffff) {
            // TODO: Figure out what to do. Output as two Unicodes?
            // Does this make this a Java-specific output class?
            out.write("\\u" + hex(codepoint));
        } else if (1 == 0) { //*OLD SLOW CODE* (can be removed)
*if (codepoint > 0xfff) {
                out.write("\\u" + hex(codepoint));
            } else if (codepoint > 0xff) {
                out.write("\\u0" + hex(codepoint));
            } else if (codepoint > 0xf) {
                out.write("\\u00" + hex(codepoint));
            } else {
                out.write("\\u000" + hex(codepoint));
            }*
        } else { // *NEW FAST CODE*
*            out.write("\\u");
            out.write(HEX_DIGIT[(codepoint >> 12) & 15]);
            out.write(HEX_DIGIT[(codepoint >> 8) & 15]);
            out.write(HEX_DIGIT[(codepoint >> 4) & 15]);
            out.write(HEX_DIGIT[(codepoint) & 15]);*
        }

*and add public static final char[] HEX_DIGIT ="0123456789ABCDEF".toCharArray();**

*

ps.
The home page for Commons lang
http://commons.apache.org/proper/commons-lang/)
 has the following 404 link on the left
    Contributing Patches
http://commons.apache.org/proper/patches.html
(I believe the 'proper' is unnecessary)

[lang] Fix for Apache2,Apache3 StringEscapeUtils/UnicodeEscaper (3 objects created per non-ascii char) & simpler unicode hex logic

Reply via email to