On 5 March 2013 22:28, Lawrence Angrave <angr...@illinois.edu> wrote: > Hi,
Thanks for the feedback. > Some comments that are relevant to Apache3 UnicodeEscaper and Apache2's > StringEscapeUtils.java > Summary- > > * I noticed the current Apache code creates three String objects each > time it writes a unicode hexadecimal value. > * Apache3 can also create a char[] array per character translation > (but I do include a fix for that) > * This is a easy-to-fix performance bottleneck when writing many > non-ascii characters. > * The logic to test for unicode values of different magnitudes can > also be simplified. > * Benchmark and code fixes for Apache2 and Apache 3 are included. I do > not have time to become an Apache maintainer. use or ignore at your > choice. > * I'm not interested in being a developer for Commons Lang Use it or > not - that's a choice for Commons Lang developers. > > A simple fix more than doubles the string escape speed (40 ms v 100ms to > translate all unicode characters) for Apache3. > The older Apache2-style implementation can now translate all unicode > characters in 8ms. > > The existing Apache3/Apache2 write unicode hex values like this- > if (codepoint > 0xfff) { > out.write("\\u" + hex(codepoint)); > } else if (codepoint > 0xff) { > out.write("\\u0" + hex(codepoint)); > } else if (codepoint > 0xf) { > out.write("\\u00" + hex(codepoint)); > } else { > out.write("\\u000" + hex(codepoint)); > } > The hex() function, > //hex(): return Integer.toHexString(codepoint).toUpperCase(Locale.ENGLISH); > also creates two string objects, so we have 3 objects per unicode hex value. > > FIX: > > The padding logic can be simplified and per-character object creation can be > eliminated by writing hex digits directly > out.write("\\u"); > out.write(HEX_DIGIT[(codepoint >> 12) & 15]); > out.write(HEX_DIGIT[(codepoint >> 8) & 15]); > out.write(HEX_DIGIT[(codepoint >> 4) & 15]); > out.write(HEX_DIGIT[(codepoint) & 15]); > > > where HEX_DIGIT is > public static final char[] HEX_DIGIT = "0123456789ABCDEF".toCharArray(); > I believe this is safe for all Locales. > > > When benchmarked it was disconcerting that Apache3 is still five times > slower (40ms instead of 8ms) than my rewritten Apache2 version (included > below). > My guess is that there are other unnecessary per-character object creation > issues still lurking Here's one example - > CharSequenceTranslator.translate(CharSequence input, Writer out) : > char[] c = *Character.toChars*(Character.codePointAt(input, pos)) > > For better performance this should use toChars(int codePoint, char[] dst, > int dstIndex) , which can re-use the dst char array > > > > The benchmark, my version of a Apache2-style escapeJavaStyleString > implementation and the code fix for UnicodeEscaper.java are included below. > If you do find this useful, some source-code attribution would be > appreciated. > Legal: I, Lawrence Angrave (email address, angr...@illinois.edu > <mailto:angr...@illinois.edu>), am the author and copyright holder of the > new code provided in this email, and share and re-license this code under > the current Apache2 license. > > I hope this email does not go into a blackhole... Feel free to forward it to > the relevant maintainers. This is the correct list for the maintainers, however bug reports and patches are normally handled via JIRA, in this case https://issues.apache.org/jira/browse/LANG It's quite difficult to keep track of patches on mailing lists, and easy for them to get forgotten or overlooked. > Regards, > Lawrence. > > public static final char[] HEX_DIGIT = "0123456789ABCDEF".toCharArray(); > public static final char[] CONTROL_CHARS; // non-zero entries for > special case control characters > static { > CONTROL_CHARS = new char[32]; > CONTROL_CHARS['\b'] = 'b'; > CONTROL_CHARS['\n'] = 'n'; > CONTROL_CHARS['\t'] = 't'; > CONTROL_CHARS['\f'] = 'f'; > CONTROL_CHARS['\r'] = 'r'; > } > public static void escapeJavaStyleString(Writer out, String s, boolean > escapeSingleQuote) throws IOException { > // Apache2 makes the following checks, so we will too- > if(out==null) throw new IllegalArgumentException("The Writer must not be > null"); > if(s == null) return; > > final int len = s.length(); > for(int i =0; i < len;i++) > escapeChar(out,s.charAt(i), escapeSingleQuote); > } > public static void escapeChar(Writer out, char c, boolean > escapeSingleQuote) > throws IOException { > // Most common case > if (c >= 32 && c < 127) { > if (c == '\\' || c == '"' || (c == '\'' && escapeSingleQuote)) > out.write('\\'); > out.write(c); > return; > } > out.write('\\'); > if (c < 32 && CONTROL_CHARS[c] != 0) { > out.write(CONTROL_CHARS[c]); > return; > } > // Fast 4 digit uppercase hexadecimal without object creation > out.write('u'); > out.write(HEX_DIGIT[(c >> 12) & 15]); > out.write(HEX_DIGIT[(c >> 8) & 15]); > out.write(HEX_DIGIT[(c >> 4) & 15]); > out.write(HEX_DIGIT[(c) & 15]); > } > > > > > FYI The benchmark test just writes all possible unicode characters into a > null writer: > Writer nullWriter = new Writer() { > public void write(String s) { > }; > > public void write(int c) { > } > > public void close() throws IOException { > } > > public void flush() throws IOException { > } > > public void write(char[] cbuf, int off, int len) throws > IOException { > } > }; > StringBuilder sb = new StringBuilder(0x10000); > for (int i = 0; i <= 0xffff; i++) > sb.append((char) i); > String allChars = sb.toString(); > > long t1 = System.currentTimeMillis(); > StringEscaper.escapeJavaStyleString(nullWriter, allChars, true); > long t2 = System.currentTimeMillis(); > System.out.println(t2 - t1); > > long t3 = System.currentTimeMillis(); > CharSequenceTranslator translator = StringEscapeUtils.ESCAPE_JAVA; > translator.translate(allChars, nullWriter); > long t4 = System.currentTimeMillis(); > System.out.println(t4 - t3); > > > > The modification to Apache3 UnicodeEscaper : > > if (codepoint > 0xffff) { > // TODO: Figure out what to do. Output as two Unicodes? > // Does this make this a Java-specific output class? > out.write("\\u" + hex(codepoint)); > } else if (1 == 0) { //*OLD SLOW CODE* (can be removed) > *if (codepoint > 0xfff) { > out.write("\\u" + hex(codepoint)); > } else if (codepoint > 0xff) { > out.write("\\u0" + hex(codepoint)); > } else if (codepoint > 0xf) { > out.write("\\u00" + hex(codepoint)); > } else { > out.write("\\u000" + hex(codepoint)); > }* > } else { // *NEW FAST CODE* > * out.write("\\u"); > out.write(HEX_DIGIT[(codepoint >> 12) & 15]); > out.write(HEX_DIGIT[(codepoint >> 8) & 15]); > out.write(HEX_DIGIT[(codepoint >> 4) & 15]); > out.write(HEX_DIGIT[(codepoint) & 15]);* > } > > *and add public static final char[] HEX_DIGIT = > "0123456789ABCDEF".toCharArray();** > * > > ps. > The home page for Commons lang > http://commons.apache.org/proper/commons-lang/) > has the following 404 link on the left > Contributing Patches > http://commons.apache.org/proper/patches.html > (I believe the 'proper' is unnecessary) > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org