RE: [lang] Suggested alternatives for escape functions

Gary Gregory Tue, 10 Dec 2013 05:39:59 -0800

Peter,

Thank you for sharing with us.


Can you provide your code as a patch in a JIRA issue? Hopefully unit tests 
still pass and the 7F conversion will not break compatibility. You should also 
provide your benchmark code so others can duplicate your findings. Make sure 
you work off of trunk in SVN.

Thank you!
Gary

-------- Original message --------
From: Peter Wall <pw...@pwall.net> 
Date:12/09/2013  23:14  (GMT-05:00) 
To: dev@commons.apache.org 
Subject: [lang] Suggested alternatives for escape functions 

Hi, I'm new here, so please forgive me if I'm duplicating a previous 
discussion (I looked back through several months of archives for 
something related, before suffering a near-fatal attack of tl;dr).

I have a toolbox of functions that I have accumulated over the years 
and among them are "escape" functions for converting, for example, XML 
"&" to "&amp;" etc.  When I showed these to a colleague he asked why I 
didn't use the Apache Commons utilities, so I benchmarked my functions 
against the Commons versions and found that mine were approximately 10 
times faster.  At which point the same colleague suggested submitting my 
versions to Apache, so here goes.

The code in org.apache.commons.lang3.text.translate is very elegant in 
the way it uses the same code and the same initialisation character 
arrays for both the escape and the unescape functions, but this elegance 
comes at a cost.  The unescape will need to look up multi-character 
sequences, but the escape code will ALWAYS be looking up single 
characters, and this can be made much simpler than a string match.  And 
in my view the function should never allocate a new object until it 
finds that it needs to do so - in many cases the string will not need to 
be modified at all so the original string should be returned.

The escape function is:

     public static final String escape(String s, CharMapper mapper) {
         for (int i = 0, n = s.length(); i < n; ) {
             char ch = s.charAt(i++);
             String mapped = mapper.map(ch);
             if (mapped != null) {
                 StringBuilder sb = new StringBuilder();
                 for (int j = 0, k = i - 1; j < k; ++j)
                     sb.append(s.charAt(j));
                 sb.append(mapped);
                 while (i < n) {
                     ch = s.charAt(i++);
                     mapped = mapper.map(ch);
                     if (mapped != null)
                         sb.append(mapped);
                     else
                         sb.append(ch);
                 }
                 return sb.toString();
             }
         }
         return s;
     }

Where CharMapper is:

     public interface CharMapper {
         String map(int codePoint);
     }

and the implementation for XML is:

     private static final CharMapper allCharMapper = new CharMapper() {
         @Override
         public String map(int codePoint) {
             if (codePoint == '<')
                 return "&lt;";
             if (codePoint == '>')
                 return "&gt;";
             if (codePoint == '&')
                 return "&amp;";
             if (codePoint == '"')
                 return "&quot;";
             if (codePoint == '\'')
                 return "&apos;";
             if (codePoint < ' ' && !isWhiteSpace(codePoint) || 
codePoint >= 0x7F) {
                 // isWhitespace checks for XML whitespace characters, 
\n \r etc.
                 StringBuilder sb = new StringBuilder(10);
                 sb.append("&#");
                 sb.append(codePoint);
                 sb.append(';');
                 return sb.toString();
             }
             return null;
         }
     };

The whole thing can be wrapped in a simple function like:

     public static String escapeAll(String s) {
         return escape(s, allCharMapper);
     }

I have versions for Java string escapes, XML, HTML (including the full 
range of entity names) and URI percent encoding, and I have versions 
that handle UTF-16 surrogate codes.  They all perform approxiamtely an 
order of magnitude better than the existing Apache Commons functons.  
They are currently under LGPL and I have JUnit tests for all of them.

One thing to note is that my versions convert all characters over 0x7F 
to numeric character references, thus sidestepping any concerns over 
UTF-8 or ISO-8859-1 character set encoding.

Is anyone interested?

Regards,
Peter Wall


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

RE: [lang] Suggested alternatives for escape functions

Reply via email to