[LANG] Rewrite StringEscapeUtils [Was: [jira] Commented: (LANG-505) Rewrite StringEscapeUtils]

Henri Yandell Sun, 17 May 2009 23:42:05 -0700

So this is more noticeable. I'm starting to play with a rewritten
StringEscapeUtils system. escapeSql is going to get deleted, the
others will sit on top of a modular system. At least that's the
proposal - more in the ticket.


First pass at the code was educational - second pass should be
discussable but thoughts welcome now if anyone has the urge.

Hen

On Sun, May 17, 2009 at 11:36 PM, Henri Yandell (JIRA) <j...@apache.org> wrote:
>
>    [ 
> https://issues.apache.org/jira/browse/LANG-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12710287#action_12710287
>  ]
>
> Henri Yandell commented on LANG-505:
> ------------------------------------
>
> A generic solution that's pluggable gets a bit interesting - do you assume 
> that a single unicode codepoint maps to 1..n unicode codepoints, or should 
> you be able to escape multiple codepoints? Unescaping is n codepoints to 1 
> codepoint. This makes it harder to have N plugins all working together. It 
> becomes a translator, unescaping and escaping the same algorithm.
>
> My first pass was on the 1..n. API is:
>
> {code:java}
>    // Note it's a FilterWriter, so 'out' is part of the class. Probably 
> change that and put Writer in the API.
>    public abstract boolean escape(int codepoint) throws IOException;
> {code}
>
> The boolean is returned to say whether that codepoint was successfully 
> translated or not. This is needed for AggregateEscapers to work and so 
> codepoint skipping can occur.
>
> It leads to a top level use of:
>
> {code:java}
>    public void escapeJava(String input, Writer out) throws IOException {
>        AggregateEscaper escapers = new AggregateEscaper(
>            new EscapeBasedOnLookup(out,
>                      new String[][] {
>                            {"\"", "\\\""},
>                            {"\\", "\\\\"}
>                      }),
>            new EscapeLowAsciiAsUnicode(out),
>            new EscapeNonAsciiAsUnicode(out)
>        );
>
>        escape(input, escapers);
>    }
> {code}
>
> It's easy to see from there how a user might modify what they view this to 
> be. They could write an escapeC quite easily etc. However - it's 1..n, and 
> unescaping isn't handled. The 1..n escape algorithm looks like:
>
> {code:java}
>    public void escape(String str, CharEscaper escaper) throws IOException {
>        if (str == null) {
>            return;
>        }
>        if (escaper == null) {
>            throw new IllegalArgumentException("The CharEscaper must not be 
> null. " +
>                                               "Use NullEscaper if you 
> expected this to mean a no-operation");
>        }
>        int sz = str.length();
>        for (int i = 0; i < sz; i++) {
>            int c = Character.codePointAt(str, i);
>            boolean success = escaper.escape(c);
>
>            // contract with escapers is that they have to understand 
> codepoints and they just took care of a surrogate pair
>            if(success && c >= 0x010000 && i < sz - 1) {
>                i++;
>            }
>        }
>    }
> {code}
>
> As I said - 1..n is the problem. Parsing needs to happen a character at a 
> time, and each Escaper needs to be offered the choice to take control of the 
> flow. Some options - thinking out loud:
>
> * Escapers need to be passed in the equivalent of a C pointer String. A 
> CharSequence would be good, but calling the subSequence method all the time 
> might not be performant. Probably best to pass down the index and the 
> CharSequence down. Making the whole thing String rather than char based - 
> probably better anyway as that saves having to think in terms of codepoints.
> * Escapers are asked whether they want to make a change first, then if they 
> return true they are called again to make the change and they return an index 
> increment for the driving loop to make based on the number of characters they 
> consumed.
>
> So:
>
> {code:java}
>    public abstract boolean isEscapable(int index, CharSequence input);
>    public abstract int escape(int index, CharSequence input, Writer out) 
> throws IOException;
> {code}
>
> Needs another code pass. Anyway - I think our current code is screwed up 
> enough to warrant a deeper implementation.
>
>> Rewrite StringEscapeUtils
>> -------------------------
>>
>>                 Key: LANG-505
>>                 URL: https://issues.apache.org/jira/browse/LANG-505
>>             Project: Commons Lang
>>          Issue Type: Task
>>            Reporter: Henri Yandell
>>             Fix For: 3.0
>>
>>
>> I think StringEscapeUtils needs a strong rewrite. For each escape method 
>> (and unescape) there tend to be three or four types of escaping happening. 
>> So not being able to define which set of three or four apply is a pain point 
>> (and cause of bug reports due to different desired features).
>> We should be offering basic functionality, but also allowing people to say 
>> "escape(Escapers.BASIC_XML, Escapers.LOW_UNICODE, Escapers.HIGH_UNICODE)".
>> Also should delete escapeSql; it's a bad one imo. Dangerous in that it will 
>> lead people to not use PreparedStatement and given it only escapes ', it's 
>> not much use. Especially as different dialects escape that in different ways.
>> Opening this ticket for discussion.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

[LANG] Rewrite StringEscapeUtils [Was: [jira] Commented: (LANG-505) Rewrite StringEscapeUtils]

Reply via email to