So this is more noticeable. I'm starting to play with a rewritten StringEscapeUtils system. escapeSql is going to get deleted, the others will sit on top of a modular system. At least that's the proposal - more in the ticket.
First pass at the code was educational - second pass should be discussable but thoughts welcome now if anyone has the urge. Hen On Sun, May 17, 2009 at 11:36 PM, Henri Yandell (JIRA) <j...@apache.org> wrote: > > [ > https://issues.apache.org/jira/browse/LANG-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12710287#action_12710287 > ] > > Henri Yandell commented on LANG-505: > ------------------------------------ > > A generic solution that's pluggable gets a bit interesting - do you assume > that a single unicode codepoint maps to 1..n unicode codepoints, or should > you be able to escape multiple codepoints? Unescaping is n codepoints to 1 > codepoint. This makes it harder to have N plugins all working together. It > becomes a translator, unescaping and escaping the same algorithm. > > My first pass was on the 1..n. API is: > > {code:java} > // Note it's a FilterWriter, so 'out' is part of the class. Probably > change that and put Writer in the API. > public abstract boolean escape(int codepoint) throws IOException; > {code} > > The boolean is returned to say whether that codepoint was successfully > translated or not. This is needed for AggregateEscapers to work and so > codepoint skipping can occur. > > It leads to a top level use of: > > {code:java} > public void escapeJava(String input, Writer out) throws IOException { > AggregateEscaper escapers = new AggregateEscaper( > new EscapeBasedOnLookup(out, > new String[][] { > {"\"", "\\\""}, > {"\\", "\\\\"} > }), > new EscapeLowAsciiAsUnicode(out), > new EscapeNonAsciiAsUnicode(out) > ); > > escape(input, escapers); > } > {code} > > It's easy to see from there how a user might modify what they view this to > be. They could write an escapeC quite easily etc. However - it's 1..n, and > unescaping isn't handled. The 1..n escape algorithm looks like: > > {code:java} > public void escape(String str, CharEscaper escaper) throws IOException { > if (str == null) { > return; > } > if (escaper == null) { > throw new IllegalArgumentException("The CharEscaper must not be > null. " + > "Use NullEscaper if you > expected this to mean a no-operation"); > } > int sz = str.length(); > for (int i = 0; i < sz; i++) { > int c = Character.codePointAt(str, i); > boolean success = escaper.escape(c); > > // contract with escapers is that they have to understand > codepoints and they just took care of a surrogate pair > if(success && c >= 0x010000 && i < sz - 1) { > i++; > } > } > } > {code} > > As I said - 1..n is the problem. Parsing needs to happen a character at a > time, and each Escaper needs to be offered the choice to take control of the > flow. Some options - thinking out loud: > > * Escapers need to be passed in the equivalent of a C pointer String. A > CharSequence would be good, but calling the subSequence method all the time > might not be performant. Probably best to pass down the index and the > CharSequence down. Making the whole thing String rather than char based - > probably better anyway as that saves having to think in terms of codepoints. > * Escapers are asked whether they want to make a change first, then if they > return true they are called again to make the change and they return an index > increment for the driving loop to make based on the number of characters they > consumed. > > So: > > {code:java} > public abstract boolean isEscapable(int index, CharSequence input); > public abstract int escape(int index, CharSequence input, Writer out) > throws IOException; > {code} > > Needs another code pass. Anyway - I think our current code is screwed up > enough to warrant a deeper implementation. > >> Rewrite StringEscapeUtils >> ------------------------- >> >> Key: LANG-505 >> URL: https://issues.apache.org/jira/browse/LANG-505 >> Project: Commons Lang >> Issue Type: Task >> Reporter: Henri Yandell >> Fix For: 3.0 >> >> >> I think StringEscapeUtils needs a strong rewrite. For each escape method >> (and unescape) there tend to be three or four types of escaping happening. >> So not being able to define which set of three or four apply is a pain point >> (and cause of bug reports due to different desired features). >> We should be offering basic functionality, but also allowing people to say >> "escape(Escapers.BASIC_XML, Escapers.LOW_UNICODE, Escapers.HIGH_UNICODE)". >> Also should delete escapeSql; it's a bad one imo. Dangerous in that it will >> lead people to not use PreparedStatement and given it only escapes ', it's >> not much use. Especially as different dialects escape that in different ways. >> Opening this ticket for discussion. > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online. > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org