2013/5/2 Gary Gregory <garydgreg...@gmail.com> > (top-posting cause it's getting hard to find stuff) > > >Should CSV unconditionally recognise \r \n etc as CR LF? > >If so, then they don't need a ! escape beforehand. > > There is ONE escape character, so if it is '!', then "\r " is just "\r".
> If we want >1 escape character, then that's a different story. Right now, > we have one. > Sorry guys, I got the whole story wrong, then. I made the assumption that we always represent CR as \r no matter what escape character we use. thanks for clarifying. > > Gary > > > On Thu, May 2, 2013 at 4:13 PM, sebb <seb...@gmail.com> wrote: > > > On 2 May 2013 18:26, Benedikt Ritter <brit...@apache.org> wrote: > > > > > 2013/5/2 sebb <seb...@gmail.com> > > > > > > > On 2 May 2013 14:56, sebb <seb...@gmail.com> wrote: > > > > > > > > > On 2 May 2013 09:48, Benedikt Ritter <brit...@apache.org> wrote: > > > > > > > > > >> 2013/5/1 sebb <seb...@gmail.com> > > > > >> > > > > >> > On 1 May 2013 08:53, Benedikt Ritter <brit...@apache.org> > wrote: > > > > >> > > > > > >> > > Hi, > > > > >> > > > > > > >> > > I have tried to solve CSV-58 - Escape handling needs > rethinking > > > [1] > > > > a > > > > >> few > > > > >> > > times over the past weeks now. But I can not see a way to get > > this > > > > >> > working. > > > > >> > > There are two problems with our current implementation: > > > > >> > > > > > > >> > > 1. a sequence of " test \a test" will be parsed to "test a > > test", > > > > >> > although > > > > >> > > there is no reason to escape 'a'. The expected behavior is to > > let > > > > the > > > > >> > > sequence \a unchanged. > > > > >> > > > > > > >> > > > > > >> > An alternative would be to throw an Exception. > > > > >> > > > > > >> > > > > >> Haven't thought about that alternative so far. It is was the java > > > > compiler > > > > >> would do if it finds "\a" in a String. > > > > >> If we decide to through an exception, the message should state > very > > > > >> clearly > > > > >> what has gone wrong, including the position of the misplaced > escape > > > > char. > > > > >> I'd like to hear more opinions on this. > > > > >> > > > > >> > > > > > Throwing an exception stops the parsing, and obviously only catches > > the > > > > > first error. > > > > > Leaving the sequences as is means potentially having to scan a > large > > > > > output file to find them all. > > > > > > > > > > There's another option: invoke a callback when any invalid > sequences > > > are > > > > > detected. > > > > > > > > > > That would allow for various possibilities, depending on what > > callback > > > > was > > > > > provided. > > > > > - throw Exception with details > > > > > - ignore (with optional log) > > > > > - replace with some other sequence (with optional log) > > > > > > > > > > I don't think CSV should do any logging of its own, but the caller > > can > > > do > > > > > so in the callback > > > > > > > > > > However we would need to decide what the default callback should > do. > > > > > If we throw an Exception, then it is obvious that something is > wrong. > > > > > If we ignore the sequence (leave as is) then there is no easy way > for > > > the > > > > > caller to know if there have been any errors. > > > > > > > > > > > > > > >> > > > > > >> > > > > > >> > > 2. If the escape char is different from backslash (for example > > > > '!'), a > > > > >> > > sequence of "test !r test" will be parsed to "test CR test" > > > > >> > > > > > > >> > > > > > > >> > Is that how other CSV parsers behave? i.e. do they always use \ > > for > > > > >> > escaping EOL? > > > > >> > > > > > >> > > > > >> Haven't tested this. I'll try to have a look this weekend. > > > > >> > > > > >> The problem with the current impl is a backslash as part of an > > escape > > > > >> sequence. > > > > >> This is what readEscape() currently does: It checks if the lexer > has > > > > >> reached a character combination of (for example) \r and replaces > > this > > > > with > > > > >> CR. Otherwise it swallows the escape character and returns the > next > > > > >> character. > > > > >> > > > > >> There are two problems with this: > > > > >> 1. it does so if it recognizes the escape character (which doesn't > > > have > > > > to > > > > >> be a backslash). This is why !r gets replaced by CR. > > > > >> 2. it swallows the escape and returns the next character (default > > > branch > > > > >> in > > > > >> the switch statement). This should only happen if the next > character > > > is > > > > >> one > > > > >> of the special characters from the format (e.g. the delimiter). > This > > > is > > > > >> why > > > > >> '\a' gets replaced by 'a' > > > > >> > > > > > > > > > > I think that's wrong, because one cannot distinguish "\a" from a". > > > > > > > > > > The readEscape() method currently returns an int; this cannot be > used > > > to > > > > > distinguish between a valid escape and some other character. > > > > > If we only want to support throwing an Exception for invalid escape > > > > > sequence, then it's trivial to fix this - just throw the Exception. > > > > > > > > > > However, to support "as is" passthrough of invalid sequences, the > > > caller > > > > > needs to know when to restore the escape character to the output. > > > > > > > > > > One way to do this would be to change the return type to a char[] > > > array. > > > > > This would either be a single char or the escape followed by a > single > > > > char. > > > > > The return array could just be appended to the buffer. This should > > work > > > > > quite well with a call-back, as the callback could also return a > char > > > > array. > > > > > > > > > > The only downside I can see is that append(char[]) may be slightly > > > slower > > > > > than append(char). > > > > > > > > > > > > > > Found another way to do it: return -1 (EOF) which cannot otherwise be > > > > returned by the method. > > > > The last character is then accessible via in.getLastChar(). > > > > i.e. if -1 is returned, output escape followed by lastChar. > > > > > > > > > > > > Needs some tweaks to readEscape, because \LF etc. stand for > themselves, > > > > i.e. both \r and \CR => CR. > > > > > > > > What needs to be decided is whether both !r and !CR => CR if the > escape > > > > character is ! rather than \, and if so, whether \CR still stands for > > CR. > > > > > > > > But I think the parsing can be tweaked without too much rewriting. > > > > > > > > > > I think there are four cases to cover (I'm using ! as escape in this > > > example): > > > 1. !r => nothing to escape, append '!r' > > > > > > > Not sure I agree with that; currently \r => CR if \ is the escape. > > > > > > > 2. !\r => the CR character in its character representation. Replace > '\r' > > by > > > CR and escape it. continue parsing of the token > > > > > > > Not sure I agree here either. I think that's an invalid sequence unless \ > > is one of the meta-characters. > > > > Should CSV unconditionally recognise \r \n etc as CR LF? > > If so, then they don't need a ! escape beforehand. > > > > > > > 3. !CR => esacpe CR and contine parsing of the token > > > > > > > !CR should be replaced by CR, i.e. *un*escaped > > > > 4. !\CR => escape '\' if it is one of the characters used in the format, > in > > > other words: remove the '!' and append the '\'. Then EOL => next > record, > > > but only if CR is the record separator (?). > > > > > > > > > > I've tried several times and failed. Do you have the time to implement > > your > > > proposal? I'll probably have some time this weekend to try once again > :-) > > > > > > > > Yes, I think I can make a start on amending readEscape. > > > > However that won't get us anywhere until the rules are clear; I don't > think > > they are currently. > > > > It should be possible to round-trip escaping then unescaping. > > > > It may be easier to consider all possible escaped output to decide how to > > unescape, and what is not a valid sequence. > > > > > > > > > > > > > > > > > > > > An alternative is to throw an exception. > > > > >> > > > > >> I think it may be necessary to separate backslash handling from > > escape > > > > >> handling. > > > > >> > > > > >> > > > > >> > > > > > Possibly. However that may make parsing harder. > > > > > > > > > > > > > > >> > > > > > >> > > > > > >> > > I have tried different approaches to handle this issue but > there > > > are > > > > >> > always > > > > >> > > corner cases that I'm unable to cover. I'm getting the feeling > > > that > > > > >> I'm > > > > >> > > just to dump to solve this, so any comments or suggestions > would > > > be > > > > >> > really > > > > >> > > appreciated! > > > > >> > > > > > > >> > > > > > > >> > Perhaps it would help to commit the unit tests you are working > to > > > > solve. > > > > >> > The tests can be disabled. > > > > >> > > > > > >> > Or perhaps better use a class whose name is not one of the ones > > > picked > > > > >> up > > > > >> > by Surefire. > > > > >> > This allows the test to be run independently by Eclipse (and > > Maven) > > > > but > > > > >> > won't mess with CI builds. > > > > >> > > > > > >> > > > > >> I have already committed three tests to CSVLexerTest and set them > to > > > be > > > > >> ignored. Maybe you can review those and tell me whether my > > assumptions > > > > are > > > > >> correct. > > > > >> > > > > >> > > > > >> > > > > > >> > Benedikt > > > > >> > > [1] https://issues.apache.org/jira/browse/CSV-58 > > > > >> > > > > > > >> > > -- > > > > >> > > http://people.apache.org/~britter/ > > > > >> > > http://www.systemoutprintln.de/ > > > > >> > > http://twitter.com/BenediktRitter > > > > >> > > http://github.com/britter > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > >> > > > > >> -- > > > > >> http://people.apache.org/~britter/ > > > > >> http://www.systemoutprintln.de/ > > > > >> http://twitter.com/BenediktRitter > > > > >> http://github.com/britter > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > http://people.apache.org/~britter/ > > > http://www.systemoutprintln.de/ > > > http://twitter.com/BenediktRitter > > > http://github.com/britter > > > > > > > > > -- > E-Mail: garydgreg...@gmail.com | ggreg...@apache.org > Java Persistence with Hibernate, Second Edition< > http://www.manning.com/bauer3/> > JUnit in Action, Second Edition <http://www.manning.com/tahchiev/> > Spring Batch in Action <http://www.manning.com/templier/> > Blog: http://garygregory.wordpress.com > Home: http://garygregory.com/ > Tweet! http://twitter.com/GaryGregory > -- http://people.apache.org/~britter/ http://www.systemoutprintln.de/ http://twitter.com/BenediktRitter http://github.com/britter