On 2 May 2013 14:56, sebb <seb...@gmail.com> wrote: > On 2 May 2013 09:48, Benedikt Ritter <brit...@apache.org> wrote: > >> 2013/5/1 sebb <seb...@gmail.com> >> >> > On 1 May 2013 08:53, Benedikt Ritter <brit...@apache.org> wrote: >> > >> > > Hi, >> > > >> > > I have tried to solve CSV-58 - Escape handling needs rethinking [1] a >> few >> > > times over the past weeks now. But I can not see a way to get this >> > working. >> > > There are two problems with our current implementation: >> > > >> > > 1. a sequence of " test \a test" will be parsed to "test a test", >> > although >> > > there is no reason to escape 'a'. The expected behavior is to let the >> > > sequence \a unchanged. >> > > >> > >> > An alternative would be to throw an Exception. >> > >> >> Haven't thought about that alternative so far. It is was the java compiler >> would do if it finds "\a" in a String. >> If we decide to through an exception, the message should state very >> clearly >> what has gone wrong, including the position of the misplaced escape char. >> I'd like to hear more opinions on this. >> >> > Throwing an exception stops the parsing, and obviously only catches the > first error. > Leaving the sequences as is means potentially having to scan a large > output file to find them all. > > There's another option: invoke a callback when any invalid sequences are > detected. > > That would allow for various possibilities, depending on what callback was > provided. > - throw Exception with details > - ignore (with optional log) > - replace with some other sequence (with optional log) > > I don't think CSV should do any logging of its own, but the caller can do > so in the callback > > However we would need to decide what the default callback should do. > If we throw an Exception, then it is obvious that something is wrong. > If we ignore the sequence (leave as is) then there is no easy way for the > caller to know if there have been any errors. > > >> > >> > >> > > 2. If the escape char is different from backslash (for example '!'), a >> > > sequence of "test !r test" will be parsed to "test CR test" >> > > >> > > >> > Is that how other CSV parsers behave? i.e. do they always use \ for >> > escaping EOL? >> > >> >> Haven't tested this. I'll try to have a look this weekend. >> >> The problem with the current impl is a backslash as part of an escape >> sequence. >> This is what readEscape() currently does: It checks if the lexer has >> reached a character combination of (for example) \r and replaces this with >> CR. Otherwise it swallows the escape character and returns the next >> character. >> >> There are two problems with this: >> 1. it does so if it recognizes the escape character (which doesn't have to >> be a backslash). This is why !r gets replaced by CR. >> 2. it swallows the escape and returns the next character (default branch >> in >> the switch statement). This should only happen if the next character is >> one >> of the special characters from the format (e.g. the delimiter). This is >> why >> '\a' gets replaced by 'a' >> > > I think that's wrong, because one cannot distinguish "\a" from a". > > The readEscape() method currently returns an int; this cannot be used to > distinguish between a valid escape and some other character. > If we only want to support throwing an Exception for invalid escape > sequence, then it's trivial to fix this - just throw the Exception. > > However, to support "as is" passthrough of invalid sequences, the caller > needs to know when to restore the escape character to the output. > > One way to do this would be to change the return type to a char[] array. > This would either be a single char or the escape followed by a single char. > The return array could just be appended to the buffer. This should work > quite well with a call-back, as the callback could also return a char array. > > The only downside I can see is that append(char[]) may be slightly slower > than append(char). > > Found another way to do it: return -1 (EOF) which cannot otherwise be returned by the method. The last character is then accessible via in.getLastChar(). i.e. if -1 is returned, output escape followed by lastChar.
Needs some tweaks to readEscape, because \LF etc. stand for themselves, i.e. both \r and \CR => CR. What needs to be decided is whether both !r and !CR => CR if the escape character is ! rather than \, and if so, whether \CR still stands for CR. But I think the parsing can be tweaked without too much rewriting. > An alternative is to throw an exception. >> >> I think it may be necessary to separate backslash handling from escape >> handling. >> >> >> > Possibly. However that may make parsing harder. > > >> > >> > >> > > I have tried different approaches to handle this issue but there are >> > always >> > > corner cases that I'm unable to cover. I'm getting the feeling that >> I'm >> > > just to dump to solve this, so any comments or suggestions would be >> > really >> > > appreciated! >> > > >> > > >> > Perhaps it would help to commit the unit tests you are working to solve. >> > The tests can be disabled. >> > >> > Or perhaps better use a class whose name is not one of the ones picked >> up >> > by Surefire. >> > This allows the test to be run independently by Eclipse (and Maven) but >> > won't mess with CI builds. >> > >> >> I have already committed three tests to CSVLexerTest and set them to be >> ignored. Maybe you can review those and tell me whether my assumptions are >> correct. >> >> >> > >> > Benedikt >> > > [1] https://issues.apache.org/jira/browse/CSV-58 >> > > >> > > -- >> > > http://people.apache.org/~britter/ >> > > http://www.systemoutprintln.de/ >> > > http://twitter.com/BenediktRitter >> > > http://github.com/britter >> > > >> > >> >> >> >> -- >> http://people.apache.org/~britter/ >> http://www.systemoutprintln.de/ >> http://twitter.com/BenediktRitter >> http://github.com/britter >> > >