Dave Angel wrote: > On 28/9/2013 02:26, Daniel Stojanov wrote: > >> Can somebody explain this. The line number reported by shlex depends >> on the previous token. I want to be able to tell if I have just popped >> the last token on a line. >> > > I agree that it seems weird. However, I don't think you have made > clear why it's not what you (and I) expect. > > import shlex > > def parseit(string): > print > print "Parsing -", string > first = shlex.shlex(string) > token = "dummy" > while token: > token = first.get_token() > print token, " -- line", first.lineno > > parseit("word1 word2\nword3") #first > parseit("word1 word2,\nword3") #second > parseit("word1 word2,word3\nword4") > parseit("word1 word2+,?\nword3") > > This will display the lineno attribute for every token. > > shlex is documented at: > > http://docs.python.org/2/library/shlex.html > > And lineno is documented on that page as: > > """shlex.lineno > Source line number (count of newlines seen so far plus one). > """ > > It's not at all clear what "seen so far" is intended to mean, but in > practice, the line number is incremented for the last token on the > line. Thus your first example > > Parsing - word1 word2 > word3 > word1 -- line 1 > word2 -- line 2 > word3 -- line 2 > -- line 2 > > word2 has the incremented line number. > > But when the token is neither whitespace nor ASCII letters, then it > doesn't increment lineno. Thus second example: > > Parsing - word1 word2, > word3 > word1 -- line 1 > word2 -- line 1 > , -- line 1 #we would expect this to be "line 2" > word3 -- line 2 -- line 2 > > Anybody else have some explanation
The explanation seems obvious: a word may be continued by the next character if that is in wordchars, so the parser has to look at that character. If it happens to be '\n' the lineno is immediately incremented. Non-wordchars are returned as single characters, so there is no need to peek ahead and the lineno is not altered. In short: this looks like an implementation accident. OP: I don't see a usecase for the current behaviour -- I suggest that you file a bug report. > or advice for Daniel, other than > preprocessing the string by stripping any non letters off the end of the > line? The following gives the tokens' starting line for your examples def shlexiter(s): p = shlex.shlex(s) p.whitespace = p.whitespace.replace("\n", "") while True: lineno = p.lineno token = p.get_token() if not token: break if token == "\n": continue yield lineno, token def parseit(string): print("Parsing - {!r}".format(string)) for lineno, token in shlexiter(string): print("{:3} {!r}".format(lineno, token)) print("") but I have no idea about the implications for more complex input. -- https://mail.python.org/mailman/listinfo/python-list