New submission from Piotr Tokarski <pt12...@gmail.com>:
Let's consider the following CSV content: "a|b\nc| 'd\ne|' f". The real delimiter in this case is '|' character while ' ' is sniffed. Find verbose example attached. Problem lays in csv.py file in the following code: ``` matches = [] for restr in (r'(?P<delim>[^\w\n"\'])(?P<space> ?)(?P<quote>["\']).*?(?P=quote)(?P=delim)', # ,".*?", r'(?:^|\n)(?P<quote>["\']).*?(?P=quote)(?P<delim>[^\w\n"\'])(?P<space> ?)', # ".*?", r'(?P<delim>[^\w\n"\'])(?P<space> ?)(?P<quote>["\']).*?(?P=quote)(?:$|\n)', # ,".*?" r'(?:^|\n)(?P<quote>["\']).*?(?P=quote)(?:$|\n)'): # ".*?" (no delim, no space) regexp = re.compile(restr, re.DOTALL | re.MULTILINE) matches = regexp.findall(data) if matches: break ``` What makes matches non-empty and farther processing happens with delimiter falsely set to ' '. ---------- components: Library (Lib) messages: 397821 nosy: pt12lol priority: normal severity: normal status: open title: CSV sniffing falsely detects space as a delimiter type: behavior versions: Python 3.8 _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue44677> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com