On Tue, Apr 14, 2015 at 7:48 AM, Steven D'Aprano < steve+comp.lang.pyt...@pearwood.info> wrote:
> with open(dfile, 'rb') as f: > for line in f: > try: > s = line.decode('utf-8', 'strict') > except UnicodeDecodeError as err: > print(err) > > If you need help deciphering the errors, please copy and paste them here > and > we'll see what we can do. Below are the errors. I knew about these and I think the correct encoding is windows-1252. I will paste some code and output at the end of this email that prints the offending column in the line. These are very likely errors, and so I what to remove them. I am reading this csv into django sqlite3 db. What is strange to me is that using " with open(dfile, 'r', encoding='utf-8', errors='ignore', newline='') " does not seem to remove these , it seems to correctly save them to the db which I don't understand. 'utf-8' codec can't decode byte 0xa6 in position 368: invalid start byte 'utf-8' codec can't decode byte 0xac in position 223: invalid start byte 'utf-8' codec can't decode byte 0xa6 in position 1203: invalid start byte 'utf-8' codec can't decode byte 0xa2 in position 44: invalid start byte 'utf-8' codec can't decode byte 0xac in position 396: invalid start byte import chardet with open("DATA/ATSDTA_ATSP600.csv", 'rb') as f: for line in f: code = chardet.detect(line) #if code == {'confidence': 0.5, 'encoding': 'windows-1252'}: if code != {'encoding': 'ascii', 'confidence': 1.0}: print(code) win = line.decode('windows-1252').split(',') #windows-1252 norm = line.decode('utf-8', 'ignore').split(',') ascii = line.decode('ascii', "ignore").split(',') ascii2 = line.decode('ISO-8859-1').split(',') for w, n, a, a2 in zip(win, norm, ascii, ascii2): if w != n: print(w ) print( n ) a, a2) print(win[0]) ## Output {'encoding': 'windows-1252', 'confidence': 0.5} "¦ " " " " " "¦ " "040543" {'encoding': 'windows-1252', 'confidence': 0.5} "LEASE GREGPRU D ¬ETERSPM " "LEASE GREGPRU D ETERSPM " "LEASE GREGPRU D ETERSPM " "LEASE GREGPRU D ¬ETERSPM " "979643" {'encoding': 'windows-1252', 'confidence': 0.5} "¦ " " " " " "¦ " "986979" {'encoding': 'windows-1252', 'confidence': 0.5} "WELLS FARGO &¢ COMPANY " "WELLS FARGO & COMPANY " "WELLS FARGO & COMPANY " "WELLS FARGO &¢ COMPANY " "994946" {'encoding': 'windows-1252', 'confidence': 0.5} OSSOSSO¬¬O " OSSOSSOO " OSSOSSOO " OSSOSSO¬¬O " "996535" Vincent Davis 720-301-3003
-- https://mail.python.org/mailman/listinfo/python-list