On Tue, 06 Oct 2015 20:20:40 +0200, Peter Otten wrote: > Jaydip Chakrabarty wrote: > >> On Tue, 06 Oct 2015 14:33:51 +0200, Peter Otten wrote: >> >>> Jaydip Chakrabarty wrote: >>> >>>> On Tue, 06 Oct 2015 01:34:17 +1100, Chris Angelico wrote: >>>> >>>>> On Tue, Oct 6, 2015 at 1:06 AM, Tim Chase >>>>> <python.l...@tim.thechases.com> wrote: >>>>>> That way, if you determine by line 3 that your million-row CSV file >>>>>> has no blank columns, you can get away with not processing all >>>>>> million rows. >>>>> >>>>> Sure, although that effectively means the entire job is moot. I >>>>> kinda assume that the OP knows that there are some blank columns >>>>> (maybe lots of them). The extra check is unnecessary unless it's >>>>> actually plausible that there'll be no blanks whatsoever. >>>>> >>>>> Incidentally, you have an ordered_headers list which is the blank >>>>> columns in order; I think the OP was looking for a list of the >>>>> _non_blank columns. But that's a trivial difference, easy to tweak. >>>>> >>>>> ChrisA >>>> >>>> Thanks to you all. I got it this far. But while writing back to >>>> another csv file, I got this error - "ValueError: dict contains >>>> fields not in fieldnames: None". Here is my code. >>>> >>>> rdr = csv.DictReader(fin, delimiter=',') >>>> header_set = set(rdr.fieldnames) >>>> for r in rdr: >>>> header_set = set(h for h in header_set if not r[h]) >>>> if not header_set: >>>> break >>>> >>>> for r in rdr: >>>> data = list(r[i] for i in header_set) >>>> >>>> dw = csv.DictWriter(fout, header_set) >>>> dw.writeheader() >>>> dw.writerows(data) >>> >>> Sorry, this is not the code you ran. I could guess what the missing >>> parts might be, but it is easier for both sides if you provide a small >>> script that actually can be executed and a small dataset that shows >>> the behaviour you describe. Then post the session and especially the >>> traceback. Example: >>> >>> $ cat my_data.csv 0 $ cat my_code.py print >>> 1/int(open("my_data.csv").read()) >>> $ python my_code.py Traceback (most recent call last): >>> File "my_code.py", line 1, in <module> >>> print 1/int(open("my_data.csv").read()) >>> ZeroDivisionError: integer division or modulo by zero >>> >>> Don't retype, use cut and paste. Thank you. >> >> I downloaded gmail contacts in google csv format. There are so many >> columns. So I was trying to create another csv with the required >> columns. >> Now when I tried to open the gmail csv file with csv DictReader, it >> said the file contained NULL characters. >> So first I did - >> >> data = open(fn, 'rb').read() >> fout = open(ofn, 'wb') >> fout.write(data.replace('\x00', '')) >> fout.close() >> shutil.move(ofn, fn) >> >> Then I found, there were some special characters in the file. So, once >> again I opened the file and did - >> >> data = open(fn, 'rb').read() >> fout = open(ofn, 'wb') >> fout.write(data.replace('\xff\xfe', '')) >> fout.close() >> shutil.move(ofn, fn) > > Uh this looks like the file is in UTF-16. Use > > import codecs fn = ... > ofn = ... > with codecs.open(fn, encoding="utf-16") as f: > with codecs.open(ofn, "w", encoding="utf-8") as g: > g.writelines(f) > ... > > to convert it to UTF-8 which is compatible with the csv module of Python > 2. > >> Now it seemed right. > > Only if all characters are encodable as iso-8859-1. > >> So I started to remove empty columns. >> >> fin = open(fn, 'rb') >> fout = open(ofn, 'wb') >> >> rdr = csv.DictReader(fin, delimiter=',') >> flds = rdr.fieldnames header_set = set(rdr.fieldnames) >> for r in rdr: >> header_set = set(h for h in header_set if not r[h]) >> if not header_set: >> break >> for r in rdr: >> data = list(r[i] for i in header_set) >> >> dw = csv.DictWriter(fout, data[0].keys()) >> dw.writeheader() >> dw.writerows(data) >> >> fin.close() >> fout.close() >> >> But, I am getting error at dw.writerows(data). I put the whole code >> here. >> Please help. > > I really meant it when I asked you to post the code you actually ran, > and the traceback it produces. > > When I fill in the blanks by guessing > > $ cat in.csv one,two,three foo,, > bar,,baz $ cat remove_empty_colums.py import csv fn = "in.csv" > ofn = "out.csv" > > fin = open(fn, 'rb') > fout = open(ofn, 'wb') > > rdr = csv.DictReader(fin, delimiter=',') > flds = rdr.fieldnames header_set = set(rdr.fieldnames) > for r in rdr: > header_set = set(h for h in header_set if not r[h]) > if not header_set: > break > for r in rdr: > data = list(r[i] for i in header_set) > > dw = csv.DictWriter(fout, data[0].keys()) > dw.writeheader() > dw.writerows(data) > > fin.close() > fout.close() > > and then run the resulting script I get > > $ python remove_empty_colums.py Traceback (most recent call last): > File "remove_empty_colums.py", line 18, in <module> > dw = csv.DictWriter(fout, data[0].keys()) > NameError: name 'data' is not defined > > So this is my traceback, and while the NameError is trivial to fix > (reopen the file or do a seek) but not sufficient to make the script do > what you want it doesn't seem to be the problem you ran into. > So you have a different script. I'd really like to see it, and the > traceback it produces.
Yes, that is so. Thank you all for the help. Thanks. -- https://mail.python.org/mailman/listinfo/python-list