Working with Huge Text Files
Hi there, I'm a Python newbie hoping for some direction in working with text files that range from 100MB to 1G in size. Basically certain rows, sorted by the first (primary) field maybe second (date), need to be copied and written to their own file, and some string manipulations need to happen as well. An example of the current format: XYZ,04JAN1993,9:30:27,28.87,7600,40,0,Z,N XYZ,04JAN1993,9:30:28,28.87,1600,40,0,Z,N | | followed by like a million rows similar to the above, with | incrementing date and time, and then on to next primary field | ABC,04JAN1993,9:30:27,28.875,7600,40,0,Z,N | | etc., there are usually 10-20 of the first field per file | so there's a lot of repetition going on | The export would ideally look like this where the first field would be written as the name of the file (XYZ.txt): 19930104, 93027, 2887, 7600, 40, 0, Z, N Pretty ambitious for a newbie? I really hope not. I've been looking at simpleParse, but it's a bit intense at first glance... not sure where to start, or even if I need to go that route. Any help from you guys in what direction to go or how to approach this would be hugely appreciated. Best regards, Lorn -- http://mail.python.org/mailman/listinfo/python-list
Re: Working with Huge Text Files
Thank you all very much for your suggestions and input... they've been very helpful. I found the easiest apporach, as a beginner to this, was working with Chirag's code. Thanks Chirag, I was actually able to read and make some edit's to the code and then use it... woohooo! My changes are annotated with ##: data_file = open('G:\pythonRead.txt', 'r') data_file.readline() ## this was to skip the first line months = {'JAN':'01', 'FEB':'02', 'MAR':'03', 'APR':'04', 'MAY':'05', 'JUN':'06', 'JUL':'07', 'AUG':'08', 'SEP':'09', 'OCT':'10', 'NOV':'11', 'DEC':'12'} output_files = {} for line in data_file: fields = line.strip().split(',') length = len(fields[3]) ## check how long the field is N = 'P','N' filename = fields[0] if filename not in output_files: output_files[filename] = open(filename+'.txt', 'w') if (fields[8] == 'N' or 'P') and (fields[6] == '0' or '1'): ## This line above doesn't work, can't figure out how to struct? fields[1] = fields[1][5:] + months[fields[1][2:5]] + fields[1][:2] fields[2] = fields[2].replace(':', '') if length == 6:## check for 6 if not add a 0 fields[3] = fields[3].replace('.', '') else: fields[3] = fields[3].replace('.', '') + '0' print >>output_files[filename], ', '.join(fields[1:5]) for filename in output_files: output_files[filename].close() data_file.close() The main changes were to create a check for the length of fields[3], I wanted to normalize it at 6 digits... the problem I can seee with it potentially is if I come across lengths < 5, but I have some ideas to fix that. The other change I attempted was a criteria for what to print based on the value of fields[8] and fields[6]. It didn't work so well. I'm a little confused at how to structure booleans like that... I come from a little experience in a Pascal type scripting language where "x and y" would entail both having to be true before continuing and "x or y" would mean either could be true before continuing. Python, unless I'm misunderstanding (very possible), doesn't organize it as such. I thought of perhaps using a set of if, elif, else statements for processing the fileds, but didn't think that would be the most elegant/efficient solution. Anyway, any critiques/ideas are welcome... they'll most definitely help me understand this language a bit better. Thank you all again for your great replies and thank you Chirag for getting me up and going. Lorn -- http://mail.python.org/mailman/listinfo/python-list