Re: Working with Huge Text Files

2005-03-18 Thread cwazir
Hi,

Lorn Davies wrote:

> . working with text files that range from 100MB to 1G in size.
> .
> XYZ,04JAN1993,9:30:27,28.87,7600,40,0,Z,N
> XYZ,04JAN1993,9:30:28,28.87,1600,40,0,Z,N
> .

I've found that for working with simple large text files like this,
nothing beats the plain old built-in string operations. Using a parsing
library is convenient if the data format is complex, but otherwise it's
overkill.
In this particular case, even the csv module isn't much of an
advantage. I'd just use split.

The following code should do the job:

data_file = open('data.txt', 'r')
months = {'JAN':'01', 'FEB':'02', 'MAR':'03', 'APR':'04', 'MAY':'05',
'JUN':'06', 'JUL':'07', 'AUG':'08', 'SEP':'09', 'OCT':'10', 'NOV':'11',
'DEC':'12'}
output_files = {}
for line in data_file:
fields = line.strip().split(',')
filename = fields[0]
if filename not in output_files:
output_files[filename] = open(filename+'.txt', 'w')
fields[1] = fields[1][5:] + months[fields[1][2:5]] + fields[1][:2]
fields[2] = fields[2].replace(':', '')
fields[3] = fields[3].replace('.', '')
print >>output_files[filename], ', '.join(fields[1:])
for filename in output_files:
output_files[filename].close()
data_file.close()

Note that it does work with unsorted data - at the minor cost of
keeping all output files open till the end of the entire process.

Chirag Wazir
http://chirag.freeshell.org

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Working with Huge Text Files

2005-03-19 Thread cwazir
Lorn Davies wrote:

> if  (fields[8] == 'N' or 'P') and (fields[6] == '0' or '1'):
> ## This line above doesn't work, can't figure out how to struct?

In Python you would need to phrase that as follows:
  if  (fields[8] == 'N' or fields[8] == 'P') and (fields[6] == '0'
  or fields[6] == '1'):
or alternatively:
  if  (fields[8] in ['N', 'P']) and (fields[6] in ['0', '1']):

> The main changes were to create a check for the length of fields[3],
> I wanted to normalize it at 6 digits...

Well, you needn't really check the length - you could directly do this:
fields[3] = (fields[3].replace('.', '') + '00')[:6]
(of course if there are more than 6 digits originally, they'd get
truncated in this case)

Chirag Wazir 
http://chirag.freeshell.org

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Working with Huge Text Files

2005-03-19 Thread cwazir
John Machin wrote:

> More meaningful names wouldn't go astray either :-)

I heartily concur!

Instead of starting with:
  fields = line.strip().split(',')
you could use something like:
  (f_name, f_date, f_time, ...) = line.strip().split(',')

Of course then you won't be able to use ', '.join(fields[1:])
for the output, but the rest of the program will be
MUCH more readable/maintainable.

Chirag Wazir 
http://chirag.freeshell.org

-- 
http://mail.python.org/mailman/listinfo/python-list