On Dec 12, 11:21 pm, Dennis Lee Bieber <wlfr...@ix.netcom.com> wrote: > On Sun, 12 Dec 2010 07:02:13 -0800 (PST), javivd > <javiervan...@gmail.com> declaimed the following in > gmane.comp.python.general: > > > > > f = open(r'c:c:\somefile.txt', 'w') > > > f.write('0123456789\n0123456789\n0123456789') > > Not the most explanatory sample data... It would be better if the > records had different contents. > > > f.close() > > > f = open(r'c:\somefile.txt', 'r') > > > for line in f: > > Here you extract one "line" from the file > > > f.seek(3,0) > > print f.read(1) #just to know if its printing the rigth column > > And here you ignored the entire line you read, seeking to the fourth > byte from the beginning of the file, andreadingjust one byte from it. > > I have no idea of how seek()/read() behaves relative to line > iteration in the for loop... Given the small size of the test data set > it is quite likely that the first "for line in f" resulted in the entire > file being read into a buffer, and that buffer scanned to find the line > ending and return the data preceding it; then the buffer position is set > to after that line ending so the next "for line" continues from that > point. > > But in a situation with a large data set, or an unbuffered I/O > system, the seek()/read() could easily result in resetting the file > position used by the "for line", so that the second call returns > "456789\n"... And all subsequent calls too, resulting in an infinite > loop. > > Presuming the assignment requires pulling multiple selected fields > from individual records, where each record is of the same > format/spacing, AND that the field selection can not be preprogrammed... > > Sample data file (use fixed width font to view): > -=-=-=-=-=- > Wulfraed 09Ranger 1915 > Bask Euren 13Cleric 1511 > Aethelwulf 07Mage 0908 > Cwiculf 08Mage 1008 > -=-=-=-=-=- > > Sample format definition file: > -=-=-=-=-=- > Name 0-14 > Level 15-16 > Class 17-24 > THAC0 25-26 > Armor 27-28 > -=-=-=-=-=- > > Code to process (Python 2.5, with minimal error handling): > -=-=-=-=-=- > > class Extractor(object): > def __init__(self, formatFile): > ff = open(formatFile, "r") > self._format = {} > self._length = 0 > for line in ff: > form = line.split("\t") #file must be tab separated > if len(form) != 2: > print "Invalid file format definition: %s" % line > continue > name = form[0] > columns = form[1].split("-") > if len(columns) == 1: #single column definition > start = int(columns[0]) > end = start > elif len(columns) == 2: > start = int(columns[0]) > end = int(columns[1]) > else: > print "Invalid column definition: %s" % form[1] > continue > self._format[name] = (start, end) > self._length = max(self._length, end) > ff.close() > > def __call__(self, line): > data = {} > if len(line) < self._length: > print "Data line is too short for required format: ignored" > else: > for (name, (start, end)) in self._format.items(): > data[name] = line[start:end+1] > return data > > if __name__ == "__main__": > FORMATFILE = "SampleFormat.tsv" > DATAFILE = "SampleData.txt" > > characterExtractor = Extractor(FORMATFILE) > > df = open(DATAFILE, "r") > for line in df: > fields = characterExtractor(line) > for (name, value) in fields.items(): > print "Field name: '%s'\t\tvalue: '%s'" % (name, value) > print > > df.close() > -=-=-=-=-=- > > Output from running above code: > -=-=-=-=-=- > Field name: 'Armor' value: '15' > Field name: 'THAC0' value: '19' > Field name: 'Level' value: '09' > Field name: 'Class' value: 'Ranger ' > Field name: 'Name' value: 'Wulfraed ' > > Field name: 'Armor' value: '11' > Field name: 'THAC0' value: '15' > Field name: 'Level' value: '13' > Field name: 'Class' value: 'Cleric ' > Field name: 'Name' value: 'Bask Euren ' > > Field name: 'Armor' value: '08' > Field name: 'THAC0' value: '09' > Field name: 'Level' value: '07' > Field name: 'Class' value: 'Mage ' > Field name: 'Name' value: 'Aethelwulf ' > > Field name: 'Armor' value: '08' > Field name: 'THAC0' value: '10' > Field name: 'Level' value: '08' > Field name: 'Class' value: 'Mage ' > Field name: 'Name' value: 'Cwiculf ' > -=-=-=-=-=- > > Note that string fields have not been trimmed, also numeric fields > are still intextformat... The format definition file would need to be > expanded to include a "string", "integer", "float" (and "Boolean"?) code > in order for the extractor to do proper type conversions. > > -- > Wulfraed Dennis Lee Bieber AF6VN > wlfr...@ix.netcom.com HTTP://wlfraed.home.netcom.com/
Clearly it's working. Altough, this code is beyond my python knowledge (i don't get along with classes, maybe it's a good moment to learn about them...) but i'll dig into it. Thanks a lot! It really helps... J -- http://mail.python.org/mailman/listinfo/python-list