"Paul McGuire" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > "manstey" <[EMAIL PROTECTED]> wrote in message > news:[EMAIL PROTECTED] > > Hi, > > > > I have a text file with about 450,000 lines. Each line has 4-5 fields, > > separated by various delimiters (spaces, @, etc). > > > > I want to load in the text file and then run routines on it to produce > > 2-3 additional fields. > > > > <snip> > > Matthew - > > If you find re's to be a bit cryptic, here is a pyparsing version that may > be a bit more readable, and will easily scan through your input file: > <snip>
Lest I be accused of pushing pyparsing where it isn't appropriate, here is a non-pyparsing version of the same program. The biggest hangup with your sample data is that you can't predict what the separator is going to be - sometimes it's '[', sometimes it's '^'. If the separator character were more predictable, you could use simple split() calls, as in: data = "blah blah blah^more blah".split("^") elements = data[0].split() + [data[1]] print elements ['blah', 'blah', 'blah', 'more blah'] Note that this also discards the separator. Since you had something which goes beyond simple string split()'s I thought you might find pyparsing to be a simple alternative to re's. Here is a version that tries different separators, then builds the appropriate list of pieces, including the matching separator. I've also shown an example of a generator, since you are likely to want one, parsing 100's of thousands of lines as you are. -- Paul ================= data = """gee fre asd[234 ger dsf asd[243 gwer af as.:^25a""" # generator to process each line of data # call using processData(listOfLines) def processData(d): separators = "[^" #expand this string if need other separators for line in d: for s in separators: if s in line: parts = line.split(s) # return the first element of parts, split on whitespace # followed by the separator # followed by whatever was after the separator yield parts[0].split() + [ s, parts[1] ] break else: yield line # to call this for a text file, use something like # for lineParts in processData( file("xyzzy.txt").readlines() ) for lineParts in processData( data.split("\n") ): print lineParts print # rerun processData, augmenting extracted values with additional # computed values for lineParts in processData( data.split("\n") ): toks = lineParts tokens = toks[:] tokens.append( toks[0]+toks[1] ) tokens.append( toks[-1] + toks[-1][-1] ) #~ tokens.append( str( lineno(start, data) ) ) print tokens ==================== prints: ['gee', 'fre', 'asd', '[', '234'] ['ger', 'dsf', 'asd', '[', '243'] ['gwer', 'af', 'as.:', '^', '25a'] ['gee', 'fre', 'asd', '[', '234', 'geefre', '2344'] ['ger', 'dsf', 'asd', '[', '243', 'gerdsf', '2433'] ['gwer', 'af', 'as.:', '^', '25a', 'gweraf', '25aa'] -- http://mail.python.org/mailman/listinfo/python-list