You have a lot of choices with this sort of thing. What you'd use depends largely on what sorts of files/input you'll be parsing.
For example, a common machine-friendly data format is the comma-separated file. These, or really any file which uses a character-based field seperator (including newline characters), are usually best read with something like split([seperator]), which will return an array of each element in the string you give it. Example: >>> str='Brian,student,California,555-0127' >>> tokens = str.split(',') >>> print tokens ['Brian', 'student', 'California', '555-0127'] More helpful is if the file has a header line: >>> str='Name,Occupation,Location,Phone\n'+\ ... 'Brian,student,California,555-0127\n'+\ ... 'Ann,carpenter,Georgia,555-3825' >>> entries=str.split('\n') >>> print entries ['Name,Occupation,Location,Phone', 'Brian,student,California,555-0127', 'Ann,carpenter,Georgia,555-3825'] >>> header=entries[0] >>> entries=entries[1:] >>> for entry in entries: ... tokens=entry.split(',') ... [whatever] A more powerful tool is the regular expression engine, which is something you'll be using quite a lot if you get into heavy text parsing. Some people have described it as its own mini-language, but by no means is it Python specific: Perl, Java, various Unix shells, and others all have a roughly equivalent setup. Python regular expression engine is very object-oriented. As a simple primer, you first import the re module, make a Pattern object from re.compile(), and then run one of the Pattern's several ways of parsing a line. A common example, where you want to know the version of the Globus software installed from reading a filename: >>> import re >>> str='/users/username/globus/globus-4.0.2/lib/libglobus_gridftp_server_gcc32.so' >>> version = re.compile("\d.\d.\d") >>> print version <_sre.SRE_Pattern object at 0x400329b0> >>> version.search(str) <_sre.SRE_Match object at 0x40075218> >>> version.search(str).group() '4.0.2' This seems a bit overblown just to find this - after all, we could have just split str on '/' to make a token array, grabbed token 4, split again on '-', and taken token 1. The advantage to regular expressions is that they're very flexible. This would work on any of the following: /users/username/globus/globus-4.0.2/lib/libglobus_gridftp_server_gcc32.so ../../globus-4.0.2/lib/libglobus_gridftp_server_gcc32.so ftp ftp.server.com -e 'get globus/globus-4.0.2/lib/libglobus_gridftp_server_gcc32.so and so on. The moral is that when the string you're parsing is fairly regular, use something like split(), when it can vary a lot, use regular expressions. Split is, as you may expect, quite a bit faster. I should stress that this is a very barebones example, and doesn't even begin to scratch the surface of regular expressions' power. It's also a little too general, as any string fragment of [number].[number].[number] will match here. An excellent resource on regular expressions in Python (I believe lifted from the original documentation, but I digress): http://www.amk.ca/python/howto/regex/ XML is another common format to have to go through, but I don't have much experience in this area. If memory serves, Python comes with a built-in XML parser that makes a multi-level dictionary of any XML file you give it. Hopefully others can fill in on that part. Also, don't be afraid of having the interpreter open next to your editor of choice, and of running test patterns through any parsing code you're writing. Regular expressions in particular are very easy to screw up, no matter how long you've been using them. bio_enthusiast wrote: > I was wondering exactly how you create a parser. I'm learning > Python and I recently have come across this material. I'm interested > in the method or art of writing a parser. > > If anyone has some python code to post for an abstract parser, or links > to some informative tutorials, that would be great. -- http://mail.python.org/mailman/listinfo/python-list