> Don't forget to write test cases. If you have a series of addresses, > and confirm they are parsed correctly, you are in a good position to > refine the pattern. You will instantly know if a change in pattern has > broken another pattern. > > The reason I'm saying this, is because I think your pattern is > incomplete. I suggest you add a test case for the following street > address: > > 221B Baker Street
There are a number of weird street names and addresses that one may need to address. Having worked with police applications, they often break it into the BLOCK, DIRECTION, STREET, SUFFIX and APARTMENT/SUITE. However, there are complications... Block can include things like 1234 1/2 (an actual street format from one of our test cases where two block numbers were divided to make room) 221B (though this might be a block + apartment) Directions can include not only your cardinal N/S/E/W directions (written out or abbreviated, with or without punctuation), but can include 8-point directions or more, such as NW, Northwest, north-west, etc. It wouldn't even surprise me if locations with 16-point directions exist (NNW). The Street portion is often whatever is left over when the rest is unparsed. The Suffix would be "Rd", "Road", "St", "Ave", "Cir", "Bvd", "Blvd", "Row", "Hwy", "Highway", etc. There are about 30 of them that we used by default, but I'm sure there are some abnormals as well. There are wrinkles in even the above, as here in the Dallas area, we have a "Northwest Highway" where Northwest is the street-name of the road, not the Direction portion. I second Goldfish's suggestion for making a suite of both normal and abnormal addresses along with their expected breakdowns. Depending on how normalized you want them to be, you may have to deal with punctuation and spacing abnormalities as well. -tkc -- http://mail.python.org/mailman/listinfo/python-list