On Jun 23, 1:43 am, Eric <[EMAIL PROTECTED]> wrote: > On Jun 21, 6:03 pm, John Machin <[EMAIL PROTECTED]> wrote: > > > > > On Jun 22, 4:43 am, Eric <[EMAIL PROTECTED]> wrote: > > > > On Jun 21, 9:47 am, cjl <[EMAIL PROTECTED]> wrote: > > > > > P: > > > > > I am working on a project that requires geocoding, and have written a > > > > very simple geocoder that uses the Google service. > > > > > I would like to be able to extract the name of the street from the > > > > addresses in my data, however they vary significantly. Here a some > > > > examples: > > > > > 25 Main St > > > > 2500 14th St > > > > 12 Bennet Pkwy > > > > Pearl St > > > > Bennet Rd and Main st > > > > 19th St > > > > > As you can see, sometimes I have the house number, and sometimes I do > > > > not. Sometimes the street name is a number. Sometimes I simply have > > > > the names of intersecting streets. > > > > > I would like to be able to parse the above into the following: > > > > > Main St > > > > 14th St > > > > Bennet Pkwy > > > > Pearl St > > > > Bennet Rd > > > > Main St > > > > 19th St > > > > > How might I approach this complex parsing problem? > > > > > -CJL > > > > You might be able to use consistencies in your data to make this > > > simpler. If the examples you have there are representative, it looks > > > like what you should do is look for a word like 'St' or 'Rd' and then > > > return that word and the previous word. > > > The OP's data already contains > > [corner|cnr [of]] Foo Rd and|& Bar St > > and real world data will contain things like > > 1234 John F Kennedy Memorial Drive > > 456 Broadway > > > As Paul wrote, "Parsing street addresses is a very complex parsing > > problem", even when you restrict yourself to one mostly-English- > > speaking country. Software written under such restrictions rapidly > > breaks down elsewhere (Rue de la Paix, Wilhelmstrasse, Avenida 9 de > > Julio, etc) and blows up altogether when street names aren't used in > > postal addresses (e.g. Japan). > > No doubt that address parsing is, in general, a very difficult > problem. However, it may not be necessary for him to solve the > general problem. If his dataset is more limited in formats then his > problem is much simpler.
Ignore the last sentence of my post. Restrict the application to [sub]urban addresses in the USA. If the OP's dataset is real-world data, it will contain cases of street addresses that don't fit "look for a word like 'St' or 'Rd' and then return that word and the previous word." To expect an OP in a newsgroup to provide representative examples is charmingly naive :-) and in any case the OP had already provided a corner case [pun intended] that busted your rule. Cheers, John -- http://mail.python.org/mailman/listinfo/python-list