DFS wrote: > On 5/5/2016 1:39 AM, Stephen Hansen wrote: > >> Given: >> >>>>> input = [u'Espa\xf1ol', 'Health & Fitness Clubs (36)', 'Health Clubs & >>>>> Gymnasiums (42)', 'Health Fitness Clubs', 'Name', 'Atlanta city >>>>> guide', 'edit address', 'Tweet', 'PHYSICAL FITNESS CONSULTANTS & >>>>> TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', >>>>> 'www.custombuiltpt.com/', 'RACQUETBALL COURTS PRIVATE', >>>>> 'www.lafitness.com', 'GYMNASIUMS', 'HEALTH & FITNESS CLUBS', >>>>> 'www.lafitness.com', 'HEALTH & FITNESS CLUBS', 'www.lafitness.com', >>>>> 'PERSONAL FITNESS TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & >>>>> PHYSICAL FITNESS PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & >>>>> GYMNASIUMS', 'HEALTH CLUBS & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS', >>>>> '5', '4', '3', '2', '1', 'Yellow Pages', 'About Us', 'Contact Us', >>>>> 'Support', 'Terms of Use', 'Privacy Policy', 'Advertise With Us', 'Add > /Update Listing', 'Business Profile Login', 'F.A.Q.'] >> >> Then: >> >>>>> pattern = re.compile(r"^[A-Z\s&]+$") >>>>> output = [x for x in list if pattern.match(x)] >>>>> output > >> ['PHYSICAL FITNESS CONSULTANTS & TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', >> 'HEALTH CLUBS & GYMNASIUMS', 'RACQUETBALL COURTS PRIVATE', 'GYMNASIUMS', >> 'HEALTH & FITNESS CLUBS', 'HEALTH & FITNESS CLUBS', 'PERSONAL FITNESS >> TRAINERS', 'HEALTH CLUBS & GYMNASIUMS', 'EXERCISE & PHYSICAL FITNESS >> PROGRAMS', 'FITNESS CENTERS', 'HEALTH CLUBS & GYMNASIUMS', 'HEALTH CLUBS >> & GYMNASIUMS', 'PERSONAL FITNESS TRAINERS'] > > > Should've looked earlier. Their master list of categories > http://www.usdirectory.com/cat/g0 shows a few commas, a bunch of dashes, > and the ampersands we talked about. > > "OFFICE SERVICES, SUPPLIES & EQUIPMENT" gets removed because of the comma. > > "AUTOMOBILE - DEALERS" gets removed because of the dash. > > I updated your regex and it seems to have fixed it. > > orig: (r"^[A-Z\s&]+$") > new : (r"^[A-Z\s&,-]+$") > > > Thanks again.
If there is a "master list" compare your candidates against it instead of using a heuristic, i. e. categories = set(master_list) output = [category for category in input if category in categories] You can find the categories with >>> import urllib.request >>> import bs4 >>> soup = bs4.BeautifulSoup(urllib.request.urlopen("http://www.usdirectory.com/cat/g0").read()) >>> categories = set() >>> for li in soup.find_all("li"): ... assert li.parent.parent["class"][0].startswith("category_items") ... categories.add(li.text) ... >>> print("\n".join(sorted(categories)[:10])) Accounting & Bookkeeping Services Adoption Services Adult Entertainment Advertising Agricultural Equipment & Supplies Agricultural Production Agricultural Services Aids Resources Aircraft Charters & Rentals Aircraft Dealers & Services -- https://mail.python.org/mailman/listinfo/python-list