On Friday, May 25, 2018 at 3:59:57 AM UTC+5:30, Cameron Simpson wrote: > First up, thank you for a well described problem! Remarks inline below. > > On 24May2018 03:13, wrote: > >I have a text as, > > > >"Hawaii volcano generates toxic gas plume called laze PAHOA: The eruption of > >Kilauea volcano in Hawaii sparked new safety warnings about toxic gas on the > >Big Island's southern coastline after lava began flowing into the ocean and > >setting off a chemical reaction. Lava haze is made of dense white clouds of > >steam, toxic gas and tiny shards of volcanic glass. Janet Babb, a geologist > >with the Hawaiian Volcano Observatory, says the plume "looks innocuous, but > >it's not." "Just like if you drop a glass on your kitchen floor, there's > >some large pieces and there are some very, very tiny pieces," Babb said. > >"These little tiny pieces are the ones that can get wafted up in that steam > >plume." Scientists call the glass Limu O Pele, or Pele's seaweed, named > >after the Hawaiian goddess of volcano and fire" > > > >and I want to see its tagged output as, > > > >"Hawaii/TAG volcano generates toxic gas plume called laze PAHOA/TAG: The > >eruption of Kilauea/TAG volcano/TAG in Hawaii/TAG sparked new safety > >warnings about toxic gas on the Big Island's southern coastline after lava > >began flowing into the ocean and setting off a chemical reaction. Lava haze > >is made of dense white clouds of steam, toxic gas and tiny shards of > >volcanic glass. Janet/TAG Babb/TAG, a geologist with the Hawaiian/TAG > >Volcano/TAG Observatory/TAG, says the plume "looks innocuous, but it's not." > >"Just like if you drop a glass on your kitchen floor, there's some large > >pieces and there are some very, very tiny pieces," Babb/TAG said. "These > >little tiny pieces are the ones that can get wafted up in that steam plume." > >Scientists call the glass Limu/TAG O/TAG Pele/TAG, or Pele's seaweed, named > >after the Hawaiian goddess of volcano and fire" > > > >To do this I generally try to take a list at the back end as, > > > >Hawaii > >PAHOA > >Kilauea > >volcano > >Janet > >Babb > >Hawaiian > >Volcano > >Observatory > >Babb > >Limu > >O > >Pele > > > >and do a simple code as follows, > > > >def tag_text(): > > corpus=open("/python27/volcanotxt.txt","r").read().split() > > wordlist=open("/python27/taglist.txt","r").read().split() > > You might want use this to compose "wordlist": > > wordlist=set(open("/python27/taglist.txt","r").read().split()) > > because it will make your "if word in wordlist" test O(1) instead of O(n), > which will matter later if your wordlist grows. > > > list1=[] > > for word in corpus: > > if word in wordlist: > > word_new=word+"/TAG" > > list1.append(word_new) > > else: > > list1.append(word) > > lst1=list1 > > tagged_text=" ".join(lst1) > > print tagged_text > > > >get the results and hand repair unwanted tags Hawaiian/TAG goddess of > >volcano/TAG. > >I am looking for a better approach of coding so that I need not spend time > >on > >hand repairing. > > It isn't entirely clear to me why these two taggings are unwanted. > Intuitively, > they seem to be either because "Hawaiian goddess" is a compound term where > you > don't want "Hawaiian" to get a tag, or because "Hawaiian" has already > received > a tag earlier in the list. Or are there other criteria. > > If you want to solve this problem with a programme you must first clearly > define what makes an unwanted tag "unwanted". > > For example, "Hawaiian" is an adjective, and therefore will always be part of > a > compound term. > > Can you clarify what makes these taggings you mention "unwanted"? > > Cheers, > Sir, Thank you for your kind time to write such a nice reply.
By unwanted I did not mean anything so intricate. Unwanted meant things I did not want. For example, if my target phrases included terms like, government of Mexico, now in my list I would have words with their tags as, government of Mexico If I put these words in list it would tag government/TAG of/TAG Mexico but would also tag all the "of" which may be anywhere like haze is made of/TAG dense white, clouds of/TAG steam, etc. Cleaning these unwanted places become a daunting task to me. I have been experimenting around wordlist=["Kilauea volcano","Kilauea/TAG volcano/TAG"),("Hawaii","Hawaii/TAG"),...] tag=reduce(lambda a, kv: a.replace(*kv), wordlist, corpus) is giving me sizeably good result but size of the wordlist is slight concern. -- https://mail.python.org/mailman/listinfo/python-list