On Friday, May 25, 2018 at 3:59:57 AM UTC+5:30, Cameron Simpson wrote:
> First up, thank you for a well described problem! Remarks inline below.
> 
> On 24May2018 03:13, wrote:
> >I have a text as,
> >
> >"Hawaii volcano generates toxic gas plume called laze PAHOA: The eruption of 
> >Kilauea volcano in Hawaii sparked new safety warnings about toxic gas on the 
> >Big Island's southern coastline after lava began flowing into the ocean and 
> >setting off a chemical reaction. Lava haze is made of dense white clouds of 
> >steam, toxic gas and tiny shards of volcanic glass. Janet Babb, a geologist 
> >with the Hawaiian Volcano Observatory, says the plume "looks innocuous, but 
> >it's not." "Just like if you drop a glass on your kitchen floor, there's 
> >some large pieces and there are some very, very tiny pieces," Babb said. 
> >"These little tiny pieces are the ones that can get wafted up in that steam 
> >plume." Scientists call the glass Limu O Pele, or Pele's seaweed, named 
> >after the Hawaiian goddess of volcano and fire"
> >
> >and I want to see its tagged output as,
> >
> >"Hawaii/TAG volcano generates toxic gas plume called laze PAHOA/TAG: The 
> >eruption of Kilauea/TAG volcano/TAG in Hawaii/TAG sparked new safety 
> >warnings about toxic gas on the Big Island's southern coastline after lava 
> >began flowing into the ocean and setting off a chemical reaction. Lava haze 
> >is made of dense white clouds of steam, toxic gas and tiny shards of 
> >volcanic glass. Janet/TAG Babb/TAG, a geologist with the Hawaiian/TAG 
> >Volcano/TAG Observatory/TAG, says the plume "looks innocuous, but it's not." 
> >"Just like if you drop a glass on your kitchen floor, there's some large 
> >pieces and there are some very, very tiny pieces," Babb/TAG said. "These 
> >little tiny pieces are the ones that can get wafted up in that steam plume." 
> >Scientists call the glass Limu/TAG O/TAG Pele/TAG, or Pele's seaweed, named 
> >after the Hawaiian goddess of volcano and fire"
> >
> >To do this I generally try to take a list at the back end as,
> >
> >Hawaii
> >PAHOA
> >Kilauea
> >volcano
> >Janet
> >Babb
> >Hawaiian
> >Volcano
> >Observatory
> >Babb
> >Limu
> >O
> >Pele
> >
> >and do a simple code as follows,
> >
> >def tag_text():
> >    corpus=open("/python27/volcanotxt.txt","r").read().split()
> >    wordlist=open("/python27/taglist.txt","r").read().split()
> 
> You might want use this to compose "wordlist":
> 
>      wordlist=set(open("/python27/taglist.txt","r").read().split())
> 
> because it will make your "if word in wordlist" test O(1) instead of O(n), 
> which will matter later if your wordlist grows.
> 
> >    list1=[]
> >    for word in corpus:
> >        if word in wordlist:
> >            word_new=word+"/TAG"
> >            list1.append(word_new)
> >        else:
> >            list1.append(word)
> >    lst1=list1
> >    tagged_text=" ".join(lst1)
> >    print tagged_text
> >
> >get the results and hand repair unwanted tags Hawaiian/TAG goddess of 
> >volcano/TAG.
> >I am looking for a better approach of coding so that I need not spend time 
> >on 
> >hand repairing.
> 
> It isn't entirely clear to me why these two taggings are unwanted. 
> Intuitively, 
> they seem to be either because "Hawaiian goddess" is a compound term where 
> you 
> don't want "Hawaiian" to get a tag, or because "Hawaiian" has already 
> received 
> a tag earlier in the list. Or are there other criteria.
> 
> If you want to solve this problem with a programme you must first clearly 
> define what makes an unwanted tag "unwanted".
> 
> For example, "Hawaiian" is an adjective, and therefore will always be part of 
> a 
> compound term.
> 
> Can you clarify what makes these taggings you mention "unwanted"?
> 
> Cheers,
> 
Sir, Thank you for your kind time to write such a nice reply. 

By unwanted I did not mean anything so intricate. 
Unwanted meant things I did not want. 
For example, 
if my target phrases included terms like, 
government of Mexico, 

now in my list I would have words with their tags as,
government
of
Mexico

If I put these words in list it would tag 
government/TAG of/TAG Mexico

but would also tag all the "of" which may be
anywhere like haze is made of/TAG dense white,
clouds of/TAG steam, etc. 

Cleaning these unwanted places become a daunting task
to me. 

I have been experimenting around 
wordlist=["Kilauea volcano","Kilauea/TAG 
volcano/TAG"),("Hawaii","Hawaii/TAG"),...]
tag=reduce(lambda a, kv: a.replace(*kv), wordlist, corpus)

is giving me sizeably good result but size of the wordlist is slight concern. 

-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to