Re: Finding keywords

Matt Chaput Tue, 08 Mar 2011 11:14:34 -0800

On 08/03/2011 8:58 AM, Cross wrote:

I know meta tags contain keywords but they are not always reliable. I
can parse xhtml to obtain keywords from meta tags; but how do I verify
them. To obtain reliable keywords, I have to parse the plain text
obtained from the URL.

I think maybe what the OP is asking about is extracting key words from atext, i.e. a short list of words that characterize the text. This is aninformation retrieval problem, not really a Python problem.

One simple way to do this is to calculate word frequency histograms foreach document in your corpus, and then for a given document, selectwords that are frequent in that document but infrequent in the corpus asa whole. Whoosh does this. There are different ways of calculating theimportance of words, and stemming and conflating synonyms can give youbetter results as well.

A more sophisticated method uses "part of speech" tagging. See thePython Natural Language Toolkit (NLTK) and topia.termextract for moreinformation.


http://pypi.python.org/pypi/topia.termextract/

Yahoo has a web service for key word extraction:

http://developer.yahoo.com/search/content/V1/termExtraction.html

You might want to investigate these resources and try google searchesfor e.g. "extracting key terms from documents" and then come back if youhave a question about the Python implementation.


Cheers,

Matt
--
http://mail.python.org/mailman/listinfo/python-list

Re: Finding keywords

Reply via email to