Hi , If you got the solutions please let me know also. I have to implement asap. On Wednesday, 9 March 2011 23:43:26 UTC+5:30, Cross wrote: > On 03/09/2011 01:21 AM, Vlastimil Brom wrote: > > 2011/3/8 Cross<x...@x.tv>: > >> On 03/08/2011 06:09 PM, Heather Brown wrote: > >>> > >>> The keywords are an attribute in a tag called<meta>, in the section > >>> called > >>> <head>. Are you having trouble parsing the xhtml to that point? > >>> > >>> Be more specific in your question, and somebody is likely to chime in. > >>> Although > >>> I'm not the one, if it's a question of parsing the xhtml. > >>> > >>> DaveA > >> > >> I know meta tags contain keywords but they are not always reliable. I can > >> parse xhtml to obtain keywords from meta tags; but how do I verify them. To > >> obtain reliable keywords, I have to parse the plain text obtained from the > >> URL. > >> > >> Cross > >> > >> --- news://freenews.netfront.net/ - complaints: n...@netfront.net --- > >> -- > >> http://mail.python.org/mailman/listinfo/python-list > >> > > > > Hi, > > if you need to extract meaningful keywords in terms of data mining > > using natural language processing, it might become quite a complex > > task, depending on the requirements; the NLTK toolkit may help with > > some approaches [ http://www.nltk.org/ ]. > > One possibility would be to filter out more frequent and less > > meaningful words ("stopwords") and extract the more frequent words > > from the reminder., e.g. (with some simplifications/hacks in the > > interactive mode): > > > >>>> import re, urllib2, nltk > >>>> page_src = > >>>> urllib2.urlopen("http://www.python.org/doc/essays/foreword/";).read().decode("utf-8") > >>>> page_plain = nltk.clean_html(page_src).lower() > >>>> txt_filtered = nltk.Text((word for word in re.findall(r"(?u)\w+", > >>>> page_plain) if word not in set(nltk.corpus.stopwords.words("english")))) > >>>> frequency_dist = nltk.FreqDist(txt_filtered) > >>>> [(word, freq) for (word, freq) in frequency_dist.items() if freq> 2] > > [(u'python', 39), (u'abc', 11), (u'code', 10), (u'c', 7), > > (u'language', 7), (u'programming', 7), (u'unix', 7), (u'foreword', 5), > > (u'new', 5), (u'would', 5), (u'1st', 4), (u'book', 4), (u'ed', 4), > > (u'features', 4), (u'many', 4), (u'one', 4), (u'programmer', 4), > > (u'time', 4), (u'use', 4), (u'community', 3), (u'documentation', 3), > > (u'early', 3), (u'enough', 3), (u'even', 3), (u'first', 3), (u'help', > > 3), (u'indentation', 3), (u'instance', 3), (u'less', 3), (u'like', 3), > > (u'makes', 3), (u'personal', 3), (u'programmers', 3), (u'readability', > > 3), (u'readable', 3), (u'write', 3)] > >>>> > > > > Another possibility would be to extract parts of speech (e.g. nouns, > > adjective, verbs) using e.g. nltk.pos_tag(input_txt) etc.; > > for more convoluted html code e.g. BeautifulSoup might be used and > > there are likely many other options. > > > > hth, > > vbr > I had considered nltk. That is why I said that straightforward frequency > calculation of words would be naive. I have to look into this BeautifulSoup > thing. > > --- news://freenews.netfront.net/ - complaints: n...@netfront.net ---
-- https://mail.python.org/mailman/listinfo/python-list