Hi , If you got the solutions please let me know also. I have to implement asap.
On Wednesday, 9 March 2011 23:43:26 UTC+5:30, Cross  wrote:
> On 03/09/2011 01:21 AM, Vlastimil Brom wrote:
> > 2011/3/8 Cross<x...@x.tv>:
> >> On 03/08/2011 06:09 PM, Heather Brown wrote:
> >>>
> >>> The keywords are an attribute in a tag called<meta>, in the section
> >>> called
> >>> <head>. Are you having trouble parsing the xhtml to that point?
> >>>
> >>> Be more specific in your question, and somebody is likely to chime in.
> >>> Although
> >>> I'm not the one, if it's a question of parsing the xhtml.
> >>>
> >>> DaveA
> >>
> >> I know meta tags contain keywords but they are not always reliable. I can
> >> parse xhtml to obtain keywords from meta tags; but how do I verify them. To
> >> obtain reliable keywords, I have to parse the plain text obtained from the
> >> URL.
> >>
> >> Cross
> >>
> >> --- news://freenews.netfront.net/ - complaints: n...@netfront.net ---
> >> --
> >> http://mail.python.org/mailman/listinfo/python-list
> >>
> >
> > Hi,
> > if you need to extract meaningful keywords in terms of data mining
> > using natural language processing, it might become quite a complex
> > task, depending on the requirements; the NLTK toolkit may help with
> > some approaches [ http://www.nltk.org/ ].
> > One possibility would be to filter out more frequent and less
> > meaningful words ("stopwords") and extract the more frequent words
> > from the reminder., e.g. (with some simplifications/hacks in the
> > interactive mode):
> >
> >>>> import re, urllib2, nltk
> >>>> page_src = 
> >>>> urllib2.urlopen("http://www.python.org/doc/essays/foreword/";).read().decode("utf-8")
> >>>> page_plain = nltk.clean_html(page_src).lower()
> >>>> txt_filtered = nltk.Text((word for word in re.findall(r"(?u)\w+", 
> >>>> page_plain) if word not in set(nltk.corpus.stopwords.words("english"))))
> >>>> frequency_dist = nltk.FreqDist(txt_filtered)
> >>>> [(word, freq) for (word, freq) in frequency_dist.items() if freq>  2]
> > [(u'python', 39), (u'abc', 11), (u'code', 10), (u'c', 7),
> > (u'language', 7), (u'programming', 7), (u'unix', 7), (u'foreword', 5),
> > (u'new', 5), (u'would', 5), (u'1st', 4), (u'book', 4), (u'ed', 4),
> > (u'features', 4), (u'many', 4), (u'one', 4), (u'programmer', 4),
> > (u'time', 4), (u'use', 4), (u'community', 3), (u'documentation', 3),
> > (u'early', 3), (u'enough', 3), (u'even', 3), (u'first', 3), (u'help',
> > 3), (u'indentation', 3), (u'instance', 3), (u'less', 3), (u'like', 3),
> > (u'makes', 3), (u'personal', 3), (u'programmers', 3), (u'readability',
> > 3), (u'readable', 3), (u'write', 3)]
> >>>>
> >
> > Another possibility would be to extract parts of speech (e.g. nouns,
> > adjective, verbs) using e.g. nltk.pos_tag(input_txt) etc.;
> > for more convoluted html code e.g. BeautifulSoup might be used and
> > there are likely many other options.
> >
> > hth,
> >    vbr
> I had considered nltk. That is why I said that straightforward frequency 
> calculation of words would be naive. I have to look into this BeautifulSoup 
> thing.
> --- news://freenews.netfront.net/ - complaints: n...@netfront.net ---

Reply via email to